Browse Source

fix(numpy):fix the readme and audit

pull/2490/head^2
miguel 11 months ago committed by MSilva95
parent
commit
e429888bf4
  1. 248
      subjects/ai/numpy/README.md
  2. 315
      subjects/ai/numpy/audit/README.md

248
subjects/ai/numpy/README.md

@ -1,20 +1,7 @@
# NumPy ## NumPy
The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way. The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
### Exercises of the day
- Exercise 0: Environment and libraries
- Exercise 1: Your first NumPy array
- Exercise 2: Zeros
- Exercise 3: Slicing
- Exercise 4: Random
- Exercise 5: Split, concatenate, reshape arrays
- Exercise 6: Broadcasting and Slicing
- Exercise 7: NaN
- Exercise 8: Wine
- Exercise 9: Football tournament
### Virtual Environment ### Virtual Environment
- Python 3.x - Python 3.x
@ -26,53 +13,52 @@ I suggest to use the most recent one.
### Resources ### Resources
- https://medium.com/fintechexplained/why-should-we-use-NumPy-c14a4fb03ee9 - [Why Should We Use NumPy](https://medium.com/fintechexplained/)why-should-we-use-NumPy-c14a4fb03ee9
- https://numpy.org/doc/ - [NumPy Documentation](https://numpy.org/doc/)
- https://jakevdp.github.io/PythonDataScienceHandbook/ - [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/)
--- ---
--- ---
# Exercise 0: Environment and libraries ## Exercise 0: Environment and libraries
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. An article below detail when the Notebook should be used. Notebook can be used for most of the exercises of the piscine as the goal is to experiment A LOT. But no worries, you'll be asked to build a more robust structure for all the projects. The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. Notebook can be used for most of the exercises of the piscine as the goal is to experiment a lot. But no worries, you'll be asked to build a more robust structure for all the projects.
**Note:** For each quest, your first exercise will be to set up the virtual environment with the required libraries. **Note:** For each quest, your first exercise will be to set up the virtual environment with the required libraries.
I recommend to use: I suggest utilizing:
- the **last stable versions** of Python. However, for educational purpose you will install a specific version of Python in this exercise. - The **latest stable version** of Python for your work. However, in this exercise, you'll install and use a specific Python version for educational purposes.
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science. - Choose a virtual environment that aligns with your familiarity. Common choices among Data Science practitioners are `virtualenv` and `conda`.
- one of the most recent versions of the libraries required - Install the most recent versions of the required libraries to ensure compatibility and access to the latest features
1. Create a virtual environment named `ex00`, with Python `3.8`, with the following libraries: `numpy`, `jupyter`. Save the installed packages in `requirements.txt` in the current directory. 1. Begin by creating a virtual environment named `ex00` that utilizes Python version `3.8`. Install the required libraries `numpy` and `jupyter`. Save the installed packages to a file named `requirements.txt`, located in the current directory.
2. Launch a `jupyter notebook` on port `8891` and create a notebook named `Notebook_ex00`. `JupyterLab` can be used instead of Jupyter Notebook here. 2. Launch a `jupyter` notebook or `JupyterLab` on port `8891`. Create a new notebook named `Notebook_ex00`.
3. Put the text `H1 TITLE` as **heading level 1** and `H2 TITLE` as **heading level 2** in the first cell. 3. In the first cell of the notebook, set `H1 TITLE` as a **heading level 1** and `H2 TITLE` as a **heading level 2**.
4. Run `print("Buy the dip ?")` in the second cell 4. Execute `print("Buy the dip ?")` in the second cell to display the message.
### Resources: ### Resources:
- https://www.python.org/ - [python](https://www.python.org/)
- https://docs.conda.io/ - [Conda Documentation](https://docs.conda.io/)
- https://jupyter.org/ - [jupyter](https://jupyter.org/)
- https://numpy.org/ - [numpy](https://numpy.org/)
- https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330 - [Jupyter Notebook Shortcuts](https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330)
- https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2 - [Why You Should be Using Jupyter Notebooks](https://odsc.medium.com/why-you-should-be-using-jupyter-notebooks-ea2e568c59f2)
- https://stackoverflow.com/questions/50777849/from-conda-create-requirements-txt-for-pip3
--- ---
--- ---
# Exercise 1: Your first NumPy array ## Exercise 1: Your first NumPy array
The goal of this exercise is to use many Python data types in **NumPy** arrays. **NumPy** arrays are intensively used in **NumPy** and **Pandas**. They are flexible and allow to use optimized **NumPy** underlying functions. The objective of this exercise is to familiarize yourself with incorporating various Python data types into **NumPy** arrays. **NumPy** arrays play a vital role in both **NumPy** and **Pandas**, offering flexibility and optimized functionalities.
1. Create a NumPy array that contains: an integer, a float, a string, a dictionary, a list, a tuple, a set and a boolean. Add the following code at the end of your python file or in a cell of the jupyter notebook: 1. Create a NumPy array that contains: an `integer`, a `float`, a `string`, a `dictionary`, a `list`, a `tuple`, a `set` and a `boolean`. Add the following code at the end of your python file or in a cell of the jupyter notebook:
```python ```python
for i in your_np_array: for i in your_np_array:
@ -83,7 +69,7 @@ for i in your_np_array:
--- ---
# Exercise 2: Zeros ## Exercise 2: Zeros
The goal of this exercise is to learn to create a NumPy array with 0s. The goal of this exercise is to learn to create a NumPy array with 0s.
@ -94,20 +80,44 @@ The goal of this exercise is to learn to create a NumPy array with 0s.
--- ---
# Exercise 3: Slicing ## Exercise 3: Slicing
The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop. The goal of this exercise is to learn NumPy indexing/slicing. It allows to access values of the NumPy array efficiently and without a for loop.
1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered. 1. Create a NumPy array of dimension 1 that contains all integers from 1 to 100 ordered.
2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is: `np.array([1,3,...,99])`. _Hint_: it takes one line
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is: `np.array([100,98,...,2])`. _Hint_: it takes one line 2. Without using a for loop and using the array created in Q1, create an array that contain all odd integers. The expected output is:
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is: `np.array([[1,0,3,4,0,...,0,99,100]])`
```console
[ 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
97 99]
```
3. Without using a for loop and using the array created in Q1, create an array that contain all even integers reversed. The expected output is:
```console
[100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66
64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30
28 26 24 22 20 18 16 14 12 10 8 6 4 2]
```
4. Using array of Q1, set the value of every 3 elements of the list (starting with the second) to 0. The expected output is:
```console
[ 1 0 3 4 0 6 7 0 9 10 0 12 13 0 15 16 0 18
19 0 21 22 0 24 25 0 27 28 0 30 31 0 33 34 0 36
37 0 39 40 0 42 43 0 45 46 0 48 49 0 51 52 0 54
55 0 57 58 0 60 61 0 63 64 0 66 67 0 69 70 0 72
73 0 75 76 0 78 79 0 81 82 0 84 85 0 87 88 0 90
91 0 93 94 0 96 97 0 99 100]
```
--- ---
--- ---
# Exercise 4: Random ## Exercise 4: Random
The goal of this exercise is to learn to generate random data. The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons: In Data Science it is extremely useful to generate random data for many reasons:
@ -118,7 +128,7 @@ NumPy proposes a lot of options to generate random data. In statistics, assumpti
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other) - Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
https://numpy.org/doc/stable/reference/random/generator.html [Random Generator](https://numpy.org/doc/stable/reference/random/generator.html)
1. Set the seed to 888 1. Set the seed to 888
2. Generate a **one-dimensional** array of size 100 with a normal distribution 2. Generate a **one-dimensional** array of size 100 with a normal distribution
@ -129,7 +139,7 @@ https://numpy.org/doc/stable/reference/random/generator.html
--- ---
# Exercise 5: Split, concatenate, reshape arrays ## Exercise 5: Split, concatenate, reshape arrays
The goal of this exercise is to learn to concatenate and reshape arrays. The goal of this exercise is to learn to concatenate and reshape arrays.
@ -142,21 +152,27 @@ The goal of this exercise is to learn to concatenate and reshape arrays.
4. Reshape the previous array into: 4. Reshape the previous array into:
```console ```console
array([[ 1, ... , 10], [[ 1 2 3 4 5 6 7 8 9 10]
[ 11 12 13 14 15 16 17 18 19 20]
... ...
[ 91, ... , 100]]) [ 81 82 83 84 85 86 87 88 89 90]
[ 91 92 93 94 95 96 97 98 99 100]]
``` ```
Print what you've created in the previous steps.
--- ---
--- ---
# Exercise 6: Broadcasting and Slicing ## Exercise 6: Broadcasting and Slicing
The goal of this exercise is to learn to access values of n-dimensional arrays efficiently. The goal of this exercise is to learn to access values of n-dimensional arrays efficiently.
1. Create an 2-dimensional array size 9,9 of 1s. Each value has to be an `int8`. **Using a for loop is not allowed in this exercise.**
2. Using **slicing**, output this array:
1. Generate a 2-dimensional array of size 9x9, with all elements initialized to 1 and of type `int8`.
2. Using **slicing**, create the following array:
```python ```python
array([[1, 1, 1, 1, 1, 1, 1, 1, 1], array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
@ -170,35 +186,58 @@ The goal of this exercise is to learn to access values of n-dimensional arrays e
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8) [1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8)
``` ```
3. Using **broadcasting** create the ouptu matrix starting from these two arrays: 3. Using **broadcasting** create an output matrix based on the following two arrays:
```python ```python
array_1 = np.array([1,2,3,4,5], dtype=int8) array_1 = np.array([1,2,3,4,5], type=int8)
array_2 = np.array([1,2,3], dtype=int8) array_2 = np.array([1,2,3], dtype=int8)
...
# output matrix
array([[ 1, 2, 3],
[ 2, 4, 6],
[ 3, 6, 9],
[ 4, 8, 12],
[ 5, 10, 15]], dtype=int8)
``` ```
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: Computation on Arrays: Broadcasting) Expected output:
--- ```console
[[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]
[1 1 1 1 1 1 1 1 1]]
[[1 1 1 1 1 1 1 1 1]
[1 0 0 0 0 0 0 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 0 1 0 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 0 0 0 0 0 0 1]
[1 1 1 1 1 1 1 1 1]]
[[ 1 2 3]
[ 2 4 6]
[ 3 6 9]
[ 4 8 12]
[ 5 10 15]]
```
### Resources
[Computation on Arrays: Broadcasting](https://jakevdp.github.io/PythonDataScienceHandbook/)
--- ---
# Exercise 7: NaN ---
The goal of this exercise is to learn to deal with missing data in NumPy and to manipulate NumPy arrays. ## Exercise 7: NaN
Let us consider a 2-dimensional array that contains the grades at the past two exams. Some of the students missed the first exam. As the grade is missing it has been replaced with a `NaN`. The goal of this exercise is to handle missing data in NumPy and manipulate arrays effectively.
1. Using `np.where` create a third column that is equal to the grade of the first exam if it exists and the second else. Add the column as the third column of the array. Let's consider a 2-dimensional array containing grades from the last two exams. Some students missed the first exam, so their grades are replaced with `NaN`.
**Using a for loop or if/else statement is not allowed in this exercise.** To simulate this scenario, we'll create a mock dataset using NumPy. Here's a snippet of code to generate this dataset:
```python ```python
import numpy as np import numpy as np
@ -209,46 +248,83 @@ grades[[1,2,5,7], [0,0,0,0]] = np.nan
print(grades) print(grades)
``` ```
This code returns:
```console
[[ 7. 1.]
[nan 2.]
[nan 8.]
[ 9. 3.]
[ 8. 9.]
[nan 2.]
[ 8. 2.]
[nan 6.]
[ 9. 2.]
[ 8. 5.]]
```
1. Using `np.where`, create a third column that takes the grade of the first exam if available; otherwise, it uses the grade from the second exam. Add this column as the third column of the array.
**Using a for loop or if/else statement is not allowed in this exercise.**
Expected output:
```console
[[ 7. 1. 7.]
[nan 2. 2.]
[nan 8. 8.]
[ 9. 3. 9.]
[ 8. 9. 8.]
[nan 2. 2.]
[ 8. 2. 8.]
[nan 6. 6.]
[ 9. 2. 9.]
[ 8. 5. 8.]]
```
--- ---
--- ---
# Exercise 8: Wine ## Exercise 8: Wine
The goal of this exercise is to learn to perform a basic data analysis on real data using NumPy. The goal of this exercise is to perform fundamental data analysis on real data using NumPy.
The data set that will be used for this exercise is the red wine data set. The dataset chosen for this task is the [red wine dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality)
https://archive.ics.uci.edu/ml/datasets/wine+quality 1. Load the data using `genfromtxt`, specifying the delimiter as ';', and optimize the numpy array size by reducing the data types. Ensure that the sum of absolute differences between the original and the "memory" optimized dataset is less than `1.10**-3`. Use `np.float32` and verify that the resulting numpy array weighs **76800 bytes**.
How to tell if a given 2D array has null columns? 2. Display the 2nd, 7th, and 12th rows as a two-dimensional array.
1. Using `genfromtxt` load the data and reduce the size of the numpy array by optimizing the types. The sum of absolute differences between the original data set and the "memory" optimized one has to be smaller than 1.10**-3. I suggest to use `np.float32`. Check that the numpy array weights **76800 bytes\*\*. 3. Determine if there is any wine in the dataset with an alcohol percentage greater than 20%. Return True or False.
2. Print 2nd, 7th and 12th rows as a two dimensional array 4. Calculate the average alcohol percentage across all wines in the dataset. Exclude `np.nan` values if present.
3. Is there any wine with a percentage of alcohol greater than 20% ? Return True or False 5. Compute various statistical measures (minimum, maximum, 25th percentile, 50th percentile, 75th percentile and the mean for the pH values).
4. What is the average % of alcohol on all wines in the data set ? If needed, drop `np.nan` values > _Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`._
5. Compute the minimum, the maximum, the 25th percentile, the 50th percentile, the 75th percentile, the mean of the pH 6. Find the average quality score of wines with the 20% least sulphate content.
6. Compute the average quality of the wines having the 20% least sulphates **Tip:** The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
7. Compute the mean of all variables for wines having the best quality. Same question for the wines having the worst quality 7. Compute the mean of all variables for wines with the best quality. Also, do the same for wines with the worst quality.
--- **Tip:** This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
--- ---
# Exercise 9: Football tournament ## Exercise 9: Football tournament
This exercise focuses on utilizing permutations and complex computations.
The goal of this exercise is to learn to use permutations, complex A Football tournament is underway in your city involving 10 teams. The tournament director seeks an engaging first round and has delegated the pairing decisions to you.
A Football tournament is organized in your city. There are 10 teams and the director of the tournaments wants you to create a first round as exciting as possible. To do so, you are allowed to choose the pairs. As a former data scientist, you implemented a model based on teams' current season performance. This models predicts the score difference between two teams. You used this algorithm to predict the score difference for every possible pair. Leveraging your expertise as a former data scientist, you've developed a predictive model based on teams' current season performance. This model forecasts the score difference between any two teams.
The matrix returned is a 2-dimensional array that contains in (i,j) the score difference between team i and j. The matrix is in [model_forecasts.txt](data/model_forecasts.txt).
Using this output, what are the pairs that will give the most interesting matches ? The model generates a 2-dimensional array stored in [model_forecasts.txt](data/model_forecasts.txt). Each (i, j) entry in this matrix signifies the predicted score difference between Team i and Team j.
The objective is to determine the pairs that will result in the most interesting matches.
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1. If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences** The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
@ -263,6 +339,4 @@ The expected output is:
- m1_t1 stands for match1_team1 - m1_t1 stands for match1_team1
- m1_t1 plays against m1_t2 ... - m1_t1 plays against m1_t2 ...
**Usage of for loop is not allowed, you may need to use the library** `itertools` **to create permutations** **Usage of for loop is not allowed, you may need to use the library [itertools](https://docs.python.org/3.9/library/itertools.html) to create permutations.**
https://docs.python.org/3.9/library/itertools.html

315
subjects/ai/numpy/audit/README.md

@ -1,7 +1,5 @@
#### Exercise 0: Environment and libraries #### Exercise 0: Environment and libraries
##### The exercise is validated if all questions of the exercise are validated
##### Install the virtual environment with `requirements.txt` ##### Install the virtual environment with `requirements.txt`
##### Activate the virtual environment. If you used `conda`, run `conda activate ex00` ##### Activate the virtual environment. If you used `conda`, run `conda activate ex00`
@ -33,13 +31,13 @@
#### Exercise 1: Your first NumPy array #### Exercise 1: Your first NumPy array
##### Add cell and run `type(your_numpy_array)` ##### Add a cell and execute `type(your_numpy_array)`.
###### Is the your_numpy_array an NumPy array? It can be checked with that should be equal to `numpy.ndarray`. ###### Is `your_numpy_array` identified as a NumPy array? It should display as `numpy.ndarray`.
##### Run all the cells of the notebook or `python main.py` ##### Execute all the cells within the notebook or use `python main.py`.
###### Are the types printed are as follows? ###### Can you confirm that the types printed match the following:
``` ```
<class 'int'> <class 'int'>
@ -60,11 +58,43 @@
#### Exercise 2: Zeros #### Exercise 2: Zeros
##### The exercise is validated if all questions of the exercise are validated ###### For question 1, does the solution use `np.zeros` and is the shape of the array `(300,)`like bellow?
```console
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
```
###### For question 1, does the solution use `np.zeros` and is the shape of the array `(300,)`? ###### For question 2, does the solution use `reshape` and is the shape of the array `(3, 100)` like bellow?
###### For question 2, does the solution use `reshape` and is the shape of the array `(3, 100)`? ```console
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0.]]
```
--- ---
@ -72,19 +102,44 @@
#### Exercise 3: Slicing #### Exercise 3: Slicing
##### The exercise is validated if all questions of the exercise are validated ###### The exercise is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`. Are the previous requirements fulfilled?
###### For question 1, is validated if the solution doesn't involve a for loop or writing all integers from 1 to 100 and if the array is: `np.array([1,...,100])`. The list from 1 to 100 can be generated with an iterator: `range`. Were the previous requirements fulfilled? ###### For question 1, does the output look like bellow?
###### For question 2, is the solution `integers[::2]`? ```console
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100]
```
###### For question 3, is the solution `integers[::-2]`? ###### For question 2, does the output look like bellow?
###### For question 4, is the array `np.array([1,0,3,4,0,...,0,99,100])`? There are at least two ways to get this results without for loop. The first one uses `integers[1::3] = 0` and the second involves creating a boolean array that indexes the array: ```console
[ 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95
97 99]
```
```python ###### For question 3, does the output look like bellow?
mask = (integers+1)%3 == 0
integers[mask] = 0 ```console
[100 98 96 94 92 90 88 86 84 82 80 78 76 74 72 70 68 66
64 62 60 58 56 54 52 50 48 46 44 42 40 38 36 34 32 30
28 26 24 22 20 18 16 14 12 10 8 6 4 2]
```
###### For question 4, does the output look like bellow?
```console
[ 1 0 3 4 0 6 7 0 9 10 0 12 13 0 15 16 0 18
19 0 21 22 0 24 25 0 27 28 0 30 31 0 33 34 0 36
37 0 39 40 0 42 43 0 45 46 0 48 49 0 51 52 0 54
55 0 57 58 0 60 61 0 63 64 0 66 67 0 69 70 0 72
73 0 75 76 0 78 79 0 81 82 0 84 85 0 87 88 0 90
91 0 93 94 0 96 97 0 99 100]
``` ```
--- ---
@ -93,18 +148,16 @@ integers[mask] = 0
#### Exercise 4: Random #### Exercise 4: Random
##### The exercise is validated if all questions of the exercise are validated > Note: For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
##### For this exercise, as the results may change depending on the version of the package or the OS, I give the code to correct the exercise. If the code is correct and the output is not the same as mine, it is accepted.
###### For question 1, is the solution `np.random.seed(888)`? ###### For question 1, does the solution contain `np.random.seed(888)`?
###### For question 2, is the output of the solution the same as `np.random.randn(100)`? The value of the first element is `0.17620087373662233`. ###### For question 2, does the solution contain `np.random.randn(100)`?
###### For question 3, is the solution `np.random.randint(1,11,(8,8))`? ###### For question 3, does the solution contain `np.random.randint(1,11,(8,8))`?
```console ```console
Given the NumPy version and the seed, you should have this output: Given the NumPy version and the seed, this is my output:
array([[ 7, 4, 8, 10, 2, 1, 1, 10], array([[ 7, 4, 8, 10, 2, 1, 1, 10],
[ 4, 1, 7, 4, 3, 5, 2, 8], [ 4, 1, 7, 4, 3, 5, 2, 8],
@ -116,10 +169,10 @@ integers[mask] = 0
[ 4, 4, 9, 2, 8, 5, 9, 5]]) [ 4, 4, 9, 2, 8, 5, 9, 5]])
``` ```
###### For question 4, is the solution `np.random.randint(1,18,(4,2,5))`? ###### For question 4, does the solution contain `np.random.randint(1,18,(4,2,5))`?
```console ```console
Given the NumPy version and the seed, you should have this output: Given the NumPy version and the seed, this is my output:
array([[[14, 16, 8, 15, 14], array([[[14, 16, 8, 15, 14],
[17, 13, 1, 4, 17]], [17, 13, 1, 4, 17]],
@ -140,25 +193,34 @@ integers[mask] = 0
#### Exercise 5: Split, concatenate, reshape arrays #### Exercise 5: Split, concatenate, reshape arrays
##### The exercise is validated if all questions of the exercise are validated ###### Run the exercise and check if the output is the same as bellow:
###### For question 1, is the generated array based on an iterator as `range` or `np.arange`? Check that 50 is part of the array.
###### For question 2, is the generated array based on an iterator as `range` or `np.arange`? Check that 100 is part of the array.
###### For question 3, is the array concatenated this way `np.concatenate(array1,array2)`?
###### For question 4, is the result the following?
```console ```console
array([[ 1, ... , 10], [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
... 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
[ 91, ... , 100]]) 49 50]
[ 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
87 88 89 90 91 92 93 94 95 96 97 98 99 100]
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100]
[[ 1 2 3 4 5 6 7 8 9 10]
[ 11 12 13 14 15 16 17 18 19 20]
[ 21 22 23 24 25 26 27 28 29 30]
[ 31 32 33 34 35 36 37 38 39 40]
[ 41 42 43 44 45 46 47 48 49 50]
[ 51 52 53 54 55 56 57 58 59 60]
[ 61 62 63 64 65 66 67 68 69 70]
[ 71 72 73 74 75 76 77 78 79 80]
[ 81 82 83 84 85 86 87 88 89 90]
[ 91 92 93 94 95 96 97 98 99 100]]
``` ```
The easiest way is to use `array.reshape(10,10)`. ###### Can you confirm that the student didn't just printed the actual result?
https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of NumPy Arrays)
--- ---
@ -166,54 +228,44 @@ https://jakevdp.github.io/PythonDataScienceHandbook/ (section: The Basics of Num
#### Exercise 6: Broadcasting and Slicing #### Exercise 6: Broadcasting and Slicing
##### The exercise is validated if all questions of the exercise are validated ###### Run the exercise and check if the output is the same as bellow:
###### For question 1, is the output the same as the following?
`np.ones([9,9], dtype=np.int8)`
###### For question 2, is the output the following?
```console ```console
array([[1, 1, 1, 1, 1, 1, 1, 1, 1], [[1 1 1 1 1 1 1 1 1]
[1, 0, 0, 0, 0, 0, 0, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 0, 1, 1, 1, 1, 1, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 0, 1, 0, 0, 0, 1, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 0, 1, 0, 1, 0, 1, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 0, 1, 0, 0, 0, 1, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 0, 1, 1, 1, 1, 1, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 0, 0, 0, 0, 0, 0, 0, 1], [1 1 1 1 1 1 1 1 1]
[1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=int8) [1 1 1 1 1 1 1 1 1]]
```
[[1 1 1 1 1 1 1 1 1]
[1 0 0 0 0 0 0 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 0 1 0 1 0 1]
[1 0 1 0 0 0 1 0 1]
[1 0 1 1 1 1 1 0 1]
[1 0 0 0 0 0 0 0 1]
[1 1 1 1 1 1 1 1 1]]
[[ 1 2 3]
[ 2 4 6]
[ 3 6 9]
[ 4 8 12]
[ 5 10 15]]
##### The solution of question 2 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution:
```python
x[1:8,1:8] = 0
x[2:7,2:7] = 1
x[3:6,3:6] = 0
x[4,4] = 1
``` ```
###### For question 3, is the output the following? ##### Check the solution for cheating like:
```console - The values of the array have been changed one by one manually.
array([[ 1, 2, 3], - The usage of the for loop, which is not allowed.
[ 2, 4, 6], - Printing the full output given in the readme.
[ 3, 6, 9],
[ 4, 8, 12],
[ 5, 10, 15]], dtype=int8)
```
##### The solution of question 3 is not accepted if the values of the array have been changed one by one manually. The usage of the for loop is not allowed neither.
Here is an example of a possible solution: ###### Can you confirm that there was no cheating in the solution?
```python
np.reshape(arr_1, (5, 1)) * arr_2
```
--- ---
@ -221,8 +273,6 @@ Here is an example of a possible solution:
#### Exercise 7: NaN #### Exercise 7: NaN
##### The exercise is validated if all questions of the exercise are validated
###### Without having used a for loop or having filled the array manually, is the output the following? ###### Without having used a for loop or having filled the array manually, is the output the following?
```console ```console
@ -238,82 +288,97 @@ Here is an example of a possible solution:
[ 8. 5. 8.]] [ 8. 5. 8.]]
``` ```
There are two steps in this exercise: ---
---
#### Exercise 8: Wine
- Create the vector that contains the grade of the first exam if available or the second. This can be done using `np.where`: ###### Was the text file successfully loaded into a NumPy array using `genfromtxt('winequality-red.csv', delimiter=';')` and optimized for memory usage, weighing `76800` bytes or less?
```python Use this in the solution to confirm:
np.where(np.isnan(grades[:, 0]), grades[:, 1], grades[:, 0])
```
- Add this vector as third column of the array. Here are two ways: ```Python
```python # Check the optimized data size and absolute differences
np.insert(arr = grades, values = new_vector, axis = 1, obj = 2) optimized_size = optimized_data.nbytes
abs_diff = np.sum(np.abs(original_data - optimized_data))
np.hstack((grades, new_vector[:, None])) # To verify if criteria are met:
if abs_diff < 1.10**-3 and optimized_size <= 76800:
print("Data optimized successfully.")
else:
print("Optimization criteria not met.")
``` ```
--- ##### For question 2:
--- ###### Is the output the following?
#### Exercise 8: Wine ```console
[[ 7.8 0.76 0.04 2.3 0.092 15. 54. 0.997 3.26
0.65 9.8 5. ]
[ 7.3 0.65 0. 1.2 0.065 15. 21. 0.9946 3.39
0.47 10. 7. ]
[ 5.6 0.615 0. 1.6 0.089 16. 59. 0.9943 3.58
0.52 9.9 5. ]]
```
##### The exercise is validated if all questions of the exercise are validated This slicing gives the answer `data[[2,7,12],:]`.
###### Has the text file successfully been loaded in a NumPy array with `genfromtxt('winequality-red.csv', delimiter=';')` and the reduced arrays weights **76800 bytes**? ##### For question 3:
###### Is the output the following? "Determine if there is any wine in the dataset with an alcohol percentage greater than 20%. Return True or False."
```python ###### Is the answer `False`?
array([[ 7.4 , 0.7 , 0. , 1.9 , 0.076 , 11. , 34. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ], ##### For question 4:
[ 7.4 , 0.66 , 0. , 1.8 , 0.075 , 13. , 40. ,
0.9978, 3.51 , 0.56 , 9.4 , 5. ], "Calculate the average alcohol percentage across all wines in the dataset. Exclude `np.nan` values if present."
[ 6.7 , 0.58 , 0.08 , 1.8 , 0.097 , 15. , 65. ,
0.9959, 3.28 , 0.54 , 9.2 , 5. ]])
```
This slicing gives the answer `my_data[[1,6,11],:]`. ###### Is the answer `10.422984`?
###### Is the answer False? There are many ways to get the answer: find the maximum or check values greater than 20. ##### For question 5:
###### Is the answer 10.422983114446529? "Compute various statistical measures (minimum, maximum, 25th percentile, 50th percentile, 75th percentile and the mean for the pH values)."
###### Is the answer the following? ###### Check if you have the correct results as bellow?
```console ```console
pH stats
25 percentile: 3.21 25 percentile: 3.21
50 percentile: 3.31 50 percentile: 3.31
75 percentile: 3.4 75 percentile: 3.40
mean: 3.3111131957473416 mean: 3.31
min: 2.74 min: 2.74
max: 4.01 max: 4.01
``` ```
> *Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`.* > _Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`._
###### Is the answer ~`5.2`? The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`. ##### For question 6:
"Find the average quality score of wines with the 20% least sulphate content."
###### Is the answer ~`5.2`?
##### For question 7:
Compute the mean of all variables for wines with the best quality. Also, do the same for wines with the worst quality.
###### Is the output for the best wines the following? ###### Is the output for the best wines the following?
```python ```console
array([ 8.56666667, 0.42333333, 0.39111111, 2.57777778, 0.06844444, [ 8.566666 0.4233333 0.39111114 2.5777776 0.06844445 13.277778
13.27777778, 33.44444444, 0.99521222, 3.26722222, 0.76777778, 33.444443 0.99521226 3.2672222 0.76777774 12.094444 8. ]
12.09444444, 8. ])
``` ```
###### Is the output for the bad wines the following? ###### Is the output for the bad wines the following?
```python ```console
array([ 8.36 , 0.8845 , 0.171 , 2.635 , 0.1225 , 11. , [ 8.359999 0.8845 0.17099999 2.6350002 0.12249999 11.
24.9 , 0.997464, 3.398 , 0.57 , 9.955 , 3. ]) 24.9 0.997464 3.398 0.57000005 9.955 3. ]
``` ```
This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
--- ---
--- ---

Loading…
Cancel
Save