The goal of this day is to understand practical usage of **NumPy**. **NumPy** is a commonly used Python data analysis package. By using **NumPy**, you can speed up your workflow, and interface with other packages in the Python ecosystem, like scikit-learn, that use **NumPy** under the hood. **NumPy** was originally developed in the mid 2000s, and arose from an even older package called Numeric. This longevity means that almost every data analysis or machine learning package for Python leverages **NumPy** in some way.
### Virtual Environment
- Python 3.x
- NumPy
- Jupyter or JupyterLab
_Version of NumPy I used to do the exercises: 1.18.1_.
The goal of this exercise is to set up the Python work environment with the required libraries and to learn to launch a `jupyter notebook`. Jupyter notebooks are very convenient as they allow to write and test code within seconds. However, it really easy to implement instable and not reproducible code using notebooks. Keep the notebook and the underlying code clean. Notebook can be used for most of the exercises of the piscine as the goal is to experiment a lot. But no worries, you'll be asked to build a more robust structure for all the projects.
- The **latest stable version** of Python for your work. However, in this exercise, you'll install and use a specific Python version for educational purposes.
- Choose a virtual environment that aligns with your familiarity. Common choices among Data Science practitioners are `virtualenv` and `conda`.
- Install the most recent versions of the required libraries to ensure compatibility and access to the latest features
1. Begin by creating a virtual environment named `ex00` that utilizes Python version `3.8`. Install the required libraries `numpy` and `jupyter`. Save the installed packages to a file named `requirements.txt`, located in the current directory.
The objective of this exercise is to familiarize yourself with incorporating various Python data types into **NumPy** arrays. **NumPy** arrays play a vital role in both **NumPy** and **Pandas**, offering flexibility and optimized functionalities.
1. Create a NumPy array that contains: an `integer`, a `float`, a `string`, a `dictionary`, a `list`, a `tuple`, a `set` and a `boolean`. Add the following code at the end of your python file or in a cell of the jupyter notebook:
The goal of this exercise is to learn to generate random data.
In Data Science it is extremely useful to generate random data for many reasons:
Lack of real data, create a random benchmark, use varied data sets.
NumPy proposes a lot of options to generate random data. In statistics, assumptions are made on the distribution the data is from. All data distribution that can be generated randomly are described in the documentation. In this exercise we will focus on two distributions:
- Uniform: For example, if your goal is to generate a random number from 1 to 100 and that the probability that all the numbers is equal you'll need the uniform distribution. NumPy provides `randint` and `uniform` to generate uniform distribution
- Normal: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.For example, if you need to generate a data sample that represents **Heights of 14 Year Old Girls** it can be done using the normal distribution. In that case, we need two parameters: the mean (1m51) and the standard deviation (0.0741m). NumPy provides `randn` to generate normal distribution (among other)
Let's consider a 2-dimensional array containing grades from the last two exams. Some students missed the first exam, so their grades are replaced with `NaN`.
1. Using `np.where`, create a third column that takes the grade of the first exam if available; otherwise, it uses the grade from the second exam. Add this column as the third column of the array.
**Using a for loop or if/else statement is not allowed in this exercise.**
1. Load the data using `genfromtxt`, specifying the delimiter as ';', and optimize the numpy array size by reducing the data types. Use `np.float32` and verify that the resulting numpy array weighs **76800 bytes**.
> _Note: Using `percentile` or `median` may give different results depending on the duplicate values in the column. If you do not have my results please use `percentile`._
**Tip:** The first step is to get the percentile 20% of the column `sulphates`, then create a boolean array that contains `True` of the value is smaller than the percentile 20%, then select this rows with the column quality and compute the `mean`.
**Tip:** This can be done in three steps: Get the max, create a boolean mask that indicates rows with max quality, use this mask to subset the rows with the best quality and compute the mean on the axis 0.
A Football tournament is underway in your city involving 10 teams. The tournament director seeks an engaging first round and has delegated the pairing decisions to you.
Leveraging your expertise as a former data scientist, you've developed a predictive model based on teams' current season performance. This model forecasts the score difference between any two teams.
The model generates a 2-dimensional array stored in [model_forecasts.txt](data/model_forecasts.txt). Each (i, j) entry in this matrix signifies the predicted score difference between Team i and Team j.
The objective is to determine the pairs that will result in the most interesting matches.
If a team wins 7-1 the match is obviously less exciting than a match where the winner wins 2-1.
The criteria that corresponds to **the pairs that will give the most interesting matches** is **the pairs that minimize the sum of squared differences**
**Usage of for loop is not allowed, you may need to use the library [itertools](https://docs.python.org/3.9/library/itertools.html) to create permutations.**