The goal of this exercise is to set up the Python work environment with the required libraries.
The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
**Note:** For each quest, your first exercise will be to set up the virtual environment with the required libraries.
I recommend to use:
I recommend to use:
- the **last stable versions** of Python.
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
- one of the most recent versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
---
---
# Exercise 1: K-Fold
# Exercise 1: K-Fold
@ -69,24 +74,25 @@ y = np.array(np.arange(1,11))
1. Using `KFold`, perform a 5-fold cross validation. For each fold, print the train index and test index. The expected output is:
1. Using `KFold`, perform a 5-fold cross validation. For each fold, print the train index and test index. The expected output is:
```console
```console
Fold: 1
Fold: 1
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
TRAIN: [2 3 4 5 6 7 8 9] TEST: [0 1]
Fold: 2
Fold: 2
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
TRAIN: [0 1 4 5 6 7 8 9] TEST: [2 3]
Fold: 3
Fold: 3
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
TRAIN: [0 1 2 3 6 7 8 9] TEST: [4 5]
Fold: 4
Fold: 4
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
TRAIN: [0 1 2 3 4 5 8 9] TEST: [6 7]
Fold: 5
Fold: 5
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
TRAIN: [0 1 2 3 4 5 6 7] TEST: [8 9]
```
```
---
---
---
---
# Exercise 2: Cross validation (k-fold)
# Exercise 2: Cross validation (k-fold)
@ -95,7 +101,7 @@ The goal of this exercise is to learn how to use cross validation. After reading
Preliminary:
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given.*
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. _The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given._
```python
```python
# imports
# imports
@ -135,7 +141,7 @@ Mean of scores on validation sets:
Standard deviation of scores on validation sets:
Standard deviation of scores on validation sets:
0.0214983822773466
0.0214983822773466
```
```
**Note: It may be confusing that the key of the dictionary that returns the results on the validation sets is `test_score`. Sometimes, the validation sets are called test sets. In that case, we run the cross validation on X_train. It means that the scores are computed on sets in the initial train set. The X_test is not used for the cross-validation.**
**Note: It may be confusing that the key of the dictionary that returns the results on the validation sets is `test_score`. Sometimes, the validation sets are called test sets. In that case, we run the cross validation on X_train. It means that the scores are computed on sets in the initial train set. The X_test is not used for the cross-validation.**
@ -144,24 +150,21 @@ Standard deviation of scores on validation sets:
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
The goal here is to utilize GridSearchCV for running a grid search, making predictions, and scoring on a test set.
Preliminary:
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the gridsearch, that is why the code to fit the Linear Regression is given.*
- Import California Housing dataset, split it into a train and a test set (10%), and fit a linear regression on the dataset.
```python
```python
# imports
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import train_test_split
1. Run `GridSearchCV` on all CPUs with 5 folds, MSE as score, Random Forest as model with:
1. Run `GridSearchCV` with the following settings:
- Using all CPUs, perform 5-fold cross-validation.
- Scoring metric: MSE (Mean Squared Error)
- Model: Random Forest
- max_depth between 1 and 20 (at least 3 values)
Hyperparameters to search:
- n_estimators between 1 and 100 (at least 3 values)
This may take few minutes to run.
- `max_depth`: range between 1 and 20 (minimum 3 values)
- `n_estimators`: range between 1 and 100 (minimum 3 values)
*Hint*: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details:
This computation might take a few minutes to run.
_Hint_: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details:
2. Extract the best fitted estimator, print its params, print its score on the validation set and print`cv_results_`.
2. Extract the best fitted estimator, print its parameters, its score on the validation set, and display`cv_results_`.
3. Compute the score the test set.
3. Compute the score on the test set.
**WARNING: If the score used in classification is the AUC, there is one rare case where the AUC may return an error or a warning: The fold contains only one class. In that case it can't be computed, by definition.**
**WARNING: For classification tasks using AUC score, an error or warning might occur if a fold contains only one class, rendering the AUC unable to be computed due to its definition.**
---
---
---
---
# Exercise 4: Validation curve and Learning curve
# Exercise 4: Validation curve and Learning curve
The goal of this exercise is to learn to analyse the model's performance with two tools:
The goal of this exercise is to learn to analyze the model's performance with two tools:
- Validation curve
- Validation curve
- Learning curve
- Learning curve
@ -220,7 +225,7 @@ X, y = make_classification(n_samples=100000,
```
```
1. Plot the validation curve, using all CPUs, with 5 folds. The goal is to focus again on max_depth between 1 and 20.
1. Plot the validation curve, using all CPUs, with 5 folds. The goal is to focus again on max_depth between 1 and 20.
You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes.
You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes.
I do not expect that you implement all the plot from scratch, you'd better leverage the code here:
I do not expect that you implement all the plot from scratch, you'd better leverage the code here:
@ -230,7 +235,7 @@ The plot should look like this:
The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model.
The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model.
2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyse the learning curves and how to plot them:
2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyze the learning curves and how to plot them: