The goal of this exercise is to set up the Python work environment with the required libraries.
**Note:** For each quest, your first exercice will be to set up the virtual environment with the required libraries.
**Note:** For each quest, your first exercise will be to set up the virtual environment with the required libraries.
I recommend to use:
- the **last stable versions** of Python.
- the virtual environment you're the most confortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recents versions of the libraries required
- the virtual environment you're the most comfortable with. `virtualenv` and `conda` are the most used in Data Science.
- one of the most recent versions of the libraries required
1. Create a virtual environment named `ex00`, with a version of Python >= `3.8`, with the following libraries: `pandas`, `numpy`, `jupyter`, `matplotlib` and `scikit-learn`.
---
---
# Exercise 1: K-Fold
@ -87,6 +92,7 @@ y = np.array(np.arange(1,11))
```
---
---
# Exercise 2: Cross validation (k-fold)
@ -95,7 +101,7 @@ The goal of this exercise is to learn how to use cross validation. After reading
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given.*
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. _The goal is to focus on the cross validation, that is why the code to fit the Linear Regression is given._
```python
# imports
@ -135,7 +141,7 @@ Mean of scores on validation sets:
Standard deviation of scores on validation sets:
0.0214983822773466
```
```
**Note: It may be confusing that the key of the dictionary that returns the results on the validation sets is `test_score`. Sometimes, the validation sets are called test sets. In that case, we run the cross validation on X_train. It means that the scores are computed on sets in the initial train set. The X_test is not used for the cross-validation.**
@ -144,24 +150,21 @@ Standard deviation of scores on validation sets:
The goal of this exercise is to learn to use GridSearchCV to run a grid search, predict on the test set and score on the test set.
The goal here is to utilize GridSearchCV for running a grid search, making predictions, and scoring on a test set.
Preliminary:
- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the gridsearch, that is why the code to fit the Linear Regression is given.*
- Import California Housing dataset, split it into a train and a test set (10%), and fit a linear regression on the dataset.
```python
# imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
1. Run `GridSearchCV` on all CPUs with 5 folds, MSE as score, Random Forest as model with:
1. Run `GridSearchCV` with the following settings:
- Using all CPUs, perform 5-fold cross-validation.
- Scoring metric: MSE (Mean Squared Error)
- Model: Random Forest
- max_depth between 1 and 20 (at least 3 values)
- n_estimators between 1 and 100 (at least 3 values)
Hyperparameters to search:
This may take few minutes to run.
- `max_depth`: range between 1 and 20 (minimum 3 values)
- `n_estimators`: range between 1 and 100 (minimum 3 values)
*Hint*: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details:
This computation might take a few minutes to run.
_Hint_: The name of the metric to put in the parameter `scoring` is `neg_mean_squared_error`. The smaller the MSE is, the better the model is. At the contrary, The greater the R2 is the better the model is. `GridSearchCV` chooses the best model by selecting the one that maximized the score on the validation sets. And, in mathematic, maximizing a function or minimizing its opposite is equivalent. More details:
2. Extract the best fitted estimator, print its params, print its score on the validation set and print`cv_results_`.
2. Extract the best fitted estimator, print its parameters, its score on the validation set, and display`cv_results_`.
3. Compute the score the test set.
3. Compute the score on the test set.
**WARNING: If the score used in classification is the AUC, there is one rare case where the AUC may return an error or a warning: The fold contains only one class. In that case it can't be computed, by definition.**
**WARNING: For classification tasks using AUC score, an error or warning might occur if a fold contains only one class, rendering the AUC unable to be computed due to its definition.**
---
---
# Exercise 4: Validation curve and Learning curve
The goal of this exercise is to learn to analyse the model's performance with two tools:
The goal of this exercise is to learn to analyze the model's performance with two tools:
- Validation curve
- Learning curve
@ -220,7 +225,7 @@ X, y = make_classification(n_samples=100000,
```
1. Plot the validation curve, using all CPUs, with 5 folds. The goal is to focus again on max_depth between 1 and 20.
You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes.
You may need to increase the window (example: between 1 and 50 ) if you notice that other values of max_depth could have returned better results. This may take few minutes.
I do not expect that you implement all the plot from scratch, you'd better leverage the code here:
@ -230,7 +235,7 @@ The plot should look like this:
The interpretation is that from max_depth=10, the train score keeps increasing but the test score (or validation score) reaches a plateau. It means that choosing max_depth = 20 may lead to have an over fitted model.
2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyse the learning curves and how to plot them:
2. Let us assume the gridsearch returned `clf = RandomForestClassifier(max_depth=12)`. Let's check if the models under fits, over fit or fits correctly. Plot the learning curve. These two resources will help you a lot to understand how to analyze the learning curves and how to plot them: