public/subjects/ai/forest-prediction/audit/README.md

# Forest Cover Type Prediction

The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.

### Preliminary

###### Does the structure of the project is as below ?


The expected structure of the project is:

```
project
│   README.md
│   environment.yml
│
└───data
│   │   train.csv
│   |   test.csv (not available first day)
|   |   covtype.info
│
└───notebook
│   │   EDA.ipynb
|
|───scripts
|   │   preprocessing_feature_engineering.py
|   │   model_selection.py
│   |   predict.py
│
└───results
    │   confusion_matrix_heatmap.png
    │   learning_curve_best_model.png
    │   test_predictions.csv
    │   best_model.pkl

```

###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file, especially details on the feature engineering which is a key step ?


###### Does the environment contain all libraries used and their versions that are necessary to run the code ?


### 1. Preprocessing and features engineering:


## 2. Model selection and predict

### Data splitting

###### Does data splitting (cross-validation) structure as follow ?

```
DATA
└───TRAIN FILE (0)
│   └───── Train (1):
│   |           Fold0:
|   |                  Train
|   |                  Validation
|   |           Fold1:
|   |                   Train
|   |                   Validation
... ...         ...
|   |
|   └───── Test (1)
│
└─── TEST FILE (0)(available last day)

```

##### The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
##### The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.

### Gridsearch

##### It contains at least these 5 different models: Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.

There are many options:
- 5 grid searches on 1 model
- 1 grid search on 5 models
- 1 grid search on a pipeline that contains the preprocessing
- 5 grid searches on a pipeline that contains the preprocessing

### Training

###### Is the **target is removed from the X** matrix ?

### Results

##### Run predict.py on the test set, check that: Test (last day) accuracy > **0.65**.

##### Train accuracy score < **0.98**.
It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).

##### The confusion matrix is represented as a DataFrame. Example:
![alt text][confusion_matrix]

[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix "

##### The learning curve for the best model is plotted. Example:

![alt text][logo_learning_curve]

[logo_learning_curve]: ../images/w2_weekend_learning_curve.png "Learning curve "

Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).

###### Is the trained model saved as a pickle file ?
docs(ai): add ai branch subjects to public 2 years ago			`# Forest Cover Type Prediction`

			`The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.`

			`### Preliminary`

			`###### Does the structure of the project is as below ?`


			`The expected structure of the project is:`

			```
			`project`
			`│ README.md`
			`│ environment.yml`
			`│`
			`└───data`
			`│ │ train.csv`
			`│ \| test.csv (not available first day)`
			`\| \| covtype.info`
			`│`
			`└───notebook`
			`│ │ EDA.ipynb`
			`\|`
			`\|───scripts`
			`\| │ preprocessing_feature_engineering.py`
			`\| │ model_selection.py`
			`│ \| predict.py`
			`│`
			`└───results`
			`│ confusion_matrix_heatmap.png`
			`│ learning_curve_best_model.png`
			`│ test_predictions.csv`
			`│ best_model.pkl`

			```

			`###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file, especially details on the feature engineering which is a key step ?`


			`###### Does the environment contain all libraries used and their versions that are necessary to run the code ?`



			`### 1. Preprocessing and features engineering:`



			`## 2. Model selection and predict`

			`### Data splitting`

			`###### Does data splitting (cross-validation) structure as follow ?`

			```
			`DATA`
			`└───TRAIN FILE (0)`
			`│ └───── Train (1):`
			`│ \| Fold0:`
			`\| \| Train`
			`\| \| Validation`
			`\| \| Fold1:`
			`\| \| Train`
			`\| \| Validation`
			`... ... ...`
			`\| \|`
			`\| └───── Test (1)`
			`│`
			`└─── TEST FILE (0)(available last day)`

			```

			`##### The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.`
			`##### The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.`

			`### Gridsearch`

			`##### It contains at least these 5 different models: Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.`

			`There are many options:`
			`- 5 grid searches on 1 model`
			`- 1 grid search on 5 models`
			`- 1 grid search on a pipeline that contains the preprocessing`
			`- 5 grid searches on a pipeline that contains the preprocessing`

			`### Training`

			`###### Is the target is removed from the X matrix ?`

			`### Results`

			`##### Run predict.py on the test set, check that: Test (last day) accuracy > 0.65.`

			`##### Train accuracy score < 0.98.`
			`It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).`

			`##### The confusion matrix is represented as a DataFrame. Example:`
			`![alt text][confusion_matrix]`

			`[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix "`

			`##### The learning curve for the best model is plotted. Example:`

			`![alt text][logo_learning_curve]`

			`[logo_learning_curve]: ../images/w2_weekend_learning_curve.png "Learning curve "`

			`Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).`

			`###### Is the trained model saved as a pickle file ?`