mirror of https://github.com/01-edu/public.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
111 lines
3.2 KiB
111 lines
3.2 KiB
2 years ago
|
# Forest Cover Type Prediction
|
||
|
|
||
|
The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.
|
||
|
|
||
|
### Preliminary
|
||
|
|
||
|
###### Does the structure of the project is as below ?
|
||
|
|
||
|
|
||
|
The expected structure of the project is:
|
||
|
|
||
|
```
|
||
|
project
|
||
|
│ README.md
|
||
|
│ environment.yml
|
||
|
│
|
||
|
└───data
|
||
|
│ │ train.csv
|
||
|
│ | test.csv (not available first day)
|
||
|
| | covtype.info
|
||
|
│
|
||
|
└───notebook
|
||
|
│ │ EDA.ipynb
|
||
|
|
|
||
|
|───scripts
|
||
|
| │ preprocessing_feature_engineering.py
|
||
|
| │ model_selection.py
|
||
|
│ | predict.py
|
||
|
│
|
||
|
└───results
|
||
|
│ confusion_matrix_heatmap.png
|
||
|
│ learning_curve_best_model.png
|
||
|
│ test_predictions.csv
|
||
|
│ best_model.pkl
|
||
|
|
||
|
```
|
||
|
|
||
|
###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file, especially details on the feature engineering which is a key step ?
|
||
|
|
||
|
|
||
|
###### Does the environment contain all libraries used and their versions that are necessary to run the code ?
|
||
|
|
||
|
|
||
|
|
||
|
### 1. Preprocessing and features engineering:
|
||
|
|
||
|
|
||
|
|
||
|
## 2. Model selection and predict
|
||
|
|
||
|
### Data splitting
|
||
|
|
||
|
###### Does data splitting (cross-validation) structure as follow ?
|
||
|
|
||
|
```
|
||
|
DATA
|
||
|
└───TRAIN FILE (0)
|
||
|
│ └───── Train (1):
|
||
|
│ | Fold0:
|
||
|
| | Train
|
||
|
| | Validation
|
||
|
| | Fold1:
|
||
|
| | Train
|
||
|
| | Validation
|
||
|
... ... ...
|
||
|
| |
|
||
|
| └───── Test (1)
|
||
|
│
|
||
|
└─── TEST FILE (0)(available last day)
|
||
|
|
||
|
```
|
||
|
|
||
|
##### The train set (0) id divised in a train set (1) and test set (1). The ratio is less than 33%.
|
||
|
##### The cross validation splits the train set (1) is at least 5 folds. If the cross validation is stratified that's a good point but it is not a requirement.
|
||
|
|
||
|
### Gridsearch
|
||
|
|
||
|
##### It contains at least these 5 different models: Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression.
|
||
|
|
||
|
There are many options:
|
||
|
- 5 grid searches on 1 model
|
||
|
- 1 grid search on 5 models
|
||
|
- 1 grid search on a pipeline that contains the preprocessing
|
||
|
- 5 grid searches on a pipeline that contains the preprocessing
|
||
|
|
||
|
### Training
|
||
|
|
||
|
###### Is the **target is removed from the X** matrix ?
|
||
|
|
||
|
### Results
|
||
|
|
||
|
##### Run predict.py on the test set, check that: Test (last day) accuracy > **0.65**.
|
||
|
|
||
|
##### Train accuracy score < **0.98**.
|
||
|
It can be checked on the learning curve. If you are not sure, load the model, load the training set (0), score on the training set (0).
|
||
|
|
||
|
##### The confusion matrix is represented as a DataFrame. Example:
|
||
|
![alt text][confusion_matrix]
|
||
|
|
||
|
[confusion_matrix]: ../images/w2_weekend_confusion_matrix.png "Confusion matrix "
|
||
|
|
||
|
##### The learning curve for the best model is plotted. Example:
|
||
|
|
||
|
![alt text][logo_learning_curve]
|
||
|
|
||
|
[logo_learning_curve]: ../images/w2_weekend_learning_curve.png "Learning curve "
|
||
|
|
||
|
Note: The green line on the plot shows the accuracy on the validation set not on the test set (1) and not on the test set (0).
|
||
|
|
||
|
###### Is the trained model saved as a pickle file ?
|