root

public

mirror of https://github.com/01-edu/public.git

eslopfer 628ceb1f20 docs(forest-cover-type-prediction): fix audits format		2 years ago
..
audit	docs(forest-cover-type-prediction): fix audits format	2 years ago
images	docs(forest-prediction): rename path to match ref	2 years ago
README.md	docs(forest-prediction): rename path to match ref	2 years ago

README.md

Forest Prediction

The goal of this project is to use cartographic variables to classify forest categories. You will have to analyse the data, create features and to train a machine learning model on the cartographic data to make it as accurate as possible.

Data

The input files are train.csv, test.csv and covtype.info:

train.csv
test.csv
covtype.info

The train data set is used to analyse the data and calibrate the models. The goal is to get the accuracy as high as possible on the test set. The test set will be available at the end of the last day to prevent from the overfitting of the test set.

The data is described in covtype.info.

Structure

The structure of the project is:

project
│   README.md
│   environment.yml
│
└───data
│   │   train.csv
│   |   test.csv (not available first day)
|   |   covtype.info
│
└───notebook
│   │   EDA.ipynb
|
|───scripts
|   │   preprocessing_feature_engineering.py
|   │   model_selection.py
│   |   predict.py
│
└───results
    │   plots
    │   test_predictions.csv
    │   best_model.pkl

1. EDA and feature engineering

Create a Jupyter Notebook to analyse the data sets and perform EDA (Exploratory Data Analysis). This notebook will not be evaluated.
Hint: Examples of interesting features
- Distance to hydrology = sqrt((Horizontal_Distance_To_Hydrology)^2 + (Vertical_Distance_To_Hydrology)^2)
- Horizontal_Distance_To_Fire_Points - Horizontal_Distance_To_Roadways

2. Model Selection

The model selection approach is a key step because, t should return the best model and guaranty that the results are reproducible on the final test set. The goal of this step is to make sure that the results on the test set are not due to test set overfitting. It implies to split the data set as shown below:

DATA
└───TRAIN FILE (0)
│   └───── Train (1)
│   |           Fold0:
|   |                  Train
|   |                  Validation
|   |           Fold1:
|   |                   Train
|   |                   Validation
... ...         ...
|   |
|   └───── Test (1)
│
└─── TEST FILE (0) (available last day)

Rules:

Split train test
Cross validation: at least 5 folds
Grid search on at least 5 different models:
- Gradient Boosting, KNN, Random Forest, SVM, Logistic Regression. Remember that for some model scaling the data is important and for others it doesn't matter.
Train accuracy score < 0.98. Train set (0). Write the result in the README.md
Test (last day) accuracy > 0.65. Test set (0). Write the result in the README.md
Display the confusion matrix for the best model in a DataFrame. Precise the index and columns names (True label and Predicted label)
Plot the learning curve for the best model
Save the trained model as a pickle file

Advice: As the grid search takes time, I suggest to prepare and test the code. Once you are confident it works, run the gridsearch at night and analyse the results

Hint: The confusion matrix shows the misclassifications class per class. Try to detect if the model misclassifies badly one class with another. Then, do some research on the internet on the two forest cover types, find the differences and create some new features that underline these differences. More generally, the methodology of a models learning is a cycle with several iterations. More details here

3. Predict (last day)

Once you have selected the best model and you are confident it will perform well on new data, you're ready to predict on the test set:

Load the trained model
Predict on the test set and compute the accuracy
Save the predictions in a csv file
Add your score on the README.md

Files needed for this project

link