public

root

public

mirror of https://github.com/01-edu/public.git

3.2 KiB

Raw Blame History

Credit scoring

Preliminary

project
│   README.md
│   environment.yml
│
└───data
│   │   ...
│
└───results
│   │
|   |───model (free format)
│   │   │   my_own_model.pkl
│   │   │   model_report.txt
│   │
|   |feature_engineering
│   │   │   EDA.ipynb
│   │
|   |───clients_outputs
|   |   |   client1_correct_train.pdf  (free format)
│   │   │   client2_wrong_train.pdf  (free format)
│   │   │   client_test.pdf   (free format)
│   │
|   |───dashboard (optional)
|   |   |   dashboard.py  (free format)
│   │   │   ...
|
|───scripts (free format)
│   │   train.py
│   │   predict.py
│   │   preprocess.py

Is the structure of the project as above?

Does the readme file introduce the project, summarize how to run the code and show the username?

Does the environment contain all libraries used and the versions that are necessary to run the code?

Does the `EDA.ipynb` explain in details the exploratory data analysis?

Machine learning model

Is the model trained only the training set?

Is the AUC on the test set higher than 75%?

Does the model learning curves prove that the model is not overfitting?

Has the training been stopped early enough to avoid the overfitting?

Does the text document `model_report.txt` describe the methodology used to train the machine learning model?

Does `predict.py` run without any error and returns the following?

    python predict.py

    AUC on test set: 0.76

This article gives a complete example of a good modelling approach.

Model's interpretability

Feature importance:

Are the importance of all features used by the model computed and showed in a visualisation?

Is the mapping between the importance of the features and the features' name correct? You should be careful here to associate the right variables to the their feature importance. Sometimes, the preprocessing pipeline can remove some features during the features selection step for instance.

Descriptive variables:

These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualised here should be "raw". Are the visualisations computed for the 3 clients?

- Visualisations that show at least 10 variables describing the client and its loan(s).
- Visualisations that show the comparison between this client and other clients.

SHAP values on the model are displayed through a summary plot that shows the important features and their impact on the target. This is optional if you have already computed the features importance.

Are the 3 clients selected as expected? 2 clients from the train set (1 on which the model is correct and 1 on which the model's wrong) and 1 client from the test set.

SHAP values on predictions are computed for the 3 clients. The force plot shows what variables contributes the most to the score. Does the score outputted by the force plot correspond to the one outputted by the model?

3.2 KiB Raw Blame History

Credit scoring

Preliminary

Is the structure of the project as above?

Does the readme file introduce the project, summarize how to run the code and show the username?

Does the environment contain all libraries used and the versions that are necessary to run the code?

Does the EDA.ipynb explain in details the exploratory data analysis?

Machine learning model

Is the model trained only the training set?

Is the AUC on the test set higher than 75%?

Does the model learning curves prove that the model is not overfitting?

Has the training been stopped early enough to avoid the overfitting?

Does the text document model_report.txt describe the methodology used to train the machine learning model?

Does predict.py run without any error and returns the following?

Model's interpretability

Feature importance:

Are the importance of all features used by the model computed and showed in a visualisation?

Is the mapping between the importance of the features and the features' name correct? You should be careful here to associate the right variables to the their feature importance. Sometimes, the preprocessing pipeline can remove some features during the features selection step for instance.

Descriptive variables:

These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualised here should be "raw". Are the visualisations computed for the 3 clients?

SHAP values on the model are displayed through a summary plot that shows the important features and their impact on the target. This is optional if you have already computed the features importance.

Are the 3 clients selected as expected? 2 clients from the train set (1 on which the model is correct and 1 on which the model's wrong) and 1 client from the test set.

SHAP values on predictions are computed for the 3 clients. The force plot shows what variables contributes the most to the score. Does the score outputted by the force plot correspond to the one outputted by the model?

3.2 KiB

Raw Blame History

Does the `EDA.ipynb` explain in details the exploratory data analysis?

Does the text document `model_report.txt` describe the methodology used to train the machine learning model?

Does `predict.py` run without any error and returns the following?