mirror of https://github.com/01-edu/public.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.2 KiB
3.2 KiB
Credit scoring
Preliminary
project
│ README.md
│ environment.yml
│
└───data
│ │ ...
│
└───results
│ │
| |───model (free format)
│ │ │ my_own_model.pkl
│ │ │ model_report.txt
│ │
| |feature_engineering
│ │ │ EDA.ipynb
│ │
| |───clients_outputs
| | | client1_correct_train.pdf (free format)
│ │ │ client2_wrong_train.pdf (free format)
│ │ │ client_test.pdf (free format)
│ │
| |───dashboard (optional)
| | | dashboard.py (free format)
│ │ │ ...
|
|───scripts (free format)
│ │ train.py
│ │ predict.py
│ │ preprocess.py
Is the structure of the project as above?
Does the readme file introduce the project, summarize how to run the code and show the username?
Does the environment contain all libraries used and the versions that are necessary to run the code?
Does the EDA.ipynb
explain in details the exploratory data analysis?
Machine learning model
Is the model trained only the training set?
Is the AUC on the test set higher than 75%?
Does the model learning curves prove that the model is not overfitting?
Has the training been stopped early enough to avoid the overfitting?
Does the text document model_report.txt
describe the methodology used to train the machine learning model?
Does predict.py
run without any error and returns the following?
python predict.py
AUC on test set: 0.76
This article gives a complete example of a good modelling approach.
Model's interpretability
Feature importance:
Are the importance of all features used by the model computed and showed in a visualisation?
Is the mapping between the importance of the features and the features' name correct? You should be careful here to associate the right variables to the their feature importance. Sometimes, the preprocessing pipeline can remove some features during the features selection step for instance.
Descriptive variables:
These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualised here should be "raw". Are the visualisations computed for the 3 clients?
- Visualisations that show at least 10 variables describing the client and its loan(s).
- Visualisations that show the comparison between this client and other clients.