Hey there, future credit scoring expert! Ready to dive into the exciting world of predicting loan defaults? You're in for a treat! This project is all about building a nifty model that can help figure out how likely someone is to pay back their loan. Cool, right?
### Overview
The goal of this project is to implement a scoring model based on various source of data ([check data documentation](./readme_data.md)) that returns the probability of default. In a nutshell, credit scoring represents an evaluation of how well the bank's customer can pay and is willing to pay off debt. It is also required that you provide an explanation of the score. For example, your model returns that the probability that one client doesn't pay back the loan is very high (90%). The reason behind is that variable_xxx which represents the ability to pay back the past loan is low. The output interpretability will appear in a visualization.
### 2. Learning objective :
### Role play
Hey there, future credit scoring expert! Ready to dive into the exciting world of predicting loan defaults? You're in for a treat! This project is all about building a nifty model that can help figure out how likely someone is to pay back their loan. Cool, right?
### Learning Objective
The ability to understand the underlying factors of credit scoring is important. Credit scoring is subject to more and more regulation, so transparency is key. And more generally, more and more companies prefer transparency to black box models.
@ -14,7 +16,9 @@ Historical timeline of machine learning techniques applied to credit scoring
- [Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds](https://hal.archives-ouvertes.fr/hal-02507499v3/document)
#### a - Scoring model
### Instructions
#### Scoring model
There are 3 expected deliverables associated with the scoring model:
@ -24,19 +28,26 @@ There are 3 expected deliverables associated with the scoring model:
- Do not forget: **Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.**
- The model is validated if the **AUC on the test set is higher than 50%**.
- The labelled test data is not publicly available. However, a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1.
- Include learning curves (training and validation scores vs. training set size or epochs) to demonstrate that the model is not overfitting.
- Explain the measures taken to prevent overfitting, such as early stopping or regularization techniques.
- Justify your choice of when to stop training based on the learning curves.
#### b - Kaggle submission
#### Kaggle submission
The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations.
- Create a username following that structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on the Git platform the first day of the week. Do not touch this file anymore.
- A text document that describes the methodology used to train the machine learning model:
- A text document `model_report.txt`that describes the methodology used to train the machine learning model:
- Algorithm
- Why the accuracy shouldn't be used in that case?
- Limit and possible improvements
#### c - Model interpretability
#### Model interpretability
This part hasn't been covered during the piscine. Take the time to understand this key concept.
There are different level of transparency:
@ -59,11 +70,11 @@ Choose the 3 clients of your choice, compute the score, run the visualizations o
- 1 on which the model is correct and the other on which the model is wrong. Try to understand why the model got wrong on this client.
- Take 1 client from the test set
#### d - Optional
#### Bonus
Implement a dashboard (using [Dash](https://dash.plotly.com/)) that takes as input the customer id and that returns the score and the required visualizations.
### 3. Project repository structure:
### Project repository structure:
```
project
@ -97,17 +108,17 @@ project
│ │ preprocess.py
```
- `README.md` introduces the project and shows the username.
- `README.md` introduces the project, how to run the code, and shows the username.
- `requirements.txt` contains all libraries required to run the code.
- `username.txt` contains the username, the last modified date of the file **has to correspond to the first day of the project**.
- `EDA.ipynb` contains the exploratory data analysis. This file should contain all steps of data analysis that contributed or not to improve the score of the model. It has to be commented so that the reviewer can understand the analysis and run it without any problem.
- `scripts` contains python file(s) that perform(s) the feature engineering, the model's training and prediction on the test set. It could also be one single Jupyter Notebook. It has to be commented to help the reviewers understand the approach and run the code without any bugs.
### 4. Advice
### Tips
Remember, creating a great credit scoring model is like baking a perfect cake - it takes the right ingredients, careful preparation, and a dash of creativity. You've got this!
###### Does the readme file introduce the project, summarize how to run the code and show the username?
###### Does the environment contain all libraries used and the versions that are necessary to run the code?
###### Does the requirements contain all libraries used and the versions that are necessary to run the code?
###### Does the `EDA.ipynb` explain in details the exploratory data analysis?
@ -46,7 +46,7 @@ project
###### Is the model trained only the training set?
###### Is the AUC on the test set higher than 75%?
###### Is the AUC on the test set higher than 50%?
###### Does the model learning curves prove that the model is not overfitting?
@ -75,11 +75,13 @@ This [article](https://medium.com/thecyphy/home-credit-default-risk-part-2-84b58
### Descriptive variables:
###### These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualised here should be "raw". Are the visualisations computed for the 3 clients?
##### These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualized here should be "raw".
- Visualisations that show at least 10 variables describing the client and its loan(s).
- Visualisations that show the comparison between this client and other clients.
###### Are the visualisations computed for the 3 clients?
##### SHAP values on the model are displayed through a summary plot that shows the important features and their impact on the target. This is optional if you have already computed the features importance.
###### Are the 3 clients selected as expected? 2 clients from the train set (1 on which the model is correct and 1 on which the model's wrong) and 1 client from the test set.