public/subjects/ai/sp500-strategies/audit/README.md

#### Financial strategies on the SP500

This documents is the correction of the project 4. Some steps are detailed in W1D5E4.

```
project
│   README.md
│   environment.yml
│
└───data
│   │   sp500.csv
│
└───results
│   │
|   |───cross-validation
│   │   │   ml_metrics_train.csv
│   │   │   metric_train.csv
│   │   │   top_10_feature_importance.csv
│   │   │   metric_train.png
│   │
|   |───selected model
│   │   │   selected_model.pkl
│   │   │   selected_model.txt
│   │   │   ml_signal.csv
│   │
|   |───strategy
|   |   |   strategy.png
│   │   │   results.csv
│   │   │   report.md
|
|───scripts (free format)
│   │   features_engineering.py
│   │   gridsearch.py
│   │   model_selection.py
│   │   create_signal.py
│   │   strategy.py

```

###### Is the structure of the project like above?

###### Does the readme file summarize how to run the code and explain the global approach?

###### Does the environment contain all libraries used and their versions that are necessary to run the code?

###### Do the text files explain the chosen model methodology?

##### **Data processing and feature engineering**

###### Is the data splitted in a train set and test set?

###### Is the last day of the train set D and the first day of the test set D+n with n>0? Splitting without considering the time series structure is wrong.

###### Is there no leakage? unfortunately there's no automated way to check if the dataset is leaked. This step is validated if the features of date d are built as follow:

| Index   |          Features          |           Target |
| ------- | :------------------------: | ---------------: |
| Day D-1 | Features until D-1 23:59pm |   return(D, D+1) |
| Day D   |  Features until D 23:59pm  | return(D+1, D+2) |
| Day D+1 | Features until D+1 23:59pm | return(D+2, D+3) |

###### Have the features been grouped by ticker before computing the features?

###### Has the target been grouped by ticker before computing the future returns?

##### **Machine Learning pipeline**

##### Cross-Validation

###### Does the CV contain at least 10 folds in total?

###### Do all train folds have more than 2y history? If you use time series split, checking that the first fold has more than 2y history is enough.

###### Does the last validation set of the train set not overlap on the test set?

###### Do all of the folds not contain data from the same day? The split should be done on the dates.

###### Is There a plot showing your cross-validation? As usual, all plots should have named axis and a title. If you chose a Time Series Split the plot should look like this:

![alt text][timeseries]

[timeseries]: ../Time_series_split.png "Time Series split"

##### Model Selection

###### Has the test set not been used to train the model and select the model?

###### Is the selected model saved in the pkl file and described in a txt file?

##### Selected model

###### Are the ml metrics computed on the train set agregated? sum or median.

###### Are the ml metrics saved in a csv file?

###### Are the top 10 important features per fold saved in `top_10_feature_importance.csv`?

###### Does `metric_train.png` show a plot similar to the one below?

_Note that, this can be done also on the test set **IF** this hasn't helped to select the pipeline. _

![alt text][barplot]

[barplot]: ../metric_plot.png "Metric plot"

##### Machine learning signal

##### **The pipeline shouldn't be trained once and predict on all data points!** As explained: The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal.

##### **Strategy backtesting**

##### Convert machine learning signal into a strategy

##### The transformed machine learning signal (long only, long short, binary, ternary, stock picking, proportional to probability or custom ) is multiplied by the return between d+1 and d+2. As a reminder, the signal at date d predicts wether the return between d+1 and d+2 is increasing or decreasing. Then, the PnL of date d could be associated with date d, d+1 or d+2. This is arbitrary and should impact the value of the PnL.

##### You invest the same amount of money every day. One exception: if you invest 1$ per day per stock the amount invested every day may change depending on the strategy chosen. If you take into account the different values of capital invested every day in the calculation of the PnL, the step is still validated.

##### Metrics and plot

###### Is the Pnl computed as: strategy \* futur_return?

###### Does the strategy give the amount invested at time t on asset i?

###### Does the plot `strategy.png` contain an x axis: date?

###### Does the plot `strategy.png` contain a y axis1: PnL of the strategy at time t?

###### Does the plot `strategy.png` contain a y axis2: PnL of the SP500 at time t?

###### Does the plot `strategy.png` use the same scale for y axis1 and y axis2?

###### Does the plot `strategy.png` contain a vertical line that shows the separation between train set and test set?

##### Report

###### Does the report detail the features used?

###### Does the report detail the pipeline used (imputer, scaler, dimension reduction and model)?

###### Does the report detail the cross-validation used (length of train sets and validation sets and if possible the cross-validation plot)?

###### Does the report detail the strategy chosen (description, PnL plot and the strategy metrics on the train set and test set)?
docs(ai): add ai branch subjects to public 2 years ago			`#### Financial strategies on the SP500`

			`This documents is the correction of the project 4. Some steps are detailed in W1D5E4.`

			```
			`project`
			`│ README.md`
			`│ environment.yml`
			`│`
			`└───data`
			`│ │ sp500.csv`
			`│`
			`└───results`
			`│ │`
			`\| \|───cross-validation`
			`│ │ │ ml_metrics_train.csv`
			`│ │ │ metric_train.csv`
			`│ │ │ top_10_feature_importance.csv`
			`│ │ │ metric_train.png`
			`│ │`
			`\| \|───selected model`
			`│ │ │ selected_model.pkl`
			`│ │ │ selected_model.txt`
			`│ │ │ ml_signal.csv`
			`│ │`
			`\| \|───strategy`
			`\| \| \| strategy.png`
			`│ │ │ results.csv`
			`│ │ │ report.md`
			`\|`
			`\|───scripts (free format)`
			`│ │ features_engineering.py`
			`│ │ gridsearch.py`
			`│ │ model_selection.py`
			`│ │ create_signal.py`
			`│ │ strategy.py`

			```

docs(sp500-strategies): fix audits format 2 years ago			`###### Is the structure of the project like above?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the readme file summarize how to run the code and explain the global approach?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the environment contain all libraries used and their versions that are necessary to run the code?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Do the text files explain the chosen model methodology?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Data processing and feature engineering`

docs(sp500-strategies): fix audits format 2 years ago			`###### Is the data splitted in a train set and test set?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Is the last day of the train set D and the first day of the test set D+n with n>0? Splitting without considering the time series structure is wrong.`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Is there no leakage? unfortunately there's no automated way to check if the dataset is leaked. This step is validated if the features of date d are built as follow:`
docs(ai): add ai branch subjects to public 2 years ago
			`\| Index \| Features \| Target \|`
			`\| ------- \| :------------------------: \| ---------------: \|`
			`\| Day D-1 \| Features until D-1 23:59pm \| return(D, D+1) \|`
			`\| Day D \| Features until D 23:59pm \| return(D+1, D+2) \|`
			`\| Day D+1 \| Features until D+1 23:59pm \| return(D+2, D+3) \|`

docs(sp500-strategies): fix audits format 2 years ago			`###### Have the features been grouped by ticker before computing the features?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Has the target been grouped by ticker before computing the future returns?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Machine Learning pipeline`

			`##### Cross-Validation`

docs(sp500-strategies): fix audits format 2 years ago			`###### Does the CV contain at least 10 folds in total?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Do all train folds have more than 2y history? If you use time series split, checking that the first fold has more than 2y history is enough.`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the last validation set of the train set not overlap on the test set?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Do all of the folds not contain data from the same day? The split should be done on the dates.`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Is There a plot showing your cross-validation? As usual, all plots should have named axis and a title. If you chose a Time Series Split the plot should look like this:`
docs(ai): add ai branch subjects to public 2 years ago
			`![alt text][timeseries]`

			`[timeseries]: ../Time_series_split.png "Time Series split"`

			`##### Model Selection`

docs(sp500-strategies): fix audits format 2 years ago			`###### Has the test set not been used to train the model and select the model?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Is the selected model saved in the pkl file and described in a txt file?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Selected model`

docs(sp500-strategies): fix audits format 2 years ago			`###### Are the ml metrics computed on the train set agregated? sum or median.`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Are the ml metrics saved in a csv file?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Are the top 10 important features per fold saved in `top_10_feature_importance.csv`?
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Does `metric_train.png` show a plot similar to the one below?
docs(ai): add ai branch subjects to public 2 years ago
			`_Note that, this can be done also on the test set IF this hasn't helped to select the pipeline. _`

			`![alt text][barplot]`

			`[barplot]: ../metric_plot.png "Metric plot"`

			`##### Machine learning signal`

docs(sp500-strategies): fix audits format 2 years ago			`##### The pipeline shouldn't be trained once and predict on all data points! As explained: The signal has to be generated with the chosen cross validation: train the model on the train set of the first fold, then predict on its validation set; train the model on the train set of the second fold, then predict on its validation set, etc ... Then, concatenate the predictions on the validation sets to build the machine learning signal.`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Strategy backtesting`

			`##### Convert machine learning signal into a strategy`

docs(sp500-strategies): fix audits format 2 years ago			`##### The transformed machine learning signal (long only, long short, binary, ternary, stock picking, proportional to probability or custom ) is multiplied by the return between d+1 and d+2. As a reminder, the signal at date d predicts wether the return between d+1 and d+2 is increasing or decreasing. Then, the PnL of date d could be associated with date d, d+1 or d+2. This is arbitrary and should impact the value of the PnL.`
docs(ai): add ai branch subjects to public 2 years ago
			`##### You invest the same amount of money every day. One exception: if you invest 1$ per day per stock the amount invested every day may change depending on the strategy chosen. If you take into account the different values of capital invested every day in the calculation of the PnL, the step is still validated.`

			`##### Metrics and plot`

docs(sp500-strategies): fix audits format 2 years ago			`###### Is the Pnl computed as: strategy \* futur_return?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the strategy give the amount invested at time t on asset i?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Does the plot `strategy.png` contain an x axis: date?
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Does the plot `strategy.png` contain a y axis1: PnL of the strategy at time t?
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Does the plot `strategy.png` contain a y axis2: PnL of the SP500 at time t?
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Does the plot `strategy.png` use the same scale for y axis1 and y axis2?
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			###### Does the plot `strategy.png` contain a vertical line that shows the separation between train set and test set?
docs(ai): add ai branch subjects to public 2 years ago
			`##### Report`

docs(sp500-strategies): fix audits format 2 years ago			`###### Does the report detail the features used?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the report detail the pipeline used (imputer, scaler, dimension reduction and model)?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the report detail the cross-validation used (length of train sets and validation sets and if possible the cross-validation plot)?`
docs(ai): add ai branch subjects to public 2 years ago
docs(sp500-strategies): fix audits format 2 years ago			`###### Does the report detail the strategy chosen (description, PnL plot and the strategy metrics on the train set and test set)?`