From 4032f00d3cd7a904b663b1c98eb52ea1820c8cd0 Mon Sep 17 00:00:00 2001 From: nprimo Date: Thu, 7 Mar 2024 14:20:42 +0000 Subject: [PATCH] feat(sp500-strategies): clarify audit question --- subjects/ai/sp500-strategies/audit/README.md | 61 ++++---------------- 1 file changed, 12 insertions(+), 49 deletions(-) diff --git a/subjects/ai/sp500-strategies/audit/README.md b/subjects/ai/sp500-strategies/audit/README.md index db6c958d1..bebec3f20 100644 --- a/subjects/ai/sp500-strategies/audit/README.md +++ b/subjects/ai/sp500-strategies/audit/README.md @@ -1,45 +1,8 @@ #### Financial strategies on the SP500 -This documents is the correction of the project 4. Some steps are detailed in W1D5E4. - -``` -project -│ README.md -│ environment.yml -│ -└───data -│ │ sp500.csv -│ -└───results -│ │ -| |───cross-validation -│ │ │ ml_metrics_train.csv -│ │ │ metric_train.csv -│ │ │ top_10_feature_importance.csv -│ │ │ metric_train.png -│ │ -| |───selected model -│ │ │ selected_model.pkl -│ │ │ selected_model.txt -│ │ │ ml_signal.csv -│ │ -| |───strategy -| | | strategy.png -│ │ │ results.csv -│ │ │ report.md -| -|───scripts (free format) -│ │ features_engineering.py -│ │ gridsearch.py -│ │ model_selection.py -│ │ create_signal.py -│ │ strategy.py - -``` - -###### Is the structure of the project like above? - -###### Does the readme file summarize how to run the code and explain the global approach? +###### Is the structure of the project like the one presented in the `Project repository structure` in the subject? + +###### Does the README file summarize how to run the code and explain the global approach? ###### Does the environment contain all libraries used and their versions that are necessary to run the code? @@ -47,11 +10,11 @@ project ##### **Data processing and feature engineering** -###### Is the data splitted in a train set and test set? +###### Is the data split in a train set and test set? ###### Is the last day of the train set D and the first day of the test set D+n with n>0? Splitting without considering the time series structure is wrong. -###### Is there no leakage? unfortunately there's no automated way to check if the dataset is leaked. This step is validated if the features of date d are built as follow: +###### Is there no leakage? Unfortunately, there's no automated way to check if the dataset is leaked. This step is validated if the features of date d are built as follows: | Index | Features | Target | | ------- | :------------------------: | ---------------: | @@ -71,9 +34,9 @@ project ###### Do all train folds have more than 2y history? If you use time series split, checking that the first fold has more than 2y history is enough. -###### Does the last validation set of the train set not overlap on the test set? +###### Is the last validation set of the train data not overlapping with the test data? -###### Do all of the folds not contain data from the same day? The split should be done on the dates. +###### Are all the data folds split by date? A fold should not contain repeated data from the same date and ticker. ###### Is There a plot showing your cross-validation? As usual, all plots should have named axis and a title. If you chose a Time Series Split the plot should look like this: @@ -85,13 +48,13 @@ project ###### Has the test set not been used to train the model and select the model? -###### Is the selected model saved in the pkl file and described in a txt file? +###### Is the selected model saved in a `pkl` file and described in a `txt` file? ##### Selected model -###### Are the ml metrics computed on the train set agregated? sum or median. +###### Are the ML metrics computed on the train set aggregated (sum or median)? -###### Are the ml metrics saved in a csv file? +###### Are the ML metrics saved in a `csv` file? ###### Are the top 10 important features per fold saved in `top_10_feature_importance.csv`? @@ -119,7 +82,7 @@ _Note that, this can be done also on the test set **IF** this hasn't helped to s ###### Is the Pnl computed as: strategy \* futur_return? -###### Does the strategy give the amount invested at time t on asset i? +###### Does the strategy give the amount invested at time `t` on asset `i`? ###### Does the plot `strategy.png` contain an x axis: date? @@ -135,7 +98,7 @@ _Note that, this can be done also on the test set **IF** this hasn't helped to s ###### Does the report detail the features used? -###### Does the report detail the pipeline used (imputer, scaler, dimension reduction and model)? +###### Does the report detail the pipeline used (`Imputer`, `Scaler`, dimension reduction and model)? ###### Does the report detail the cross-validation used (length of train sets and validation sets and if possible the cross-validation plot)?