mirror of https://github.com/01-edu/public.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
eslopfer
4faa4927bf
|
2 years ago | |
---|---|---|
.. | ||
README.md | 2 years ago |
README.md
Functional
Is the structure of the project as below?
project
│ README.md
│ environment.yml
│
└───data
│ │ sp500.csv
│ | prices.csv
│
└───notebook
│ │ analysis.ipynb
|
|───scripts
| │ memory_reducer.py
| │ preprocessing.py
| │ create_signal.py
| | backtester.py
│ | main.py
│
└───results
│ plots
│ results.txt
│ outliers.txt
Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file and contain a conclusion that gives the performance of the strategy?
Does the environment contain all libraries used and their versions that are necessary to run the code?
Does the notebook contain a missing values analysis? Example: number of missing values per variables or per year
Does the notebook contain an outliers analysis?
Does the notebook contain a Histogram of average price for companies for all variables (saved the plot with the images)? This is required only for prices.csv data.
Does the notebook describe at least 5 outliers ('ticker', 'date', price) ? To check the outliers it is simple: Search the historical stock price on Google at the given date and compare. The price may fluctuate a bit. The goal here is not to match the historical price found on Google but to detect a huge difference between the price in our data and the real historical one.
Notes:
- For all questions always check the values are sorted by date. If not the answers are wrong.
- The plots are validated only if they contain a title
Python files
1. memory_reducer.py
Does the prices data set weight less than 8MB (Mega Bytes)?
Does the sp500 data set weight less than 0.15MB (Mega Bytes)?
Is the data type greater than np.float32? Smaller data types may alter the precision of the data.
2. preprocessing.py
Is the data agregated on a monthly period and only the last element is kept?
Are the outliers filtered out by removing all prices bigger than 10k$ and smaller than 0.1$?
Is the historical return computed using only current and past values?
Is the future return computed using only current and future value? (Reminder: as the data is resampled monthly, computing the return is straightforward)
Are the outliers in the returns data set to NaN for all returns not in the years 2008 and 2009? The filters are: return > 1 and return < -0.5.
Are the missing values filled using the last value available for the company. df.fillna(method='ffill') is wrong because the previous value can be the return or price of another company.
Are the missing values that can't be filled using a the previous existing value dropped?
Are the number of missing values 0?
Best practice:
Do not fill the last values for the future return because the values are missing because the data set ends at a given date. Filling the previous doesn't make sense. It makes more sense to drop the row because the backtest focuses on observed data.