From c1faa65e9149367b67d764a276d7e140f26350ac Mon Sep 17 00:00:00 2001 From: Oumaima Fisaoui <48260689+Oumaimafisaoui@users.noreply.github.com> Date: Tue, 1 Oct 2024 09:05:26 +0100 Subject: [PATCH] Chore(AI): Fix subjects structure and issue with emotions detector --- subjects/ai/backtesting-sp500/README.md | 148 +++++++++++++----- subjects/ai/backtesting-sp500/audit/README.md | 6 +- subjects/ai/credit-scoring/README.md | 47 ++++-- subjects/ai/credit-scoring/audit/README.md | 12 +- subjects/ai/credit-scoring/readme_data.md | 9 +- subjects/ai/emotions-detector/README.md | 32 ++-- subjects/ai/emotions-detector/audit/README.md | 4 +- subjects/ai/kaggle-titanic/README.md | 81 +++++----- subjects/ai/kaggle-titanic/audit/README.md | 2 +- subjects/ai/nlp-scraper/README.md | 38 +++-- subjects/ai/nlp-scraper/audit/README.md | 2 +- subjects/ai/sp500-strategies/README.md | 29 ++-- subjects/ai/sp500-strategies/audit/README.md | 2 +- 13 files changed, 275 insertions(+), 137 deletions(-) diff --git a/subjects/ai/backtesting-sp500/README.md b/subjects/ai/backtesting-sp500/README.md index 54fca3fb9..b9c59167d 100644 --- a/subjects/ai/backtesting-sp500/README.md +++ b/subjects/ai/backtesting-sp500/README.md @@ -1,10 +1,31 @@ -# Backtesting on the SP500 +## Backtesting-SP500 -## SP500 data preprocessing +### Overview The goal of this project is to perform a Backtest on the SP500 constituents, which represents the 500 largest companies by market capitalization in the United States. -## Data +### Role Play + +You are a quantitative analyst at a prestigious hedge fund. Your manager has tasked you with developing and backtesting a stock-picking strategy using historical data from the S&P 500 index. The goal is to create a strategy that outperforms the market benchmark. You'll need to clean and preprocess messy financial data, develop a signal for stock selection, implement a backtesting framework, and present your findings to the investment committee. + +### Learning Objectives + +By the end of this project, you will be able to: + +1. Optimize data types in large datasets to improve memory efficiency +2. Perform exploratory data analysis on financial time series data +3. Identify and handle outliers and missing values in stock price data +4. Preprocess financial data, including resampling and calculating returns +5. Develop a simple stock selection signal based on historical performance +6. Implement a backtesting framework for evaluating trading strategies +7. Compare the performance of a custom strategy against a market benchmark +8. Visualize financial performance data using appropriate charts and graphs +9. Write modular, reusable code for financial data analysis and strategy testing +10. Interpret and communicate the results of a quantitative trading strategy + +### Instructions + +#### Data The input files are: @@ -24,42 +45,15 @@ _Note: The quality of this data set is not good: some prices are wrong, there ar _Note: The corrections will not fix the data, as a result the results may be abnormal compared to results from cleaned financial data. That's not a problem for this small project !_ -## Problem +#### Problem Once preprocessed this data, it will be used to generate a signal that is, for each asset at each date a metric that indicates if the asset price will increase the next month. At each date (once a month) we will take the 20 highest metrics and invest $1 per company. This strategy is called **stock picking**. It consists in picking stock in an index and try to over perform the index. Finally, we will compare the performance of our strategy compared to the benchmark: the SP500 It is important to understand that the SP500 components change over time. The reason is simple: Facebook entered the SP500 in 2013 thus meaning that another company had to be removed from the 500 companies. -The structure of the project is: - -```console -project -│ README.md -│ environment.yml -│ -└───data -│ │ sp500.csv -│ | prices.csv -│ -└───notebook -│ │ analysis.ipynb -| -|───scripts -| │ memory_reducer.py -| │ preprocessing.py -| │ create_signal.py -| | backtester.py -│ | main.py -│ -└───results - │ plots - │ results.txt - │ outliers.txt -``` - There are four parts: -## 1. Preliminary +#### 1. Preliminary - Create a function that takes as input one CSV data file. This function should optimize the types to reduce its size and returns a memory optimized DataFrame. - For `float` data the smaller data type used is `np.float32` @@ -71,7 +65,7 @@ There are four parts: 4. Find the min and the max value 5. Determine and apply the smallest datatype that can fit the range of values -## 2. Data wrangling and preprocessing +#### 2. Data wrangling and preprocessing - Create a Jupyter Notebook to analyze the data sets and perform EDA (Exploratory Data Analysis). This notebook should contain at least: @@ -112,7 +106,7 @@ At this stage the DataFrame should look like this: - Resample data on month and keep the last value - Compute historical monthly returns on the adjusted close -## 3. Create signal +#### 3. Create signal At this stage we have a data set with features that we will leverage to get an investment signal. As previously said, we will focus on one single variable to create the signal: **monthly_past_return**. The signal will be the average of monthly returns of the previous year @@ -121,7 +115,7 @@ The naive assumption made here is that if a stock has performed well the last ye - Create a column `average_return_1y` - Create a column named `signal` that contains `True` if `average_return_1y` is among the 20 highest in the month `average_return_1y`. -## 4. Backtester +#### 4. Backtester At this stage we have an investment signal that indicates each month what are the 20 companies we should invest 1$ on (1$ each). In order to check the strategies and performance we will backtest our investment signal. @@ -135,9 +129,9 @@ A data point (x-axis: date, y-axis: cumulated_return) is: the **cumulated return ![alt text][performance] -[performance]: images/w1_weekend_plot_pnl.png 'Cumulative Performance' +[performance]: images/w1_weekend_plot_pnl.png "Cumulative Performance" -## 5. Main +#### 5. Main Here is a sketch of `main.py`. @@ -158,3 +152,83 @@ backtest(prices, sp500) ``` **The command `python main.py` executes the code from data imports to the backtest and save the results.** + +### Project repository structure: + +```console +project +│ README.md +│ requirements.txt +│ +└───data +│ │ sp500.csv +│ | prices.csv +│ +└───notebook +│ │ analysis.ipynb +| +|───scripts +| │ memory_reducer.py +| │ preprocessing.py +| │ create_signal.py +| | backtester.py +│ | main.py +│ +└───results + │ plots + │ results.txt + │ outliers.txt +``` + +### Tips: + +1. Data Quality Management: + + - Be prepared to encounter messy data. Financial datasets often contain errors, outliers, and missing values. + - Develop a systematic approach to identify and handle data quality issues. + +2. Memory Optimization: + + - When working with large datasets, optimize memory usage by selecting appropriate data types for each column. + - Consider using smaller data types like np.float32 for floating-point numbers when precision allows. + +3. Exploratory Data Analysis: + + - Spend time understanding the data through visualization and statistical analysis before diving into strategy development. + - Pay special attention to outliers and their potential impact on your strategy. + +4. Preprocessing Financial Data: + + - When resampling time series data, be mindful of which value to keep (e.g., last value for month-end prices). + - Calculate both historical and future returns to avoid look-ahead bias in your strategy. + +5. Handling Outliers: + + - Develop a method to identify and handle outliers that is specific to each company's historical data. + - Be cautious about removing outliers during periods of high market volatility (e.g., 2008-2009 financial crisis). + +6. Signal Creation: + + - Start with a simple signal (like past 12-month average returns) before exploring more complex strategies. + - Ensure your signal doesn't use future information that wouldn't have been available at the time of decision. + +7. Backtesting: + + - Implement your backtesting logic without using loops for better performance. + - Compare your strategy's performance against a relevant benchmark (in this case, the S&P 500). + +8. Visualization: + + - Create clear, informative visualizations to communicate your strategy's performance. + - Include cumulative return plots to show how your strategy performs over time compared to the benchmark. + +9. Code Structure: + + - Organize your code into modular functions for better readability and reusability. + - Use a main script to orchestrate the entire process from data loading to results visualization. + +10. Results Interpretation: + - Don't just focus on total returns. Consider other metrics like risk-adjusted returns, maximum drawdown, etc. + - Be prepared to explain any anomalies or unexpected results in your strategy's performance. + +Remember, the goal is not just to create a strategy that looks good on paper, but to develop a robust process for analyzing financial data and testing investment ideas. diff --git a/subjects/ai/backtesting-sp500/audit/README.md b/subjects/ai/backtesting-sp500/audit/README.md index fdf200d93..3f09f1389 100644 --- a/subjects/ai/backtesting-sp500/audit/README.md +++ b/subjects/ai/backtesting-sp500/audit/README.md @@ -5,7 +5,7 @@ ``` project │ README.md -│ environment.yml +│ requirements.txt │ └───data │ │ sp500.csv @@ -30,7 +30,7 @@ project ###### Does the readme file contain a description of the project, explain how to run the code from an empty environment, give a summary of the implementation of each python file and contain a conclusion that gives the performance of the strategy? -###### Does the environment contain all libraries used and their versions that are necessary to run the code? +###### Does the requirements contain all libraries used and their versions that are necessary to run the code? ###### Does the notebook contain a missing values analysis? **Example**: number of missing values per variables or per year @@ -107,7 +107,7 @@ Best practice: ![alt text][performance] -[performance]: ../images/w1_weekend_plot_pnl.png 'Cumulative Performance' +[performance]: ../images/w1_weekend_plot_pnl.png "Cumulative Performance" ##### 5. main.py diff --git a/subjects/ai/credit-scoring/README.md b/subjects/ai/credit-scoring/README.md index d74008b31..66a7f65c6 100644 --- a/subjects/ai/credit-scoring/README.md +++ b/subjects/ai/credit-scoring/README.md @@ -1,16 +1,24 @@ ## Credit scoring +### Overview + The goal of this project is to implement a scoring model based on various source of data ([check data documentation](./readme_data.md)) that returns the probability of default. In a nutshell, credit scoring represents an evaluation of how well the bank's customer can pay and is willing to pay off debt. It is also required that you provide an explanation of the score. For example, your model returns that the probability that one client doesn't pay back the loan is very high (90%). The reason behind is that variable_xxx which represents the ability to pay back the past loan is low. The output interpretability will appear in a visualization. -The ability to understand the underlying factors of credit scoring is important. Credit scoring is subject to more and more regulation, so transparency is key. And more generally, more and more companies prefer transparency to black box models. +### Role play -### Resources +Hey there, future credit scoring expert! Ready to dive into the exciting world of predicting loan defaults? You're in for a treat! This project is all about building a nifty model that can help figure out how likely someone is to pay back their loan. Cool, right? + +### Learning Objective + +The ability to understand the underlying factors of credit scoring is important. Credit scoring is subject to more and more regulation, so transparency is key. And more generally, more and more companies prefer transparency to black box models. Historical timeline of machine learning techniques applied to credit scoring - [Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds](https://hal.archives-ouvertes.fr/hal-02507499v3/document) -### Scoring model +### Instructions + +#### Scoring model There are 3 expected deliverables associated with the scoring model: @@ -18,21 +26,28 @@ There are 3 expected deliverables associated with the scoring model: - The trained machine learning model with the features engineering pipeline: - Do not forget: **Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.** - - The model is validated if the **AUC on the test set is higher than 75%**. + - The model is validated if the **AUC on the test set is higher than 50%**. - The labelled test data is not publicly available. However, a Kaggle competition uses the same data. The procedure to evaluate test set submission is the same as the one used for the project 1. + - Here are the [DataSets](https://assets.01-edu.org/ai-branch/project5/home-credit-default-risk.zip). + +- A report on model training and evaluation: + + - Include learning curves (training and validation scores vs. training set size or epochs) to demonstrate that the model is not overfitting. + - Explain the measures taken to prevent overfitting, such as early stopping or regularization techniques. + - Justify your choice of when to stop training based on the learning curves. -### Kaggle submission +#### Kaggle submission The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest [this resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations. - Create a username following that structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on the Git platform the first day of the week. Do not touch this file anymore. -- A text document that describes the methodology used to train the machine learning model: +- A text document `model_report.txt` that describes the methodology used to train the machine learning model : - Algorithm - Why the accuracy shouldn't be used in that case? - Limit and possible improvements -### Model interpretability +#### Model interpretability This part hasn't been covered during the piscine. Take the time to understand this key concept. There are different level of transparency: @@ -55,16 +70,16 @@ Choose the 3 clients of your choice, compute the score, run the visualizations o - 1 on which the model is correct and the other on which the model is wrong. Try to understand why the model got wrong on this client. - Take 1 client from the test set -### Optional +#### Bonus Implement a dashboard (using [Dash](https://dash.plotly.com/)) that takes as input the customer id and that returns the score and the required visualizations. -### Deliverables +### Project repository structure: ``` project │ README.md -│ environment.yml +│ requirements.txt │ └───data │ │ ... @@ -93,17 +108,17 @@ project │ │ preprocess.py ``` -- `README.md` introduces the project and shows the username. -- `environment.yml` contains all libraries required to run the code. +- `README.md` introduces the project, how to run the code, and shows the username. +- `requirements.txt` contains all libraries required to run the code. - `username.txt` contains the username, the last modified date of the file **has to correspond to the first day of the project**. - `EDA.ipynb` contains the exploratory data analysis. This file should contain all steps of data analysis that contributed or not to improve the score of the model. It has to be commented so that the reviewer can understand the analysis and run it without any problem. - `scripts` contains python file(s) that perform(s) the feature engineering, the model's training and prediction on the test set. It could also be one single Jupyter Notebook. It has to be commented to help the reviewers understand the approach and run the code without any bugs. -### Useful resources +### Tips -- [Interpreting machine learning models](https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f) +Remember, creating a great credit scoring model is like baking a perfect cake - it takes the right ingredients, careful preparation, and a dash of creativity. You've got this! -### Files needed for this project +### Resources -[Files](https://assets.01-edu.org/ai-branch/project5/home-credit-default-risk.zip) +- [Interpreting machine learning models](https://towardsdatascience.com/interpretability-in-machine-learning-70c30694a05f) diff --git a/subjects/ai/credit-scoring/audit/README.md b/subjects/ai/credit-scoring/audit/README.md index 0c363cdd8..1cceee536 100644 --- a/subjects/ai/credit-scoring/audit/README.md +++ b/subjects/ai/credit-scoring/audit/README.md @@ -5,7 +5,7 @@ ``` project │ README.md -│ environment.yml +│ requirements.txt │ └───data │ │ ... @@ -38,7 +38,7 @@ project ###### Does the readme file introduce the project, summarize how to run the code and show the username? -###### Does the environment contain all libraries used and the versions that are necessary to run the code? +###### Does the requirements contain all libraries used and the versions that are necessary to run the code? ###### Does the `EDA.ipynb` explain in details the exploratory data analysis? @@ -46,7 +46,7 @@ project ###### Is the model trained only the training set? -###### Is the AUC on the test set higher than 75%? +###### Is the AUC on the test set higher than 50%? ###### Does the model learning curves prove that the model is not overfitting? @@ -59,7 +59,7 @@ project ```prompt python predict.py - AUC on test set: 0.76 + AUC on test set: 0.50 ``` @@ -75,11 +75,13 @@ This [article](https://medium.com/thecyphy/home-credit-default-risk-part-2-84b58 ### Descriptive variables: -###### These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualised here should be "raw". Are the visualisations computed for the 3 clients? +##### These are important to understand for example the age of the client. If the data could be scaled or modified in the preprocessing pipeline but the data visualized here should be "raw". - Visualisations that show at least 10 variables describing the client and its loan(s). - Visualisations that show the comparison between this client and other clients. +###### Are the visualisations computed for the 3 clients? + ##### SHAP values on the model are displayed through a summary plot that shows the important features and their impact on the target. This is optional if you have already computed the features importance. ###### Are the 3 clients selected as expected? 2 clients from the train set (1 on which the model is correct and 1 on which the model's wrong) and 1 client from the test set. diff --git a/subjects/ai/credit-scoring/readme_data.md b/subjects/ai/credit-scoring/readme_data.md index 12aa9509c..8682b10e3 100644 --- a/subjects/ai/credit-scoring/readme_data.md +++ b/subjects/ai/credit-scoring/readme_data.md @@ -4,7 +4,7 @@ This file describes the available data for the project. ![alt data description](data_description.png "Credit scoring data description") -## application_{train|test}.csv +## application\_{train|test}.csv This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET). Static data for all applications. One row represents one loan in our data sample. @@ -17,24 +17,23 @@ For every loan in our sample, there are as many rows as number of credits the cl ## bureau_balance.csv Monthly balances of previous credits in Credit Bureau. -This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample * # of relative previous credits * # of months where we have some history observable for the previous credits) rows. +This table has one row for each month of history of every previous credit reported to Credit Bureau – i.e the table has (#loans in sample _ # of relative previous credits _ # of months where we have some history observable for the previous credits) rows. ## POS_CASH_balance.csv Monthly balance snapshots of previous POS (point of sales) and cash loans that the applicant had with Home Credit. -This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credits * # of months in which we have some history observable for the previous credits) rows. +This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample _ # of relative previous credits _ # of months in which we have some history observable for the previous credits) rows. ## credit_card_balance.csv Monthly balance snapshots of previous credit cards that the applicant has with Home Credit. -This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample * # of relative previous credit cards * # of months where we have some history observable for the previous credit card) rows. +This table has one row for each month of history of every previous credit in Home Credit (consumer credit and cash loans) related to loans in our sample – i.e. the table has (#loans in sample _ # of relative previous credit cards _ # of months where we have some history observable for the previous credit card) rows. ## previous_application.csv All previous applications for Home Credit loans of clients who have loans in our sample. There is one row for each previous application related to loans in our data sample. - ## installments_payments.csv Repayment history for the previously disbursed credits in Home Credit related to the loans in our sample. diff --git a/subjects/ai/emotions-detector/README.md b/subjects/ai/emotions-detector/README.md index a0bd99839..4d3e31136 100644 --- a/subjects/ai/emotions-detector/README.md +++ b/subjects/ai/emotions-detector/README.md @@ -1,10 +1,18 @@ -## Emotions detection with Deep Learning +## Emotion detector + +### Overview Cameras are everywhere. Videos and images have become one of the most interesting data sets for artificial intelligence. Image processing is a quite broad research area, not just filtering, compression, and enhancement. Besides, we are even interested in the question, “what is in images?”, i.e., content analysis of visual inputs, which is part of the main task of computer vision. +### Role play + +you're going to train a computer to be like a mind reader, but instead of reading thoughts, it's reading emotions! You'll be working with a bunch of pictures of faces, teaching your AI to tell the difference between a big grin and a grumpy frown, or a surprised gasp and a fearful wide-eyed look. + +### Learning Objective + The study of computer vision could make possible such tasks as 3D reconstruction of scenes, motion capturing, and object recognition, which are crucial for even higher-level intelligence such as image and video understanding, and motion understanding. For this project we will focus on two tasks: @@ -18,7 +26,9 @@ With the computing power exponentially increasing the computer vision field has - The history behind this field is fascinating! [Here](https://kapernikov.com/basic-introduction-to-computer-vision/) is a short summary of its history. -### Project goal and suggested timeline +### Instructions + +#### Project goal: The goal of the project is to implement a **system that detects the emotion on a face from a webcam video stream**. To achieve this exciting task you'll have to understand how to: @@ -32,7 +42,7 @@ Then starts the emotion detection in a webcam video stream step that will last u The two steps are detailed below. -### Preliminary: +#### Preliminary: - Take [this course](https://www.coursera.org/learn/convolutional-neural-networks). This course is a reference for many reasons and one of them is the creator: **Andrew Ng**. He explains the basics of CNNs but also some more advanced topics as transfer learning, siamese networks etc ... - I suggest to focus on Week 1 and 2 and to spend less time on Week 3 and 4. Don't worry the time scoping of such MOOCs are conservative. You can attend the lessons for free! @@ -41,7 +51,7 @@ The two steps are detailed below. - Start first with a logistic regression to understand how to handle images in Python. And then train your first CNN on this data set. -### Face emotions classification +#### Face emotions classification Emotion detection is one of the most researched topics in the modern-day machine learning arena. The ability to accurately detect and identify an emotion opens up numerous doors for Advanced Human Computer Interaction. The aim of this project is to detect up to seven distinct facial emotions in real time. @@ -57,7 +67,7 @@ Your goal is to implement a program that takes as input a video stream that cont This dataset was provided for this past [Kaggle challenge](https://www.kaggle.com/competitions/challenges-in-representation-learning-facial-expression-recognition-challenge/overview). It is possible to find more information about on the challenge page. Train a CNN on the dataset `train.csv`. Here is an [example of architecture](https://www.quora.com/What-is-the-VGG-neural-network) you can implement. **The CNN has to perform more than 60% on the test set**. You can use the `test_with_emotions.csv` file for this. You will see that the CNNs take a lot of time to train. - You don't want to overfit the neural network. I strongly suggest to use early stopping, callbacks and to monitor the training using the `TensorBoard`. + You don't want to overfit the neural network. I strongly suggest to use early stopping, callbacks and to monitor the training using the `TensorBoard` 'note: Integrating TensorBoard is not optional'. You have to save the trained model in `final_emotion_model.keras` and to explain the chosen architecture in `final_emotion_model_arch.txt`. Use `model.summary())` to print the architecture. It is also expected that you explain the iterations and how you end up choosing your final architecture. Save a screenshot of the `TensorBoard` while the model's training in `tensorboard.png` and save a plot with the learning curves showing the model training and stopping BEFORE the model starts overfitting in `learning_curves.png`. @@ -82,7 +92,7 @@ For that step, I suggest again to use **OpenCV** as much as possible. This link - Optional: **(very cool)** Hack the CNN. Take a picture for which the prediction of your CNN is **Happy**. Now, hack the CNN: using the same image **SLIGHTLY** modified make the CNN predict **Sad**. You can find an example on how to achieve this in [this article](https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196) -### Deliverable +### Project repository structure: ``` project @@ -90,7 +100,7 @@ project │   ├── test.csv │   ├── train.csv │   └── xxx.csv -├── environment.yml +├── requirements.txt ├── README.md ├── results │   ├── model @@ -148,7 +158,11 @@ Preprocessing ... ``` -### Useful resources: +### Tips + +Balance technical prowess with psychological insight: as you fine-tune your CNN and optimize your video processing, remember that understanding the nuances of human facial expressions is key to creating a truly effective emotion detection system. + +### Resources - https://machinelearningmastery.com/what-is-computer-vision/ @@ -156,6 +170,4 @@ Preprocessing ... - Hack the CNN https://medium.com/@ageitgey/machine-learning-is-fun-part-8-how-to-intentionally-trick-neural-networks-b55da32b7196 -- http://ice.dlut.edu.cn/valse2018/ppt/WeihongDeng_VALSE2018.pdf - - https://arxiv.org/pdf/1812.06387.pdf diff --git a/subjects/ai/emotions-detector/audit/README.md b/subjects/ai/emotions-detector/audit/README.md index fe2a21df0..3fb9a7867 100644 --- a/subjects/ai/emotions-detector/audit/README.md +++ b/subjects/ai/emotions-detector/audit/README.md @@ -1,4 +1,4 @@ -#### Computer vision +#### Emotion detector ##### Preliminary @@ -14,7 +14,7 @@ ###### Is the model trained only the training set? -###### Is the accuracy on the test set higher than 70%? +###### Is the accuracy on the test set higher than 60%? ###### Do the learning curves prove that the model is not overfitting? diff --git a/subjects/ai/kaggle-titanic/README.md b/subjects/ai/kaggle-titanic/README.md index 0f31e79a7..2d931512c 100644 --- a/subjects/ai/kaggle-titanic/README.md +++ b/subjects/ai/kaggle-titanic/README.md @@ -1,6 +1,6 @@ -# Your first Kaggle: Titanic +## Kaggle Titanic -### Introduction +### Overview The goal of this **1 week** project is to get the highest possible score on a Data Science competition. More precisely you will have to predict who survived the Titanic crash. @@ -8,11 +8,11 @@ The goal of this **1 week** project is to get the highest possible score on a Da [titanic]: titanic.jpg "Titanic" -### Kaggle +#### Kaggle Kaggle is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. It’s a crowd-sourced platform to attract, nurture, train and challenge data scientists from all around the world to solve data science, machine learning and predictive analytics problems. -### Titanic - Machine Learning from Disaster +#### Titanic - Machine Learning from Disaster One of the first Kaggle competition I did was: Titanic - Machine Learning from Disaster. This is a not-to-be-missed Kaggle competition. @@ -22,17 +22,54 @@ The sinking of the Titanic is one of the most infamous shipwrecks in history. On While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. +### Role play + +Ahoy, data explorer! Ready to set sail on the most thrilling voyage of your data science career? Welcome aboard the Kaggle Titanic challenge! You're about to embark on a journey through time, back to that fateful night in 1912. +Your mission, should you choose to accept it (and let's face it, you're already hooked), is to dive deep into the passenger manifest and uncover the secrets of survival. Who lived? Who perished? And most importantly, can you build a model that predicts it all? + +### Learning Objective + In this challenge, you have to build a predictive model that answers the question: **“what sorts of people were more likely to survive?”** using passenger data (ie name, age, gender, socio-economic class, etc). **You will have to submit your prediction on Kaggle**. -### Preliminary +### Instructions + +#### Preliminary -The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest this [resource](https://towardsdatascience.com/getting-started-with-kaggle-f9138b35ae18) that gives detailed explanations. +The way the Kaggle platform works is explained in the challenge overview page. If you need more details, I suggest this [resource](https://www.kaggle.com/code/alexisbcook/getting-started-with-kaggle) that gives detailed explanations. - Create a username following this structure: username*01EDU* location_MM_YYYY. Submit the description profile and push it on GitHub the first day of the week. Do not touch this file anymore. - It is possible to have different personal accounts merged in a team for one single competition. -### Deliverables +#### Scores + +In order to validate the project you will have to score at least **79% accuracy on the Leaderboard**: + +- 78.9% accuracy is the minimum score to validate the project. + +Scores indication: + +- 78.9% difficult - minimum required +- 80% very difficult: smart feature engineering needed +- More than 83%: excellent that corresponds to the top 2% on Kaggle +- More than 85%: cheating + +#### Cheating + +It is impossible to get 100%. Who would have predicted that Rose wouldn't let [Jack on the door](https://www.reddit.com/r/titanic/comments/14i0v5j/for_all_the_newbies_proof_its_not_a_door/?rdt=35268) ? + +All people having 100% of accuracy on the Leaderboard cheated, there's no point to compare with them or to cheat. The Kaggle community estimates that having more than 85% is almost considered as cheated submissions as they are element of luck involved in the surviving. + +**You can't use external data sets than the ones provided in that competition.** + +#### The key points + +- **Feature engineering**: + Put yourself in the shoes of an investigator trying to understand what happened exactly in that boat during the crash. Do not hesitate to watch the movie to try to find as many insights as possible. Without a smart the feature engineering there's no way to validate the project ;-) + +- The leaderboard evaluates on test data for which you don't have the labels. It means that there's no point to over fit the train set. Check the over fitting on the train set by dividing the data and by cross-validating the accuracy. + +### Project repository structure ```console project @@ -60,35 +97,7 @@ project - `main.ipynb` This file (single Jupyter Notebook) should contain all steps of data analysis that contributed or not to improve the accuracy, the feature engineering, the model's training and prediction on the test set. It has to be commented to help the reviewers understand the approach and run the code without any bugs. - **Submit your predictions on the Kaggle's competition platform**. Check your ranking and score in the leaderboard. -### Scores - -In order to validate the project you will have to score at least **79% accuracy on the Leaderboard**: - -- 78.9% accuracy is the minimum score to validate the project. - -Scores indication: - -- 78.9% difficult - minimum required -- 80% very difficult: smart feature engineering needed -- More than 83%: excellent that corresponds to the top 2% on Kaggle -- More than 85%: cheating - -#### Cheating - -It is impossible to get 100%. Who would have predicted that Rose wouldn't let [Jack on the door](https://www.insider.com/jack-and-rose-werent-on-a-door-in-titanic-2019-7) ? - -All people having 100% of accuracy on the Leaderboard cheated, there's no point to compare with them or to cheat. The Kaggle community estimates that having more than 85% is almost considered as cheated submissions as they are element of luck involved in the surviving. - -**You can't use external data sets than the ones provided in that competition.** - -### The key points - -- **Feature engineering**: - Put yourself in the shoes of an investigator trying to understand what happened exactly in that boat during the crash. Do not hesitate to watch the movie to try to find as many insights as possible. Without a smart the feature engineering there's no way to validate the project ;-) - -- The leaderboard evaluates on test data for which you don't have the labels. It means that there's no point to over fit the train set. Check the over fitting on the train set by dividing the data and by cross-validating the accuracy. - -### Advice +### Tips Don't try to build the perfect model the first day. Iterate a lot and test your assumptions: diff --git a/subjects/ai/kaggle-titanic/audit/README.md b/subjects/ai/kaggle-titanic/audit/README.md index 78dda58e1..43f81ce05 100644 --- a/subjects/ai/kaggle-titanic/audit/README.md +++ b/subjects/ai/kaggle-titanic/audit/README.md @@ -1,4 +1,4 @@ -#### First Kaggle: Titanic +#### Kaggle Titanic ##### Preliminary diff --git a/subjects/ai/nlp-scraper/README.md b/subjects/ai/nlp-scraper/README.md index b7c1741fc..69545a6bf 100644 --- a/subjects/ai/nlp-scraper/README.md +++ b/subjects/ai/nlp-scraper/README.md @@ -1,4 +1,4 @@ -## NLP-enriched News Intelligence platform +## NLP Scraper The goal of this project is to build an NLP-enriched News Intelligence platform. News analysis is a trending and important topic. The analysts get @@ -10,7 +10,25 @@ The platform connects to a news data source, detects the entities, detects the topic of the article, analyses the sentiment and performs a scandal detection analysis. -### Scraper +### Role Play + +You're a Natural Language Processing (NLP) specialist at a tech startup developing a sentiment analysis tool for social media posts. Your task is to build the preprocessing pipeline and create a bag-of-words representation for tweet analysis. + +### Learning Objectives + +1. Set up an NLP-focused Python environment +2. Implement basic text preprocessing techniques (lowercase, punctuation removal) +3. Perform text tokenization at sentence and word levels +4. Remove stop words from text data +5. Apply stemming to reduce words to their root forms +6. Create a complete text preprocessing pipeline +7. Implement a bag-of-words model using CountVectorizer +8. Analyze word frequency in a corpus of tweets +9. Prepare a labeled dataset for sentiment analysis + +### Instructions + +#### Scraper News data source: @@ -29,7 +47,7 @@ Use data from the last week otherwise the volume may be too high. There should be at least 300 articles stored in your file system or SQL database. -### NLP engine +#### NLP engine In production architectures, the NLP engine delivers a live output based on the news that are delivered in a live stream data by the scraper. However, it @@ -41,7 +59,7 @@ the stored data. Here how the NLP engine should process the news: -#### **1. Entities detection:** +##### **1. Entities detection:** The goal is to detect all the entities in the document (headline and body). The type of entity we focus on is `ORG`. This corresponds to companies and @@ -52,7 +70,7 @@ organizations. This information should be stored. [Named Entity Recognition with NLTK and SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da) -#### **2. Topic detection:** +##### **2. Topic detection:** The goal is to detect what the article is dealing with: Tech, Sport, Business, Entertainment or Politics. To do so, a labelled dataset is provided: [training @@ -68,7 +86,7 @@ that the model is trained correctly and not overfitted. - Learning constraints: **Score on test: > 95%** -#### **3. Sentiment analysis:** +##### **3. Sentiment analysis:** The goal is to detect the sentiment (positive, negative or neutral) of the news articles. To do so, use a pre-trained sentiment model. I suggest to use: @@ -82,7 +100,7 @@ articles. To do so, use a pre-trained sentiment model. I suggest to use: - [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) -#### **4. Scandal detection** +##### **4. Scandal detection** The goal is to detect environmental disaster for the detected companies. Here is the methodology that should be used: @@ -107,7 +125,7 @@ is the methodology that should be used: - Flag the top 10 articles. -#### 5. **Source analysis (optional)** +##### 5. **Source analysis (optional)** The goal is to show insights about the news' source you scraped. This requires to scrap data on at least 5 days (a week ideally). Save the plots @@ -127,7 +145,7 @@ Here are examples of insights: - Companies mentioned the most - Sentiment per companies -### Deliverables +### Project repository structure: The expected structure of the project is: @@ -212,7 +230,7 @@ python scraper_news.py Environmental scandal detected for ``` -### Notions +### Resources - [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0) - [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) diff --git a/subjects/ai/nlp-scraper/audit/README.md b/subjects/ai/nlp-scraper/audit/README.md index e38aa8144..225dbeba4 100644 --- a/subjects/ai/nlp-scraper/audit/README.md +++ b/subjects/ai/nlp-scraper/audit/README.md @@ -1,4 +1,4 @@ -#### NLP-enriched News Intelligence platform +#### NLP Scraper ##### Preliminary diff --git a/subjects/ai/sp500-strategies/README.md b/subjects/ai/sp500-strategies/README.md index b7d88c1ab..71e0767ee 100644 --- a/subjects/ai/sp500-strategies/README.md +++ b/subjects/ai/sp500-strategies/README.md @@ -1,4 +1,6 @@ -## Financial strategies on the SP500 +## SP500 strategies + +### Overview In this project, you'll apply machine learning to finance. Your goal as a Quant/Data Scientist is to create a financial strategy that uses a signal generated by a machine learning model to outperform the [SP500](https://en.wikipedia.org/wiki/S%26P_500). @@ -6,15 +8,19 @@ The S&P 500 Index is a collection of 500 stocks that represent the overall perfo The S&P 500 started in 1926 with only 90 stocks and has grown to include 500 stocks since 1957. Historically, the average annual return of the S&P 500 has been about 10-11% since 1926, and around 8% since 1957. +### Role play + As a Quantitative Researcher, your challenge is to develop a strategy that can consistently outperform the S&P 500, not just in one year, but over many years. This is a difficult task and is the primary goal of many hedge funds around the world. -The project is divided in parts: +### Learning Objective - **Data processing and feature engineering**: Build a dataset: insightful features and the target - **Machine Learning pipeline**: Train machine learning models on the dataset, select the best model and generate the machine learning signal. - **Strategy backtesting**: Generate a strategy from the Machine Learning model output and backtest the strategy. As a reminder, the idea here is to see what would have performed the strategy if you had invested. -### Data processing and features engineering +### Instructions + +#### Data processing and features engineering The file `HistoricalData.csv` contains the open-high-low-close (OHLC) SP500 index data and the other file, `all_stocks_5yr.csv`, contains the open-high-low-close-volume (OHLCV) data on the SP500 constituents. @@ -42,7 +48,7 @@ We assume it is day `D`, and we want to take a position on the next n days. The > Remark: The target used is the return computed on the price and not the price directly. There are statistical reasons for this choice - the price is not stationary. The consequence is that a machine learning model tends to overfit while training on not stationary data. -### Machine learning pipeline +#### Machine learning pipeline - Cross-validation deliverables: - Implements a cross validation with at least 10 folds. The train set has to be bigger than 2 years history. @@ -80,7 +86,7 @@ Once you'll have run the grid search on the cross validation (choose either Bloc - (optional): [Train an RNN/LSTM](https://towardsdatascience.com/predicting-stock-price-with-lstm-13af86a74944). This is a nice way to discover and learn about recurrent neural networks. But keep in mind that there are some new neural network architectures that seem to outperform recurrent neural networks. Here is an [interesting article](https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0) about the topic. -### Strategy backtesting +#### Strategy backtesting - Backtesting module deliverables. The module takes as input a machine learning signal, convert it into a financial strategy. A financial strategy DataFrame gives the amount invested at time `t` on asset `i`. The module returns the following metrics on the train set and the test set. - Profit and Loss (PnL) plot: save it as `strategy.png` @@ -107,7 +113,7 @@ Once you'll have run the grid search on the cross validation (choose either Bloc - PnL plot - strategy metrics on the train set and test set -### Example of strategies: +#### Example of strategies: - Long only: - Binary signal: @@ -172,7 +178,7 @@ Here's an example on how to convert a machine learning signal into a financial s project ├── data │   └── sp500.csv -├── environment.yml +├── requirements.txt ├── README.md ├── results │   ├── cross-validation @@ -199,7 +205,10 @@ project Note: `features_engineering.py` can be used in `gridsearch.py` -### Files for this project +### Tips + +Remember, the goal of this project is not just to beat the S&P 500 in a backtest, but to learn about the process of developing and testing trading strategies using machine learning techniques. + +### Resources -You can find the data required for this project in this : -[link](https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip) +You can find the data required for this project in this : [link](https://assets.01-edu.org/ai-branch/project4/project04-20221031T173034Z-001.zip) diff --git a/subjects/ai/sp500-strategies/audit/README.md b/subjects/ai/sp500-strategies/audit/README.md index 5c5b361d4..499436301 100644 --- a/subjects/ai/sp500-strategies/audit/README.md +++ b/subjects/ai/sp500-strategies/audit/README.md @@ -1,4 +1,4 @@ -#### Financial strategies on the SP500 +#### SP500 strategies ###### Is the structure of the project like the one presented in the `Project repository structure` in the subject?