CON-2393 clarify output of `nlp_enriched_news.py` script (#2419)

* chore(nlp-scraper): fix small grammar mistakes and improve readability * feat(nlp-scraper): add link to datasets provided * feat(nlp-scraper): add clarification about sentiment analysis * feat(nlp-scraper): define how many articles are expected to be scraped * chore(nlp-scraper): improve grammar and readability * chore(nlp-scraper): fix typos * feat(nlp-scraper): add label to link * feat(nlp-scraper): remove audit question not related to the project * refactor(nlp-scraper): refactor question * chore(nlp-scraper): fix small typos * feat(nlp-scraper): add information on how to calculate scandal * feat(nlp-scraper): adde details to the delivrable section * feat(nlp-scraper): add reference to subject in audit * feat(nlp-scraper): update project structure - run prettier * feat(nlp-scraper): complete sentence in subject intro -make formatting consistent with 01 subject
9 months ago · c6d8ca334a
2 changed files with 75 additions and 128 deletions
--- a/subjects/ai/nlp-scraper/README.md
+++ b/subjects/ai/nlp-scraper/README.md
@ -1,4 +1,4 @@
-# NLP-enriched News Intelligence platform
+## NLP-enriched News Intelligence platform
 The goal of this project is to build an NLP-enriched News Intelligence
 platform. News analysis is a trending and important topic. The analysts get
@ -7,7 +7,8 @@ limitless. Having a platform that helps to detect the relevant information is
 definitely valuable.
 The platform connects to a news data source, detects the entities, detects the
-topic of the article, analyse the sentiment and ...
+topic of the article, analyses the sentiment and performs a scandal detection
 analysis.
 ### Scraper
@ -40,7 +41,7 @@ the stored data.
 Here how the NLP engine should process the news:
-### **1. Entities detection:**
+#### **1. Entities detection:**
 The goal is to detect all the entities in the document (headline and body). The
 type of entity we focus on is `ORG`. This corresponds to companies and
@ -51,7 +52,7 @@ organizations. This information should be stored.
 [Named Entity Recognition with NLTK and
 SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
-### **2. Topic detection:**
+#### **2. Topic detection:**
 The goal is to detect what the article is dealing with: Tech, Sport, Business,
 Entertainment or Politics. To do so, a labelled dataset is provided: [training
@ -71,7 +72,7 @@ that the model is trained correctly and not overfitted.
  [following](https://www.kaggle.com/rmisra/news-category-dataset) which is
  based on 200k news headlines.
-### **3. Sentiment analysis:**
+#### **3. Sentiment analysis:**
 The goal is to detect the sentiment (positive, negative or neutral) of the news
 articles. To do so, use a pre-trained sentiment model. I suggest to use:
@ -85,29 +86,32 @@ articles. To do so, use a pre-trained sentiment model. I suggest to use:
 - [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
-### **4. Scandal detection **
+#### **4. Scandal detection**
 The goal is to detect environmental disaster for the detected companies. Here
 is the methodology that should be used:
 - Define keywords that correspond to environmental disaster that may be caused
-  by companies: pollution, deforestation etc ... Here is an example of disaster
+  by companies: pollution, deforestation etc ... Here is [an example of
-  we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to
+  disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
-  not use ambiguous words that make sense in the context of an environmental
+  attention to not use ambiguous words that make sense in the context of an
-  disaster but also in another context. This would lead to detect a false
+  environmental disaster but also in another context. This would lead to detect
-  positive natural disaster.
+  a false positive natural disaster.
- Compute the embeddings of the keywords.
+- Compute the [embeddings of the
  keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).
- Compute the distance between the embeddings of the keywords and all sentences
+- Compute the distance ([here some
-  that contain an entity. Explain in the `README.md` the embeddings chosen and
+  examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
-  why. Similarly explain the distance or similarity chosen and why.
+  between the embeddings of the keywords and all sentences that contain an
  entity. Explain in the `README.md` the embeddings chosen and why. Similarly
  explain the distance or similarity chosen and why.
- Save the distance
+- Save a metric to unify all the distances calculated per article.
 - Flag the top 10 articles.
-### 5. **Source analysis (optional)**
+#### 5. **Source analysis (optional)**
 The goal is to show insights about the news' source you scraped.
 This requires to scrap data on at least 5 days (a week ideally). Save the plots
@ -129,24 +133,20 @@ Here are examples of insights:
 ### Deliverables
-The structure of the project is:
+The expected structure of the project is:
 ```
 project
-│   README.md
+.
-│   environment.yml
+├── data
-│
+│   └── date_scrape_data.csv
-└───data
+├── nlp_enriched_news.py
-│   │   topic_classification_data.csv
+├── README.md
-│
+├── results
-└───results
+│   ├── topic_classifier.pkl
-│   │   topic_classifier.pkl
+│   ├── enhanced_news.csv
-│   │   learning_curves.png
+│   └── learning_curves.png
-│   │   enhanced_news.csv
+└── scraper_news.py
 |
 |───nlp_engine
 │
 ```
 1.  Run the scraper until it fetches at least 300 articles
@ -166,52 +166,60 @@ python scraper_news.py
 ```
-2. Run on these 300 articles the NLP engine.
+2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
   should:
-Save a `DataFrame`:
+   - Save a `DataFrame` with the following struct:
-Date scraped (date)
+   ```
-Title (`str`)
+   Unique ID (`uuid` or `int`)
-URL (`str`)
+   URL (`str`)
-Body (`str`)
+   Date scraped (`date`)
-Org (`str`)
+   Headline (`str`)
-Topics (`list str`)
+   Body (`str`)
-Sentiment (`list float1 or `float`)
+   Org (`list str`)
-Scandal_distance (`float`)
+   Topics (`list str`)
-Top_10 (`bool`)
+   Sentiment (`list float` or `float`)
   Scandal_distance (`float`)
   Top_10 (`bool`)
   ```
-```prompt
+   - Have a similar output while it process the articles
 python nlp_enriched_news.py
-Enriching <URL>:
+   ```prompt
   python nlp_enriched_news.py
-Cleaning document ... (optional)
+   Enriching <URL>:
---------- Detect entities ----------
+   Cleaning document ... (optional)
-Detected <X> companies which are <company_1> and <company_2>
+   ---------- Detect entities ----------
---------- Topic detection ----------
+   Detected <X> companies which are <company_1> and <company_2>
-Text preprocessing ...
+   ---------- Topic detection ----------
-The topic of the article is: <topic>
+   Text preprocessing ...
---------- Sentiment analysis ----------
+   The topic of the article is: <topic>
-Text preprocessing ... (optional)
+   ---------- Sentiment analysis ----------
 The title which is <title> is <sentiment>
 The body of the article is <sentiment>
---------- Scandal detection ----------
+   Text preprocessing ... (optional)
   The article <title> has a <sentiment> sentiment
-Computing embeddings and distance ...
+   ---------- Scandal detection ----------
-Environmental scandal detected for <entity>
+   Computing embeddings and distance ...
 ```
-I strongly suggest creating a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed.
+   Environmental scandal detected for <entity>
   ```
-Resources:
+> I strongly suggest creating a data structure (dictionary for example) to save
 > all the intermediate result. Then, a boolean argument `cache` fetched the
 > intermediate results when they are already computed.
- https://www.youtube.com/watch?v=XVv6mJpFOb0
+### Notions
 - [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)
 - [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
--- a/subjects/ai/nlp-scraper/audit/README.md
+++ b/subjects/ai/nlp-scraper/audit/README.md
@ -2,25 +2,7 @@
 ##### Preliminary
-```
+###### Does the structure of the project look like the one described in the subject?
 project
 │   README.md
 │   environment.yml
 │
 └───data
 │   │   topic_classification_data.csv
 │
 └───results
 │   │   topic_classifier.pkl
 │   │   learning_curves.png
 │   │   enhanced_news.csv
 |
 |───nlp_engine
 │
 ```
 ###### Does the structure of the project look like the above?
 ###### Does the environment contain all libraries used and their versions that are necessary to run the code?
@ -54,20 +36,7 @@ project
 ###### Does the DataFrame contain 300 different rows?
-###### Are the columns of the DataFrame as expected?
+###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
 ```
 Date scraped (date)
 Title (str)
 URL (str)
 Body (str)
 Org (str)
 Topics (list str)
 Sentiment (list float or float)
 Scandal_distance (float)
 Top_10 (bool)
 ```
 ##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
@ -75,36 +44,6 @@ Top_10 (bool)
 ###### Can you run `python nlp_enriched_news.py` without any error?
-###### Does the output of the NLP engine correspond to the output below?
+###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
 ```prompt
 python nlp_enriched_news.py
 Enriching <URL>:
 Cleaning document ... (optional)
 ---------- Detect entities ----------
 Detected <X> companies which are <company_1> and <company_2>
 ---------- Topic detection ----------
 Text preprocessing ...
 The topic of the article is: <topic>
 ---------- Sentiment analysis ----------
 Text preprocessing ... (optional)
 The title which is <title> is <sentiment>
 The body of the article is <sentiment>
 ---------- Scandal detection ----------
 Computing embeddings and distance ...
 Environmental scandal detected for <entity>
 ```
 ##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.