Browse Source

CON-2393 clarify output of `nlp_enriched_news.py` script (#2419)

* chore(nlp-scraper): fix small grammar mistakes and improve readability

* feat(nlp-scraper): add link to datasets provided

* feat(nlp-scraper): add clarification about sentiment analysis

* feat(nlp-scraper): define how many articles are expected to be scraped

* chore(nlp-scraper): improve grammar and readability

* chore(nlp-scraper): fix typos

* feat(nlp-scraper): add label to link

* feat(nlp-scraper): remove audit question not related to the project

* refactor(nlp-scraper): refactor question

* chore(nlp-scraper): fix small typos

* feat(nlp-scraper): add information on how to calculate scandal

* feat(nlp-scraper): adde details to the delivrable section

* feat(nlp-scraper): add reference to subject in audit

* feat(nlp-scraper): update project structure

- run prettier

* feat(nlp-scraper): complete sentence in subject intro

-make formatting consistent with 01 subject
pull/2437/head
Niccolò Primo 9 months ago committed by GitHub
parent
commit
c6d8ca334a
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
  1. 94
      subjects/ai/nlp-scraper/README.md
  2. 67
      subjects/ai/nlp-scraper/audit/README.md

94
subjects/ai/nlp-scraper/README.md

@ -1,4 +1,4 @@
# NLP-enriched News Intelligence platform ## NLP-enriched News Intelligence platform
The goal of this project is to build an NLP-enriched News Intelligence The goal of this project is to build an NLP-enriched News Intelligence
platform. News analysis is a trending and important topic. The analysts get platform. News analysis is a trending and important topic. The analysts get
@ -7,7 +7,8 @@ limitless. Having a platform that helps to detect the relevant information is
definitely valuable. definitely valuable.
The platform connects to a news data source, detects the entities, detects the The platform connects to a news data source, detects the entities, detects the
topic of the article, analyse the sentiment and ... topic of the article, analyses the sentiment and performs a scandal detection
analysis.
### Scraper ### Scraper
@ -40,7 +41,7 @@ the stored data.
Here how the NLP engine should process the news: Here how the NLP engine should process the news:
### **1. Entities detection:** #### **1. Entities detection:**
The goal is to detect all the entities in the document (headline and body). The The goal is to detect all the entities in the document (headline and body). The
type of entity we focus on is `ORG`. This corresponds to companies and type of entity we focus on is `ORG`. This corresponds to companies and
@ -51,7 +52,7 @@ organizations. This information should be stored.
[Named Entity Recognition with NLTK and [Named Entity Recognition with NLTK and
SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da) SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
### **2. Topic detection:** #### **2. Topic detection:**
The goal is to detect what the article is dealing with: Tech, Sport, Business, The goal is to detect what the article is dealing with: Tech, Sport, Business,
Entertainment or Politics. To do so, a labelled dataset is provided: [training Entertainment or Politics. To do so, a labelled dataset is provided: [training
@ -71,7 +72,7 @@ that the model is trained correctly and not overfitted.
[following](https://www.kaggle.com/rmisra/news-category-dataset) which is [following](https://www.kaggle.com/rmisra/news-category-dataset) which is
based on 200k news headlines. based on 200k news headlines.
### **3. Sentiment analysis:** #### **3. Sentiment analysis:**
The goal is to detect the sentiment (positive, negative or neutral) of the news The goal is to detect the sentiment (positive, negative or neutral) of the news
articles. To do so, use a pre-trained sentiment model. I suggest to use: articles. To do so, use a pre-trained sentiment model. I suggest to use:
@ -85,29 +86,32 @@ articles. To do so, use a pre-trained sentiment model. I suggest to use:
- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) - [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
### **4. Scandal detection ** #### **4. Scandal detection**
The goal is to detect environmental disaster for the detected companies. Here The goal is to detect environmental disaster for the detected companies. Here
is the methodology that should be used: is the methodology that should be used:
- Define keywords that correspond to environmental disaster that may be caused - Define keywords that correspond to environmental disaster that may be caused
by companies: pollution, deforestation etc ... Here is an example of disaster by companies: pollution, deforestation etc ... Here is [an example of
we want to detect: https://en.wikipedia.org/wiki/MV_Erika. Pay attention to disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
not use ambiguous words that make sense in the context of an environmental attention to not use ambiguous words that make sense in the context of an
disaster but also in another context. This would lead to detect a false environmental disaster but also in another context. This would lead to detect
positive natural disaster. a false positive natural disaster.
- Compute the embeddings of the keywords. - Compute the [embeddings of the
keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).
- Compute the distance between the embeddings of the keywords and all sentences - Compute the distance ([here some
that contain an entity. Explain in the `README.md` the embeddings chosen and examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
why. Similarly explain the distance or similarity chosen and why. between the embeddings of the keywords and all sentences that contain an
entity. Explain in the `README.md` the embeddings chosen and why. Similarly
explain the distance or similarity chosen and why.
- Save the distance - Save a metric to unify all the distances calculated per article.
- Flag the top 10 articles. - Flag the top 10 articles.
### 5. **Source analysis (optional)** #### 5. **Source analysis (optional)**
The goal is to show insights about the news' source you scraped. The goal is to show insights about the news' source you scraped.
This requires to scrap data on at least 5 days (a week ideally). Save the plots This requires to scrap data on at least 5 days (a week ideally). Save the plots
@ -129,24 +133,20 @@ Here are examples of insights:
### Deliverables ### Deliverables
The structure of the project is: The expected structure of the project is:
``` ```
project project
│ README.md .
│ environment.yml ├── data
   └── date_scrape_data.csv
└───data ├── nlp_enriched_news.py
│ │ topic_classification_data.csv ├── README.md
├── results
└───results    ├── topic_classifier.pkl
│ │ topic_classifier.pkl    ├── enhanced_news.csv
│ │ learning_curves.png    └── learning_curves.png
│ │ enhanced_news.csv └── scraper_news.py
|
|───nlp_engine
``` ```
1. Run the scraper until it fetches at least 300 articles 1. Run the scraper until it fetches at least 300 articles
@ -166,19 +166,25 @@ python scraper_news.py
``` ```
2. Run on these 300 articles the NLP engine. 2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should:
Save a `DataFrame`: - Save a `DataFrame` with the following struct:
Date scraped (date) ```
Title (`str`) Unique ID (`uuid` or `int`)
URL (`str`) URL (`str`)
Date scraped (`date`)
Headline (`str`)
Body (`str`) Body (`str`)
Org (`str`) Org (`list str`)
Topics (`list str`) Topics (`list str`)
Sentiment (`list float1 or `float`) Sentiment (`list float` or `float`)
Scandal_distance (`float`) Scandal_distance (`float`)
Top_10 (`bool`) Top_10 (`bool`)
```
- Have a similar output while it process the articles
```prompt ```prompt
python nlp_enriched_news.py python nlp_enriched_news.py
@ -200,8 +206,7 @@ The topic of the article is: <topic>
---------- Sentiment analysis ---------- ---------- Sentiment analysis ----------
Text preprocessing ... (optional) Text preprocessing ... (optional)
The title which is <title> is <sentiment> The article <title> has a <sentiment> sentiment
The body of the article is <sentiment>
---------- Scandal detection ---------- ---------- Scandal detection ----------
@ -210,8 +215,11 @@ Computing embeddings and distance ...
Environmental scandal detected for <entity> Environmental scandal detected for <entity>
``` ```
I strongly suggest creating a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed. > I strongly suggest creating a data structure (dictionary for example) to save
> all the intermediate result. Then, a boolean argument `cache` fetched the
> intermediate results when they are already computed.
Resources: ### Notions
- https://www.youtube.com/watch?v=XVv6mJpFOb0 - [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)
- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)

67
subjects/ai/nlp-scraper/audit/README.md

@ -2,25 +2,7 @@
##### Preliminary ##### Preliminary
``` ###### Does the structure of the project look like the one described in the subject?
project
│ README.md
│ environment.yml
└───data
│ │ topic_classification_data.csv
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
```
###### Does the structure of the project look like the above?
###### Does the environment contain all libraries used and their versions that are necessary to run the code? ###### Does the environment contain all libraries used and their versions that are necessary to run the code?
@ -54,20 +36,7 @@ project
###### Does the DataFrame contain 300 different rows? ###### Does the DataFrame contain 300 different rows?
###### Are the columns of the DataFrame as expected? ###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
```
Date scraped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)
```
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results. ##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
@ -75,36 +44,6 @@ Top_10 (bool)
###### Can you run `python nlp_enriched_news.py` without any error? ###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the output of the NLP engine correspond to the output below? ###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched. ##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.

Loading…
Cancel
Save