CON-2393 clarify output of `nlp_enriched_news.py` script (#2419)
* chore(nlp-scraper): fix small grammar mistakes and improve readability
* feat(nlp-scraper): add link to datasets provided
* feat(nlp-scraper): add clarification about sentiment analysis
* feat(nlp-scraper): define how many articles are expected to be scraped
* chore(nlp-scraper): improve grammar and readability
* chore(nlp-scraper): fix typos
* feat(nlp-scraper): add label to link
* feat(nlp-scraper): remove audit question not related to the project
* refactor(nlp-scraper): refactor question
* chore(nlp-scraper): fix small typos
* feat(nlp-scraper): add information on how to calculate scandal
* feat(nlp-scraper): adde details to the delivrable section
* feat(nlp-scraper): add reference to subject in audit
* feat(nlp-scraper): update project structure
- run prettier
* feat(nlp-scraper): complete sentence in subject intro
-make formatting consistent with 01 subject
why. Similarly explain the distance or similarity chosen and why.
between the embeddings of the keywords and all sentences that contain an
entity. Explain in the `README.md` the embeddings chosen and why. Similarly
explain the distance or similarity chosen and why.
- Save the distance
- Save a metric to unify all the distances calculated per article.
- Flag the top 10 articles.
- Flag the top 10 articles.
### 5. **Source analysis (optional)**
#### 5. **Source analysis (optional)**
The goal is to show insights about the news' source you scraped.
The goal is to show insights about the news' source you scraped.
This requires to scrap data on at least 5 days (a week ideally). Save the plots
This requires to scrap data on at least 5 days (a week ideally). Save the plots
@ -129,24 +133,20 @@ Here are examples of insights:
### Deliverables
### Deliverables
The structure of the project is:
The expected structure of the project is:
```
```
project
project
│ README.md
.
│ environment.yml
├── data
│
│ └── date_scrape_data.csv
└───data
├── nlp_enriched_news.py
│ │ topic_classification_data.csv
├── README.md
│
├── results
└───results
│ ├── topic_classifier.pkl
│ │ topic_classifier.pkl
│ ├── enhanced_news.csv
│ │ learning_curves.png
│ └── learning_curves.png
│ │ enhanced_news.csv
└── scraper_news.py
|
|───nlp_engine
│
```
```
1. Run the scraper until it fetches at least 300 articles
1. Run the scraper until it fetches at least 300 articles
@ -166,52 +166,60 @@ python scraper_news.py
```
```
2. Run on these 300 articles the NLP engine.
2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should:
Save a `DataFrame`:
- Save a `DataFrame` with the following struct:
Date scraped (date)
```
Title (`str`)
Unique ID (`uuid` or `int`)
URL (`str`)
URL (`str`)
Body (`str`)
Date scraped (`date`)
Org (`str`)
Headline (`str`)
Topics (`list str`)
Body (`str`)
Sentiment (`list float1 or `float`)
Org (`list str`)
Scandal_distance (`float`)
Topics (`list str`)
Top_10 (`bool`)
Sentiment (`list float` or `float`)
Scandal_distance (`float`)
Top_10 (`bool`)
```
```prompt
- Have a similar output while it process the articles
python nlp_enriched_news.py
Enriching <URL>:
```prompt
python nlp_enriched_news.py
Cleaning document ... (optional)
Enriching <URL>:
---------- Detect entities ----------
Cleaning document ... (optional)
Detected <X> companies which are <company_1> and <company_2>
---------- Detect entities ----------
---------- Topic detection ----------
Detected <X> companies which are <company_1> and <company_2>
Text preprocessing ...
---------- Topic detection ----------
The topic of the article is: <topic>
Text preprocessing ...
---------- Sentiment analysis ----------
The topic of the article is: <topic>
Text preprocessing ... (optional)
---------- Sentiment analysis ----------
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Text preprocessing ... (optional)
The article <title> has a <sentiment> sentiment
Computing embeddings and distance ...
---------- Scandal detection ----------
Environmental scandal detected for <entity>
Computing embeddings and distance ...
```
I strongly suggest creating a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed.
Environmental scandal detected for <entity>
```
Resources:
> I strongly suggest creating a data structure (dictionary for example) to save
> all the intermediate result. Then, a boolean argument `cache` fetched the
> intermediate results when they are already computed.
###### Does the structure of the project look like the one described in the subject?
project
│ README.md
│ environment.yml
│
└───data
│ │ topic_classification_data.csv
│
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
│
```
###### Does the structure of the project look like the above?
###### Does the environment contain all libraries used and their versions that are necessary to run the code?
###### Does the environment contain all libraries used and their versions that are necessary to run the code?
@ -54,20 +36,7 @@ project
###### Does the DataFrame contain 300 different rows?
###### Does the DataFrame contain 300 different rows?
###### Are the columns of the DataFrame as expected?
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
```
Date scraped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)
```
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
@ -75,36 +44,6 @@ Top_10 (bool)
###### Can you run `python nlp_enriched_news.py` without any error?
###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the output of the NLP engine correspond to the output below?
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.