CON-2393 clarify output of `nlp_enriched_news.py` script (#2419)
* chore(nlp-scraper): fix small grammar mistakes and improve readability
* feat(nlp-scraper): add link to datasets provided
* feat(nlp-scraper): add clarification about sentiment analysis
* feat(nlp-scraper): define how many articles are expected to be scraped
* chore(nlp-scraper): improve grammar and readability
* chore(nlp-scraper): fix typos
* feat(nlp-scraper): add label to link
* feat(nlp-scraper): remove audit question not related to the project
* refactor(nlp-scraper): refactor question
* chore(nlp-scraper): fix small typos
* feat(nlp-scraper): add information on how to calculate scandal
* feat(nlp-scraper): adde details to the delivrable section
* feat(nlp-scraper): add reference to subject in audit
* feat(nlp-scraper): update project structure
- run prettier
* feat(nlp-scraper): complete sentence in subject intro
-make formatting consistent with 01 subject
between the embeddings of the keywords and all sentences that contain an
entity. Explain in the `README.md` the embeddings chosen and why. Similarly
explain the distance or similarity chosen and why.
- Save the distance
- Save a metric to unify all the distances calculated per article.
- Flag the top 10 articles.
### 5. **Source analysis (optional)**
#### 5. **Source analysis (optional)**
The goal is to show insights about the news' source you scraped.
This requires to scrap data on at least 5 days (a week ideally). Save the plots
@ -129,24 +133,20 @@ Here are examples of insights:
### Deliverables
The structure of the project is:
The expected structure of the project is:
```
project
│ README.md
│ environment.yml
│
└───data
│ │ topic_classification_data.csv
│
└───results
│ │ topic_classifier.pkl
│ │ learning_curves.png
│ │ enhanced_news.csv
|
|───nlp_engine
│
.
├── data
│ └── date_scrape_data.csv
├── nlp_enriched_news.py
├── README.md
├── results
│ ├── topic_classifier.pkl
│ ├── enhanced_news.csv
│ └── learning_curves.png
└── scraper_news.py
```
1. Run the scraper until it fetches at least 300 articles
@ -166,52 +166,60 @@ python scraper_news.py
```
2. Run on these 300 articles the NLP engine.
2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should:
Save a `DataFrame`:
- Save a `DataFrame` with the following struct:
Date scraped (date)
Title (`str`)
URL (`str`)
Body (`str`)
Org (`str`)
Topics (`list str`)
Sentiment (`list float1 or `float`)
Scandal_distance (`float`)
Top_10 (`bool`)
```
Unique ID (`uuid` or `int`)
URL (`str`)
Date scraped (`date`)
Headline (`str`)
Body (`str`)
Org (`list str`)
Topics (`list str`)
Sentiment (`list float` or `float`)
Scandal_distance (`float`)
Top_10 (`bool`)
```
```prompt
python nlp_enriched_news.py
- Have a similar output while it process the articles
Enriching <URL>:
```prompt
python nlp_enriched_news.py
Cleaning document ... (optional)
Enriching <URL>:
---------- Detect entities ----------
Cleaning document ... (optional)
Detected <X> companies which are <company_1> and <company_2>
---------- Detect entities ----------
---------- Topic detection ----------
Detected <X> companies which are <company_1> and <company_2>
Text preprocessing ...
---------- Topic detection ----------
The topic of the article is: <topic>
Text preprocessing ...
---------- Sentiment analysis ----------
The topic of the article is: <topic>
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Sentiment analysis ----------
---------- Scandal detection ----------
Text preprocessing ... (optional)
The article <title> has a <sentiment> sentiment
Computing embeddings and distance ...
---------- Scandal detection ----------
Environmental scandal detected for <entity>
```
Computing embeddings and distance ...
I strongly suggest creating a data structure (dictionary for example) to save all the intermediate result. Then, a boolean argument `cache` fetched the intermediate results when they are already computed.
Environmental scandal detected for <entity>
```
Resources:
> I strongly suggest creating a data structure (dictionary for example) to save
> all the intermediate result. Then, a boolean argument `cache` fetched the
> intermediate results when they are already computed.
###### Does the structure of the project look like the above?
###### Does the structure of the project look like the one described in the subject?
###### Does the environment contain all libraries used and their versions that are necessary to run the code?
@ -54,20 +36,7 @@ project
###### Does the DataFrame contain 300 different rows?
###### Are the columns of the DataFrame as expected?
```
Date scraped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)
```
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
@ -75,36 +44,6 @@ Top_10 (bool)
###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the output of the NLP engine correspond to the output below?
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.