public/subjects/ai/nlp-scraper/audit/README.md

#### NLP-enriched News Intelligence platform

##### Preliminary

```
project
│   README.md
│   environment.yml
│
└───data
│   │   topic_classification_data.csv
│
└───results
│   │   topic_classifier.pkl
│   │   learning_curves.png
│   │   enhanced_news.csv
|
|───nlp_engine
│

```

###### Does the structure of the project look like the above?

###### Does the environment contain all libraries used and their versions that are necessary to run the code?

##### Scraper

##### There are at least 300 news articles stored in the file system or the database.

##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually. 

###### Does it run without any error and store the 3 files as expected?

##### Topic classifier

###### Are the learning curves provided?

###### Do the learning curves prove the topics classifier is trained correctly - without overfitting?

###### Can you run the topic classifier model on the test set without any error?

###### Does the topic classifier score an accuracy higher than 95%?

##### Scandal detection

###### Does the `README.md` explain the choice of embeddings and distance?

###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?

###### Is the distance or similarity saved in the DataFrame?

##### NLP engine output on 300 articles

###### Does the DataFrame contain 300 different rows?

###### Are the columns of the DataFrame as expected?

```
Date scraped (date)
Title (str)
URL (str)
Body (str)
Org (str)
Topics (list str)
Sentiment (list float or float)
Scandal_distance (float)
Top_10 (bool)

```

##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.

##### NLP engine on 3 articles

###### Can you run `python nlp_enriched_news.py` without any error?

###### Does the output of the NLP engine correspond to the output below?

```prompt
python nlp_enriched_news.py

Enriching <URL>:

Cleaning document ... (optional)

---------- Detect entities ----------

Detected <X> companies which are <company_1> and <company_2>

---------- Topic detection ----------

Text preprocessing ...

The topic of the article is: <topic>

---------- Sentiment analysis ----------

Text preprocessing ... (optional)
The title which is <title> is <sentiment>
The body of the article is <sentiment>

---------- Scandal detection ----------

Computing embeddings and distance ...

Environmental scandal detected for <entity>
```

##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
docs(ai): add ai branch subjects to public 2 years ago			`#### NLP-enriched News Intelligence platform`

			`##### Preliminary`

			```
			`project`
			`│ README.md`
			`│ environment.yml`
			`│`
			`└───data`
			`│ │ topic_classification_data.csv`
			`│`
			`└───results`
			`│ │ topic_classifier.pkl`
			`│ │ learning_curves.png`
			`│ │ enhanced_news.csv`
			`\|`
			`\|───nlp_engine`
			`│`

			```

docs(nlp-scraper): fix audits format 2 years ago			`###### Does the structure of the project look like the above?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Does the environment contain all libraries used and their versions that are necessary to run the code?`
docs(ai): add ai branch subjects to public 2 years ago
chore(nlp-scraper): improve grammar and readibility 10 months ago			`##### Scraper`
docs(ai): add ai branch subjects to public 2 years ago
			`##### There are at least 300 news articles stored in the file system or the database.`

refactor(nlp-scraper): refactor question 10 months ago			##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.

			`###### Does it run without any error and store the 3 files as expected?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Topic classifier`

docs(nlp-scraper): fix audits format 2 years ago			`###### Are the learning curves provided?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Do the learning curves prove the topics classifier is trained correctly - without overfitting?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Can you run the topic classifier model on the test set without any error?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Does the topic classifier score an accuracy higher than 95%?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Scandal detection`

docs(nlp-scraper): fix audits format 2 years ago			###### Does the `README.md` explain the choice of embeddings and distance?
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Is the distance or similarity saved in the DataFrame?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### NLP engine output on 300 articles`

docs(nlp-scraper): fix audits format 2 years ago			`###### Does the DataFrame contain 300 different rows?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Are the columns of the DataFrame as expected?`
docs(ai): add ai branch subjects to public 2 years ago
			```
chore(nlp-scraper): improve grammar and readibility 10 months ago			`Date scraped (date)`
docs(ai): add ai branch subjects to public 2 years ago			`Title (str)`
			`URL (str)`
			`Body (str)`
			`Org (str)`
			`Topics (list str)`
			`Sentiment (list float or float)`
			`Scandal_distance (float)`
			`Top_10 (bool)`

			```

chore(nlp-scraper): improve grammar and readibility 10 months ago			`##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.`
docs(ai): add ai branch subjects to public 2 years ago
			`##### NLP engine on 3 articles`

docs(nlp-scraper): fix audits format 2 years ago			###### Can you run `python nlp_enriched_news.py` without any error?
docs(ai): add ai branch subjects to public 2 years ago
chore(nlp-scraper): improve grammar and readibility 10 months ago			`###### Does the output of the NLP engine correspond to the output below?`
docs(ai): add ai branch subjects to public 2 years ago
			```prompt
			`python nlp_enriched_news.py`

			`Enriching <URL>:`

			`Cleaning document ... (optional)`

			`---------- Detect entities ----------`

			`Detected <X> companies which are <company_1> and <company_2>`

			`---------- Topic detection ----------`

			`Text preprocessing ...`

			`The topic of the article is: <topic>`

			`---------- Sentiment analysis ----------`

			`Text preprocessing ... (optional)`
			`The title which is <title> is <sentiment>`
			`The body of the article is <sentiment>`

			`---------- Scandal detection ----------`

			`Computing embeddings and distance ...`

			`Environmental scandal detected for <entity>`
			```

chore(nlp-scraper): improve grammar and readibility 10 months ago			`##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.`