public/subjects/ai/nlp-scraper/audit/README.md

#### NLP-enriched News Intelligence platform

##### Preliminary

###### Does the structure of the project look like the one described in the subject?

###### Does the environment contain all libraries used and their versions that are necessary to run the code?

##### Scraper

##### There are at least 300 news articles stored in the file system or the database.

##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.

###### Does it run without any error and store the 3 files as expected?

##### Topic classifier

###### Are the learning curves provided?

###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon.

> Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting".

##### Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`.

###### Can you run the topic classifier model on the test set without any error?

###### Does the topic classifier score an accuracy higher than 95% on the given datasets?

##### Scandal detection

###### Does the `README.md` explain the choice of embeddings and distance?

###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?

###### Is the distance or similarity saved in the DataFrame?

##### NLP engine output on 300 articles

###### Does the DataFrame contain 300 different rows?

###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?

##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.

##### NLP engine on 3 articles

###### Can you run `python nlp_enriched_news.py` without any error?

###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?

##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
docs(ai): add ai branch subjects to public 2 years ago			`#### NLP-enriched News Intelligence platform`

			`##### Preliminary`

CON-2393 clarify output of `nlp_enriched_news.py` script (#2419) * chore(nlp-scraper): fix small grammar mistakes and improve readability * feat(nlp-scraper): add link to datasets provided * feat(nlp-scraper): add clarification about sentiment analysis * feat(nlp-scraper): define how many articles are expected to be scraped * chore(nlp-scraper): improve grammar and readability * chore(nlp-scraper): fix typos * feat(nlp-scraper): add label to link * feat(nlp-scraper): remove audit question not related to the project * refactor(nlp-scraper): refactor question * chore(nlp-scraper): fix small typos * feat(nlp-scraper): add information on how to calculate scandal * feat(nlp-scraper): adde details to the delivrable section * feat(nlp-scraper): add reference to subject in audit * feat(nlp-scraper): update project structure - run prettier * feat(nlp-scraper): complete sentence in subject intro -make formatting consistent with 01 subject 8 months ago			`###### Does the structure of the project look like the one described in the subject?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Does the environment contain all libraries used and their versions that are necessary to run the code?`
docs(ai): add ai branch subjects to public 2 years ago
chore(nlp-scraper): improve grammar and readibility 8 months ago			`##### Scraper`
docs(ai): add ai branch subjects to public 2 years ago
			`##### There are at least 300 news articles stored in the file system or the database.`

CON-2393 clarify output of `nlp_enriched_news.py` script (#2419) * chore(nlp-scraper): fix small grammar mistakes and improve readability * feat(nlp-scraper): add link to datasets provided * feat(nlp-scraper): add clarification about sentiment analysis * feat(nlp-scraper): define how many articles are expected to be scraped * chore(nlp-scraper): improve grammar and readability * chore(nlp-scraper): fix typos * feat(nlp-scraper): add label to link * feat(nlp-scraper): remove audit question not related to the project * refactor(nlp-scraper): refactor question * chore(nlp-scraper): fix small typos * feat(nlp-scraper): add information on how to calculate scandal * feat(nlp-scraper): adde details to the delivrable section * feat(nlp-scraper): add reference to subject in audit * feat(nlp-scraper): update project structure - run prettier * feat(nlp-scraper): complete sentence in subject intro -make formatting consistent with 01 subject 8 months ago			##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
refactor(nlp-scraper): refactor question 8 months ago
			`###### Does it run without any error and store the 3 files as expected?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Topic classifier`

docs(nlp-scraper): fix audits format 2 years ago			`###### Are the learning curves provided?`
docs(ai): add ai branch subjects to public 2 years ago
feat(nlp-scraper): improve audit and subject - add details for question about checking "overfitting" - remove not so clear suggestion - move creation of `topic_classifier.pkl` to audit phase 8 months ago			`###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon.`

			`> Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting".`

			##### Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`.
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Can you run the topic classifier model on the test set without any error?`
docs(ai): add ai branch subjects to public 2 years ago
feat(nlp-scraper): improve audit and subject - add details for question about checking "overfitting" - remove not so clear suggestion - move creation of `topic_classifier.pkl` to audit phase 8 months ago			`###### Does the topic classifier score an accuracy higher than 95% on the given datasets?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### Scandal detection`

docs(nlp-scraper): fix audits format 2 years ago			###### Does the `README.md` explain the choice of embeddings and distance?
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?`
docs(ai): add ai branch subjects to public 2 years ago
docs(nlp-scraper): fix audits format 2 years ago			`###### Is the distance or similarity saved in the DataFrame?`
docs(ai): add ai branch subjects to public 2 years ago
			`##### NLP engine output on 300 articles`

docs(nlp-scraper): fix audits format 2 years ago			`###### Does the DataFrame contain 300 different rows?`
docs(ai): add ai branch subjects to public 2 years ago
CON-2393 clarify output of `nlp_enriched_news.py` script (#2419) * chore(nlp-scraper): fix small grammar mistakes and improve readability * feat(nlp-scraper): add link to datasets provided * feat(nlp-scraper): add clarification about sentiment analysis * feat(nlp-scraper): define how many articles are expected to be scraped * chore(nlp-scraper): improve grammar and readability * chore(nlp-scraper): fix typos * feat(nlp-scraper): add label to link * feat(nlp-scraper): remove audit question not related to the project * refactor(nlp-scraper): refactor question * chore(nlp-scraper): fix small typos * feat(nlp-scraper): add information on how to calculate scandal * feat(nlp-scraper): adde details to the delivrable section * feat(nlp-scraper): add reference to subject in audit * feat(nlp-scraper): update project structure - run prettier * feat(nlp-scraper): complete sentence in subject intro -make formatting consistent with 01 subject 8 months ago			###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
docs(ai): add ai branch subjects to public 2 years ago
chore(nlp-scraper): improve grammar and readibility 8 months ago			`##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.`
docs(ai): add ai branch subjects to public 2 years ago
			`##### NLP engine on 3 articles`

docs(nlp-scraper): fix audits format 2 years ago			###### Can you run `python nlp_enriched_news.py` without any error?
docs(ai): add ai branch subjects to public 2 years ago
CON-2393 clarify output of `nlp_enriched_news.py` script (#2419) * chore(nlp-scraper): fix small grammar mistakes and improve readability * feat(nlp-scraper): add link to datasets provided * feat(nlp-scraper): add clarification about sentiment analysis * feat(nlp-scraper): define how many articles are expected to be scraped * chore(nlp-scraper): improve grammar and readability * chore(nlp-scraper): fix typos * feat(nlp-scraper): add label to link * feat(nlp-scraper): remove audit question not related to the project * refactor(nlp-scraper): refactor question * chore(nlp-scraper): fix small typos * feat(nlp-scraper): add information on how to calculate scandal * feat(nlp-scraper): adde details to the delivrable section * feat(nlp-scraper): add reference to subject in audit * feat(nlp-scraper): update project structure - run prettier * feat(nlp-scraper): complete sentence in subject intro -make formatting consistent with 01 subject 8 months ago			###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
docs(ai): add ai branch subjects to public 2 years ago
chore(nlp-scraper): improve grammar and readibility 8 months ago			`##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.`