diff --git a/subjects/ai/nlp-scraper/README.md b/subjects/ai/nlp-scraper/README.md index 3be89edd9..9b210ce8f 100644 --- a/subjects/ai/nlp-scraper/README.md +++ b/subjects/ai/nlp-scraper/README.md @@ -143,7 +143,6 @@ project ├── nlp_enriched_news.py ├── README.md ├── results -│   ├── topic_classifier.pkl │   ├── enhanced_news.csv │   └── learning_curves.png └── scraper_news.py @@ -169,7 +168,8 @@ python scraper_news.py 2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py` should: - - Save a `DataFrame` with the following struct: + - Save a `DataFrame` with the following struct and store the result in a + `csv` file, `enhancend_news.csv`: ``` Unique ID (`uuid` or `int`) @@ -215,10 +215,6 @@ python scraper_news.py Environmental scandal detected for ``` -> I strongly suggest creating a data structure (dictionary for example) to save -> all the intermediate result. Then, a boolean argument `cache` fetched the -> intermediate results when they are already computed. - ### Notions - [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0) diff --git a/subjects/ai/nlp-scraper/audit/README.md b/subjects/ai/nlp-scraper/audit/README.md index 7609587b7..6dd031407 100644 --- a/subjects/ai/nlp-scraper/audit/README.md +++ b/subjects/ai/nlp-scraper/audit/README.md @@ -18,11 +18,15 @@ ###### Are the learning curves provided? -###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? +###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon. + +> Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting". + +##### Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`. ###### Can you run the topic classifier model on the test set without any error? -###### Does the topic classifier score an accuracy higher than 95%? +###### Does the topic classifier score an accuracy higher than 95% on the given datasets? ##### Scandal detection