Browse Source

feat(nlp-scraper): improve audit and subject

- add details for question about checking "overfitting"
- remove not so clear suggestion
- move creation of `topic_classifier.pkl` to audit phase
pull/2468/head
nprimo 9 months ago committed by Niccolò Primo
parent
commit
700efcb57b
  1. 8
      subjects/ai/nlp-scraper/README.md
  2. 8
      subjects/ai/nlp-scraper/audit/README.md

8
subjects/ai/nlp-scraper/README.md

@ -143,7 +143,6 @@ project
├── nlp_enriched_news.py ├── nlp_enriched_news.py
├── README.md ├── README.md
├── results ├── results
   ├── topic_classifier.pkl
   ├── enhanced_news.csv    ├── enhanced_news.csv
   └── learning_curves.png    └── learning_curves.png
└── scraper_news.py └── scraper_news.py
@ -169,7 +168,8 @@ python scraper_news.py
2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py` 2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should: should:
- Save a `DataFrame` with the following struct: - Save a `DataFrame` with the following struct and store the result in a
`csv` file, `enhancend_news.csv`:
``` ```
Unique ID (`uuid` or `int`) Unique ID (`uuid` or `int`)
@ -215,10 +215,6 @@ python scraper_news.py
Environmental scandal detected for <entity> Environmental scandal detected for <entity>
``` ```
> I strongly suggest creating a data structure (dictionary for example) to save
> all the intermediate result. Then, a boolean argument `cache` fetched the
> intermediate results when they are already computed.
### Notions ### Notions
- [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0) - [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)

8
subjects/ai/nlp-scraper/audit/README.md

@ -18,11 +18,15 @@
###### Are the learning curves provided? ###### Are the learning curves provided?
###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? ###### Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon.
> Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting".
##### Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`.
###### Can you run the topic classifier model on the test set without any error? ###### Can you run the topic classifier model on the test set without any error?
###### Does the topic classifier score an accuracy higher than 95%? ###### Does the topic classifier score an accuracy higher than 95% on the given datasets?
##### Scandal detection ##### Scandal detection

Loading…
Cancel
Save