Browse Source

feat(nlp-scraper): restructure subject and audit to avoid storing big files in solution

pull/2468/head
nprimo 9 months ago committed by Niccolò Primo
parent
commit
d40ec29cf3
  1. 10
      subjects/ai/nlp-scraper/README.md
  2. 30
      subjects/ai/nlp-scraper/audit/README.md

10
subjects/ai/nlp-scraper/README.md

@ -58,9 +58,10 @@ The goal is to detect what the article is dealing with: Tech, Sport, Business,
Entertainment or Politics. To do so, a labelled dataset is provided: [training
data](bbc_news_train.csv) and [test data](bbc_news_test.csv). From this
dataset, build a classifier that learns to detect the right topic in the
article. The trained model should be stored as `topic_classifier.pkl`. Make
sure the model can be used easily (with the preprocessing pipeline built for
instance) because the audit requires the auditor to test the model.
article. Save the training process to a python file because the audit requires
the auditor to test the model.
To proceed with the following instructions, save the model as
`topic_classifier.pkl`.
Save the plot of learning curves (`learning_curves.png`) in `results` to prove
that the model is trained correctly and not overfitted.
@ -139,10 +140,11 @@ The expected structure of the project is:
project
.
├── data
   └── date_scrape_data.csv
   └── ...
├── nlp_enriched_news.py
├── README.md
├── results
   ├── training_model.py
   ├── enhanced_news.csv
   └── learning_curves.png
└── scraper_news.py

30
subjects/ai/nlp-scraper/audit/README.md

@ -8,11 +8,9 @@
##### Scraper
##### There are at least 300 news articles stored in the file system or the database.
##### Run the scraper with `python scraper_news.py` and fetch 300 articles. If needed, stop the program manually when enough data has been retrieved.
##### Run the scraper with `python scraper_news.py` and fetch 3 documents. The scraper is not expected to fetch 3 documents and stop by itself, you can stop it manually.
###### Does it run without any error and store the 3 files as expected?
###### Does it run without any error and store the articles as described in the subject?
##### Topic classifier
@ -28,26 +26,24 @@
###### Does the topic classifier score an accuracy higher than 95% on the given datasets?
##### Scandal detection
###### Does the `README.md` explain the choice of embeddings and distance?
##### NLP engine output on 300 articles
###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?
###### Can you run `python nlp_enriched_news.py` without any error?
###### Is the distance or similarity saved in the DataFrame?
###### Does the DataFrame saved in the `csv` file contain 300 different rows?
##### NLP engine output on 300 articles
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
###### Does the DataFrame contain 300 different rows?
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
###### Are the columns of the DataFrame as defined in the subject `Deliverable` section?
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
##### Analyse the DataFrame with 300 articles: relevance of the topics matched, relevance of the sentiment, relevance of the scandal detected and relevance of the companies matched. The algorithms are not 100% accurate, so you should expect a few issues in the results.
###### Is the information presented consistent and accurate?
##### NLP engine on 3 articles
##### Scandal detection
###### Can you run `python nlp_enriched_news.py` without any error?
###### Does the `README.md` explain the choice of embeddings and distance?
###### Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?
###### Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?
##### Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.
###### Is the distance or similarity saved in the DataFrame?

Loading…
Cancel
Save