file.title%!(EXTRA string=public/audit, string=master) - public - Gitea

root

/

public

mirror of https://github.com/01-edu/public.git

History

nprimo d40ec29cf3 feat(nlp-scraper): restructure subject and audit to avoid storing big files in solution		3 months ago
..
README.md	feat(nlp-scraper): restructure subject and audit to avoid storing big files in solution	3 months ago

README.md

NLP-enriched News Intelligence platform

Preliminary

Does the structure of the project look like the one described in the subject?

Does the environment contain all libraries used and their versions that are necessary to run the code?

Scraper

Run the scraper with `python scraper_news.py` and fetch 300 articles. If needed, stop the program manually when enough data has been retrieved.

Does it run without any error and store the articles as described in the subject?

Topic classifier

Are the learning curves provided?

Do the learning curves prove the topics classifier is trained correctly - without overfitting? Ask the student to explain what the term "overfitting" means and how he avoided this phenomenon.

Additionally, you can look for external resources. For example, Wikipedia has a good page on "overfitting".

Ask the student to train and store the topic classifier model in a file named `topic_classifier.pkl`.

Can you run the topic classifier model on the test set without any error?

Does the topic classifier score an accuracy higher than 95% on the given datasets?

NLP engine output on 300 articles

Can you run `python nlp_enriched_news.py` without any error?

Does the DataFrame saved in the `csv` file contain 300 different rows?

Are the columns of the DataFrame as defined in the subject `Deliverable` section?

Does the output of the NLP engine correspond to the output defined in the subject `Deliverable` section?

Analyse the output: relevance of the topic(s) matched, relevance of the sentiment, relevance of the scandal detected (if detected on the three articles) and relevance of the company(ies) matched.

Is the information presented consistent and accurate?

Scandal detection

Does the `README.md` explain the choice of embeddings and distance?

Does the DataFrame flag the top 10 articles with the highest likelihood of environmental scandal?

Is the distance or similarity saved in the DataFrame?