You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

219 lines
6.6 KiB

## NLP-enriched News Intelligence platform
The goal of this project is to build an NLP-enriched News Intelligence
platform. News analysis is a trending and important topic. The analysts get
their information from the news and the amount of available information is
limitless. Having a platform that helps to detect the relevant information is
definitely valuable.
The platform connects to a news data source, detects the entities, detects the
topic of the article, analyses the sentiment and performs a scandal detection
analysis.
### Scraper
News data source:
- Find a news website that is easy to scrape. I could have chosen the website,
but the news' websites change their scraping policy frequently.
- Store the following information either in one file per day or in a SQL
database:
- unique ID,
- URL,
- date,
- headline,
- body of the article.
Use data from the last week otherwise the volume may be too high.
There should be at least 300 articles stored in your file system or SQL
database.
### NLP engine
In production architectures, the NLP engine delivers a live output based on the
news that are delivered in a live stream data by the scraper. However, it
required advanced Python skills that is not a requisite for the AI branch.
To simplify this step the scraper and the NLP engine are used independently in
the project. The scraper fetches the news and store them in the data structure
(either the file system or the SQL database) and then, the NLP engine runs on
the stored data.
Here how the NLP engine should process the news:
#### **1. Entities detection:**
The goal is to detect all the entities in the document (headline and body). The
type of entity we focus on is `ORG`. This corresponds to companies and
organizations. This information should be stored.
- Detect all companies using `SpaCy NER` on the body of the text.
[Named Entity Recognition with NLTK and
SpaCy](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
#### **2. Topic detection:**
The goal is to detect what the article is dealing with: Tech, Sport, Business,
Entertainment or Politics. To do so, a labelled dataset is provided: [training
data](bbc_news_train.csv) and [test data](bbc_news_tests.csv). From this
dataset, build a classifier that learns to detect the right topic in the
article. Save the training process to a python file because the audit requires
the auditor to test the model.
To proceed with the following instructions, save the model as
`topic_classifier.pkl`.
Save the plot of learning curves (`learning_curves.png`) in `results` to prove
that the model is trained correctly and not overfitted.
- Learning constraints: **Score on test: > 95%**
#### **3. Sentiment analysis:**
The goal is to detect the sentiment (positive, negative or neutral) of the news
articles. To do so, use a pre-trained sentiment model. I suggest to use:
`NLTK`. There are 3 reasons for which we use a pre-trained model:
1. As a Data Scientist, you should learn to use a pre-trained model. There are
so many models available and trained that sometimes you don't need to train
one from scratch.
2. Labelled news data for sentiment analysis are very expensive. Companies as
[SESAMm](https://www.sesamm.com/) provide this kind of services.
- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)
#### **4. Scandal detection**
The goal is to detect environmental disaster for the detected companies. Here
is the methodology that should be used:
- Define keywords that correspond to environmental disaster that may be caused
by companies: pollution, deforestation etc ... Here is [an example of
disaster we want to detect](https://en.wikipedia.org/wiki/MV_Erika). Pay
attention to not use ambiguous words that make sense in the context of an
environmental disaster but also in another context. This would lead to detect
a false positive natural disaster.
- Compute the [embeddings of the
keywords](https://en.wikipedia.org/wiki/Word_embedding#Software).
- Compute the distance ([here some
examples](https://www.nltk.org/api/nltk.metrics.distance.html#module-nltk.metrics.distance))
between the embeddings of the keywords and all sentences that contain an
entity. Explain in the `README.md` the embeddings chosen and why. Similarly
explain the distance or similarity chosen and why.
- Save a metric to unify all the distances calculated per article.
- Flag the top 10 articles.
#### 5. **Source analysis (optional)**
The goal is to show insights about the news' source you scraped.
This requires to scrap data on at least 5 days (a week ideally). Save the plots
in the `results` folder.
Here are examples of insights:
- Per day:
- Proportion of topics per day
- Number of articles
- Number of companies mentioned
- Sentiment per day
- Per companies:
- Companies mentioned the most
- Sentiment per companies
### Deliverables
The expected structure of the project is:
```
project
.
├── data
   └── ...
├── nlp_enriched_news.py
├── README.md
├── results
   ├── training_model.py
   ├── enhanced_news.csv
   └── learning_curves.png
└── scraper_news.py
```
1. Run the scraper until it fetches at least 300 articles
```
python scraper_news.py
1. scraping <URL>
requesting ...
parsing ...
saved in <path>
2. scraping <URL>
requesting ...
parsing ...
saved in <path>
```
2. Run on these 300 articles the NLP engine. The script `nlp_eneriched_news.py`
should:
- Save a `DataFrame` with the following struct and store the result in a
`csv` file, `enhancend_news.csv`:
```
Unique ID (`uuid` or `int`)
URL (`str`)
Date scraped (`date`)
Headline (`str`)
Body (`str`)
Org (`list str`)
Topics (`list str`)
Sentiment (`list float` or `float`)
Scandal_distance (`float`)
Top_10 (`bool`)
```
- Have a similar output while it process the articles
```prompt
python nlp_enriched_news.py
Enriching <URL>:
Cleaning document ... (optional)
---------- Detect entities ----------
Detected <X> companies which are <company_1> and <company_2>
---------- Topic detection ----------
Text preprocessing ...
The topic of the article is: <topic>
---------- Sentiment analysis ----------
Text preprocessing ... (optional)
The article <title> has a <sentiment> sentiment
---------- Scandal detection ----------
Computing embeddings and distance ...
Environmental scandal detected for <entity>
```
### Notions
- [Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)
- [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis)