From c2e60afc28d065ac7b8ec176dea46bb5ce4d1e18 Mon Sep 17 00:00:00 2001 From: nprimo Date: Mon, 18 Dec 2023 10:39:50 +0000 Subject: [PATCH] feat(nlp): update exercise 7 subject and audit --- subjects/ai/nlp/README.md | 21 ++++++--- subjects/ai/nlp/audit/README.md | 78 +++++++++++++++++++++++++-------- 2 files changed, 73 insertions(+), 26 deletions(-) diff --git a/subjects/ai/nlp/README.md b/subjects/ai/nlp/README.md index 208f603d4..b6a5d3409 100644 --- a/subjects/ai/nlp/README.md +++ b/subjects/ai/nlp/README.md @@ -196,7 +196,7 @@ Steps: > Note: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity. -2. Using `from_spmatrix` from Pandas, create a DataFrame with documents in rows and the dictionary in columns. +2. Using `from_spmatrix` from Pandas, create a DataFrame `count_vecotrized_df` using the output features names as column names. The final results should be similar to the below one. | | and | boat | compute | | --: | --: | ---: | ------: | @@ -206,16 +206,23 @@ Steps: > Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context. -3. Create a DataFrame with labels where: +3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet. +4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets + +5. Add to your `count_vecotrized_df` a `label` column considering the following: - 1: Positive - 0: Neutral - -1: Negative -| | Label | -| --: | ----: | -| 0 | -1 | -| 1 | 0 | -| 2 | 1 | + The final DataFrame should be similar to the below: + + +| | ... | label | +|---:|-------:|--------:| +| 0 | ... | 1 | +| 1 | ... | -1 | +| 2 | ... | -1 | +| 3 | ... | -1 | _Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_ diff --git a/subjects/ai/nlp/audit/README.md b/subjects/ai/nlp/audit/README.md index d6d5dbcfe..4c8efe809 100644 --- a/subjects/ai/nlp/audit/README.md +++ b/subjects/ai/nlp/audit/README.md @@ -183,26 +183,66 @@ Remove this from the sentence ##### The exercise is validated if all questions of the exercise are validated -###### For question 1, is the output of the CountVectorizer the following? +###### For question 1, is the output of the `CountVectorizer` the following? ``` <6588x500 sparse matrix of type '' - with 79709 stored elements in Compressed Sparse Row format> + with 37334 stored elements in Compressed Sparse Row format> +``` + +###### For question 2, is the output of `print(count_vecotrized_df.iloc[:3,400:403].to_markdown())` the following? + +```python + | | someth | son | song | + |---:|---------:|------:|-------:| + | 0 | 0 | 0 | 0 | + | 1 | 0 | 0 | 0 | + | 2 | 0 | 0 | 0 | +``` + +###### For question 3, is the output matching with the following one? + +```python +cant 1 +deal 1 +end 1 +find 1 +keep 1 +like 1 +may 1 +say 1 +talk 1 +Name: 3, dtype: Sparse[int64, 0] +``` + +###### For question 4, is the output matching with the following one? + +```python +tomorrow 1126 +go 733 +day 667 +night 641 +may 533 +tonight 501 +see 439 +time 429 +im 422 +get 398 +today 389 +game 382 +saturday 379 +friday 375 +sunday 368 +dtype: int64 +``` + +###### For question 5, is the output of `print(count_vectorized_df.iloc[350:354,499:501].to_markdown())` the following? + +```python +| | your | label | +|----:|-------:|--------:| +| 350 | 0 | 1 | +| 351 | 1 | -1 | +| 352 | 0 | 1 | +| 353 | 0 | 0 | ``` - -###### For question 2, is the output of `print(df.iloc[:3,400:403].to_markdown())` the following? - - | | talk | team | tell | - |---:|-------:|-------:|-------:| - | 0 | 0 | 0 | 0 | - | 1 | 0 | 0 | 0 | - | 2 | 0 | 0 | 0 | - -###### For question 3, is the shape of the wordcount DataFrame `(6588, 501)` and the output of `print(df.iloc[300:304,499:501].to_markdown())` the following? - - | | youtube | label | - |----:|----------:|--------:| - | 300 | 0 | 0 | - | 301 | 0 | -1 | - | 302 | 1 | 0 | - | 303 | 0 | 1 |