From c2e60afc28d065ac7b8ec176dea46bb5ce4d1e18 Mon Sep 17 00:00:00 2001
From: nprimo <primo.niccolo@gmail.com>
Date: Mon, 18 Dec 2023 10:39:50 +0000
Subject: [PATCH] feat(nlp): update exercise 7 subject and audit

---
 subjects/ai/nlp/README.md       | 21 ++++++---
 subjects/ai/nlp/audit/README.md | 78 +++++++++++++++++++++++++--------
 2 files changed, 73 insertions(+), 26 deletions(-)
diff --git a/subjects/ai/nlp/README.md b/subjects/ai/nlp/README.md
index 208f603d4..b6a5d3409 100644
--- a/subjects/ai/nlp/README.md
+++ b/subjects/ai/nlp/README.md
@@ -196,7 +196,7 @@ Steps:
 
 > Note: Given that a data set is often described as an m x n matrix in which m is the number of rows and n is the number of columns: features. It is strongly recommended to work with m >> n. The value of the ratio depends on the signal existing in the data set and on the model complexity.
 
-2. Using `from_spmatrix` from Pandas, create a DataFrame with documents in rows and the dictionary in columns.
+2. Using `from_spmatrix` from Pandas, create a DataFrame `count_vecotrized_df` using the output features names as column names. The final results should be similar to the below one.
 
 |     | and | boat | compute |
 | --: | --: | ---: | ------: |
@@ -206,16 +206,23 @@ Steps:
 
 > Note: The sample 3x3 table mentioned is a small representation of the expected output for demonstration purposes. It's not necessary to drop columns in this context.
 
-3. Create a DataFrame with labels where:
+3. Show the token counts (obtained with the above-mentioned steps) of the fourth tweet. 
 
+4. Using the word counter, show the 15 most used tokenized words in the datasets' tweets 
+
+5. Add to your `count_vecotrized_df` a `label` column considering the following:
    - 1: Positive
    - 0: Neutral
    - -1: Negative
 
-|     | Label |
-| --: | ----: |
-|   0 |    -1 |
-|   1 |     0 |
-|   2 |     1 |
+   The final DataFrame should be similar to the below:
+
+
+|    |   ...  |   label |
+|---:|-------:|--------:|
+|  0 |    ... |       1 |
+|  1 |    ... |      -1 |
+|  2 |    ... |      -1 |
+|  3 |    ... |      -1 |
 
 _Resources: [sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)_
diff --git a/subjects/ai/nlp/audit/README.md b/subjects/ai/nlp/audit/README.md
index d6d5dbcfe..4c8efe809 100644
--- a/subjects/ai/nlp/audit/README.md
+++ b/subjects/ai/nlp/audit/README.md
@@ -183,26 +183,66 @@ Remove this from  the sentence
 
 ##### The exercise is validated if all questions of the exercise are validated
 
-###### For question 1, is the output of the CountVectorizer the following?
+###### For question 1, is the output of the `CountVectorizer` the following?
 
 ```
 <6588x500 sparse matrix of type '<class 'numpy.int64'>'
-	with 79709 stored elements in Compressed Sparse Row format>
+	with 37334 stored elements in Compressed Sparse Row format>
+```
+
+###### For question 2, is the output of `print(count_vecotrized_df.iloc[:3,400:403].to_markdown())` the following?
+
+```python
+    |    |   someth |   son |   song |
+    |---:|---------:|------:|-------:|
+    |  0 |        0 |     0 |      0 |
+    |  1 |        0 |     0 |      0 |
+    |  2 |        0 |     0 |      0 |
+```
+
+###### For question 3, is the output matching with the following one?
+
+```python
+cant    1
+deal    1
+end     1
+find    1
+keep    1
+like    1
+may     1
+say     1
+talk    1
+Name: 3, dtype: Sparse[int64, 0]
+```
+
+###### For question 4, is the output matching with the following one?
+
+```python
+tomorrow    1126
+go           733
+day          667
+night        641
+may          533
+tonight      501
+see          439
+time         429
+im           422
+get          398
+today        389
+game         382
+saturday     379
+friday       375
+sunday       368
+dtype: int64
+```
+
+###### For question 5, is the output of `print(count_vectorized_df.iloc[350:354,499:501].to_markdown())` the following?
+
+```python
+|     |   your |   label |
+|----:|-------:|--------:|
+| 350 |      0 |       1 |
+| 351 |      1 |      -1 |
+| 352 |      0 |       1 |
+| 353 |      0 |       0 |
 ```
-
-###### For question 2, is the output of `print(df.iloc[:3,400:403].to_markdown())` the following?
-
-    |    |   talk |   team |   tell |
-    |---:|-------:|-------:|-------:|
-    |  0 |      0 |      0 |      0 |
-    |  1 |      0 |      0 |      0 |
-    |  2 |      0 |      0 |      0 |
-
-###### For question 3, is the shape of the wordcount DataFrame `(6588, 501)` and the output of `print(df.iloc[300:304,499:501].to_markdown())` the following?
-
-    |     |   youtube |   label |
-    |----:|----------:|--------:|
-    | 300 |         0 |       0 |
-    | 301 |         0 |      -1 |
-    | 302 |         1 |       0 |
-    | 303 |         0 |       1 |