feat(string_tokenizer): add subject for the new exercise

1 year ago · 4c09ccd4b7
1 changed files with 91 additions and 0 deletions
--- a/subjects/devops/string_tokenizer/README.md
+++ b/subjects/devops/string_tokenizer/README.md
@ -0,0 +1,91 @@
+## String tokenizer
+
+### Instructions
+
+Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences.
+
+Create a file `string_processing.py` which will have a function `tokenize(sentence)` that given a sentence will do the following:
+
+- removes all punctuation marks and special characters
+- separates all words like so: `"it's not 3" => ['it', 's', 'not', '3']`
+- put all the words in lowercase
+- return a list of all the words.
+
+### Usage
+
+Here is a possible `test.py` to test your functions:
+
+```python
+import string_processing
+
+if __name__ == '__main__':
+    my_sentence = "It's not possible, you can't ask for a raise"
+    print(string_processing.tokenize(my_sentence))
+```
+
+```bash
+$ python test.py
+['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise']
+```
+
+### Hints
+
+The `re` library is a module for working with regular expressions it provides a set of functions for working with regular expressions, including:
+
+- `re.sub()` : Replaces all occurrences of a regular expression pattern in a string with a replacement string.
+
+```python
+text = "This is a test sentence. It has multiple punctuation marks!"
+
+# Replace all exclamation marks with question marks
+new_text = re.sub("!", "?", text)
+
+print(new_text)
+```
+
+and the output:
+
+```console
+This is a test sentence. It has multiple punctuation marks?
+```
+
+The `.lower()` method is used to convert the sentence to lowercase before tokenizing it.
+
+```python
+text = "This Is A TeST Sentence."
+
+lower_text = text.lower()
+
+print(lower_text)
+```
+
+and the output:
+
+```console
+this is a test sentence.
+```
+
+The `.split()` method is used to split the sentence into a list of words.
+
+text = "This is a test sentence."
+
+words = text.split()
+
+print(words)
+
+````
+
+and the output:
+
+```console
+['This', 'is', 'a', 'test', 'sentence.']
+````
+
+### References
+
+- [string methods](https://www.w3schools.com/python/python_ref_string.asp)
+- [replace](https://www.w3schools.com/python/ref_string_replace.asp)
+- [split](https://www.w3schools.com/python/ref_string_split.asp)
+- import "string" module and [get all string punctuations](https://docs.python.org/3/library/string.html#string.punctuation)
+- [Tokenization in text analysis](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
+- [Word segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation)