public/subjects/devops/string_tokenizer/README.md

## String tokenizer

### Instructions

Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences.

Create a file `string_processing.py` which will have a function `tokenize(sentence)` that given a sentence will do the following:

- removes all punctuation marks and special characters
- separates all words like so: `"it's not 3" => ['it', 's', 'not', '3']`
- put all the words in lowercase
- return a list of all the words.

### Usage

Here is a possible `test.py` to test your functions:

```python
import string_processing

if __name__ == '__main__':
    my_sentence = "It's not possible, you can't ask for a raise"
    print(string_processing.tokenize(my_sentence))
```

```bash
$ python test.py
['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise']
```

### Hints

The `re` library is a module for working with regular expressions it provides a set of functions for working with regular expressions, including:

- `re.sub()` : Replaces all occurrences of a regular expression pattern in a string with a replacement string.

```python
text = "This is a test sentence. It has multiple punctuation marks!"

# Replace all exclamation marks with question marks
new_text = re.sub("!", "?", text)

print(new_text)
```

and the output:

```console
This is a test sentence. It has multiple punctuation marks?
```

The `.lower()` method is used to convert the sentence to lowercase before tokenizing it.

```python
text = "This Is A TeST Sentence."

lower_text = text.lower()

print(lower_text)
```

and the output:

```console
this is a test sentence.
```

The `.split()` method is used to split the sentence into a list of words.

text = "This is a test sentence."

words = text.split()

print(words)

````

and the output:

```console
['This', 'is', 'a', 'test', 'sentence.']
````

### References

- [string methods](https://www.w3schools.com/python/python_ref_string.asp)
- [replace](https://www.w3schools.com/python/ref_string_replace.asp)
- [split](https://www.w3schools.com/python/ref_string_split.asp)
- import "string" module and [get all string punctuations](https://docs.python.org/3/library/string.html#string.punctuation)
- [Tokenization in text analysis](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
- [Word segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation)
feat(string_tokenizer): add subject for the new exercise 2 years ago			`## String tokenizer`

			`### Instructions`

			`Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences.`

			Create a file `string_processing.py` which will have a function `tokenize(sentence)` that given a sentence will do the following:

			`- removes all punctuation marks and special characters`
			- separates all words like so: `"it's not 3" => ['it', 's', 'not', '3']`
			`- put all the words in lowercase`
			`- return a list of all the words.`

			`### Usage`

			Here is a possible `test.py` to test your functions:

			```python
			`import string_processing`

			`if __name__ == '__main__':`
			`my_sentence = "It's not possible, you can't ask for a raise"`
			`print(string_processing.tokenize(my_sentence))`
			```

			```bash
			`$ python test.py`
			`['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise']`
			```

			`### Hints`

			The `re` library is a module for working with regular expressions it provides a set of functions for working with regular expressions, including:

			- `re.sub()` : Replaces all occurrences of a regular expression pattern in a string with a replacement string.

			```python
			`text = "This is a test sentence. It has multiple punctuation marks!"`

			`# Replace all exclamation marks with question marks`
			`new_text = re.sub("!", "?", text)`

			`print(new_text)`
			```

			`and the output:`

			```console
			`This is a test sentence. It has multiple punctuation marks?`
			```

			The `.lower()` method is used to convert the sentence to lowercase before tokenizing it.

			```python
			`text = "This Is A TeST Sentence."`

			`lower_text = text.lower()`

			`print(lower_text)`
			```

			`and the output:`

			```console
			`this is a test sentence.`
			```

			The `.split()` method is used to split the sentence into a list of words.

			`text = "This is a test sentence."`

			`words = text.split()`

			`print(words)`

			````

			`and the output:`

			```console
			`['This', 'is', 'a', 'test', 'sentence.']`
			````

			`### References`

			`- [string methods](https://www.w3schools.com/python/python_ref_string.asp)`
			`- [replace](https://www.w3schools.com/python/ref_string_replace.asp)`
			`- [split](https://www.w3schools.com/python/ref_string_split.asp)`
			`- import "string" module and [get all string punctuations](https://docs.python.org/3/library/string.html#string.punctuation)`
			`- [Tokenization in text analysis](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)`
			`- [Word segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation)`