mirror of https://github.com/01-edu/public.git
1 changed files with 91 additions and 0 deletions
@ -0,0 +1,91 @@
|
||||
## String tokenizer |
||||
|
||||
### Instructions |
||||
|
||||
Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences. |
||||
|
||||
Create a file `string_processing.py` which will have a function `tokenize(sentence)` that given a sentence will do the following: |
||||
|
||||
- removes all punctuation marks and special characters |
||||
- separates all words like so: `"it's not 3" => ['it', 's', 'not', '3']` |
||||
- put all the words in lowercase |
||||
- return a list of all the words. |
||||
|
||||
### Usage |
||||
|
||||
Here is a possible `test.py` to test your functions: |
||||
|
||||
```python |
||||
import string_processing |
||||
|
||||
if __name__ == '__main__': |
||||
my_sentence = "It's not possible, you can't ask for a raise" |
||||
print(string_processing.tokenize(my_sentence)) |
||||
``` |
||||
|
||||
```bash |
||||
$ python test.py |
||||
['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise'] |
||||
``` |
||||
|
||||
### Hints |
||||
|
||||
The `re` library is a module for working with regular expressions it provides a set of functions for working with regular expressions, including: |
||||
|
||||
- `re.sub()` : Replaces all occurrences of a regular expression pattern in a string with a replacement string. |
||||
|
||||
```python |
||||
text = "This is a test sentence. It has multiple punctuation marks!" |
||||
|
||||
# Replace all exclamation marks with question marks |
||||
new_text = re.sub("!", "?", text) |
||||
|
||||
print(new_text) |
||||
``` |
||||
|
||||
and the output: |
||||
|
||||
```console |
||||
This is a test sentence. It has multiple punctuation marks? |
||||
``` |
||||
|
||||
The `.lower()` method is used to convert the sentence to lowercase before tokenizing it. |
||||
|
||||
```python |
||||
text = "This Is A TeST Sentence." |
||||
|
||||
lower_text = text.lower() |
||||
|
||||
print(lower_text) |
||||
``` |
||||
|
||||
and the output: |
||||
|
||||
```console |
||||
this is a test sentence. |
||||
``` |
||||
|
||||
The `.split()` method is used to split the sentence into a list of words. |
||||
|
||||
text = "This is a test sentence." |
||||
|
||||
words = text.split() |
||||
|
||||
print(words) |
||||
|
||||
```` |
||||
|
||||
and the output: |
||||
|
||||
```console |
||||
['This', 'is', 'a', 'test', 'sentence.'] |
||||
```` |
||||
|
||||
### References |
||||
|
||||
- [string methods](https://www.w3schools.com/python/python_ref_string.asp) |
||||
- [replace](https://www.w3schools.com/python/ref_string_replace.asp) |
||||
- [split](https://www.w3schools.com/python/ref_string_split.asp) |
||||
- import "string" module and [get all string punctuations](https://docs.python.org/3/library/string.html#string.punctuation) |
||||
- [Tokenization in text analysis](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) |
||||
- [Word segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation) |
Loading…
Reference in new issue