mirror of https://github.com/01-edu/public.git
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
92 lines
2.5 KiB
92 lines
2.5 KiB
2 years ago
|
## String tokenizer
|
||
|
|
||
|
### Instructions
|
||
|
|
||
|
Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences.
|
||
|
|
||
|
Create a file `string_processing.py` which will have a function `tokenize(sentence)` that given a sentence will do the following:
|
||
|
|
||
|
- removes all punctuation marks and special characters
|
||
|
- separates all words like so: `"it's not 3" => ['it', 's', 'not', '3']`
|
||
|
- put all the words in lowercase
|
||
|
- return a list of all the words.
|
||
|
|
||
|
### Usage
|
||
|
|
||
|
Here is a possible `test.py` to test your functions:
|
||
|
|
||
|
```python
|
||
|
import string_processing
|
||
|
|
||
|
if __name__ == '__main__':
|
||
|
my_sentence = "It's not possible, you can't ask for a raise"
|
||
|
print(string_processing.tokenize(my_sentence))
|
||
|
```
|
||
|
|
||
|
```bash
|
||
|
$ python test.py
|
||
|
['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise']
|
||
|
```
|
||
|
|
||
|
### Hints
|
||
|
|
||
|
The `re` library is a module for working with regular expressions it provides a set of functions for working with regular expressions, including:
|
||
|
|
||
|
- `re.sub()` : Replaces all occurrences of a regular expression pattern in a string with a replacement string.
|
||
|
|
||
|
```python
|
||
|
text = "This is a test sentence. It has multiple punctuation marks!"
|
||
|
|
||
|
# Replace all exclamation marks with question marks
|
||
|
new_text = re.sub("!", "?", text)
|
||
|
|
||
|
print(new_text)
|
||
|
```
|
||
|
|
||
|
and the output:
|
||
|
|
||
|
```console
|
||
|
This is a test sentence. It has multiple punctuation marks?
|
||
|
```
|
||
|
|
||
|
The `.lower()` method is used to convert the sentence to lowercase before tokenizing it.
|
||
|
|
||
|
```python
|
||
|
text = "This Is A TeST Sentence."
|
||
|
|
||
|
lower_text = text.lower()
|
||
|
|
||
|
print(lower_text)
|
||
|
```
|
||
|
|
||
|
and the output:
|
||
|
|
||
|
```console
|
||
|
this is a test sentence.
|
||
|
```
|
||
|
|
||
|
The `.split()` method is used to split the sentence into a list of words.
|
||
|
|
||
|
text = "This is a test sentence."
|
||
|
|
||
|
words = text.split()
|
||
|
|
||
|
print(words)
|
||
|
|
||
|
````
|
||
|
|
||
|
and the output:
|
||
|
|
||
|
```console
|
||
|
['This', 'is', 'a', 'test', 'sentence.']
|
||
|
````
|
||
|
|
||
|
### References
|
||
|
|
||
|
- [string methods](https://www.w3schools.com/python/python_ref_string.asp)
|
||
|
- [replace](https://www.w3schools.com/python/ref_string_replace.asp)
|
||
|
- [split](https://www.w3schools.com/python/ref_string_split.asp)
|
||
|
- import "string" module and [get all string punctuations](https://docs.python.org/3/library/string.html#string.punctuation)
|
||
|
- [Tokenization in text analysis](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
|
||
|
- [Word segmentation](https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation)
|