public

root

public

mirror of https://github.com/01-edu/public.git

2.5 KiB

Raw Blame History

String tokenizer

Instructions

Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences.

Create a file string_processing.py which will have a function tokenize(sentence) that given a sentence will do the following:

removes all punctuation marks and special characters
separates all words like so: "it's not 3" => ['it', 's', 'not', '3']
put all the words in lowercase
return a list of all the words.

Usage

Here is a possible test.py to test your functions:

import string_processing

if __name__ == '__main__':
    my_sentence = "It's not possible, you can't ask for a raise"
    print(string_processing.tokenize(my_sentence))

$ python test.py
['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise']

Hints

The re library is a module for working with regular expressions it provides a set of functions for working with regular expressions, including:

re.sub() : Replaces all occurrences of a regular expression pattern in a string with a replacement string.

text = "This is a test sentence. It has multiple punctuation marks!"

# Replace all exclamation marks with question marks
new_text = re.sub("!", "?", text)

print(new_text)

and the output:

This is a test sentence. It has multiple punctuation marks?

The .lower() method is used to convert the sentence to lowercase before tokenizing it.

text = "This Is A TeST Sentence."

lower_text = text.lower()

print(lower_text)

and the output:

this is a test sentence.

The .split() method is used to split the sentence into a list of words.

text = "This is a test sentence."

words = text.split()

print(words)


and the output:

```console
['This', 'is', 'a', 'test', 'sentence.']

References

string methods
replace
split
import "string" module and get all string punctuations
Tokenization in text analysis
Word segmentation

2.5 KiB Raw Blame History