You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

2.4 KiB

String tokenizer

Instructions

Tokenization is the process of breaking down a string into smaller pieces, called tokens. In natural language processing, tokenization typically refers to the process of breaking down a sentence into words or breaking down a paragraph into sentences.

Create a file string_processing.py which will have a function tokenize(sentence) that given a sentence will do the following:

  • remove all punctuation marks and special characters
  • separate all words like so: "it's not 3" => ['it', 's', 'not', '3']
  • put all the words in lowercase
  • return a list of all the words.

Usage

Here is a possible test.py to test your functions:

import string_processing

if __name__ == '__main__':
    my_sentence = "It's not possible, you can't ask for a raise"
    print(string_processing.tokenize(my_sentence))
$ python test.py
['it', 's', 'not', 'possible', 'you', 'can', 't', 'ask', 'for', 'a', 'raise']

Hints

The re library is a module for working with regular expressions. It provides a set of functions for working with regular expressions, including:

  • re.sub() : Replaces all occurrences of a regular expression pattern in a string with a replacement string.
text = "This is a test sentence. It has multiple punctuation marks!"

# Replace all exclamation marks with question marks
new_text = re.sub("!", "?", text)

print(new_text)

and the output:

This is a test sentence. It has multiple punctuation marks?

The .lower() method is used to convert the sentence to lowercase before tokenizing it.

text = "This Is A TeST Sentence."

lower_text = text.lower()

print(lower_text)

and the output:

this is a test sentence.

The .split() method is used to split the sentence into a list of words.

text = "This is a test sentence."

words = text.split()

print(words)

and the output:

['This', 'is', 'a', 'test', 'sentence.']

References