This is a demonstration of the various tokenizers
provided by NLTK 3.9.1
.
Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module. This demo shows how 5 of them work.
The text is first tokenized into sentences using the PunktSentenceTokenizer. Then each sentence is tokenized into words using 3 different word tokenizers:
The spaCy tokenizer does its own sentence and word tokenization, and is included to show how this libraries tokenize text before further parsing.
The initial example text provides 2 sentences that demonstrate how each word tokenizer handles non-ascii characters and the simple punctuation of contractions.