Word Tokenization with Python NLTK

This is a demonstration of the various tokenizers provided by NLTK 2.0.4.

Tokenize Text
  • Enter up to 50000 characters

How Text Tokenization Works

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module. This demo shows how 5 of them work.

The text is first tokenized into sentences using the PunktSentenceTokenizer. Then each sentence is tokenized into words using 4 different word tokenizers:

  1. TreebankWordTokenizer
  2. WordPunctTokenizer
  3. PunctWordTokenizer
  4. WhitespaceTokenizer

The pattern tokenizer does its own sentence and word tokenization, and is included to show how this library tokenizes text before further parsing.

The initial example text provides 2 sentences that demonstrate how each word tokenizer handles non-ascii characters and the simple punctuation of contractions.

Natural Language Processing Services

  • Want to download/purchase any of these models?
  • Need a custom model, trained on a public or custom corpus?
  • Want help creating or bootstrapping a custom corpus?

If you answered yes to any of these questions, please fill out this Natural Language Processing Services Survey.


Real-time Web Analytics by Mixpanel  python powered  A Django project.  Powered by NLTK.
Python Text Processing with NLTK 2.0 Cookbook

Natural Language Processing with Python

Bad Data Handbook