Word Tokenization with Python NLTK

This is a demonstration of the various tokenizers provided by NLTK 2.0.4.

Tokenize Text
  • Enter up to 50000 characters

How Text Tokenization Works

Tokenization is a way to split text into tokens. These tokens could be paragraphs, sentences, or individual words. NLTK provides a number of tokenizers in the tokenize module. This demo shows how 5 of them work.

The text is first tokenized into sentences using the PunktSentenceTokenizer. Then each sentence is tokenized into words using 4 different word tokenizers:

  1. TreebankWordTokenizer
  2. WordPunctTokenizer
  3. PunctWordTokenizer
  4. WhitespaceTokenizer

The initial example text provides 2 sentences that demonstrate how each word tokenizer handles non-ascii characters and the simple punctuation of contractions.

Natural Language Processing Services

  • Want to download/purchase any of these models?
  • Need a custom model, trained on a public or custom corpus?
  • Want help creating or bootstrapping a custom corpus?

If you answered yes to any of these questions, please fill out this Natural Language Processing Services Survey.


Real-time Web Analytics by Mixpanel  python powered  A Django project.  Powered by NLTK.
Python Text Processing with NLTK 2.0 Cookbook

Natural Language Processing with Python

Bad Data Handbook