Stemming and Lemmatization with Python NLTK

This is a demonstration of stemming and lemmatization for the 17 languages supported by the NLTK 2.0.4 stem package.

Stem Text
  • Enter up to 50000 characters

How Stemming and Lemmatization Works

Stemming is a process of removing and replacing word suffixes to arrive at a common root form of the word.

English Stemmers and Lemmatizers

For stemming English words with NLTK, you can choose between the PorterStemmer or the LancasterStemmer. The Porter Stemming Algorithm is the oldest stemming algorithm supported in NLTK, originally published in 1979. The Lancaster Stemming Algorithm is much newer, published in 1990, and can be more aggressive than the Porter stemming algorithm.

The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.

Non-English Stemmers

Stemming for Portuguese is available in NLTK with the RSLPStemmer and also with the SnowballStemmer. Arabic stemming is supported with the ISRIStemmer.

Snowball Stemmers

Snowball is actually a language for creating stemmers, and was added to NLTK version 2.0b9 as the SnowballStemmer class. The NLTK Snowball stemmer currently supports the following languages:

  • Danish
  • Dutch
  • English
  • Finnish
  • French
  • German
  • Hungarian
  • Italian
  • Norwegian
  • Porter
  • Portuguese
  • Romanian
  • Russian
  • Spanish
  • Swedish

Natural Language Stemming API

If you'd like to use this thru an API, please see the Stemming API Docs. And for higher limits and premium API access, signup for the Mashape Text-Processing API.

Natural Language Processing Services

  • Want to download/purchase any of these models?
  • Need a custom model, trained on a public or custom corpus?
  • Want help creating or bootstrapping a custom corpus?

If you answered yes to any of these questions, please fill out this Natural Language Processing Services Survey.

Real-time Web Analytics by Mixpanel  python powered  A Django project.  Powered by NLTK.
Python 3 Text Processing with NLTK 3 Cookbook

Natural Language Processing with Python

Bad Data Handbook