lemmatization Archives - Michael's Programming Bytes

Stemming

Now that we’ve covered some basic tokenization concepts (like tokenization itself and filtering out stopwords), we can move on to the next important concepts in NLP-stemming and lemmatization. Stemming is an NLP task that involves reducing words to their roots-for instance, stemming the words “liked” and “likely” would result in “like”

Now, the NLTK package has several stemmers you can use, but for this lesson (along with all Python NLP lesson on this blog) I will be using NLTK’s PorterStemmer stemmer. Let’s see stemming in action:

import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer

test = input('Please input a test string: ')
testWords = nltk.word_tokenize(test)
print(testWords)

stemmer = PorterStemmer()
stemmedWords = [stemmer.stem(word) for word in testWords]
print(stemmedWords)

Please input a test string: Byte sized programming classes for eager coding learners
['Byte', 'sized', 'programming', 'classes', 'for', 'eager', 'coding', 'learners']
['byte', 'size', 'program', 'class', 'for', 'eager', 'code', 'learner']

To start your stemming, include the first three lines of code you see above as the imports and downloads. And yes, you’ll need to import the PorterStemmer separately.

After including all the necessary downloads and imports, I then included code to input a test string, word-tokenize that test string, and print the list of tokens. After performing the word-tokenizing, I then created a PorterStemmer object (aptly named stemmer), performed a list comprehension to stem each token in the input string, and printed the stemmed list of tokens.

What do you notice in the stemmed list of tokens (it’s the last line of output by the way)? First of all, all words are displayed in lowercase, which is nothing remarkable. Secondly, notice how most of the stemmed tokens make perfect sense (e.g. sized = size, programming = program, and so on); sometimes when stemming words in Python NLP, you’ll get some weird outputs.

Now, let’s try another input string and see what kind of results we get:

import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer

test = input('Please input a test string: ')
testWords = nltk.word_tokenize(test)
print(testWords)

stemmer = PorterStemmer()
stemmedWords = [stemmer.stem(word) for word in testWords]
print(stemmedWords)

Please input a test string: The quick brown fox jumped over the lazy brown dog and jumps over the even lazier brown cat.
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'brown', 'dog', 'and', 'jumps', 'over', 'the', 'even', 'lazier', 'brown', 'cat', '.']
['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'brown', 'dog', 'and', 'jump', 'over', 'the', 'even', 'lazier', 'brown', 'cat', '.']

Just like the previous example, this example tokenizes an input string and stems each element in the tokenized list. However, pay attention to the words “lazy” and “lazier”. Although “lazier” is a conjugation of “lazy”, “lazy” has a stem of, “lazi” while “lazier” has a stem of “lazier”.

OK, so if stemming sometimes gives you weird and inconsistent results (like in the example above), there’s a reason for that. See, stemming reduces words to their core meaning. However, unlike lemmatization (which I’ll discuss next), stemming is a lot cruder, so it’s not uncommon to get fragments of words when stemming. Plus, the PorterStemmer tool is based off an algorithm that was developed in 1979-so yea, it’s a little dated. There is a PorterStemmer2 tool that improves upon the PorterStemmer tool we used-just FYI.

Lemmatization

Now that we’ve covered the basics of word stemming, let’s move on to word lemmatization. Lemmatization, like stemming, is an NLP tool that is meant to reduce words to their core meaning. However, unlike stemming, lemmatization usually gives you a complete word rather than a fragment of a word (e.g. “lazi” from the previous example).

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
lemmatizer = WordNetLemmatizer()
lemmatizedList = [lemmatizer.lemmatize(word) for word in tokens]

print(lemmatizedList)

Please input a test string: The two friends drove their nice blue cars across the Florida coast.
['The', 'two', 'friend', 'drove', 'their', 'nice', 'blue', 'car', 'across', 'the', 'Florida', 'coast', '.']

So, how did I accomplish the lemmatization? First of all, after adding in all the necessary downloads and imports (and typing in an import string), I first tokenized my input string. I then created a lemmatizer object (aptly named lemmatizer) using NLTK’s WordNetLemmatizer tool. To lemmatize each token in the input string, I ran list comprehension to pass each element of the list of tokens (aptly named tokens) into the lemmatizer tool to lemmatize each word. I stored the results of this list comprehension into the lemmatizedList and printed that list below the text input.

As you can see from the example above, the lemmas of the tokens above are the same as the tokens themselves (e.g. two, across, blue). However, some of the tokens have different lemmas (e.g. cars–>car, friends–>friend). That’s because, as I mentioned earlier, lemmas find the root of a word. In the case of the words cars and friends, the root of the word would be the word’s singular form (car and friend, respectively).

Just thought I’d put this out here, but the root word that is generated is called a lemma, and the group of words with a particular lemma is called a lexeme. For instance, the word “try” would be the lemma while the words “trying, tried, tries” could be part of (but not the only words) that are part of the lexeme.

So, from the example above, looks like the lemmatization works much better than stemming when it comes to finding the root of a word. But what if you tried lemmatizing a word that looked very different from its lemma? Let’s see an example of that below (using the lemmatizer object created from the previous example):

lemmatizer.lemmatize("bought")
'bought'

In this example, I’m trying to lemmatize the word “bought” (as in, I bought a new watch.) However, you can see that in this example, the lemma of bought is bought. That can’t be right, can it?

Why do you think that particular output was generated? Simply put, the lemmatizer tool, by default, will assume a word is a noun (even when that clearly isn’t the case).

How can we correct this? Take a look at the example below:

lemmatizer.lemmatize("bought", pos='v')
'buy'

In this example, I added the pos parameter, which specifies a part of speech for a particular word. In this case, I set the value of pos to v, as the word “bought” is a verb. Once I added the pos parameter, I was able to get the correct lemma for the word “bought”-“buy”.

When working with lemmatization, you’ll run into the issue quite a bit with adjectives and irregular verbs.

Thanks for reading,

Michael

Tag: lemmatization

Python Lesson 34: Stemming and Lemmatization (NLP pt. 3)

Stemming

Lemmatization