July 2022 - Michael's Programming Bytes

Python Lesson 35: Parts-of-speech tagging (NLP pt. 4)

Intro to parts-of-speech tagging

What is parts-of-speech (POS) tagging, exactly? See, Python NLP can do some really cool things, such as getting the roots of words (Python Lesson 34: Stemming and Lemmatization (NLP pt. 3)) and finding commonly used words (stopwords) in 24 different languages (Python Lesson 33: Stopwords (NLP pt.2)). In Python, parts-of-speech tagging is a quite self-explanatory process, as it involves tokenizing a string and identifying each token’s part-of-speech (such as a noun, verb, etc.). Keep in mind that this isn’t going to be a grammar lesson, so I’m not going to teach you how to use POS tagging to improve your grammar or proofread something you wrote.

POS tagging in action

Now that I’ve explained the basics of POS tagging, let’s see it in action! Take a look at the example below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)

tagged = []

for t in tokens:
    tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: I had a fun time dancing last night.
[('I', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('fun', 'JJ'), ('time', 'NN'), ('dancing', 'VBG'), ('last', 'JJ'), ('night', 'NN'), ('.', '.')]

Before getting into the fun POS tagging, you’d first need to import the nltk package and download two of the package’s modules-punkt and averaged_perceptron_tagger. punkt is NLTK’s standard package module which allows Python to work its NLP magic while averaged_perceptron_tagger is the module that enables all the fun POS tagging capabilities.

After including all the necessary imports and downloading all the necessary package modules, I then inputted and word-tokenized a test string. I then created an empty list-tagged-that will store the results of our POS tagging.

To perform the POS tagging, I iterated through each element in the list of tokens (aptly called tokens) and used NLTK’s pos_tag() method to add the appropriate POS tag to the element. I then printed out the tagged list, which contains the results of our POS tagging. As you can see, the tagged list contains a list of tuples-the first element in each tuple is the token itself while the second element is that token’s part-of-speech. Punctuation is also included, as punctuation counts as its own token, but doesn’t belong to any part-of-speech.

You likely noticed that the POS tags are all two or three character abbrevations. Here’s a table explaining all of the POS tags:

Tag	Part-of-speech	Example/Explanation
CC	coordinating conjuction	Any of the FANBOYS conjuctions (for, and, nor, but, or, yet, so)
CD	cardinal digit	The numbers 0-9
DT	determiner	A word in front of a noun to specify quanity or to clarify what the noun refers to (e.g. one car, that child)
EX	existential there	There is a snake in the grass.
FW	foreign word	Since I’m using English for this post, any word that isn’t English (e.g. palabra in Spanish)
IN	preposition	on, in, at
JJ	adjective (base form)	large, tiny
JJR	comparative adjective	larger, tinier
JJS	superlative adjective	largest, tiniest
LS	list marker	1), 2)
MD	modal verb	Otherwise known as auxiliary verb (e.g. might happen, must visit)
NN	singular noun	car, tree, cat, etc.
NNS	plural noun	cars, trees, cats, etc.
NNP	singular proper noun	Ford
NNPS	plural proper noun	Americans
PDT	predeterminer	A word or phrase that occurs before a determiner that quantifies a noun phrase (e.g. lots of toys, few students)
POS	possessive ending	Michael’s, Tommy’s
PRP	personal pronoun	Pronouns associated with a grammatical person-be it first person, second person, or third person (e.g. they, he, she)
PRP$	possessive pronoun	Pronouns that indicate possession (e.g. mine, theirs, hers)
RB	adverb	very, extremely
RBR	comparative adverb	earlier, worse
RBS	superlative adverb	best, worst
RP	particle	Any word that doesn’t fall within the main parts-of-speech (e.g. give up)
TO	the word ‘to’	to come home
UH	interjection	Yikes! Ummmm.
VB	base form of verb	walk
VBD	past tense of verb	walked
VBG	gerund form of verb	walking
VBN	past participle of verb	walked
VBP	present singular form of verb (non-3rd person)	walk
VBZ	present singular form of verb (3rd-person)	walks
WDT	“wh” determiner	which
WP	“wh” pronoun	who, what
WP$	possessive “wh” pronoun	whose
WRB	“wh” adverb	where, when

As you can see, even though English has only eight parts of speech (verbs, nouns, adjectives, adverbs, pronouns, prepositions, conjuctions, and interjections), Python has 35 (!) parts-of-speech tags.

Even though I’m working with English here, I imagine these POS tags can work for any language.

If you take a look at the last line of output in the above example (the line containing the list of tuples), you can see two-element tuples containing the token itself as the first element along with the token’s POS tag as the second element. And yes, punctuation in a sentence counts as a token itself, but it has no POS tag. Hence why the tuple-POS tag pair for the period at the end of the sentence looks like this-['.', '.'].

Now, what if there was a sentence that had the same word twice but used as different parts-of-speech (e.g. a sentence that had the same word used as a noun and a verb). Let’s take a look at the example below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: She got a call at work telling her to call the project manager.
[('She', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('call', 'NN'), ('at', 'IN'), ('work', 'NN'), ('telling', 'VBG'), ('her', 'PRP'), ('to', 'TO'), ('call', 'VB'), ('the', 'DT'), ('project', 'NN'), ('manager', 'NN'), ('.', '.')]

Take a close look at the sentence I used in this example-She got a call at work telling her to call the project manager. Notice the repeated word here-call. In this example, call is used as both a noun (She got a call at work) and a verb (to call the project manager.). The neat thing here is that NLTK’s POS tagger recognizes that the word call is used as two different parts-of-speech in that sentence.

However, the POS tagger may not always be so accurate when it comes to recognizing the same word used as a different part of speech. Take a look at this example:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: My apartment building is bigger than any other apartment in a 5-block vicinity.
[('My', 'PRP$'), ('apartment', 'NN'), ('building', 'NN'), ('is', 'VBZ'), ('bigger', 'JJR'), ('than', 'IN'), ('any', 'DT'), ('other', 'JJ'), ('apartment', 'NN'), ('in', 'IN'), ('a', 'DT'), ('5-block', 'JJ'), ('vicinity', 'NN'), ('.', '.')]

In this example, I’m using the word apartment twice, as both an adjective (My apartment building) and a noun (any other apartment). However, NLTK’s POS tagger doesn’t recognize that the first instance of the word apartment is being used as an adjective to modify the noun building.

Hey, what can you say, programs aren’t always perfect. But I’d say NLTK’s POS tagger works quite well for parts-of-speech analysis.

Thanks for reading,

Michael

Python Lesson 34: Stemming and Lemmatization (NLP pt. 3)

Stemming

Now that we’ve covered some basic tokenization concepts (like tokenization itself and filtering out stopwords), we can move on to the next important concepts in NLP-stemming and lemmatization. Stemming is an NLP task that involves reducing words to their roots-for instance, stemming the words “liked” and “likely” would result in “like”

Now, the NLTK package has several stemmers you can use, but for this lesson (along with all Python NLP lesson on this blog) I will be using NLTK’s PorterStemmer stemmer. Let’s see stemming in action:

import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer

test = input('Please input a test string: ')
testWords = nltk.word_tokenize(test)
print(testWords)

stemmer = PorterStemmer()
stemmedWords = [stemmer.stem(word) for word in testWords]
print(stemmedWords)

Please input a test string: Byte sized programming classes for eager coding learners
['Byte', 'sized', 'programming', 'classes', 'for', 'eager', 'coding', 'learners']
['byte', 'size', 'program', 'class', 'for', 'eager', 'code', 'learner']

To start your stemming, include the first three lines of code you see above as the imports and downloads. And yes, you’ll need to import the PorterStemmer separately.

After including all the necessary downloads and imports, I then included code to input a test string, word-tokenize that test string, and print the list of tokens. After performing the word-tokenizing, I then created a PorterStemmer object (aptly named stemmer), performed a list comprehension to stem each token in the input string, and printed the stemmed list of tokens.

What do you notice in the stemmed list of tokens (it’s the last line of output by the way)? First of all, all words are displayed in lowercase, which is nothing remarkable. Secondly, notice how most of the stemmed tokens make perfect sense (e.g. sized = size, programming = program, and so on); sometimes when stemming words in Python NLP, you’ll get some weird outputs.

Now, let’s try another input string and see what kind of results we get:

import nltk
nltk.download('punkt')
from nltk.stem import PorterStemmer

test = input('Please input a test string: ')
testWords = nltk.word_tokenize(test)
print(testWords)

stemmer = PorterStemmer()
stemmedWords = [stemmer.stem(word) for word in testWords]
print(stemmedWords)

Please input a test string: The quick brown fox jumped over the lazy brown dog and jumps over the even lazier brown cat.
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'brown', 'dog', 'and', 'jumps', 'over', 'the', 'even', 'lazier', 'brown', 'cat', '.']
['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'brown', 'dog', 'and', 'jump', 'over', 'the', 'even', 'lazier', 'brown', 'cat', '.']

Just like the previous example, this example tokenizes an input string and stems each element in the tokenized list. However, pay attention to the words “lazy” and “lazier”. Although “lazier” is a conjugation of “lazy”, “lazy” has a stem of, “lazi” while “lazier” has a stem of “lazier”.

OK, so if stemming sometimes gives you weird and inconsistent results (like in the example above), there’s a reason for that. See, stemming reduces words to their core meaning. However, unlike lemmatization (which I’ll discuss next), stemming is a lot cruder, so it’s not uncommon to get fragments of words when stemming. Plus, the PorterStemmer tool is based off an algorithm that was developed in 1979-so yea, it’s a little dated. There is a PorterStemmer2 tool that improves upon the PorterStemmer tool we used-just FYI.

Lemmatization

Now that we’ve covered the basics of word stemming, let’s move on to word lemmatization. Lemmatization, like stemming, is an NLP tool that is meant to reduce words to their core meaning. However, unlike stemming, lemmatization usually gives you a complete word rather than a fragment of a word (e.g. “lazi” from the previous example).

import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
lemmatizer = WordNetLemmatizer()
lemmatizedList = [lemmatizer.lemmatize(word) for word in tokens]

print(lemmatizedList)

Please input a test string: The two friends drove their nice blue cars across the Florida coast.
['The', 'two', 'friend', 'drove', 'their', 'nice', 'blue', 'car', 'across', 'the', 'Florida', 'coast', '.']

So, how did I accomplish the lemmatization? First of all, after adding in all the necessary downloads and imports (and typing in an import string), I first tokenized my input string. I then created a lemmatizer object (aptly named lemmatizer) using NLTK’s WordNetLemmatizer tool. To lemmatize each token in the input string, I ran list comprehension to pass each element of the list of tokens (aptly named tokens) into the lemmatizer tool to lemmatize each word. I stored the results of this list comprehension into the lemmatizedList and printed that list below the text input.

As you can see from the example above, the lemmas of the tokens above are the same as the tokens themselves (e.g. two, across, blue). However, some of the tokens have different lemmas (e.g. cars–>car, friends–>friend). That’s because, as I mentioned earlier, lemmas find the root of a word. In the case of the words cars and friends, the root of the word would be the word’s singular form (car and friend, respectively).

Just thought I’d put this out here, but the root word that is generated is called a lemma, and the group of words with a particular lemma is called a lexeme. For instance, the word “try” would be the lemma while the words “trying, tried, tries” could be part of (but not the only words) that are part of the lexeme.

So, from the example above, looks like the lemmatization works much better than stemming when it comes to finding the root of a word. But what if you tried lemmatizing a word that looked very different from its lemma? Let’s see an example of that below (using the lemmatizer object created from the previous example):

lemmatizer.lemmatize("bought")
'bought'

In this example, I’m trying to lemmatize the word “bought” (as in, I bought a new watch.) However, you can see that in this example, the lemma of bought is bought. That can’t be right, can it?

Why do you think that particular output was generated? Simply put, the lemmatizer tool, by default, will assume a word is a noun (even when that clearly isn’t the case).

How can we correct this? Take a look at the example below:

lemmatizer.lemmatize("bought", pos='v')
'buy'

In this example, I added the pos parameter, which specifies a part of speech for a particular word. In this case, I set the value of pos to v, as the word “bought” is a verb. Once I added the pos parameter, I was able to get the correct lemma for the word “bought”-“buy”.

When working with lemmatization, you’ll run into the issue quite a bit with adjectives and irregular verbs.

Thanks for reading,

Michael