Python Lesson 35: Parts-of-speech tagging (NLP pt. 4)

Advertisements

Hello everybody,

Michael here, and today’s post will cover parts-of-speech tagging as it relates to Python NLP (this is part 4 in my NLP Python series).

Intro to parts-of-speech tagging

What is parts-of-speech (POS) tagging, exactly? See, Python NLP can do some really cool things, such as getting the roots of words (Python Lesson 34: Stemming and Lemmatization (NLP pt. 3)) and finding commonly used words (stopwords) in 24 different languages (Python Lesson 33: Stopwords (NLP pt.2)). In Python, parts-of-speech tagging is a quite self-explanatory process, as it involves tokenizing a string and identifying each token’s part-of-speech (such as a noun, verb, etc.). Keep in mind that this isn’t going to be a grammar lesson, so I’m not going to teach you how to use POS tagging to improve your grammar or proofread something you wrote.

POS tagging in action

Now that I’ve explained the basics of POS tagging, let’s see it in action! Take a look at the example below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)

tagged = []

for t in tokens:
    tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: I had a fun time dancing last night.
[('I', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('fun', 'JJ'), ('time', 'NN'), ('dancing', 'VBG'), ('last', 'JJ'), ('night', 'NN'), ('.', '.')]

Before getting into the fun POS tagging, you’d first need to import the nltk package and download two of the package’s modules-punkt and averaged_perceptron_tagger. punkt is NLTK’s standard package module which allows Python to work its NLP magic while averaged_perceptron_tagger is the module that enables all the fun POS tagging capabilities.

After including all the necessary imports and downloading all the necessary package modules, I then inputted and word-tokenized a test string. I then created an empty list-tagged-that will store the results of our POS tagging.

To perform the POS tagging, I iterated through each element in the list of tokens (aptly called tokens) and used NLTK’s pos_tag() method to add the appropriate POS tag to the element. I then printed out the tagged list, which contains the results of our POS tagging. As you can see, the tagged list contains a list of tuples-the first element in each tuple is the token itself while the second element is that token’s part-of-speech. Punctuation is also included, as punctuation counts as its own token, but doesn’t belong to any part-of-speech.

You likely noticed that the POS tags are all two or three character abbrevations. Here’s a table explaining all of the POS tags:

TagPart-of-speechExample/Explanation
CCcoordinating conjuctionAny of the FANBOYS conjuctions (for, and,
nor, but, or, yet, so)
CDcardinal digitThe numbers 0-9
DTdeterminerA word in front of a noun to specify quanity
or to clarify what the noun refers to (e.g. one
car, that child)
EXexistential thereThere is a snake in the grass.
FWforeign wordSince I’m using English for this post, any word
that isn’t English (e.g. palabra in Spanish)
INprepositionon, in, at
JJadjective (base form)large, tiny
JJRcomparative adjectivelarger, tinier
JJSsuperlative adjectivelargest, tiniest
LSlist marker1), 2)
MDmodal verbOtherwise known as auxiliary verb (e.g. might
happen, must visit)
NNsingular nouncar, tree, cat, etc.
NNSplural nouncars, trees, cats, etc.
NNPsingular proper nounFord
NNPSplural proper nounAmericans
PDTpredeterminerA word or phrase that occurs before a determiner
that quantifies a noun phrase (e.g. lots of toys,
few students)
POSpossessive endingMichael’s, Tommy’s
PRPpersonal pronounPronouns associated with a grammatical person-be it first person, second person, or third person (e.g.
they, he, she)
PRP$possessive pronounPronouns that indicate possession (e.g. mine, theirs, hers)
RBadverbvery, extremely
RBRcomparative adverbearlier, worse
RBSsuperlative adverbbest, worst
RPparticleAny word that doesn’t fall within the main parts-of-speech (e.g. give up)
TOthe word ‘to’to come home
UHinterjectionYikes! Ummmm.
VBbase form of verbwalk
VBDpast tense of verbwalked
VBGgerund form of verbwalking
VBNpast participle of verbwalked
VBPpresent singular form of verb (non-3rd person)walk
VBZpresent singular form
of verb (3rd-person)
walks
WDT“wh” determinerwhich
WP“wh” pronounwho, what
WP$possessive “wh” pronounwhose
WRB“wh” adverbwhere, when

As you can see, even though English has only eight parts of speech (verbs, nouns, adjectives, adverbs, pronouns, prepositions, conjuctions, and interjections), Python has 35 (!) parts-of-speech tags.

  • Even though I’m working with English here, I imagine these POS tags can work for any language.

If you take a look at the last line of output in the above example (the line containing the list of tuples), you can see two-element tuples containing the token itself as the first element along with the token’s POS tag as the second element. And yes, punctuation in a sentence counts as a token itself, but it has no POS tag. Hence why the tuple-POS tag pair for the period at the end of the sentence looks like this-['.', '.'].

Now, what if there was a sentence that had the same word twice but used as different parts-of-speech (e.g. a sentence that had the same word used as a noun and a verb). Let’s take a look at the example below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: She got a call at work telling her to call the project manager.
[('She', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('call', 'NN'), ('at', 'IN'), ('work', 'NN'), ('telling', 'VBG'), ('her', 'PRP'), ('to', 'TO'), ('call', 'VB'), ('the', 'DT'), ('project', 'NN'), ('manager', 'NN'), ('.', '.')]

Take a close look at the sentence I used in this example-She got a call at work telling her to call the project manager. Notice the repeated word here-call. In this example, call is used as both a noun (She got a call at work) and a verb (to call the project manager.). The neat thing here is that NLTK’s POS tagger recognizes that the word call is used as two different parts-of-speech in that sentence.

However, the POS tagger may not always be so accurate when it comes to recognizing the same word used as a different part of speech. Take a look at this example:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: My apartment building is bigger than any other apartment in a 5-block vicinity.
[('My', 'PRP$'), ('apartment', 'NN'), ('building', 'NN'), ('is', 'VBZ'), ('bigger', 'JJR'), ('than', 'IN'), ('any', 'DT'), ('other', 'JJ'), ('apartment', 'NN'), ('in', 'IN'), ('a', 'DT'), ('5-block', 'JJ'), ('vicinity', 'NN'), ('.', '.')]

In this example, I’m using the word apartment twice, as both an adjective (My apartment building) and a noun (any other apartment). However, NLTK’s POS tagger doesn’t recognize that the first instance of the word apartment is being used as an adjective to modify the noun building.

  • Hey, what can you say, programs aren’t always perfect. But I’d say NLTK’s POS tagger works quite well for parts-of-speech analysis.

Thanks for reading,

Michael

Leave a ReplyCancel reply