Python Lesson 35: Parts-of-speech tagging (NLP pt. 4)

Intro to parts-of-speech tagging

What is parts-of-speech (POS) tagging, exactly? See, Python NLP can do some really cool things, such as getting the roots of words (Python Lesson 34: Stemming and Lemmatization (NLP pt. 3)) and finding commonly used words (stopwords) in 24 different languages (Python Lesson 33: Stopwords (NLP pt.2)). In Python, parts-of-speech tagging is a quite self-explanatory process, as it involves tokenizing a string and identifying each token’s part-of-speech (such as a noun, verb, etc.). Keep in mind that this isn’t going to be a grammar lesson, so I’m not going to teach you how to use POS tagging to improve your grammar or proofread something you wrote.

POS tagging in action

Now that I’ve explained the basics of POS tagging, let’s see it in action! Take a look at the example below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)

tagged = []

for t in tokens:
    tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: I had a fun time dancing last night.
[('I', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('fun', 'JJ'), ('time', 'NN'), ('dancing', 'VBG'), ('last', 'JJ'), ('night', 'NN'), ('.', '.')]

Before getting into the fun POS tagging, you’d first need to import the nltk package and download two of the package’s modules-punkt and averaged_perceptron_tagger. punkt is NLTK’s standard package module which allows Python to work its NLP magic while averaged_perceptron_tagger is the module that enables all the fun POS tagging capabilities.

After including all the necessary imports and downloading all the necessary package modules, I then inputted and word-tokenized a test string. I then created an empty list-tagged-that will store the results of our POS tagging.

To perform the POS tagging, I iterated through each element in the list of tokens (aptly called tokens) and used NLTK’s pos_tag() method to add the appropriate POS tag to the element. I then printed out the tagged list, which contains the results of our POS tagging. As you can see, the tagged list contains a list of tuples-the first element in each tuple is the token itself while the second element is that token’s part-of-speech. Punctuation is also included, as punctuation counts as its own token, but doesn’t belong to any part-of-speech.

You likely noticed that the POS tags are all two or three character abbrevations. Here’s a table explaining all of the POS tags:

Tag	Part-of-speech	Example/Explanation
CC	coordinating conjuction	Any of the FANBOYS conjuctions (for, and, nor, but, or, yet, so)
CD	cardinal digit	The numbers 0-9
DT	determiner	A word in front of a noun to specify quanity or to clarify what the noun refers to (e.g. one car, that child)
EX	existential there	There is a snake in the grass.
FW	foreign word	Since I’m using English for this post, any word that isn’t English (e.g. palabra in Spanish)
IN	preposition	on, in, at
JJ	adjective (base form)	large, tiny
JJR	comparative adjective	larger, tinier
JJS	superlative adjective	largest, tiniest
LS	list marker	1), 2)
MD	modal verb	Otherwise known as auxiliary verb (e.g. might happen, must visit)
NN	singular noun	car, tree, cat, etc.
NNS	plural noun	cars, trees, cats, etc.
NNP	singular proper noun	Ford
NNPS	plural proper noun	Americans
PDT	predeterminer	A word or phrase that occurs before a determiner that quantifies a noun phrase (e.g. lots of toys, few students)
POS	possessive ending	Michael’s, Tommy’s
PRP	personal pronoun	Pronouns associated with a grammatical person-be it first person, second person, or third person (e.g. they, he, she)
PRP$	possessive pronoun	Pronouns that indicate possession (e.g. mine, theirs, hers)
RB	adverb	very, extremely
RBR	comparative adverb	earlier, worse
RBS	superlative adverb	best, worst
RP	particle	Any word that doesn’t fall within the main parts-of-speech (e.g. give up)
TO	the word ‘to’	to come home
UH	interjection	Yikes! Ummmm.
VB	base form of verb	walk
VBD	past tense of verb	walked
VBG	gerund form of verb	walking
VBN	past participle of verb	walked
VBP	present singular form of verb (non-3rd person)	walk
VBZ	present singular form of verb (3rd-person)	walks
WDT	“wh” determiner	which
WP	“wh” pronoun	who, what
WP$	possessive “wh” pronoun	whose
WRB	“wh” adverb	where, when

As you can see, even though English has only eight parts of speech (verbs, nouns, adjectives, adverbs, pronouns, prepositions, conjuctions, and interjections), Python has 35 (!) parts-of-speech tags.

Even though I’m working with English here, I imagine these POS tags can work for any language.

If you take a look at the last line of output in the above example (the line containing the list of tuples), you can see two-element tuples containing the token itself as the first element along with the token’s POS tag as the second element. And yes, punctuation in a sentence counts as a token itself, but it has no POS tag. Hence why the tuple-POS tag pair for the period at the end of the sentence looks like this-['.', '.'].

Now, what if there was a sentence that had the same word twice but used as different parts-of-speech (e.g. a sentence that had the same word used as a noun and a verb). Let’s take a look at the example below:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: She got a call at work telling her to call the project manager.
[('She', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('call', 'NN'), ('at', 'IN'), ('work', 'NN'), ('telling', 'VBG'), ('her', 'PRP'), ('to', 'TO'), ('call', 'VB'), ('the', 'DT'), ('project', 'NN'), ('manager', 'NN'), ('.', '.')]

Take a close look at the sentence I used in this example-She got a call at work telling her to call the project manager. Notice the repeated word here-call. In this example, call is used as both a noun (She got a call at work) and a verb (to call the project manager.). The neat thing here is that NLTK’s POS tagger recognizes that the word call is used as two different parts-of-speech in that sentence.

However, the POS tagger may not always be so accurate when it comes to recognizing the same word used as a different part of speech. Take a look at this example:

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
    
print(tagged)

Please input a test string: My apartment building is bigger than any other apartment in a 5-block vicinity.
[('My', 'PRP$'), ('apartment', 'NN'), ('building', 'NN'), ('is', 'VBZ'), ('bigger', 'JJR'), ('than', 'IN'), ('any', 'DT'), ('other', 'JJ'), ('apartment', 'NN'), ('in', 'IN'), ('a', 'DT'), ('5-block', 'JJ'), ('vicinity', 'NN'), ('.', '.')]

In this example, I’m using the word apartment twice, as both an adjective (My apartment building) and a noun (any other apartment). However, NLTK’s POS tagger doesn’t recognize that the first instance of the word apartment is being used as an adjective to modify the noun building.

Hey, what can you say, programs aren’t always perfect. But I’d say NLTK’s POS tagger works quite well for parts-of-speech analysis.

Thanks for reading,

Michael

Intro to parts-of-speech tagging

POS tagging in action

Share this:

Leave a ReplyCancel reply

Discover more from Michael's Programming Bytes