Hello everybody,
Michael here, and today’s post will cover parts-of-speech tagging as it relates to Python NLP (this is part 4 in my NLP Python series).
Intro to parts-of-speech tagging
What is parts-of-speech (POS) tagging, exactly? See, Python NLP can do some really cool things, such as getting the roots of words (Python Lesson 34: Stemming and Lemmatization (NLP pt. 3)) and finding commonly used words (stopwords) in 24 different languages (Python Lesson 33: Stopwords (NLP pt.2)). In Python, parts-of-speech tagging is a quite self-explanatory process, as it involves tokenizing a string and identifying each token’s part-of-speech (such as a noun, verb, etc.). Keep in mind that this isn’t going to be a grammar lesson, so I’m not going to teach you how to use POS tagging to improve your grammar or proofread something you wrote.
POS tagging in action
Now that I’ve explained the basics of POS tagging, let’s see it in action! Take a look at the example below:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = []
for t in tokens:
tagged = nltk.pos_tag(tokens)
print(tagged)
Please input a test string: I had a fun time dancing last night.
[('I', 'PRP'), ('had', 'VBD'), ('a', 'DT'), ('fun', 'JJ'), ('time', 'NN'), ('dancing', 'VBG'), ('last', 'JJ'), ('night', 'NN'), ('.', '.')]
Before getting into the fun POS tagging, you’d first need to import the nltk package and download two of the package’s modules-punkt and averaged_perceptron_tagger. punkt is NLTK’s standard package module which allows Python to work its NLP magic while averaged_perceptron_tagger is the module that enables all the fun POS tagging capabilities.
After including all the necessary imports and downloading all the necessary package modules, I then inputted and word-tokenized a test string. I then created an empty list-tagged-that will store the results of our POS tagging.
To perform the POS tagging, I iterated through each element in the list of tokens (aptly called tokens) and used NLTK’s pos_tag() method to add the appropriate POS tag to the element. I then printed out the tagged list, which contains the results of our POS tagging. As you can see, the tagged list contains a list of tuples-the first element in each tuple is the token itself while the second element is that token’s part-of-speech. Punctuation is also included, as punctuation counts as its own token, but doesn’t belong to any part-of-speech.
You likely noticed that the POS tags are all two or three character abbrevations. Here’s a table explaining all of the POS tags:
| Tag | Part-of-speech | Example/Explanation |
| CC | coordinating conjuction | Any of the FANBOYS conjuctions (for, and, nor, but, or, yet, so) |
| CD | cardinal digit | The numbers 0-9 |
| DT | determiner | A word in front of a noun to specify quanity or to clarify what the noun refers to (e.g. one car, that child) |
| EX | existential there | There is a snake in the grass. |
| FW | foreign word | Since I’m using English for this post, any word that isn’t English (e.g. palabra in Spanish) |
| IN | preposition | on, in, at |
| JJ | adjective (base form) | large, tiny |
| JJR | comparative adjective | larger, tinier |
| JJS | superlative adjective | largest, tiniest |
| LS | list marker | 1), 2) |
| MD | modal verb | Otherwise known as auxiliary verb (e.g. might happen, must visit) |
| NN | singular noun | car, tree, cat, etc. |
| NNS | plural noun | cars, trees, cats, etc. |
| NNP | singular proper noun | Ford |
| NNPS | plural proper noun | Americans |
| PDT | predeterminer | A word or phrase that occurs before a determiner that quantifies a noun phrase (e.g. lots of toys, few students) |
| POS | possessive ending | Michael’s, Tommy’s |
| PRP | personal pronoun | Pronouns associated with a grammatical person-be it first person, second person, or third person (e.g. they, he, she) |
| PRP$ | possessive pronoun | Pronouns that indicate possession (e.g. mine, theirs, hers) |
| RB | adverb | very, extremely |
| RBR | comparative adverb | earlier, worse |
| RBS | superlative adverb | best, worst |
| RP | particle | Any word that doesn’t fall within the main parts-of-speech (e.g. give up) |
| TO | the word ‘to’ | to come home |
| UH | interjection | Yikes! Ummmm. |
| VB | base form of verb | walk |
| VBD | past tense of verb | walked |
| VBG | gerund form of verb | walking |
| VBN | past participle of verb | walked |
| VBP | present singular form of verb (non-3rd person) | walk |
| VBZ | present singular form of verb (3rd-person) | walks |
| WDT | “wh” determiner | which |
| WP | “wh” pronoun | who, what |
| WP$ | possessive “wh” pronoun | whose |
| WRB | “wh” adverb | where, when |
As you can see, even though English has only eight parts of speech (verbs, nouns, adjectives, adverbs, pronouns, prepositions, conjuctions, and interjections), Python has 35 (!) parts-of-speech tags.
- Even though I’m working with English here, I imagine these POS tags can work for any language.
If you take a look at the last line of output in the above example (the line containing the list of tuples), you can see two-element tuples containing the token itself as the first element along with the token’s POS tag as the second element. And yes, punctuation in a sentence counts as a token itself, but it has no POS tag. Hence why the tuple-POS tag pair for the period at the end of the sentence looks like this-['.', '.'].
Now, what if there was a sentence that had the same word twice but used as different parts-of-speech (e.g. a sentence that had the same word used as a noun and a verb). Let’s take a look at the example below:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
print(tagged)
Please input a test string: She got a call at work telling her to call the project manager.
[('She', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('call', 'NN'), ('at', 'IN'), ('work', 'NN'), ('telling', 'VBG'), ('her', 'PRP'), ('to', 'TO'), ('call', 'VB'), ('the', 'DT'), ('project', 'NN'), ('manager', 'NN'), ('.', '.')]
Take a close look at the sentence I used in this example-She got a call at work telling her to call the project manager. Notice the repeated word here-call. In this example, call is used as both a noun (She got a call at work) and a verb (to call the project manager.). The neat thing here is that NLTK’s POS tagger recognizes that the word call is used as two different parts-of-speech in that sentence.
However, the POS tagger may not always be so accurate when it comes to recognizing the same word used as a different part of speech. Take a look at this example:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
test = input('Please input a test string: ')
tokens = nltk.word_tokenize(test)
tagged = nltk.pos_tag(tokens)
print(tagged)
Please input a test string: My apartment building is bigger than any other apartment in a 5-block vicinity.
[('My', 'PRP$'), ('apartment', 'NN'), ('building', 'NN'), ('is', 'VBZ'), ('bigger', 'JJR'), ('than', 'IN'), ('any', 'DT'), ('other', 'JJ'), ('apartment', 'NN'), ('in', 'IN'), ('a', 'DT'), ('5-block', 'JJ'), ('vicinity', 'NN'), ('.', '.')]
In this example, I’m using the word apartment twice, as both an adjective (My apartment building) and a noun (any other apartment). However, NLTK’s POS tagger doesn’t recognize that the first instance of the word apartment is being used as an adjective to modify the noun building.
- Hey, what can you say, programs aren’t always perfect. But I’d say NLTK’s POS tagger works quite well for parts-of-speech analysis.
Thanks for reading,
Michael