Python Lesson 33: Stopwords (NLP pt.2)

Hello everybody,

Michael here, and today’s post will be on stopwords in Python NLP-part 2 in my NLP series.

What are stopwords? Simply put, stopwords are words you want to ignore when tokenizing a string. Oftentimes, stopwords are common English words like “a”, “the”, and “is” that are so commonly used in English that they don’t add much meaning in text.

Stopwords can be found for any language, but for this series of NLP lessons, I’ll focus on English words.

Now that I’ve explained the basics of stopwords, let’s see them in action:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

test = input('Please type in a string: ')
testWords = nltk.word_tokenize(test)

stopwordsList = set(stopwords.words('english'))

filteredList = []

for t in testWords:
    if t.casefold() not in stopwordsList:
        filteredList.append(t)

print(filteredList)

Please type in a string: The puppies and the kitties played in their playpen on the hot summer afternoon.
['puppies', 'kitties', 'played', 'playpen', 'hot', 'summer', 'afternoon', '.']

To utilize NLTK’s stopwords module, you’ll need to run the nltk.download(stopwords) command and import the stopwords module from the nltk.corpus package.

Yes, you’ll still need to download the punkt module, as it will enable easy tokenization, which is important to have when working with stopwords.

To store the list of tokens in the string you input, create a testWords variable that stores the output of the nltk.word_tokenize function. To get a list of NLTK’s English stopwords, use the line of code set(stopwords.words('english'))-this line of code creates a set from the list of NLTK’s English stopwords. Recall that sets are like lists, except without duplicate elements.

To gather a list of stopwords in the input string, you’d need to first create an empty list-filteredList in this case-that you’ll need to filter the stopwords out of the list of tokens (testWords in this case). To remove the stopwords, you’ll need to iterate through the list of tokens (again, testWords in this case), check if each token is in the list of stopwords and if not, add the token to the empty list you created earlier (filteredList in this case).

As you can see in the example above, the input string I used has 15 tokens (the punctuation at the end of the sentence counts as a token). After filtering out the stopwords, the resulting list only contains 8 tokens, as 7 tokens have been filtered out-The and the in their on the. Yes, even though I am iterating though a UNIQUE list of stopwords, the loop I am running will check for all instances of a stopword and exclude them from the filtered list (after all, there were three instances of the word “the” in the input string).

Ever want to see all of the words included in NLTK’s English stopwords list? Run the command print(stopwords.words('english') and you’ll see all the stopwords NLTK uses in English:

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

In total, NLTK has 179 stopwords in English, which consist of common English pronouns (I, my, you), commonly used English contracations (don’t, isn’t), conjungations of common English verbs (such as be and have), and surprisingly, contractions that you don’t hear most people use nowadays (be honest, when was the last time you heard someone use a word like shan’t or mightn’t).

Of course, you can always append words to NLTK’s stopwords list as you see fit, but when working with stopwords in English (or any language), I’d suggest sticking with the default stopwords list.

Now, I know I mentioned that I’ll mostly be working with English throughout my NLP lessons, but let’s explore stopwords in other languages. For this example, I’ll use the same code and same input string I used in the previous example, except this time use Spanish:

Please type in a string: Los cachorros y los gatitos jugaban en su corralito en la calurosa tarde de verano.
['cachorros', 'gatitos', 'jugaban', 'corralito', 'calurosa', 'tarde', 'verano', '.']

So, the Spanish-translated version of my previous example has 16 tokens, 8 of which appear on the filtered list. Thus, there were 8 stopwords that were removed from the testWords list.

Want to see the Spanish list of stopwords? Run the command print(stopwords.words('spanish') and take a look:

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni', 'contra', 'otros', 'ese', 'eso', 'ante', 'ellos', 'e', 'esto', 'mí', 'antes', 'algunos', 'qué', 'unos', 'yo', 'otro', 'otras', 'otra', 'él', 'tanto', 'esa', 'estos', 'mucho', 'quienes', 'nada', 'muchos', 'cual', 'poco', 'ella', 'estar', 'estas', 'algunas', 'algo', 'nosotros', 'mi', 'mis', 'tú', 'te', 'ti', 'tu', 'tus', 'ellas', 'nosotras', 'vosotros', 'vosotras', 'os', 'mío', 'mía', 'míos', 'mías', 'tuyo', 'tuya', 'tuyos', 'tuyas', 'suyo', 'suya', 'suyos', 'suyas', 'nuestro', 'nuestra', 'nuestros', 'nuestras', 'vuestro', 'vuestra', 'vuestros', 'vuestras', 'esos', 'esas', 'estoy', 'estás', 'está', 'estamos', 'estáis', 'están', 'esté', 'estés', 'estemos', 'estéis', 'estén', 'estaré', 'estarás', 'estará', 'estaremos', 'estaréis', 'estarán', 'estaría', 'estarías', 'estaríamos', 'estaríais', 'estarían', 'estaba', 'estabas', 'estábamos', 'estabais', 'estaban', 'estuve', 'estuviste', 'estuvo', 'estuvimos', 'estuvisteis', 'estuvieron', 'estuviera', 'estuvieras', 'estuviéramos', 'estuvierais', 'estuvieran', 'estuviese', 'estuvieses', 'estuviésemos', 'estuvieseis', 'estuviesen', 'estando', 'estado', 'estada', 'estados', 'estadas', 'estad', 'he', 'has', 'ha', 'hemos', 'habéis', 'han', 'haya', 'hayas', 'hayamos', 'hayáis', 'hayan', 'habré', 'habrás', 'habrá', 'habremos', 'habréis', 'habrán', 'habría', 'habrías', 'habríamos', 'habríais', 'habrían', 'había', 'habías', 'habíamos', 'habíais', 'habían', 'hube', 'hubiste', 'hubo', 'hubimos', 'hubisteis', 'hubieron', 'hubiera', 'hubieras', 'hubiéramos', 'hubierais', 'hubieran', 'hubiese', 'hubieses', 'hubiésemos', 'hubieseis', 'hubiesen', 'habiendo', 'habido', 'habida', 'habidos', 'habidas', 'soy', 'eres', 'es', 'somos', 'sois', 'son', 'sea', 'seas', 'seamos', 'seáis', 'sean', 'seré', 'serás', 'será', 'seremos', 'seréis', 'serán', 'sería', 'serías', 'seríamos', 'seríais', 'serían', 'era', 'eras', 'éramos', 'erais', 'eran', 'fui', 'fuiste', 'fue', 'fuimos', 'fuisteis', 'fueron', 'fuera', 'fueras', 'fuéramos', 'fuerais', 'fueran', 'fuese', 'fueses', 'fuésemos', 'fueseis', 'fuesen', 'sintiendo', 'sentido', 'sentida', 'sentidos', 'sentidas', 'siente', 'sentid', 'tengo', 'tienes', 'tiene', 'tenemos', 'tenéis', 'tienen', 'tenga', 'tengas', 'tengamos', 'tengáis', 'tengan', 'tendré', 'tendrás', 'tendrá', 'tendremos', 'tendréis', 'tendrán', 'tendría', 'tendrías', 'tendríamos', 'tendríais', 'tendrían', 'tenía', 'tenías', 'teníamos', 'teníais', 'tenían', 'tuve', 'tuviste', 'tuvo', 'tuvimos', 'tuvisteis', 'tuvieron', 'tuviera', 'tuvieras', 'tuviéramos', 'tuvierais', 'tuvieran', 'tuviese', 'tuvieses', 'tuviésemos', 'tuvieseis', 'tuviesen', 'teniendo', 'tenido', 'tenida', 'tenidos', 'tenidas', 'tened']

In comparison to the English stopwords list, the Spanish list has 313 stopwords. However, both the English and Spanish lists have the same type of elements, such as conjugations of commonly used verbs (such as ser and estar), common pronouns and prepositions (yo, tu, para, contra), among other things. What you don’t see much of in the Spanish stopwords list are contractions, and that’s because there are only two knwon contractions in Spanish (al and del-both of which are on this list) while English has plenty of contractions.

Now, one cool thing about working with stopwords (and NLP in general) is that you can play around with several foreign languages. Run the command print(stopwords.fileids()) to see all the languages you can play with when working with stopwords:

['arabic', 'azerbaijani', 'bengali', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']

In total, you can use 24 languages when working with stopwords-from common languages like English, Spanish and French to more interesting options like Kazakh and Turkish. Interestingly, I don’t see an option to use Mandarin on here, as it’s a commonly spoken language worldwide.

Thank you,

Michael

Like this:

Related

Leave a ReplyCancel reply

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Michael's Programming Bytes