Python Lesson 40: The NLP Bag-Of-Words (NLP pt. 6/AI pt.5)

Hello everybody,

Michael here, and in today’s post, we’re going to explore a Python NLP machine learning/AI technique known as the bag-of-words.

What is the bag-of-words?

Good question-what is Python’s bag-of-words technique? The bag-of-words is a simple NLP algorithm that turns text into fixed-length vectors by counting the number of times a word occurs in a text string or document. The information that the bag-of-words algorithm provides is useful for various NLP tasks such as topic modelling (along the lines of categorizing a news article based on its content) and sentiment analysis, among other things.

Do you wonder why this algorithm is called the bag-of-words? The bag-of-words algorithm represents a text string or document as a, well, “bag” of words. All this algorithm does is count how many times a word appears in a text string or document-the string/document’s syntax and semantics aren’t taken into account here. By that I mean if we have a word like free in a sentence that’s used as both a noun and a verb, the bag-of-words algorithm won’t take the different tenses of the word into account.

It’s data preparation time!

Now that you know the gist of the bag-of-words algorithm, let’s implement it in Python!

However, before we get to the fun part (implementing the algorithm), let’s first import the three packages and download the two NLTK modules we’ll be using in this lesson:

import pandas
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords

Next up, let’s add a list of strings that we will be analyzing:

reviews = ["Wow! You’ll say that over and over again as this mind-blowing, superhero epic unfolds. Wow!",
          "The tribute here is heartfelt, but the spirit of the man and the character sometimes get lost in all the bric-a-brac of the Marvel machine... the film lands on a triumphant note of succession, as it must– the gods inside and above the narrative demand it.",
          "An exercise in superhero mourning done right.",
          "The MCU’s mechanics are too oppressive to allow for true mournful meditation.",
          "This soulful sequel teams an emotional tribute to late star Chadwick Boseman with some spectacular visual action. A maturity milestone for the Marvel Cinematic Universe, starring Angela Bassett and Winston Duke.",
          "The opening and closing sequences of Wakanda Forever will make your heart ache. But at 2hrs 41mins, this is also one of the longest films in the MCU. And there are long stretches in it which border on boredom. I was weepy but also weary.",
          "Coogler pulls off an incredible feat, despite some story stumbles, creating a superhero film that is emotionally affecting, politically and culturally urgent, and that pays loving tribute not just to T’Challa but Chadwick Boseman too.",
          "“Wakanda Forever” is the first blockbuster wake, and it’s powered not by vibranium but by its vibrant and fully felt emotions.",
          "For all its comic-book violence, over-the-top villainy, and too dark CGI, at its core this is a film about dealing with loss.",
          "It’s both a tribute to the late Chadwick Boseman and a problem for the movie that “Black Panther: Wakanda Forever” feels his loss so keenly.",
          "Presented the daunting task of bidding farewell to a star tragically taken in his prime in sober but stirring fashion, Coogler has given audiences, and the studio, a solidly and gracefully executed dive into a “Wakanda” for right now."]

In this example, we’re going to analyze 11 randomly selected critic reviews from the most recently released MCU (Marvel Cinematic Universe) film Black Panther: Wakanda Forever-which, by the way, is one of the MCU’s best entries since Avengers: Endgame.

Well, now that Ant-Man and the Wasp: Quantumania is out, Black Panther: Wakanda Forever is no longer the most recently released MCU film.

Now that we have the strings that we are going to analyze, let’s start analyzing! The first step in our analysis will be data preparation-which should be the first step in any data analysis you do. Here’s one way to approach the data preparation process:

stopwordsList = set(stopwords.words('english'))
tokensList = []

for r in reviews:
    tokens = nltk.word_tokenize(r)                
    tokens = list(filter(lambda word: word not in '.!,’“”:...', tokens))
    tokens = list(filter(lambda word: word.casefold() not in stopwordsList, tokens))
        
    if 'must-' in tokens:
        tokens.remove('must-')

    tokensList.append(tokens)
               
    print(tokens)  

['Wow', 'say', 'mind-blowing', 'superhero', 'epic', 'unfolds', 'Wow']
['tribute', 'heartfelt', 'spirit', 'man', 'character', 'sometimes', 'get', 'lost', 'bric-a-brac', 'Marvel', 'machine', 'film', 'lands', 'triumphant', 'note', 'succession', 'must–', 'gods', 'inside', 'narrative', 'demand']
['exercise', 'superhero', 'mourning', 'done', 'right']
['MCU', 'mechanics', 'oppressive', 'allow', 'true', 'mournful', 'meditation']
['soulful', 'sequel', 'teams', 'emotional', 'tribute', 'late', 'star', 'Chadwick', 'Boseman', 'spectacular', 'visual', 'action', 'maturity', 'milestone', 'Marvel', 'Cinematic', 'Universe', 'starring', 'Angela', 'Bassett', 'Winston', 'Duke']
['opening', 'closing', 'sequences', 'Wakanda', 'Forever', 'make', 'heart', 'ache', '2hrs', '41mins', 'also', 'one', 'longest', 'films', 'MCU', 'long', 'stretches', 'border', 'boredom', 'weepy', 'also', 'weary']
['Coogler', 'pulls', 'incredible', 'feat', 'despite', 'story', 'stumbles', 'creating', 'superhero', 'film', 'emotionally', 'affecting', 'politically', 'culturally', 'urgent', 'pays', 'loving', 'tribute', 'Challa', 'Chadwick', 'Boseman']
['Wakanda', 'Forever', 'first', 'blockbuster', 'wake', 'powered', 'vibranium', 'vibrant', 'fully', 'felt', 'emotions']
['comic-book', 'violence', 'over-the-top', 'villainy', 'dark', 'CGI', 'core', 'film', 'dealing', 'loss']
['tribute', 'late', 'Chadwick', 'Boseman', 'problem', 'movie', 'Black', 'Panther', 'Wakanda', 'Forever', 'feels', 'loss', 'keenly']
['Presented', 'daunting', 'task', 'bidding', 'farewell', 'star', 'tragically', 'taken', 'prime', 'sober', 'stirring', 'fashion', 'Coogler', 'given', 'audiences', 'studio', 'solidly', 'gracefully', 'executed', 'dive', 'Wakanda', 'right']

So, how exactly did I preprocess the data? Well, I first created a stopwordsList, which will allow us to filter out all of the [English] stopwords from the text. I also created a tokensList that I will append each of the processed tokens to-I’ll explain this more further in the post.

When running NLP analyses, you don’t necessarily have to remove the stopwords from the text you’re analyzing-it’s more of a best practice thing to do!

After creating my stopwords list, I then ran a for loop through all of the elements in the reviews list and word-tokenized each element using NLTK’s sent_tokenize method. I also stored the outputs of the word-tokenization in the tokens variable.

The two following lines are where the data preparation magic really happens, as I utilize a combination of filter and lambda functions to remove both commonly occuring punctuation and stopwords from each tokens list.

In case you’re wondering why I chose to remove punctuation and stopwords on separate lines of code, I tried running this line-tokens: list(filter(lambda word: word not in '.!,’“”:...', tokens) | filter(lambda word: word.casefold() not in stopwordsList, tokens)) and it didn’t remove the punctuation or stopwords.
Yes, you’ll need to include the list wrapper in your code. Otherwise, the filter function will return a bunch of Filter class objects rather than the processed word-tokenized list (tokens).

After removing the punctuation and stopwords from the list, I noticed that there was a must- token (yes, with a dash) among the filtered tokens, so I added in a few lines of code to check for this token and remove it. Lastly, I then printed out all of the processed tokens (after the punctuation and stopwords have been removed).

Now for the fun part…the bag-of-words implementation!

Now that the data has been processed, it’s time for the fun part…implementing the bag-of-words algorithm! The first step in implementing the bag of words would be to create a vocab list of all the tokens (words) found in each of the reviews (and pay attention to the underlined lines of code):

stopwordsList = set(stopwords.words('english'))
vocab = []

for r in reviews:
    tokens = nltk.word_tokenize(r)                
    tokens = list(filter(lambda word: word not in '.!,’“”:...', tokens))
    tokens = list(filter(lambda word: word.casefold() not in stopwordsList, tokens))
        
    if 'must–' in tokens:
        tokens.remove('must–')
               
    for t in tokens:
        vocab.append(t)
        
vocab = list(set(vocab))

print(vocab)

['loss', 'milestone', 'powered', 'Challa', 'bidding', 'mechanics', 'triumphant', 'task', 'violence', 'spectacular', 'CGI', 'feat', 'lands', 'creating', 'fashion', 'allow', 'feels', 'stretches', 'starring', 'villainy', 'gods', 'movie', 'sober', 'Cinematic', 'felt', 'incredible', 'action', 'Chadwick', 'opening', 'affecting', 'get', '41mins', 'border', 'sequel', 'problem', 'Bassett', 'wake', 'note', 'spirit', 'done', 'succession', 'machine', 'Angela', 'loving', 'comic-book', 'dive', 'pulls', 'star', 'stirring', 'man', 'boredom', 'pays', 'first', 'prime', 'ache', 'taken', 'late', 'demand', 'Presented', 'fully', 'exercise', 'one', 'film', 'Panther', 'despite', 'sometimes', 'farewell', 'mind-blowing', 'Winston', 'blockbuster', 'weary', 'character', 'Marvel', 'meditation', 'Black', 'mourning', 'emotions', 'heartfelt', 'Coogler', 'MCU', 'emotionally', 'studio', 'closing', 'superhero', 'lost', 'Universe', '2hrs', 'Wakanda', 'inside', 'keenly', 'long', 'executed', 'also', 'films', 'sequences', 'core', 'vibrant', 'tribute', 'tragically', 'culturally', 'epic', 'Wow', 'audiences', 'urgent', 'emotional', 'soulful', 'over-the-top', 'vibranium', 'visual', 'teams', 'Duke', 'bric-a-brac', 'true', 'maturity', 'gracefully', 'Forever', 'right', 'mournful', 'oppressive', 'make', 'unfolds', 'weepy', 'given', 'Boseman', 'dark', 'story', 'dealing', 'say', 'heart', 'solidly', 'narrative', 'stumbles', 'politically', 'daunting', 'longest']

The new lines of code I added include initializing an empty vocab list that I will add the vocabulary list that I create from the tokens in each string in the reviews list.

I also added another for loop within the main for loop that iterates through each token in the tokens list and appends it to the vocab list. Once all the tokens from each review have been iterated through, I then run the list(set(...)) nested function to turn the vocab list into a set and back into a list before printing the vocab list.

Why do I turn the vocab list into a set? I wanted to remove all duplicate elements from the vocab list but still wanted to keep vocab as a list, so changing the vocab list to a set then back to a list was the easiest thing to do. Recall that sets in Python are like lists but with no duplicate elements.

It’s vectorization time!

Now that we have a vocab list ready, it’s time for vectorization!

What is vectorization though? In the context of the bag-of-words algorithm, vectorization utilizes a common vocabulary list-like the vocab list we just created-and based off of that common vocabulary list, creates a frequency count for each word (or combined phrase like we did here) by assigned a number to that word that indicates how many times that word appears in a document/string.

How would we implement vectorization? First off, and this part is completely optional, let’s sort the vocab list alphabetically:

vocabSorted = sorted(vocab)
print(vocabSorted)

['2hrs', '41mins', 'Angela Bassett', 'Black Panther', 'CGI', 'Chadwick Boseman', 'Coogler', 'MCU', 'Marvel Cinematic Universe', 'Presented', 'T`Challa', 'Wakanda Forever', 'Winston Duke', 'Wow', 'ache', 'action', 'affecting', 'allow', 'also', 'audiences', 'bidding', 'blockbuster', 'border', 'boredom', 'bric-a-brac', 'character', 'closing', 'comic-book', 'core', 'creating', 'culturally', 'dark', 'daunting', 'dealing', 'demand', 'despite', 'dive', 'done', 'emotional', 'emotionally', 'emotions', 'epic', 'executed', 'exercise', 'farewell', 'fashion', 'feat', 'feels', 'felt', 'film', 'films', 'first', 'fully', 'get', 'given', 'gods', 'gracefully', 'heart', 'heartfelt', 'incredible', 'inside', 'keenly', 'lands', 'late', 'long', 'longest', 'loss', 'lost', 'loving', 'machine', 'make', 'man', 'maturity', 'mechanics', 'meditation', 'milestone', 'mind-blowing', 'mournful', 'mourning', 'movie', 'narrative', 'note', 'one', 'opening', 'oppressive', 'over-the-top', 'pays', 'politically', 'powered', 'prime', 'problem', 'pulls', 'right', 'say', 'sequel', 'sequences', 'sober', 'solidly', 'sometimes', 'soulful', 'spectacular', 'spirit', 'star', 'starring', 'stirring', 'story', 'stretches', 'studio', 'stumbles', 'succession', 'superhero', 'taken', 'task', 'teams', 'tragically', 'tribute', 'triumphant', 'true', 'unfolds', 'urgent', 'vibranium', 'vibrant', 'villainy', 'violence', 'visual', 'wake', 'weary', 'weepy']

In order to sort the vocabulary list alphabetically, I used the sorted() function and passed in the vocab list as the function’s parameter. I also saved the sorted vocabulary list to the vocabSorted variable.

As you can see from the output above, the sorted() function will sort all of the uppercase strings in the list alphabetically before doing the same with the lowercase strings. That’s why the capitalized Wow is listed before the lowercase ache.

As I just said, it’s not required to sort the vocabulary list, but I just wanted to do it in order to make the vectorization process easier.

Now, how would we create the bag-of-words vectors for each string? Take a look at the code below:

wordVectorDict = {}

for t in tokensList:
    for v in vocabSorted:
        if v in t:
            wordVectorDict[v] = t.count(v)
        else:
            wordVectorDict[v] = 0
            
    print(wordVectorDict)
    print()

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 0, 'MCU': 0, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 0, 'Winston': 0, 'Wow': 2, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 1, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 0, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 1, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 1, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 1, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 0, 'triumphant': 0, 'true': 0, 'unfolds': 1, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 0, 'MCU': 0, 'Marvel': 1, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 0, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 1, 'character': 1, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 1, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 1, 'films': 0, 'first': 0, 'fully': 0, 'get': 1, 'given': 0, 'gods': 1, 'gracefully': 0, 'heart': 0, 'heartfelt': 1, 'incredible': 0, 'inside': 1, 'keenly': 0, 'lands': 1, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 1, 'loving': 0, 'machine': 1, 'make': 0, 'man': 1, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 1, 'note': 1, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 1, 'soulful': 0, 'spectacular': 0, 'spirit': 1, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 1, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 1, 'triumphant': 1, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 0, 'MCU': 0, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 0, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 1, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 1, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 0, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 1, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 1, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 1, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 0, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 0, 'MCU': 1, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 0, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 1, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 0, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 1, 'meditation': 1, 'milestone': 0, 'mind-blowing': 0, 'mournful': 1, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 1, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 0, 'triumphant': 0, 'true': 1, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 1, 'Bassett': 1, 'Black': 0, 'Boseman': 1, 'CGI': 0, 'Chadwick': 1, 'Challa': 0, 'Cinematic': 1, 'Coogler': 0, 'Duke': 1, 'Forever': 0, 'MCU': 0, 'Marvel': 1, 'Panther': 0, 'Presented': 0, 'Universe': 1, 'Wakanda': 0, 'Winston': 1, 'Wow': 0, 'ache': 0, 'action': 1, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 1, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 0, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 1, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 1, 'mechanics': 0, 'meditation': 0, 'milestone': 1, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 1, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 1, 'spectacular': 1, 'spirit': 0, 'star': 1, 'starring': 1, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 1, 'tragically': 0, 'tribute': 1, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 1, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 1, '41mins': 1, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 1, 'MCU': 1, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 1, 'Winston': 0, 'Wow': 0, 'ache': 1, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 2, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 1, 'boredom': 1, 'bric-a-brac': 0, 'character': 0, 'closing': 1, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 0, 'films': 1, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 1, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 1, 'longest': 1, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 1, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 1, 'opening': 1, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 1, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 1, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 0, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 1, 'weepy': 1}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 1, 'CGI': 0, 'Chadwick': 1, 'Challa': 1, 'Cinematic': 0, 'Coogler': 1, 'Duke': 0, 'Forever': 0, 'MCU': 0, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 0, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 1, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 1, 'culturally': 1, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 1, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 1, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 1, 'feels': 0, 'felt': 0, 'film': 1, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 1, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 1, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 1, 'politically': 1, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 1, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 1, 'stretches': 0, 'studio': 0, 'stumbles': 1, 'succession': 0, 'superhero': 1, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 1, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 1, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 1, 'MCU': 0, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 1, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 1, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 1, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 1, 'film': 0, 'films': 0, 'first': 1, 'fully': 1, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 1, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 0, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 1, 'vibrant': 1, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 1, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 1, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 0, 'MCU': 0, 'Marvel': 0, 'Panther': 0, 'Presented': 0, 'Universe': 0, 'Wakanda': 0, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 1, 'core': 1, 'creating': 0, 'culturally': 0, 'dark': 1, 'daunting': 0, 'dealing': 1, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 1, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 1, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 1, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 0, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 0, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 1, 'violence': 1, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 1, 'Boseman': 1, 'CGI': 0, 'Chadwick': 1, 'Challa': 0, 'Cinematic': 0, 'Coogler': 0, 'Duke': 0, 'Forever': 1, 'MCU': 0, 'Marvel': 0, 'Panther': 1, 'Presented': 0, 'Universe': 0, 'Wakanda': 1, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 0, 'bidding': 0, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 0, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 0, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 0, 'exercise': 0, 'farewell': 0, 'fashion': 0, 'feat': 0, 'feels': 1, 'felt': 0, 'film': 0, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 0, 'gods': 0, 'gracefully': 0, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 1, 'lands': 0, 'late': 1, 'long': 0, 'longest': 0, 'loss': 1, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 1, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 0, 'problem': 1, 'pulls': 0, 'right': 0, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 0, 'solidly': 0, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 0, 'starring': 0, 'stirring': 0, 'story': 0, 'stretches': 0, 'studio': 0, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 0, 'task': 0, 'teams': 0, 'tragically': 0, 'tribute': 1, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

{'2hrs': 0, '41mins': 0, 'Angela': 0, 'Bassett': 0, 'Black': 0, 'Boseman': 0, 'CGI': 0, 'Chadwick': 0, 'Challa': 0, 'Cinematic': 0, 'Coogler': 1, 'Duke': 0, 'Forever': 0, 'MCU': 0, 'Marvel': 0, 'Panther': 0, 'Presented': 1, 'Universe': 0, 'Wakanda': 1, 'Winston': 0, 'Wow': 0, 'ache': 0, 'action': 0, 'affecting': 0, 'allow': 0, 'also': 0, 'audiences': 1, 'bidding': 1, 'blockbuster': 0, 'border': 0, 'boredom': 0, 'bric-a-brac': 0, 'character': 0, 'closing': 0, 'comic-book': 0, 'core': 0, 'creating': 0, 'culturally': 0, 'dark': 0, 'daunting': 1, 'dealing': 0, 'demand': 0, 'despite': 0, 'dive': 1, 'done': 0, 'emotional': 0, 'emotionally': 0, 'emotions': 0, 'epic': 0, 'executed': 1, 'exercise': 0, 'farewell': 1, 'fashion': 1, 'feat': 0, 'feels': 0, 'felt': 0, 'film': 0, 'films': 0, 'first': 0, 'fully': 0, 'get': 0, 'given': 1, 'gods': 0, 'gracefully': 1, 'heart': 0, 'heartfelt': 0, 'incredible': 0, 'inside': 0, 'keenly': 0, 'lands': 0, 'late': 0, 'long': 0, 'longest': 0, 'loss': 0, 'lost': 0, 'loving': 0, 'machine': 0, 'make': 0, 'man': 0, 'maturity': 0, 'mechanics': 0, 'meditation': 0, 'milestone': 0, 'mind-blowing': 0, 'mournful': 0, 'mourning': 0, 'movie': 0, 'narrative': 0, 'note': 0, 'one': 0, 'opening': 0, 'oppressive': 0, 'over-the-top': 0, 'pays': 0, 'politically': 0, 'powered': 0, 'prime': 1, 'problem': 0, 'pulls': 0, 'right': 1, 'say': 0, 'sequel': 0, 'sequences': 0, 'sober': 1, 'solidly': 1, 'sometimes': 0, 'soulful': 0, 'spectacular': 0, 'spirit': 0, 'star': 1, 'starring': 0, 'stirring': 1, 'story': 0, 'stretches': 0, 'studio': 1, 'stumbles': 0, 'succession': 0, 'superhero': 0, 'taken': 1, 'task': 1, 'teams': 0, 'tragically': 1, 'tribute': 0, 'triumphant': 0, 'true': 0, 'unfolds': 0, 'urgent': 0, 'vibranium': 0, 'vibrant': 0, 'villainy': 0, 'violence': 0, 'visual': 0, 'wake': 0, 'weary': 0, 'weepy': 0}

In this example, I created a wordVectorDict dictionary, which I’ll use to create the word vectors for each element in the tokensList.

After creating the wordVectorDict dictionary, I then run a for loop through the tokensList and run a nested for loop through the vocabSorted list (you can simply use the vocab list if you chose not to sort the vocabulary). As for the wordVectorDict dictionary, each of the elements in the vocabSorted list serve as keys while the count of each element in a processed review string serves as the corresponding values. For instance, in the first review, the word Wow is used twice, so the key-value pair for the word Wow in the first wordVectorDict would be Wow: 2. If an element in the sortedVocab list doesn’t appear in a processed review string, the corresponding value to the vocabulary key would be 0. For instance, since the word farewell doesn’t appear in the first review, its key-value pair would be farewell: 0.

As you could probably guess from my code, I created 11 wordVectorDict dictionaries, one for each element in the tokensList, and printed them all out so you can see what each word vector will eventually look like (more on that later).

Creating the word vectors

Now that we’ve got an idea as to the token count for each processed review, it’s time to create the word vectors! How would we do so? Take a look at the underlined lines of code to see one approach to creating the word vectors:

import numpy as np

wordVectorDict = {}
wordVector = []

for t in tokensList:
    for v in vocabSorted:
        if v in t:
            wordVectorDict[v] = t.count(v)
        else:
            wordVectorDict[v] = 0
        
    wordVector = np.array(list(wordVectorDict.values()))
    print(wordVector)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0
 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0
 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 1
 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0]
[1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 0 2 0 0 0 1 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]
[0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0]
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0]
[0 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0
 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]

To create the word vector lists, I simply grabbed the values from all of the wordVectorDict elements, placed them into a numpy array, and printed each array.

Yes, you will need to install and import numpy for this example.

As you can see from the output, most of the elements in each numpy array are zeroes and ones with a handful of twos-indicating that many of the tokens in the vocabSorted list only appear in each string once or not at all.

Presenting our bag-of-words

Now that we’ve created our word vector for each processed element in the reviews list, it’s time to figure out how to best present the data. Take a look at the code below (and pay attention to the underlined lines of code):

import numpy as np
import pandas as pd

wordVectorDict = {}
wordVector = []
bagOfWords = pd.DataFrame()
wordVectorList = []

for t in tokensList:
    for v in vocabSorted:
        if v in t:
            wordVectorDict[v] = t.count(v)
        else:
            wordVectorDict[v] = 0
        
    wordVector = np.array(list(wordVectorDict.values()))
    
    wordVectorList.append(wordVector)
    
    bagOfWords = pd.DataFrame(wordVectorList)
    
bagOfWords

In this example, I created a pandas data-frame (appropriately called bagOfWords) for all 11 wordVectors that shows you how many times a token appears in a particular string. I used the wordVectorList variable to gather all 11 wordVector elements into a single list; creating the wordVectorList made it easier to create the data-frame.

The 0-row corresponds to the first element in reviews whereas the 10-row would correspond to the 11th and final element in reviews.

So, data-frame is looking pretty good, right? There’s just one issue-you can’t tell which tokens are which just by the headers (though granted, this data-frame does utilize the vocabSorted list, so it’ll take you some time to figure out which token corresponds to which index).

How can we fix this issue? There’s just one tiny change in the code above that you’ll need to make. Can you guess what that would be?

import numpy as np
import pandas as pd

wordVectorDict = {}
wordVector = []
bagOfWords = pd.DataFrame()
wordVectorList = []

for t in tokensList:
    for v in vocabSorted:
        if v in t:
            wordVectorDict[v] = t.count(v)
        else:
            wordVectorDict[v] = 0
        
    wordVector = np.array(list(wordVectorDict.values()))
    
    wordVectorList.append(wordVectorDict)
    
    bagOfWords = pd.DataFrame(wordVectorList, columns=vocabSorted)
    
bagOfWords

The small change I made in the code is to add the columns = vocabSorted line to the pd.DataFrame() function and just like that, the indeces of each element in the vocabSorted list is replaced with the token itself, making it much easier to tell where all the ones and zeroes connect to.

Thanks for reading!

Michael

Python Lesson 40: The NLP Bag-Of-Words (NLP pt. 6/AI pt.5)

What is the bag-of-words?

It’s data preparation time!

Now for the fun part…the bag-of-words implementation!

It’s vectorization time!

Creating the word vectors

Presenting our bag-of-words

Like this:

Related

Leave a ReplyCancel reply

What is the bag-of-words?

It’s data preparation time!

Now for the fun part…the bag-of-words implementation!

It’s vectorization time!

Creating the word vectors

Presenting our bag-of-words

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Michael's Programming Bytes