Python Lesson 41: Word2Vec (NLP pt.7/AI pt.7)

Hello everybody,

Michael here, and in today’s lesson, we’ll discover another AI NLP algorithm-word2vec (recall that a few posts back in Python Lesson 40: The NLP Bag-Of-Words (NLP pt. 6/AI pt.5) we discussed the bag-of-words algorithm).

What is the word2vec algorithm?

So, how does the word2vec algorithm work? Let’s say we were to process a text document using this algorithm. The word2vec algorithm would analyze each word in this document along with all other words commonly found near a certain word. For instance, if the word “family” was found in a document and the words “daughter”, “parents”, and “generations” were found near it-the word “family” would be lumped in with these words, as they are all related to each other.

If you were to ask your program for the three most similar words to “family” (according to our hypothetical document), it would say “daughter”, “parents” and “generations”. How would the word2vec algorithm find the most common words? By using a mathematical metric called cosine similarity-more on that later.

Preparing the data

So, before we dive into the magic of word2vec, let’s gather the data we’re going to use.

In this example, we’ll utilize this ChatGPT-generated essay below that contains five paragraphs on a very relevant topic in the year 2023-returning to the office-including both sides of the debate.

So, let’s open up our Python IDEs and start coding!

First, let’s open and read this file into our IDE:

with open(r'C:\Users\mof39\OneDrive\Documents\return to office.txt', 'r', encoding='utf-8') as file:
    word2VecFile = file.read()
    print(word2VecFile)

In Favor of Returning to Office:

Many argue that returning to the office is crucial for maintaining the productivity and collaboration necessary for businesses to thrive. Working in the same space allows for easier communication, quicker decision-making, and better team building. It also provides opportunities for social interactions that can improve employee morale and job satisfaction.

Another benefit of returning to the office is the separation of work and home life. With many employees working remotely during the pandemic, the lines between work and personal time have become blurred. This can lead to burnout and other negative consequences. By returning to the office, employees can more easily maintain a work-life balance and avoid the mental exhaustion that comes with always being “on.”

Against Returning to Office:

On the other hand, many people argue that remote work has proven to be effective, and there is no need to return to the office. In fact, studies have shown that remote workers are often more productive than those who work in the office. Additionally, remote work allows for more flexibility in scheduling and reduces the amount of time and money spent commuting.

Another important consideration is the health and safety of employees. With the ongoing threat of COVID-19 and the emergence of new variants, returning to the office could put employees at risk. Some may not feel comfortable being in close proximity to their colleagues or may have concerns about the effectiveness of safety protocols. In these cases, continuing to work remotely may be the best option.

Ultimately, the decision to return to the office or continue remote work will depend on a variety of factors, including the type of work being done, the needs of the business, and the preferences of employees. It is important to consider all perspectives and prioritize the health and safety of everyone involved.
  • You can technically open a .txt file with the pd.read_csv method, but I prefer using the with...open method of reading .txt files since the pd.read_csv method outputs the text file in tabluar form, which isn’t going to work for this tutorial.

And now let’s create the word2vec model

Now that we’ve read our text file into the system, the next thing we’ll do is create our word2vec model. Here’s the code to do so:

First, lets gather our list of tokens for the analysis:

import nltk
import gensim

modelData = []
punctuation = [',', '.', '”', '“', ':']

for w in nltk.sent_tokenize(word2VecFile):
    tokens = []
    
    for t in nltk.word_tokenize(w):
        if t not in punctuation:
            tokens.append(t.lower())
        
    modelData.append(tokens)
    
print(modelData)

[['in', 'favor', 'of', 'returning', 'to', 'office', 'many', 'argue', 'that', 'returning', 'to', 'the', 'office', 'is', 'crucial', 'for', 'maintaining', 'the', 'productivity', 'and', 'collaboration', 'necessary', 'for', 'businesses', 'to', 'thrive'], ['working', 'in', 'the', 'same', 'space', 'allows', 'for', 'easier', 'communication', 'quicker', 'decision-making', 'and', 'better', 'team', 'building'], ['it', 'also', 'provides', 'opportunities', 'for', 'social', 'interactions', 'that', 'can', 'improve', 'employee', 'morale', 'and', 'job', 'satisfaction'], ['another', 'benefit', 'of', 'returning', 'to', 'the', 'office', 'is', 'the', 'separation', 'of', 'work', 'and', 'home', 'life'], ['with', 'many', 'employees', 'working', 'remotely', 'during', 'the', 'pandemic', 'the', 'lines', 'between', 'work', 'and', 'personal', 'time', 'have', 'become', 'blurred'], ['this', 'can', 'lead', 'to', 'burnout', 'and', 'other', 'negative', 'consequences'], ['by', 'returning', 'to', 'the', 'office', 'employees', 'can', 'more', 'easily', 'maintain', 'a', 'work-life', 'balance', 'and', 'avoid', 'the', 'mental', 'exhaustion', 'that', 'comes', 'with', 'always', 'being', 'on.', 'against', 'returning', 'to', 'office', 'on', 'the', 'other', 'hand', 'many', 'people', 'argue', 'that', 'remote', 'work', 'has', 'proven', 'to', 'be', 'effective', 'and', 'there', 'is', 'no', 'need', 'to', 'return', 'to', 'the', 'office'], ['in', 'fact', 'studies', 'have', 'shown', 'that', 'remote', 'workers', 'are', 'often', 'more', 'productive', 'than', 'those', 'who', 'work', 'in', 'the', 'office'], ['additionally', 'remote', 'work', 'allows', 'for', 'more', 'flexibility', 'in', 'scheduling', 'and', 'reduces', 'the', 'amount', 'of', 'time', 'and', 'money', 'spent', 'commuting'], ['another', 'important', 'consideration', 'is', 'the', 'health', 'and', 'safety', 'of', 'employees'], ['with', 'the', 'ongoing', 'threat', 'of', 'covid-19', 'and', 'the', 'emergence', 'of', 'new', 'variants', 'returning', 'to', 'the', 'office', 'could', 'put', 'employees', 'at', 'risk'], ['some', 'may', 'not', 'feel', 'comfortable', 'being', 'in', 'close', 'proximity', 'to', 'their', 'colleagues', 'or', 'may', 'have', 'concerns', 'about', 'the', 'effectiveness', 'of', 'safety', 'protocols'], ['in', 'these', 'cases', 'continuing', 'to', 'work', 'remotely', 'may', 'be', 'the', 'best', 'option'], ['ultimately', 'the', 'decision', 'to', 'return', 'to', 'the', 'office', 'or', 'continue', 'remote', 'work', 'will', 'depend', 'on', 'a', 'variety', 'of', 'factors', 'including', 'the', 'type', 'of', 'work', 'being', 'done', 'the', 'needs', 'of', 'the', 'business', 'and', 'the', 'preferences', 'of', 'employees'], ['it', 'is', 'important', 'to', 'consider', 'all', 'perspectives', 'and', 'prioritize', 'the', 'health', 'and', 'safety', 'of', 'everyone', 'involved']]

Before creating our tokens lists to use for our model, I first imported the nltk and gensim packages (as always, if there’s a required package you don’t have, pip install it!).

I then created two lists, modelData and tokens. Then I sentence-tokenized this file before word-tokenizing it, appending MOST of the tokens to the tokens list before appending each tokens list to the modelData list. I say appending MOST (not all) of the tokens as I excluded any tokens that were part of the punctuation list.

Now, you may be wondering if we should remove stopwords here like we’ve done for our previous NLP tutorials. Normally, I’d say yes, but since the text is so short and we’re trying to figure out word connections, I’d say keep the stopwords here to get more accurate word2vec results.

Now that we’ve got our tokens list-or should I say lists-within-a-list-it’s time to build our model!

Rather, I should say model(s), since there are two approaches we can take to build our word2vec analysis-skipgrams and continous-bag-of-words.

Skip-gram

The first word2vec approach we’ll explore is the skip-gram model.

What exactly is the skip-gram model, though? Using our text file as an example, let’s say we were trying to guess which words we’d commonly see before and after the word “working”. The skip-gram model would analyze our text file to guess which words we’d likely see before and after the word “working” such as “remotely” or “hardly”.

Now, how do we implement this model in Python? Take a look at the code below:

skipGram = gensim.models.Word2Vec(modelData, min_count = 1, vector_size = 2, window = 5, sg = 1)

What do each of these parameters mean. Let me explain:

  • min_count-sets the minimum amount of times a word must appear in a document to be factored into the skip-gram analysis; in this example, a word must appear in the document at least once to be factored into the skip-gram analysis
  • vector_size-sets the number of dimensions that each word vector will contain; in this example, each word vector will contain 2 dimensions
  • window-sets the maximum distance between current and predicted words in the document; this skip-gram analysis will look at the five words following and preceding any given word to determine cosine similarity (more on that later)
  • sg-if this value is 1, use the skip-gram analysis; if this value is 0, use the continous bag-of-words analysis (more on that later)

Now that we have our skip-gram word2vec model set up, let’s test it on three word pairs:

print("Cosine similarity between 'returning' and 'office' - Skip Gram : ", skipGram.wv.similarity('returning', 'office'))     
print("Cosine similarity between 'remote' and 'covid-19' - Skip Gram : ", skipGram.wv.similarity('remote', 'covid-19'))
print("Cosine similarity between 'reduces' and 'commuting' - Skip Gram : ", skipGram.wv.similarity('reduces', 'commuting'))

Cosine similarity between 'returning' and 'office' - Skip Gram :  0.9542125
Cosine similarity between 'remote' and 'covid-19' - Skip Gram :  -0.8962952
Cosine similarity between 'reduces' and 'commuting' - Skip Gram :  0.9316714

In this example, we are analyzing the cosine similiarty between three word pairs-returning & office, remote & covid-19 and reduces & commuting. As you can see from the output, two of the word pairs have positive cosine similarity while the other word pair has negative cosine similarity.

What does positive and negative cosine similarity mean? Well, the higher the positive cosine similarity, the more semantically similar the two words are to each other (in the context of the document being analyzed). The lower the negative cosine similarity, the more dissimilar the two words are to each other (again, in the context of the document being analyzed).

  • Just a tip, but using a vector_size of 2 is often not ideal, especially because most NLP analyses work with larger documents. I used a vector_size of 2 here because the document we’re working with is rather small.
  • Using a vector size of 0 or 1 won’t work for either the skip-gram or continuous bag-of-words model, as it will return cosine similarities of 1, -1, or 0, which aren’t ideal for your analysis.

Continuous bag-of-words

Now that we’ve explored the skip-gram model, let’s now analyze the continous bag-of-words model.

What is the continuous bag-of-words model? Well, like the skip-gram model, the continuous bag-of-words model is like a word2vec guessing game. However, while the skip-gram model attempts to predict words that will come before and after a given word, the continuous bag-of-words model will analyze a sentence in a document and try to predict what any given word in a sentence would be based on the surrounding words in the sentence.

Here’s a simple example:

The happy couple decided to close on their first _____ yesterday, eagerly anticipating all the memories they would make there. 

Just from this sentence alone, what do you think the missing word would be? If you guessed house, you’d be right! In this example, the continuous bag-of-words model would look at each word in the sentence and based on the given words, guess the missing word.

Now let’s see how to implement the CBOW (continuous bag-of-words) model in Python.

CBOW = gensim.models.Word2Vec(modelData, min_count = 1, vector_size = 2, window = 5, sg = 0)

Now, here’s the best part about the CBOW model-it’s same up in exactly the same way, with the same parameters, as the skip-gram model. The only difference between the two models is that you’d need to set the sg parameter to 0 to indicate that you’d like to use the CBOW model.

Now let’s test our CBOW word2vec model on the same three word pairs we used for the skip-gram model.

print("Cosine similarity between 'returning' and 'office' - Skip Gram : ", CBOW.wv.similarity('returning', 'office'))     
print("Cosine similarity between 'remote' and 'covid-19' - Skip Gram : ", CBOW.wv.similarity('remote', 'covid-19'))
print("Cosine similarity between 'reduces' and 'commuting' - Skip Gram : ", CBOW.wv.similarity('reduces', 'commuting'))

Cosine similarity between 'returning' and 'office' - Skip Gram :  0.95738184
Cosine similarity between 'remote' and 'covid-19' - Skip Gram :  -0.9320767
Cosine similarity between 'reduces' and 'commuting' - Skip Gram :  0.94031644

Assuming a change in word2vec model but leaving all other parameters unchanged, we can see that the cosine similarity scores between the three word pairs show small differences from the skip-gram cosine similarity scores for the same three word pairs.

Cosine similarity explained

So, now that I’ve shown you all what cosine similarity looks like, it’s time to explain it further.

In the word2vec algorithm (for both the skip-gram and CBOW models), each word in the document is treated like a vector. Still unsure about the whole vector thing? Here’s an illustration that might help:

Imagine a simple right triangle, where the three words returning, office, and working are the angles. A and B are the sides while C is the triangle’s hypotenuse.

Let’s say you wanted to find the cosine similarity between the words returning and office. If you know even basic trigonometry, you’ll know that the cosine of a right triangle is the ratio of the length of the side adjacent to a given angle to the length of the triangle’s hypotenuse. The cosine would represent the cosine similarity between these two words.

In word2vec, cosine similarity scores can range from -1 to 1. Here’s an illustration on cosine similarity scores:

In simple terms, a word pair with a cosine similarity score of -1 or 0 indicates completely different, not-at-all semantic similar words (in the context of the document being analyzed). A word pair with a cosine similarity score greater than -1 but less than 0 indicates that the two words are somewhat, but not entirely, dissimilar. Finally, a word pair with a cosine similarity score of 1 indicates that the two words are identical or very, very similar semantically.

In the example above, the words returning and office have cosine similarity scores of roughly 0.95 (for both the skip-gram and CBOW models), indicating that these two words are quite semantically similar in the context of the document. Interestingly, the words remote and covid-19 have cosine similarity scores lower than -0.85 (for both the skip-gram and CBOW models)-personally I find this interesting as the word remote seems semantically similar to the word covid-19 in the context of this document (to me at least).

  • How is negative cosine similarity possible? Well, think of an obtuse traingle. In an obtuse triangle, one angle has to be greater than 90 degrees, which means the cosine of that angle can possibly be negative. Think of NLP cosine similarity the same way.

Thanks for reading,

Michael

Michael’s New Spinoff Blog!

Hello readers,

It’s Michael, and I’ve got a super cool blog announcement for you all! I know you all wanted some more neural network/AI content, but please here me out here!

So as most of you know, I launched this blog on June 13, 2018, which means that I’ve been keeping up this blog for almost five years now. Through the course of writing 140 blog posts, I thought it would be time to expand the Michael’s Programming Bytes brand.

You may be wondering, how do I plan to expand my brand? Well, I figured that today, April 1, 2023, would be the perfect day to announce my very-first spinoff blog-Michael’s Poetic Programming Bytes! Yes, I thought I’d combine all the fun of coding with the simple creative joy of poetry.

What can you, the readers, expect from my first foray into creative writing? Well, we’ve got odes to the joy of coding:

Oh, joy of programming, how sweet your song,
With every keystroke, we dance along.
A world of code at our fingertips,
Endless possibilities, endless trips.

With loops and functions, we craft our art,
A symphony of logic, a work of heart.
Each line a brushstroke, each file a canvas,
We paint our dreams with logic and balance.

The bugs may bite, the code may crash,
But we rise above and make a dash.
To debug and fix, to learn and grow,
And create a masterpiece, a thing of flow.

Oh, joy of programming, you bring us delight,
A world of creation, a world of might.
We tinker and play, we solve and code,
And with each project, we find our abode.

So let us embrace this world of code,
And let our imaginations be bestowed.
With the joy of programming, we can create,
A world of wonder, a world of fate.

Haikus for those who enjoy this form of poetry, like this one about COBOL:

Legacy language,
COBOL endures through the years,
Business still needs it.

Looking for a fun way to learn about basic coding concepts? Michael’s Poetic Programming Bytes has content for that, like this Shakespearan sonnet explaining the concept of object-oriented programming:

Oh, wondrous world of code, how strange thy ways,
Wherein doth dwell the art of programming,
And rules of logic shape the programmer's days,
While algorithms bring life to every thing.

And yet, there lies a path less traveled by,
A new approach to programming art sublime,
Where objects are the stars that light the sky,
And classes, their domains, rule space and time.

Behold, the paradigm that we call OOP,
Where objects are defined by their attributes,
And methods grant them power, this we do see,
As they perform actions and execute.

Thus, OOP is a world of boundless creation,
Where objects reign supreme, in this new nation.

Or, if you want something a little more modern, here’s a rap about about why you shouldn’t use star imports in Python:

(Intro)
Listen up, y'all, it's time to learn
'Bout Python code and what to discern
We talkin' star imports, don't you know
Why you shouldn't use 'em, I'll tell you so

(Verse 1)
Star imports may seem like a good choice
But let me tell you, it's not worth the noise
You import everything, all at once
But that's a recipe for a coding dunce

'Cause when you import, you want to be clear
What you need, and what you don't want near
But with star imports, it's all or none
And that's not the way to get it done

(Hook)
So don't use star imports, take your time
Import what you need, it's not a crime
Cleaner code, that's what you'll get
And that's the key, you can bet

(Verse 2)
Now let's talk about naming collisions
With star imports, you risk some divisions
Two modules with the same name
Now you got a problem, that's not a game

You gotta qualify the name, make it clear
Which module you want, or it won't adhere
And if you got a lot of imports, it's a mess
So don't use star imports, that's my address

(Hook)
So don't use star imports, take your time
Import what you need, it's not a crime
Cleaner code, that's what you'll get
And that's the key, you can bet

(Bridge)
Now you know the deal, you gotta be wise
Don't take shortcuts, that's not the prize
Clean code is what we strive for
So don't use star imports, that's the score

(Hook)
So don't use star imports, take your time
Import what you need, it's not a crime
Cleaner code, that's what you'll get
And that's the key, you can bet

(Outro)
Yeah, that's right, don't use star imports
Import what you need, that's the report
Clean code, that's the goal
And with Python, you're in control!

I don’t know about you, but I think that’s a good hook!

Feeling a little nostalgic for your favorite old software? We’ve got poems for that too, like this one mourning the loss and remembering the life of Python 2 (which was retired in 2020):

Oh Python 2, we bid thee farewell
With heavy hearts and a mournful knell
For all the years we spent with thee
Were filled with joy and camaraderie

But now the time has come to part
And let Python 2 depart
For though we loved thee to the end
It's time for Python 3 to ascend

Oh how we'll miss thy simple ways
Thy elegance and thy concise phrase
But now we must embrace the new
And bid thee fond adieu

So let us honor thee this day
And remember all the fun we had along the way
We'll miss thee dearly, that much is true
But in our hearts, we'll always remember Python 2.

Michael’s Poetic Programmic Bytes will not only allow you to mourn the losses of your favorite programming tools, but also any piece of software that was near and dear to your heart, like this farewell tribute to Club Penguin (of which I never got into):

Dear Club Penguin, the time has come
To say goodbye, our hearts are numb
We'll miss the island and the snow
And all the friends we've come to know

For years we've waddled on your shores
Played mini-games and explored
Dressed up in our penguin clothes
And danced in clubs with puffle bows

You gave us endless hours of fun
And taught us lessons, one by one
To always be kind and to share
And help others, to show we care

We'll miss your parties and your quests
And all the joy that you expressed
But as we say our last goodbye
We'll keep your memory alive

So thank you Club Penguin, for it all
For the laughter and the thrill
We'll never forget the love we felt
For you, and always will.

Last but certainly not least, this blog will allow you to submit your own programming-themed poems that I’d love to showcase, such as this one on the joys of HTML written by a first-grader:

HTML, oh HTML,
You make websites so pretty,
With colors and pictures and text,
You make the internet look so witty!

I love how you make things bold,
And add links to click and see,
It's so cool to make a webpage,
With you, it's easy as can be!

Sometimes I forget a tag,
And my page looks kinda funny,
But I know I can always fix it,
And make my site look sunny!

HTML, oh HTML,
You're my favorite thing to code,
I'll keep making webpages with you,
And sharing them all over the globe!

Or this gem from a 60-year-old celebrating the life of dial-up Internet:

Dial-up, oh dial-up,
You were slow but steady,
You brought the world to our fingertips,
And made us feel so heady.

With your hissing and buzzing sounds,
And the beeps that signaled our connection,
We knew we were in for a wait,
But we savored every moment of your affection.

You were our gateway to the internet,
A time when things were simpler,
We couldn't stream or download much,
But you gave us the world on a platter.

I remember waiting for pages to load,
And watching as images took shape,
It was a different time, a slower time,
But one that we embraced.

Now we have fiber and Wi-Fi,
And we can stream without a hitch,
But I'll always remember dial-up,
And the way you made us rich.

Dial-up, oh dial-up,
You may be a relic of the past,
But your memory lives on in our hearts,
A time that will always last.

Wow-that’s a beautifully written tribute to the joys of dial-up internet (hey, I’m just 27 and still recall the dial-up days).

So if you’re ever in the mood to learn basic coding concepts or just pay homage to your favorite programming tools or software (ahem Club Penguin fans), then you’ll definitely enjoy Michael’s Poetic Programming Bytes with a whole library of poems that were certainly not spit out of an AI chatbot.

More details on this expansion of the Michael’s Programming Bytes brand to come, and also,

Will this be a new Michael’s Programming Bytes tradition? Time will tell.