Python Lesson 41: Word2Vec (NLP pt.7/AI pt.7)

What is the word2vec algorithm?

So, how does the word2vec algorithm work? Let’s say we were to process a text document using this algorithm. The word2vec algorithm would analyze each word in this document along with all other words commonly found near a certain word. For instance, if the word “family” was found in a document and the words “daughter”, “parents”, and “generations” were found near it-the word “family” would be lumped in with these words, as they are all related to each other.

If you were to ask your program for the three most similar words to “family” (according to our hypothetical document), it would say “daughter”, “parents” and “generations”. How would the word2vec algorithm find the most common words? By using a mathematical metric called cosine similarity-more on that later.

Preparing the data

So, before we dive into the magic of word2vec, let’s gather the data we’re going to use.

In this example, we’ll utilize this ChatGPT-generated essay below that contains five paragraphs on a very relevant topic in the year 2023-returning to the office-including both sides of the debate.

return-to-office-1 Download

So, let’s open up our Python IDEs and start coding!

First, let’s open and read this file into our IDE:

with open(r'C:\Users\mof39\OneDrive\Documents\return to office.txt', 'r', encoding='utf-8') as file:
    word2VecFile = file.read()
    print(word2VecFile)

In Favor of Returning to Office:

Many argue that returning to the office is crucial for maintaining the productivity and collaboration necessary for businesses to thrive. Working in the same space allows for easier communication, quicker decision-making, and better team building. It also provides opportunities for social interactions that can improve employee morale and job satisfaction.

Another benefit of returning to the office is the separation of work and home life. With many employees working remotely during the pandemic, the lines between work and personal time have become blurred. This can lead to burnout and other negative consequences. By returning to the office, employees can more easily maintain a work-life balance and avoid the mental exhaustion that comes with always being “on.”

Against Returning to Office:

On the other hand, many people argue that remote work has proven to be effective, and there is no need to return to the office. In fact, studies have shown that remote workers are often more productive than those who work in the office. Additionally, remote work allows for more flexibility in scheduling and reduces the amount of time and money spent commuting.

Another important consideration is the health and safety of employees. With the ongoing threat of COVID-19 and the emergence of new variants, returning to the office could put employees at risk. Some may not feel comfortable being in close proximity to their colleagues or may have concerns about the effectiveness of safety protocols. In these cases, continuing to work remotely may be the best option.

Ultimately, the decision to return to the office or continue remote work will depend on a variety of factors, including the type of work being done, the needs of the business, and the preferences of employees. It is important to consider all perspectives and prioritize the health and safety of everyone involved.

You can technically open a .txt file with the pd.read_csv method, but I prefer using the with...open method of reading .txt files since the pd.read_csv method outputs the text file in tabluar form, which isn’t going to work for this tutorial.

And now let’s create the word2vec model

Now that we’ve read our text file into the system, the next thing we’ll do is create our word2vec model. Here’s the code to do so:

First, lets gather our list of tokens for the analysis:

import nltk
import gensim

modelData = []
punctuation = [',', '.', '”', '“', ':']

for w in nltk.sent_tokenize(word2VecFile):
    tokens = []
    
    for t in nltk.word_tokenize(w):
        if t not in punctuation:
            tokens.append(t.lower())
        
    modelData.append(tokens)
    
print(modelData)

[['in', 'favor', 'of', 'returning', 'to', 'office', 'many', 'argue', 'that', 'returning', 'to', 'the', 'office', 'is', 'crucial', 'for', 'maintaining', 'the', 'productivity', 'and', 'collaboration', 'necessary', 'for', 'businesses', 'to', 'thrive'], ['working', 'in', 'the', 'same', 'space', 'allows', 'for', 'easier', 'communication', 'quicker', 'decision-making', 'and', 'better', 'team', 'building'], ['it', 'also', 'provides', 'opportunities', 'for', 'social', 'interactions', 'that', 'can', 'improve', 'employee', 'morale', 'and', 'job', 'satisfaction'], ['another', 'benefit', 'of', 'returning', 'to', 'the', 'office', 'is', 'the', 'separation', 'of', 'work', 'and', 'home', 'life'], ['with', 'many', 'employees', 'working', 'remotely', 'during', 'the', 'pandemic', 'the', 'lines', 'between', 'work', 'and', 'personal', 'time', 'have', 'become', 'blurred'], ['this', 'can', 'lead', 'to', 'burnout', 'and', 'other', 'negative', 'consequences'], ['by', 'returning', 'to', 'the', 'office', 'employees', 'can', 'more', 'easily', 'maintain', 'a', 'work-life', 'balance', 'and', 'avoid', 'the', 'mental', 'exhaustion', 'that', 'comes', 'with', 'always', 'being', 'on.', 'against', 'returning', 'to', 'office', 'on', 'the', 'other', 'hand', 'many', 'people', 'argue', 'that', 'remote', 'work', 'has', 'proven', 'to', 'be', 'effective', 'and', 'there', 'is', 'no', 'need', 'to', 'return', 'to', 'the', 'office'], ['in', 'fact', 'studies', 'have', 'shown', 'that', 'remote', 'workers', 'are', 'often', 'more', 'productive', 'than', 'those', 'who', 'work', 'in', 'the', 'office'], ['additionally', 'remote', 'work', 'allows', 'for', 'more', 'flexibility', 'in', 'scheduling', 'and', 'reduces', 'the', 'amount', 'of', 'time', 'and', 'money', 'spent', 'commuting'], ['another', 'important', 'consideration', 'is', 'the', 'health', 'and', 'safety', 'of', 'employees'], ['with', 'the', 'ongoing', 'threat', 'of', 'covid-19', 'and', 'the', 'emergence', 'of', 'new', 'variants', 'returning', 'to', 'the', 'office', 'could', 'put', 'employees', 'at', 'risk'], ['some', 'may', 'not', 'feel', 'comfortable', 'being', 'in', 'close', 'proximity', 'to', 'their', 'colleagues', 'or', 'may', 'have', 'concerns', 'about', 'the', 'effectiveness', 'of', 'safety', 'protocols'], ['in', 'these', 'cases', 'continuing', 'to', 'work', 'remotely', 'may', 'be', 'the', 'best', 'option'], ['ultimately', 'the', 'decision', 'to', 'return', 'to', 'the', 'office', 'or', 'continue', 'remote', 'work', 'will', 'depend', 'on', 'a', 'variety', 'of', 'factors', 'including', 'the', 'type', 'of', 'work', 'being', 'done', 'the', 'needs', 'of', 'the', 'business', 'and', 'the', 'preferences', 'of', 'employees'], ['it', 'is', 'important', 'to', 'consider', 'all', 'perspectives', 'and', 'prioritize', 'the', 'health', 'and', 'safety', 'of', 'everyone', 'involved']]

Before creating our tokens lists to use for our model, I first imported the nltk and gensim packages (as always, if there’s a required package you don’t have, pip install it!).

I then created two lists, modelData and tokens. Then I sentence-tokenized this file before word-tokenizing it, appending MOST of the tokens to the tokens list before appending each tokens list to the modelData list. I say appending MOST (not all) of the tokens as I excluded any tokens that were part of the punctuation list.

Now, you may be wondering if we should remove stopwords here like we’ve done for our previous NLP tutorials. Normally, I’d say yes, but since the text is so short and we’re trying to figure out word connections, I’d say keep the stopwords here to get more accurate word2vec results.

Now that we’ve got our tokens list-or should I say lists-within-a-list-it’s time to build our model!

Rather, I should say model(s), since there are two approaches we can take to build our word2vec analysis-skipgrams and continous-bag-of-words.

Skip-gram

The first word2vec approach we’ll explore is the skip-gram model.

What exactly is the skip-gram model, though? Using our text file as an example, let’s say we were trying to guess which words we’d commonly see before and after the word “working”. The skip-gram model would analyze our text file to guess which words we’d likely see before and after the word “working” such as “remotely” or “hardly”.

Now, how do we implement this model in Python? Take a look at the code below:

skipGram = gensim.models.Word2Vec(modelData, min_count = 1, vector_size = 2, window = 5, sg = 1)

What do each of these parameters mean. Let me explain:

min_count-sets the minimum amount of times a word must appear in a document to be factored into the skip-gram analysis; in this example, a word must appear in the document at least once to be factored into the skip-gram analysis
vector_size-sets the number of dimensions that each word vector will contain; in this example, each word vector will contain 2 dimensions
window-sets the maximum distance between current and predicted words in the document; this skip-gram analysis will look at the five words following and preceding any given word to determine cosine similarity (more on that later)
sg-if this value is 1, use the skip-gram analysis; if this value is 0, use the continous bag-of-words analysis (more on that later)

Now that we have our skip-gram word2vec model set up, let’s test it on three word pairs:

print("Cosine similarity between 'returning' and 'office' - Skip Gram : ", skipGram.wv.similarity('returning', 'office'))     
print("Cosine similarity between 'remote' and 'covid-19' - Skip Gram : ", skipGram.wv.similarity('remote', 'covid-19'))
print("Cosine similarity between 'reduces' and 'commuting' - Skip Gram : ", skipGram.wv.similarity('reduces', 'commuting'))

Cosine similarity between 'returning' and 'office' - Skip Gram :  0.9542125
Cosine similarity between 'remote' and 'covid-19' - Skip Gram :  -0.8962952
Cosine similarity between 'reduces' and 'commuting' - Skip Gram :  0.9316714

In this example, we are analyzing the cosine similiarty between three word pairs-returning & office, remote & covid-19 and reduces & commuting. As you can see from the output, two of the word pairs have positive cosine similarity while the other word pair has negative cosine similarity.

What does positive and negative cosine similarity mean? Well, the higher the positive cosine similarity, the more semantically similar the two words are to each other (in the context of the document being analyzed). The lower the negative cosine similarity, the more dissimilar the two words are to each other (again, in the context of the document being analyzed).

Just a tip, but using a vector_size of 2 is often not ideal, especially because most NLP analyses work with larger documents. I used a vector_size of 2 here because the document we’re working with is rather small.
Using a vector size of 0 or 1 won’t work for either the skip-gram or continuous bag-of-words model, as it will return cosine similarities of 1, -1, or 0, which aren’t ideal for your analysis.

Continuous bag-of-words

Now that we’ve explored the skip-gram model, let’s now analyze the continous bag-of-words model.

What is the continuous bag-of-words model? Well, like the skip-gram model, the continuous bag-of-words model is like a word2vec guessing game. However, while the skip-gram model attempts to predict words that will come before and after a given word, the continuous bag-of-words model will analyze a sentence in a document and try to predict what any given word in a sentence would be based on the surrounding words in the sentence.

Here’s a simple example:

The happy couple decided to close on their first _____ yesterday, eagerly anticipating all the memories they would make there.

Just from this sentence alone, what do you think the missing word would be? If you guessed house, you’d be right! In this example, the continuous bag-of-words model would look at each word in the sentence and based on the given words, guess the missing word.

Now let’s see how to implement the CBOW (continuous bag-of-words) model in Python.

CBOW = gensim.models.Word2Vec(modelData, min_count = 1, vector_size = 2, window = 5, sg = 0)

Now, here’s the best part about the CBOW model-it’s same up in exactly the same way, with the same parameters, as the skip-gram model. The only difference between the two models is that you’d need to set the sg parameter to 0 to indicate that you’d like to use the CBOW model.

Now let’s test our CBOW word2vec model on the same three word pairs we used for the skip-gram model.

print("Cosine similarity between 'returning' and 'office' - Skip Gram : ", CBOW.wv.similarity('returning', 'office'))     
print("Cosine similarity between 'remote' and 'covid-19' - Skip Gram : ", CBOW.wv.similarity('remote', 'covid-19'))
print("Cosine similarity between 'reduces' and 'commuting' - Skip Gram : ", CBOW.wv.similarity('reduces', 'commuting'))

Cosine similarity between 'returning' and 'office' - Skip Gram :  0.95738184
Cosine similarity between 'remote' and 'covid-19' - Skip Gram :  -0.9320767
Cosine similarity between 'reduces' and 'commuting' - Skip Gram :  0.94031644

Assuming a change in word2vec model but leaving all other parameters unchanged, we can see that the cosine similarity scores between the three word pairs show small differences from the skip-gram cosine similarity scores for the same three word pairs.

Cosine similarity explained

So, now that I’ve shown you all what cosine similarity looks like, it’s time to explain it further.

In the word2vec algorithm (for both the skip-gram and CBOW models), each word in the document is treated like a vector. Still unsure about the whole vector thing? Here’s an illustration that might help:

Imagine a simple right triangle, where the three words returning, office, and working are the angles. A and B are the sides while C is the triangle’s hypotenuse.

Let’s say you wanted to find the cosine similarity between the words returning and office. If you know even basic trigonometry, you’ll know that the cosine of a right triangle is the ratio of the length of the side adjacent to a given angle to the length of the triangle’s hypotenuse. The cosine would represent the cosine similarity between these two words.

In word2vec, cosine similarity scores can range from -1 to 1. Here’s an illustration on cosine similarity scores:

In simple terms, a word pair with a cosine similarity score of -1 or 0 indicates completely different, not-at-all semantic similar words (in the context of the document being analyzed). A word pair with a cosine similarity score greater than -1 but less than 0 indicates that the two words are somewhat, but not entirely, dissimilar. Finally, a word pair with a cosine similarity score of 1 indicates that the two words are identical or very, very similar semantically.

In the example above, the words returning and office have cosine similarity scores of roughly 0.95 (for both the skip-gram and CBOW models), indicating that these two words are quite semantically similar in the context of the document. Interestingly, the words remote and covid-19 have cosine similarity scores lower than -0.85 (for both the skip-gram and CBOW models)-personally I find this interesting as the word remote seems semantically similar to the word covid-19 in the context of this document (to me at least).

How is negative cosine similarity possible? Well, think of an obtuse traingle. In an obtuse triangle, one angle has to be greater than 90 degrees, which means the cosine of that angle can possibly be negative. Think of NLP cosine similarity the same way.

Thanks for reading,

Michael

What is the word2vec algorithm?

Preparing the data

And now let’s create the word2vec model

Skip-gram

Continuous bag-of-words

Cosine similarity explained

Share this:

Leave a ReplyCancel reply

Discover more from Michael's Programming Bytes