R Analysis 4: Naive Bayes and Amazon Alexa

Advertisements

Hello everybody,

It’s Michael, and today’s post will be an R analysis (my fourth one overall). I’ll be going back to R lessons now, but for those who want more Java content, don’t worry, it will be back soon.

First of all, let’s upload and understand our data. Here is the file-amazon_alexa.

Screen Shot 2019-05-04 at 12.46.49 PM

This dataset contains reviews for various Amazon Alexa products (such as the Echo Sub and Echo Dot) along with information about those reviews (such as star rating). Here’s a variable-by-variable breakdown of the data:

  • rating-the star rating the user gave the product, with 1 being the worst and 5 being the best
  • date-the date the review was posted; all of the reviews are from June or July 2018
  • variation-the exact product that is being reviewed, of which there are 16
  • verified_reviews-what the actual review said
  • feedback-whether the review was positive or negative-0 denotes negative reviews and 1 denotes positive reviews

In total, there are 3,150 reviews along with five aspects of information about the reviews (such as star rating, date posted, etc.)

Next, let’s convert feedback into a factored create a missmap in order to see if we have any missing observations (remember to install the Amelia package):

According to the diagram, there are no missing observations, which is always a good thing.

Now here’s a table using the feedback variable that shows us how many positive (1) and negative (0) reviews are in the dataset:

According to the table, of the 3150 reviews, only 257 are negative, while 2893 are positive. I’m guessing this is because Amazon’s various Alexa products are widely praised (though there are always going to be people who weren’t impressed with the products).

Let’s create some word-clouds next in order to analyze which words are common amongst positive and negative reviews. We’ll create two word-clouds, one for the positive reviews (feedback == 1) and another for the negative reviews (feedback == 0). Remember to install the wordcloud package:

  • Before creating the word-clouds, remember to subset the data. In this case, create subsets for positive (feedback == 1) and negative (feedback == 0) reviews.

Here are the wordclouds for both the positive and negative reviews:

Some of the words that are common amongst both the positive and negative reviews include echo, Alexa, and work. This is likely because customers usually mention the name of the product in the review (most of which contain either Alexa or echo) in both positive and negative reviews. Customers’ reviews, whether positive or negative, are also likely to use the word work because customers usually mention whether or not their product worked (and if it did, how well it worked).

Some of the words that are common amongst positive reviews include great, love, can, like, and easy, which are all words that are commonly found in positive reviews (e.g. “The Echo is super easy to use. Love it will buy again!”). On the other hand, some of the words that are common amongst negative reviews include doesn't, didn't, stopped, and never, which you are very likely to find in critical reviews (e.g.”The Alexa speaker doesn’t work AT ALL! Never gonna buy again!”). An interesting observation is that the word refurbished is commonly found in negative reviews; this could be because many Alexa users who are dissatisfied with their product usually buy refurbished Alexa devices.

Next up, let’s clean up the data and prepare a corpus, which will be the text document that is derived from the actual text of the reviews. Our corpus is then used to create a document term matrix (referred to in the program as dtm), which contains the text of the review in the rows and the words in the reviews in the columns (remember to install the tm package):

You may recall from R Lesson 13: Naive Bayes Classification that there are four functions to include when creating the dtm. They are:

  • toLower-sets all words to lowercase
  • removeNumbers-remove any numbers present in the reviews (like years)
  • removePunctuation-remove any punctuation present in the reviews
  • stemming-simplify analysis by combining similar words (such as verbs in different tenses and pairs of plural/singular nouns). For instance, the words “trying”, “tries”, “tried” would be combined into the word “try”.

Remember to set each function to TRUE.

Next we’ll create training and testing labels for our dataset. I’ve probably said it before, but I always like to use the 75-25 split when it comes to splitting my data into training and testing sets. This means that I like to use 75% of my dataset to train and build the model before testing it on the other 25% of my dataset. Here’s how to split the data:

Using the 75-25 split, observations 1-2363 will be part of my trainLabels while observations 2364-3150 will be part of my testLabels. I included the feedback variable since you should always include the binary variable when creating your trainLabels and testLabels.

Now let’s make some tables analyzing the proportions of positive-to-negative reviews for both our trainLabels and testLabels. Remember to use the prop.table function:

According to the tables, the difference between the proportions of positive-to-negative reviews for our trainLabels and testLabels is rather small-1% to be exact.

Now let’s split our dtm into training and testing sets, using the same guidelines that were used to create trainLabels and testLabels:

To further tidy up the model, let’s only include words that appear at least 3 times:

Our DTM uses 1s and 0s depending on whether a certain word can be found in the review (1 means the word is present and 0 means the word isn’t present). Since Naive Bayes works with categorical features, 1 and 0 are converted into Yes and No. This conversion is applied to every column (hence MARGIN = 2):

Last but not least, it’s time to create the Naive Bayes classifier (remember to install the package e1071):

Now let’s test out the classifier on two words, great and disappoint:

Now, what does all this mean?

Well, the word great:

  • DOESN’T appear in 93% of negative reviews but DOES appear in the other 7% of negative reviews
  • DOESN’T appear in 77.1% of positive reviews but DOES appear in the other 22.9% of positive reviews

These results are a little surprising since I thought the word great would appear in a majority of positive reviews. Then again, great is more prevalent in positive reviews than in negative reviews, which isn’t surprising, since negative reviews aren’t likely to use the word great.

And the word disappoint:

  • DOESN’T appear in 93.5% of negative reviews but DOES appear in the other 6.5% of negative reviews
  • DOESN’T appear in 98.8% of positive reviews but DOES appear in the other 1.2% of positive reviews

Just as with great, these results are and aren’t surprising. The surprising thing is that disappoint only appears in 6.5% of negative reviews, when I thought (and probably you did too) that disappoint would be found in more negative reviews. Then again, disappoint is more common in negative reviews than in positive reviews, which isn’t surprising.

Last but not least, let’s create a confusion matrix, which evaluates the performance of this classifier using our testing dataset (which is the variable test). Remember to install the gmodels package:

The confusion matrix contains 787 observations-the amount of observations in our testing set. The column total for column 1 represents the amount of correctly classified observations-729. The column total for column 0 represents the amount of incorrectly classified observations-58. In other words, 729 reviews were correctly classified while 58 reviews were incorrectly classified. The overall accuracy of the classifier is 93% which is excellent, but for the misclassified reviews, it could complicate customer feedback analysis, in which case a more sophisticated model would be needed.

Thanks for reading.

Michael

 

R Lesson 13: Naive Bayes Classification

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be on Naive Bayes classification in R.

But first, what exactly is Naive Bayes classification? It’s a simple probability mechanism based on Bayes’ Theorem.

OK, so what is Bayes’ Theorem? Well, let me give you a little math and history lesson, The theorem was devised sometime in the 18th century by a guy named Reverend Thomas Bayes. Bayes’ Theorem essentially describes the likelihood that an event will occur, based on prior knowledge of conditions that might be related to the event, For instance, if the likelihood of getting into a car accident was based on how much driving experience someone had, then, with Bayes’ Theorem, we can more accurately assess the likelihood that someone will get into a car accident based on the amount of driving experience they have, as opposed to trying to figure out the chances that someone will get into a car accident without knowledge of the person’s driving experience.

Here’s a mathematical representation of Bayes’ Theorem:

Here’s an explanation of each part in the equation:

  • P(A|B)-The conditional probability that event A will occur depending on the occurrence of event B
  • P(B|A)-The conditional probability that event B will occur depending on the occurrence of event A
  • P(A)-The probability that event A will occur independent of the occurrence of event B
  • P(B)-The probability that event B will occur independent of the occurrence of event A

Here’s an example. Let’s say there’s a 55% chance that the Miami Heat will win their game on 3/30/19 against the NY Knicks. If they win that game, then there is a 30% chance the Miami Heat will win their game on 4/1/19 against the Boston Celtics. However, if the Heat don’t beat the Knicks, then there is only a 25% chance that they will beat the Celtics. (Keep in mind I made up all of these odds)

So let’s say “Heat beat Knicks” is event A, and “Heat beat Celtics” is event B. Here’s a probability breakdown of all possible outcomes:

  • P(A)-55%
  • P(A’)-45%
    • the apostrophe right by the A means NOT, as in the likelihood event A will NOT happen (or in this case, the odds that the Heat will lose to the Knicks)
  • P(B|A)-30%
  • P(B’|A)-70%
    • the odds that the Heat will NOT beat the Celtics if they beat the Knicks
  • P(B|A’)-25%
  • P(B’|A’)-75%
    • the odds that the Heat will lose to BOTH the Knicks and the Celtics

The first thing we’d calculate is the probability of the Heat beating the Celtics, which would be the sum of the products of the probability of “beating the Knicks and Celtics” AND “losing to the Knicks and Celtics”.

Here’s what I mean by “sum of the products of the probability”:

(.55*.3)+(.45*.7)=48%

So there is a 48% chance the Heat will beat the Celtics on 4/1/19.

If we were wondering what are the odds that the Heat beat the Knicks if they beat the Celtics, we would use Bayes’ Theorem:

(.3*.55)/(.48)=34.4%

So assuming the Heat beat the Celtics, then there is a 34.4% chance that they will beat the Knicks. By the way, the .48 would be P(B), or the probability that event B will occur, which I calculated in the previous problem.

Now, how exactly does Bayes’ Theorem relate to Naive Bayes classification? Naive Bayes is a collection of probability-based classification algorithms based off of Bayes’ Theorem; it’s not a single algorithm but several algorithms that share a common principle-that every feature being classified is independent of every other feature.

For instance, let’s say we were trying to classify vegetables based on their features. Also, let’s assume that a vegetable that is green, long, and comes in an arch shape is a piece of celery. A Naive Bayes classifier considers that each of these three aforementioned features will contribute independently to the likelihood that a certain vegetable is a piece of celery, regardless of any commonalities between these features. However, classification features do sometimes correlate with each other. This is a disadvantage of using Naive Bayes classification, as this method makes strong assumptions regarding the independence between features (the strong assumptions regarding independence among features are why this classification method is referred to as “Naive Bayes”).

The whole point of Naive Bayes classification is to allow us to predict a class given a set of features. To use another vegetable example, let’s say we could predict whether a vegetable is a carrot, piece of celery, or corn cob (the class) based on its taste, shape, etc. (the features).

Even though Naive Bayes is a relatively simple algorithm, it can outperform more complex algorithms like k-means clustering. One real-world use of this algorithm is spam detection (as in e-mail spam detection).

Ok, now that I’ve got that explanation out of the way, it’s time to do some Naive Bayes in R. The dataset I will be using is-Youtube04-Eminem-which gives 453 random comments on Eminem’s “Love The Way You Lie” video (ft. Rihanna). With this dataset, I will show you guys how to do spam detection with Naive Bayes using R by detecting the amount of spam and non-spam comments on this video (after all, Naive Bayes works for any type of spam detection, not just email spam).

  • But first, I wanted to acknowledge that I got this dataset from UCI Machine Learning Repository. Just like Kaggle, this website contains several datasets covering a wide variety of topics (sports, business, etc.) that you can use for analytical projects; here, you can find datasets ranging from the late 1980s to the present year (a lot better than the archaic datasets R offers for free). The website is maintained by the University of California-Irvine Center for Machine Learning and Intelligent Systems.

Now let’s get started. But first, as with any R analysis, let’s try to understand our data:

This dataset contains 453 comments and 5 different variables related to the comments. Here’s what each variable means:

  • COMMENT_ID-YouTube’s alphanumeric comment ID for each comment
  • AUTHOR-The YouTube username of the person who wrote the comment; there are only 396 because some people commented more than once on this video
  • DATE-The date and time a comment was posted; T denotes the comment’s timestamp
    • Only less than half the comments have a corresponding date
  • CONTENT-The content of a comment; there are duplicate comments
  • CLASS-Whether or not a comment might be spam; 0 denotes non-spam and 1 denotes spam

So how exactly does Naive Bayes factor into all of this? In this case, Naive Bayes will calculate the likelihood that a comment is spam based on the words it contains. The strong assumptions about independence that are associated with Naive Bayes will come into play here, since the algorithm will assume that the probability of a certain word being found in a spam comment (e.g. Instagram) will be independent of the probability of another word being found in a spam comment (e.g. likes).

Next we create a table using the CLASS variable to show how the percentage of spam to non-spam comments (denoted by 1 and 0 respectively):

According to the table, 45.9% of our comments are non-spam (0) while 54.1% are spam (1).

Now, let’s create subsets of spam and non-spam comments:

In these subsets, I wanted to include the CONTENT (as in the actual comment) and the value of the CLASS variable (1 for spam and 0 for non-spam).

Now let’s make some wordclouds, which can help us visualize words that frequently occur in spam and non-spam comments (remember to install the wordcloud package). We will make two wordclouds, one for spam comments and the other for non-spam comments. Remember to use the subset variables that I just mentioned (spam and nonspam) when creating the wordcloud.

The basic idea of wordclouds is that the bigger a word appears on the wordcloud, the more common that word is in a certain category (spam or non-spam comments).

As you can see, some of the most common words in the spam category include check, please, channel and subscribe. This is not surprising, as many YouTube spammers often post something along the lines of “Hey guys please check out and subscribe to my channel: <link to spam channel>” in the comments section.

In the non-spam category, some of the most common words include Eminem, love, song, and Megan. This is also not surprising, since many people who leave comments on a music video would say how much they love the song and/or the artist (Eminem), often mentioning the artist’s name in the comments; the music video is for a song called “Love The Way You Lie”, which could be another reason why the word “love” is one of the most common amongst non-spam comments. Megan is also another one of the most common words amongst non-spam comments; this is likely because Megan Fox appears in the video.

One interesting observation is that the names of the two singers-Eminem and Rihanna-appear in both the spam and non-spam wordclouds. I think it’s interesting because I didn’t think spammers would mention the names of the artists in the comments section. However, keep in mind that Eminem and Rihanna are a lot less commonly mentioned in spam comments than they are in non-spam comments.

Now I’ll show you how to prepare the data for statical analysis. But first we must create a corpus, which is a collection of documents from the text in our file; use the CONTENT variable as it contains the comments themselves. Remember to install the tm package:

In case you are wondering what print(fileCorpus) does, it just prints out all the text of the comments; duplicate comments in our dataset are only printed once.

Now we have to make a document term matrix (or dtm) from the corpus; in our dtm, the comments themselves are shown in rows while the words that occur in the comments are shown in the columns:

In order to prepare our dtm, we must first clean our data by making all words lowercase, removing any numbers and punctuation (and presumably emojis) that are in the comments, and stem all of our words. Stemming removes the suffix from words, which makes it easier for analysis since words with similar meanings (such as verbs with different tenses and plural nouns) are combined into one. For instance, “driving”, “drives” and “driven” would be converted into “drive”.

Here is the structure of our dtm, though it isn’t too relevant with regards to our analysis:

Now we must split our data into training and testing datasets. There isn’t a single correct training-to-testing split, but I think 75-25 is ideal; this means we should use 75% of our data to build and train our model which we will then test on the other 25% of the dataset.  First we should split the file, then split the dtm:

In this case, observations 1-340 (for both the file and dtm) will be part of the training set while observations 341-453 (for both the file and dtm) will be part of the testing set. And yes, I had to round here to get the stopping point for my testing set, since 75% of 453 is 339.75.

As you can see, the spam-to-nonspam proportions in our training set are different from those in our testing set. In our training set, the spam-to-nonspam proportions are 55.6%-44.4%. In our testing set, the spam-to-nonspam proportions are 49.6%-50.4%.

Now we must further clean our data by removing infrequent words from dtmTrain (the training dataset derived from our document term matrix), as they are unlikely to be useful in our analysis. I will only include words that are used at least 3 times:

The document term matrix uses 1s and 0s to determine whether or not a word appears in a comment or not; 1 indicates an appearance and 0 indicates no appearance. This is applied to every column (hence the MARGIN = 2).

Now let’s create our Naive Bayes classifier. Remember to install the package e1071:

  • Also remember to use training, not testing datasets!

Now let’s see how our classifier works, using the word check (which is commonly found in comments like “Plz check out my channel”) as an example. Using our trainLabels, the output for the word check is displayed:

In this table, the likelihood that the word check will occur in a spam and non-spam comment is displayed. Here’s a breakdown as to what this table means:

  • There is a 100% chance that the word check will NOT be found in a non spam comment but only a 33.9% chance that check will NOT be found in a spam comment.
  • On the other hand, there is a 0% chance that the word check will be found in a nonspam comment but a 66.1% chance that check will be found in a spam comment.

To evaluate the accuracy of our Naive Bayes classifier, we create a confusion matrix using our testing set. Remember to install the gmodels package:

Our testing set has 113 observations; the amount of comments that are correctly and incorrectly classified is shown in the matrix above. The sum of the numbers in 0A-0P and 1A-1P (A means actual and P means predicted) represents the amount of correctly classified comments-110 (56+54). On the other hand, the sum of the numbers in 0A-1P and 1A-0P represents the amount of incorrectly classified comments-3 (2+1). Since only 3 of our 113 comments were misclassified, our Naive Bayes classifier has fantastic accuracy (97.3%).

Thanks for reading,

Michael