R Analysis 4: Naive Bayes and Amazon Alexa

Hello everybody,

It’s Michael, and today’s post will be an R analysis (my fourth one overall). I’ll be going back to R lessons now, but for those who want more Java content, don’t worry, it will be back soon.

First of all, let’s upload and understand our data. Here is the file-amazon_alexa.

Screen Shot 2019-05-04 at 12.46.49 PM

This dataset contains reviews for various Amazon Alexa products (such as the Echo Sub and Echo Dot) along with information about those reviews (such as star rating). Here’s a variable-by-variable breakdown of the data:

  • rating-the star rating the user gave the product, with 1 being the worst and 5 being the best
  • date-the date the review was posted; all of the reviews are from June or July 2018
  • variation-the exact product that is being reviewed, of which there are 16
  • verified_reviews-what the actual review said
  • feedback-whether the review was positive or negative-0 denotes negative reviews and 1 denotes positive reviews

In total, there are 3,150 reviews along with five aspects of information about the reviews (such as star rating, date posted, etc.)

Next, let’s convert feedback into a factored create a missmap in order to see if we have any missing observations (remember to install the Amelia package):

Screen Shot 2019-05-04 at 1.07.54 PM

Screen Shot 2019-05-04 at 1.07.39 PM

Screen Shot 2019-05-04 at 1.07.21 PM

According to the diagram, there are no missing observations, which is always a good thing.

Now here’s a table using the feedback variable that shows us how many positive (1) and negative (0) reviews are in the dataset:

Screen Shot 2019-05-04 at 1.15.30 PM

According to the table, of the 3150 reviews, only 257 are negative, while 2893 are positive. I’m guessing this is because Amazon’s various Alexa products are widely praised (though there are always going to be people who weren’t impressed with the products).

Let’s create some word-clouds next in order to analyze which words are common amongst positive and negative reviews. We’ll create two word-clouds, one for the positive reviews (feedback == 1) and another for the negative reviews (feedback == 0). Remember to install the wordcloud package:

Screen Shot 2019-05-05 at 3.10.52 PM

  • Before creating the word-clouds, remember to subset the data. In this case, create subsets for positive (feedback == 1) and negative (feedback == 0) reviews.

Here are the wordclouds for both the positive and negative reviews:

Screen Shot 2019-05-05 at 3.10.33 PM

Screen Shot 2019-05-05 at 10.15.08 PM

Screen Shot 2019-05-05 at 3.09.39 PM

Screen Shot 2019-05-05 at 10.12.27 PM

Some of the words that are common amongst both the positive and negative reviews include echo, Alexa, and work. This is likely because customers usually mention the name of the product in the review (most of which contain either Alexa or echo) in both positive and negative reviews. Customers’ reviews, whether positive or negative, are also likely to use the word work because customers usually mention whether or not their product worked (and if it did, how well it worked).

Some of the words that are common amongst positive reviews include great, love, can, like, and easy, which are all words that are commonly found in positive reviews (e.g. “The Echo is super easy to use. Love it will buy again!”). On the other hand, some of the words that are common amongst negative reviews include doesn't, didn't, stopped, and never, which you are very likely to find in critical reviews (e.g.”The Alexa speaker doesn’t work AT ALL! Never gonna buy again!”). An interesting observation is that the word refurbished is commonly found in negative reviews; this could be because many Alexa users who are dissatisfied with their product usually buy refurbished Alexa devices.

Next up, let’s clean up the data and prepare a corpus, which will be the text document that is derived from the actual text of the reviews. Our corpus is then used to create a document term matrix (referred to in the program as dtm), which contains the text of the review in the rows and the words in the reviews in the columns (remember to install the tm package):

Screen Shot 2019-05-06 at 8.28.44 PM

You may recall from R Lesson 13: Naive Bayes Classification that there are four functions to include when creating the dtm. They are:

  • toLower-sets all words to lowercase
  • removeNumbers-remove any numbers present in the reviews (like years)
  • removePunctuation-remove any punctuation present in the reviews
  • stemming-simplify analysis by combining similar words (such as verbs in different tenses and pairs of plural/singular nouns). For instance, the words “trying”, “tries”, “tried” would be combined into the word “try”.

Remember to set each function to TRUE.

Next we’ll create training and testing labels for our dataset. I’ve probably said it before, but I always like to use the 75-25 split when it comes to splitting my data into training and testing sets. This means that I like to use 75% of my dataset to train and build the model before testing it on the other 25% of my dataset. Here’s how to split the data:

Screen Shot 2019-05-06 at 8.42.14 PM

Using the 75-25 split, observations 1-2363 will be part of my trainLabels while observations 2364-3150 will be part of my testLabels. I included the feedback variable since you should always include the binary variable when creating your trainLabels and testLabels.

Now let’s make some tables analyzing the proportions of positive-to-negative reviews for both our trainLabels and testLabels. Remember to use the prop.table function:

Screen Shot 2019-05-07 at 3.06.20 PM

According to the tables, the difference between the proportions of positive-to-negative reviews for our trainLabels and testLabels is rather small-1% to be exact.

Now let’s split our dtm into training and testing sets, using the same guidelines that were used to create trainLabels and testLabels:

Screen Shot 2019-05-07 at 3.14.44 PM

To further tidy up the model, let’s only include words that appear at least 3 times:

Screen Shot 2019-05-07 at 3.20.43 PM

Our DTM uses 1s and 0s depending on whether a certain word can be found in the review (1 means the word is present and 0 means the word isn’t present). Since Naive Bayes works with categorical features, 1 and 0 are converted into Yes and No. This conversion is applied to every column (hence MARGIN = 2):

Screen Shot 2019-05-07 at 3.33.22 PM

Last but not least, it’s time to create the Naive Bayes classifier (remember to install the package e1071):

Screen Shot 2019-05-07 at 3.38.59 PM

Now let’s test out the classifier on two words, great and disappoint:

Screen Shot 2019-05-07 at 3.39.48 PM

Now, what does all this mean?

Well, the word great:

  • DOESN’T appear in 93% of negative reviews but DOES appear in the other 7% of negative reviews
  • DOESN’T appear in 77.1% of positive reviews but DOES appear in the other 22.9% of positive reviews

These results are a little surprising since I thought the word great would appear in a majority of positive reviews. Then again, great is more prevalent in positive reviews than in negative reviews, which isn’t surprising, since negative reviews aren’t likely to use the word great.

And the word disappoint:

  • DOESN’T appear in 93.5% of negative reviews but DOES appear in the other 6.5% of negative reviews
  • DOESN’T appear in 98.8% of positive reviews but DOES appear in the other 1.2% of positive reviews

Just as with great, these results are and aren’t surprising. The surprising thing is that disappoint only appears in 6.5% of negative reviews, when I thought (and probably you did too) that disappoint would be found in more negative reviews. Then again, disappoint is more common in negative reviews than in positive reviews, which isn’t surprising.

Last but not least, let’s create a confusion matrix, which evaluates the performance of this classifier using our testing dataset (which is the variable test). Remember to install the gmodels package:

Screen Shot 2019-05-07 at 3.55.27 PM

The confusion matrix contains 787 observations-the amount of observations in our testing set. The column total for column 1 represents the amount of correctly classified observations-729. The column total for column 0 represents the amount of incorrectly classified observations-58. In other words, 729 reviews were correctly classified while 58 reviews were incorrectly classified. The overall accuracy of the classifier is 93% which is excellent, but for the misclassified reviews, it could complicate customer feedback analysis, in which case a more sophisticated model would be needed.

Thanks for reading.

Michael

 

Leave a Reply