Hello everybody,
It’s Michael, and today’s lesson will be on Naive Bayes classification in R.
But first, what exactly is Naive Bayes classification? It’s a simple probability mechanism based on Bayes’ Theorem.
OK, so what is Bayes’ Theorem? Well, let me give you a little math and history lesson, The theorem was devised sometime in the 18th century by a guy named Reverend Thomas Bayes. Bayes’ Theorem essentially describes the likelihood that an event will occur, based on prior knowledge of conditions that might be related to the event, For instance, if the likelihood of getting into a car accident was based on how much driving experience someone had, then, with Bayes’ Theorem, we can more accurately assess the likelihood that someone will get into a car accident based on the amount of driving experience they have, as opposed to trying to figure out the chances that someone will get into a car accident without knowledge of the person’s driving experience.
Here’s a mathematical representation of Bayes’ Theorem:

Here’s an explanation of each part in the equation:
- P(A|B)-The conditional probability that event A will occur depending on the occurrence of event B
- P(B|A)-The conditional probability that event B will occur depending on the occurrence of event A
- P(A)-The probability that event A will occur independent of the occurrence of event B
- P(B)-The probability that event B will occur independent of the occurrence of event A
Here’s an example. Let’s say there’s a 55% chance that the Miami Heat will win their game on 3/30/19 against the NY Knicks. If they win that game, then there is a 30% chance the Miami Heat will win their game on 4/1/19 against the Boston Celtics. However, if the Heat don’t beat the Knicks, then there is only a 25% chance that they will beat the Celtics. (Keep in mind I made up all of these odds)
So let’s say “Heat beat Knicks” is event A, and “Heat beat Celtics” is event B. Here’s a probability breakdown of all possible outcomes:
- P(A)-55%
- P(A’)-45%
- the apostrophe right by the A means NOT, as in the likelihood event A will NOT happen (or in this case, the odds that the Heat will lose to the Knicks)
- P(B|A)-30%
- P(B’|A)-70%
- the odds that the Heat will NOT beat the Celtics if they beat the Knicks
- P(B|A’)-25%
- P(B’|A’)-75%
- the odds that the Heat will lose to BOTH the Knicks and the Celtics
The first thing we’d calculate is the probability of the Heat beating the Celtics, which would be the sum of the products of the probability of “beating the Knicks and Celtics” AND “losing to the Knicks and Celtics”.
Here’s what I mean by “sum of the products of the probability”:
(.55*.3)+(.45*.7)=48%
So there is a 48% chance the Heat will beat the Celtics on 4/1/19.
If we were wondering what are the odds that the Heat beat the Knicks if they beat the Celtics, we would use Bayes’ Theorem:
(.3*.55)/(.48)=34.4%
So assuming the Heat beat the Celtics, then there is a 34.4% chance that they will beat the Knicks. By the way, the .48 would be P(B), or the probability that event B will occur, which I calculated in the previous problem.
Now, how exactly does Bayes’ Theorem relate to Naive Bayes classification? Naive Bayes is a collection of probability-based classification algorithms based off of Bayes’ Theorem; it’s not a single algorithm but several algorithms that share a common principle-that every feature being classified is independent of every other feature.
For instance, let’s say we were trying to classify vegetables based on their features. Also, let’s assume that a vegetable that is green, long, and comes in an arch shape is a piece of celery. A Naive Bayes classifier considers that each of these three aforementioned features will contribute independently to the likelihood that a certain vegetable is a piece of celery, regardless of any commonalities between these features. However, classification features do sometimes correlate with each other. This is a disadvantage of using Naive Bayes classification, as this method makes strong assumptions regarding the independence between features (the strong assumptions regarding independence among features are why this classification method is referred to as “Naive Bayes”).
The whole point of Naive Bayes classification is to allow us to predict a class given a set of features. To use another vegetable example, let’s say we could predict whether a vegetable is a carrot, piece of celery, or corn cob (the class) based on its taste, shape, etc. (the features).
Even though Naive Bayes is a relatively simple algorithm, it can outperform more complex algorithms like k-means clustering. One real-world use of this algorithm is spam detection (as in e-mail spam detection).
Ok, now that I’ve got that explanation out of the way, it’s time to do some Naive Bayes in R. The dataset I will be using is-Youtube04-Eminem-which gives 453 random comments on Eminem’s “Love The Way You Lie” video (ft. Rihanna). With this dataset, I will show you guys how to do spam detection with Naive Bayes using R by detecting the amount of spam and non-spam comments on this video (after all, Naive Bayes works for any type of spam detection, not just email spam).
- But first, I wanted to acknowledge that I got this dataset from UCI Machine Learning Repository. Just like Kaggle, this website contains several datasets covering a wide variety of topics (sports, business, etc.) that you can use for analytical projects; here, you can find datasets ranging from the late 1980s to the present year (a lot better than the archaic datasets R offers for free). The website is maintained by the University of California-Irvine Center for Machine Learning and Intelligent Systems.
Now let’s get started. But first, as with any R analysis, let’s try to understand our data:
This dataset contains 453 comments and 5 different variables related to the comments. Here’s what each variable means:
COMMENT_ID-YouTube’s alphanumeric comment ID for each commentAUTHOR-The YouTube username of the person who wrote the comment; there are only 396 because some people commented more than once on this videoDATE-The date and time a comment was posted; T denotes the comment’s timestamp- Only less than half the comments have a corresponding date
CONTENT-The content of a comment; there are duplicate commentsCLASS-Whether or not a comment might be spam; 0 denotes non-spam and 1 denotes spam
So how exactly does Naive Bayes factor into all of this? In this case, Naive Bayes will calculate the likelihood that a comment is spam based on the words it contains. The strong assumptions about independence that are associated with Naive Bayes will come into play here, since the algorithm will assume that the probability of a certain word being found in a spam comment (e.g. Instagram) will be independent of the probability of another word being found in a spam comment (e.g. likes).
Next we create a table using the CLASS variable to show how the percentage of spam to non-spam comments (denoted by 1 and 0 respectively):
According to the table, 45.9% of our comments are non-spam (0) while 54.1% are spam (1).
Now, let’s create subsets of spam and non-spam comments:
In these subsets, I wanted to include the CONTENT (as in the actual comment) and the value of the CLASS variable (1 for spam and 0 for non-spam).
Now let’s make some wordclouds, which can help us visualize words that frequently occur in spam and non-spam comments (remember to install the wordcloud package). We will make two wordclouds, one for spam comments and the other for non-spam comments. Remember to use the subset variables that I just mentioned (spam and nonspam) when creating the wordcloud.
The basic idea of wordclouds is that the bigger a word appears on the wordcloud, the more common that word is in a certain category (spam or non-spam comments).
As you can see, some of the most common words in the spam category include check, please, channel and subscribe. This is not surprising, as many YouTube spammers often post something along the lines of “Hey guys please check out and subscribe to my channel: <link to spam channel>” in the comments section.
In the non-spam category, some of the most common words include Eminem, love, song, and Megan. This is also not surprising, since many people who leave comments on a music video would say how much they love the song and/or the artist (Eminem), often mentioning the artist’s name in the comments; the music video is for a song called “Love The Way You Lie”, which could be another reason why the word “love” is one of the most common amongst non-spam comments. Megan is also another one of the most common words amongst non-spam comments; this is likely because Megan Fox appears in the video.
One interesting observation is that the names of the two singers-Eminem and Rihanna-appear in both the spam and non-spam wordclouds. I think it’s interesting because I didn’t think spammers would mention the names of the artists in the comments section. However, keep in mind that Eminem and Rihanna are a lot less commonly mentioned in spam comments than they are in non-spam comments.
Now I’ll show you how to prepare the data for statical analysis. But first we must create a corpus, which is a collection of documents from the text in our file; use the CONTENT variable as it contains the comments themselves. Remember to install the tm package:
In case you are wondering what print(fileCorpus) does, it just prints out all the text of the comments; duplicate comments in our dataset are only printed once.
Now we have to make a document term matrix (or dtm) from the corpus; in our dtm, the comments themselves are shown in rows while the words that occur in the comments are shown in the columns:
In order to prepare our dtm, we must first clean our data by making all words lowercase, removing any numbers and punctuation (and presumably emojis) that are in the comments, and stem all of our words. Stemming removes the suffix from words, which makes it easier for analysis since words with similar meanings (such as verbs with different tenses and plural nouns) are combined into one. For instance, “driving”, “drives” and “driven” would be converted into “drive”.
Here is the structure of our dtm, though it isn’t too relevant with regards to our analysis:
Now we must split our data into training and testing datasets. There isn’t a single correct training-to-testing split, but I think 75-25 is ideal; this means we should use 75% of our data to build and train our model which we will then test on the other 25% of the dataset. First we should split the file, then split the dtm:
In this case, observations 1-340 (for both the file and dtm) will be part of the training set while observations 341-453 (for both the file and dtm) will be part of the testing set. And yes, I had to round here to get the stopping point for my testing set, since 75% of 453 is 339.75.
As you can see, the spam-to-nonspam proportions in our training set are different from those in our testing set. In our training set, the spam-to-nonspam proportions are 55.6%-44.4%. In our testing set, the spam-to-nonspam proportions are 49.6%-50.4%.
Now we must further clean our data by removing infrequent words from dtmTrain (the training dataset derived from our document term matrix), as they are unlikely to be useful in our analysis. I will only include words that are used at least 3 times:
The document term matrix uses 1s and 0s to determine whether or not a word appears in a comment or not; 1 indicates an appearance and 0 indicates no appearance. This is applied to every column (hence the MARGIN = 2).
Now let’s create our Naive Bayes classifier. Remember to install the package e1071:
- Also remember to use training, not testing datasets!
Now let’s see how our classifier works, using the word check (which is commonly found in comments like “Plz check out my channel”) as an example. Using our trainLabels, the output for the word check is displayed:
In this table, the likelihood that the word check will occur in a spam and non-spam comment is displayed. Here’s a breakdown as to what this table means:
- There is a 100% chance that the word
checkwill NOT be found in a non spam comment but only a 33.9% chance thatcheckwill NOT be found in a spam comment. - On the other hand, there is a 0% chance that the word
checkwill be found in a nonspam comment but a 66.1% chance thatcheckwill be found in a spam comment.
To evaluate the accuracy of our Naive Bayes classifier, we create a confusion matrix using our testing set. Remember to install the gmodels package:
Our testing set has 113 observations; the amount of comments that are correctly and incorrectly classified is shown in the matrix above. The sum of the numbers in 0A-0P and 1A-1P (A means actual and P means predicted) represents the amount of correctly classified comments-110 (56+54). On the other hand, the sum of the numbers in 0A-1P and 1A-0P represents the amount of incorrectly classified comments-3 (2+1). Since only 3 of our 113 comments were misclassified, our Naive Bayes classifier has fantastic accuracy (97.3%).
Thanks for reading,
Michael