April 2019 - Michael's Programming Bytes

Java Lesson 9: Inheritance, Polymorphism & Subclasses

Advertisements

Hello everybody,

It’s Michael, and today’s Java lesson will be on inheritance, polymorphism & subclasses.

First of all, what is inheritance? In Java, it’s possible for one class to inherit the attributes and methods of another class. Using the keyword extends, any class can inherit methods and attributes from another class. The class that is inheriting the methods and attributes is called the subclass or child class while the class that is providing the methods for the subclass is called the parent class or superclass.

Polymorphism is related to inheritance, but the two aren’t interchangeable. Inheritance lets us use methods and attributes from another class by means of the extends keyword while polymorphism lets us use those inherited methods to perform a single action in different ways.

Let’s say we have a superclass called Sport which has a single method-score. Subclasses of Sport could include Bowling, AmericanFootball, Golf, Basketball, etc. and each subclass would have their own implementation of a scoring system* (for instance, baseball would use runs, golf would use par, etc.)

*this is where polymorphism would come in

Here’s some code we can use to approach the problem in the above example (keep in mind I didn’t run any of this on NetBeans; I’m just using this solely as an example):

Let’s start with some code for the main class:

public class Sport

{

public void score ()

{

System.out.print(“Score system: “);

}

}

And here’s the code for two subclasses-AmericanFootball and Baseball:

class AmericanFootball extends Sport

{

public void score ()

{

System.out.print(“Score system: Points”)

}

}

class Baseball extends Sport

{

public void score ()

{

System.out.print (“Score system: Runs”);

}

}

As you can see, there is no main method for the main class Sport but rather the void method score, which is used in my two subclasses-AmericanFootball and Baseball. One thing you will notice is that-and this is where polymorphism comes into play-the score method produces a different output for each class. For instance, if you were to create an object of the main class-Sport-and call the score method, you would get the output Score system: . If you were to create an object of class AmericanFootball and call the score method, you would get the output Score system: Points. If you did the same thing for the Baseball class, you would get the output Score System: Runs.

My point is that with polymorphism, you can use the same method across various classes and tailor the method to each class.

Thanks for reading,

Michael

Java Lesson 8: Arrays

Advertisements

Hello everybody,

It’s Michael, and today’s Java post will be about arrays. But what exactly do arrays do in Java?

Arrays are basically a list of variables that are referred to by a common name. Here’s an example:

Florida Ohio Georgia Mississippi Texas Montana

This is an array of states, which I will call states.

You can make arrays for variables of any Java type, including:

int
- And any numerical type for that matter (i.e. double, float)
char
boolean
String

However, you cannot have several variable types in an array (so no mixing String elements with int elements). For instance, if you wanted to include 737 Down Over ABQ in your array, the 737 would have to be a String in order to make the array work (assuming Down Over ABQ are all String).

Now let me demonstrate a simple array program:

package javalessons;

public class JavaLessons
{

public static void main(String[] args)
{
String [] stateCaps = {“Helena”, “Salem”, “Tallahassee”, “Atlanta”, “St.Paul”, “Pierre”, “Columbus”, “Providence”};
System.out.println(stateCaps[4]);
}
}

And here are two sample outputs (using a different index for each array):

run:
Tallahassee
BUILD SUCCESSFUL (total time: 0 seconds)

run:
St.Paul
BUILD SUCCESSFUL (total time: 0 seconds)

The first thing I did in the program is to create my array declaration-String [] stateCaps = {"Helena", "Salem", "Tallahassee", "Atlanta", "St.Paul", "Pierre", "Columbus", "Providence"}. The point of the array declaration is to create an array and POSSIBLY fill it with elements.

I say POSSIBLY because there are three ways I can go about create my array and filling it with elements. Here are some possibilities:

Do what I did in the aforementioned program, which is to create the array and the elements in the same line of code (separated by an equals sign)
Create the array and allocate space to the array in the same line, which would mean I would have written String [] stateCaps = new String [8] to create a String array with 8 values. Meanwhile, I would have added elements in 8 separate lines of code, starting with stateCaps[0]="Helena" and continuing on until I fill the array.
Create a for loop. More on that later in this post.

OK, here’s something I want to mention about that second bullet point. Each element in an array is called an index (plural: indices). However, the first index in an array is [0] not [1] since array indices are counted starting from 0, not 1. So the first index in an array is [0], the second index is [1], the third index is [2], and so on.

In each of my outputs, I chose a different index, and thus got a different output each time. The first output-Tallahassee-corresponds to index 2-or the third element in the array, which is “Tallahassee”. The second output-St.Paul-corresponds to index 4-or the fifth element in the array, which is “St.Paul”.

Next, remember how I mentioned that you could use a random number generator in arrays in my previous post? Here’s how you can do that (using a different array):

package javalessons;
import java.util.Random;

public class JavaLessons
{

public static void main(String[] args)
{
String [] cartoons = {“Family Guy”, “Spongebob”, “Simpsons”, “South Park”,
“Archer”, “Daniel Tiger’s Neighborhood”, “Arthur”, “Looney Tunes”,
“Big Mouth”, “Rick & Morty”};
Random gen = new Random ();
int index = gen.nextInt(9);

System.out.println(cartoons[index]);
}
}

And here are two sample outputs:

run:
Big Mouth
BUILD SUCCESSFUL (total time: 0 seconds)

run:
Spongebob
BUILD SUCCESSFUL (total time: 2 seconds)

In this program, I first imported java.util.Random, which is what you need to do if you plan on using a random number generator in your program. I also remembered to create Random and int variables, which you need to do in order to have a random number generator handy and, in the case of the int variable, set the upper limit for the random number generator. In this case, the upper limit for my generator is 9, since there are 10 elements in my array (remember array elements start counting from 0). Had I used 10 as the upper limit, there is a chance I would have gotten an error message since there is no index 10. I then created an array-cartoons-and filled it with 10 elements (10 American cartoons).

For my output, I asked the program to print out a random element each time I run the program. In my first sample output, index 8 was displayed (corresponding to Big Mouth) while in my second output, index 1 was displayed (corresponding to Spongebob).

Now let me show you how to fill in an array using a for loop:

package javalessons;

public class JavaLessons
{

public static void main(String[] args)
{
int [] multiples = new int [15];

for (int i = 0; i <= 14; i++)
{
multiples[i]=i*8;
}

System.out.println (multiples[3]);
}
}

And here are two sample outputs (each using a different index):

run:
24
BUILD SUCCESSFUL (total time: 1 second)

run:
48
BUILD SUCCESSFUL (total time: 0 seconds)

This program demonstrates a very simple way to fill in an array using a for loop. I used my for loop to fill in my array with multiples of 8, hence the I*8. I first created my array in the line directly above the for loop, then I created the for loop to fill my array with elements. In my for loop, I started my counter at 0 (remember that array indices start from 0), asked it to stop at 14 (since I will have 15 elements in my array), and asked it to increment by 1 after each iteration.

I’ll admit that the new int [15] may seem redundant since I already determined that my loop will have 15 iterations (and thus 15 elements).

My sample outputs printed indexes 3 and 6 (the 4th and 7th elements respectively), which correspond to the numbers 24 and 48. Remember that since I started this loop with 0, the first element will be 0, since 0 times 8 is 0.

Last but not least, I will introduce the concept of two dimensional arrays using a combination of a for loop and a random number generator. Here’s a sample program:

package javalessons;
import java.util.Random;

public class JavaLessons
{

public static void main(String[] args)
{
int [] [] twoDim = new int [6][6];
Random gen = new Random ();
int boundary = gen.nextInt(35);

for (int i = 0; i <= 5; i++)
{
for (int j = 0; j <= 5; j++)
{
twoDim [i][j] = boundary;
}
}

System.out.println (twoDim[2][3]);
}
}

And here are two sample outputs:

run:
24
BUILD SUCCESSFUL (total time: 0 seconds)

run:
1
BUILD SUCCESSFUL (total time: 1 second)\

The process of working with two-dimensional arrays (or other multidimensional arrays for that matter) is a little different than the process of working with one dimensional arrays. First of all, you would need a nested for loop (or a for loop within a for loop) to traverse through (and possibly fill in) your array. Second, you would need two squares [] [] to initialize the array as opposed to just one [].

You don’t always need a nested for loop to fill in a two dimensional array. Plus if you are dealing with non-numerical elements like String, a loop isn’t very practical to use.
Here’s another way to fill in a 2-dimensional array, using String elements:
- Let’s say our array is 3 by 3.
- This is another way to fill in elements:
  - String [][] cities = {{"Fort Collins", "Mentor", "Miami"}, {"Bozeman", "Annapolis", "Pittsburgh"}, {"Boston", "Omaha", "Manhattan"}}
  - Since this array is 3 by 3, we would have 3 groups of 3 elements each.
Here’s a handy rule when it comes to two dimensional arrays:
- For an array with dimensions of [x][y], create x groups of y elements each.

The “indexes start at 0” rule applies here. Here’s the index structure for the 6 by 6 array I created above:

Screen Shot 2019-04-25 at 4.21.52 PM

The first index would be [0][0] while the final index would be [5][5]. If you wanted to select the element on the second row and fourth column, that would correspond to [1][3]. Remember that the row always comes first, followed by the column.

In my sample outputs, I selected indexes [2][3] and [3][0], which printed out 24 and 1, respectively. Keep in mind that even if you use the same indexes, you might get a different output each time due to the use of the random number generator in this program. However, I set the upper limit to 35, which means the number printed will always be between 0 and 35.

One last thing I want to address is that you can create arrays that are more than two dimensions. For instance, if you wanted to make a 4-dimensional array, you would include 4 squares-[][][][]. As to how many elements an array like this can hold, just find the product of all of the numbers in the dimension brackets (which would be to the right of the equals sign). For instance, if you have a 4-dimensional array like this-String [][][][] names = new String [3][9][2][4]-multiply the numbers in the dimension brackets to see how many elements this array will hold (3*9*2*4=216; this array will hold 216 elements).

Thanks for reading,

Michael

Java Lesson 7: Random Number Generator

Advertisements

Hello everybody,

It’s Michael, and I’ve decided to give you guys another series of java lessons. In this post, I’ll show you how to create your own random number generator, which is a tool that can generate random numbers in Java. Generating random numbers in Java is a common task that is used for a variety of purposes, such as gambling, statistical sampling, and any other scenario where you would need to simulate unpredictable behavior. Random number generators can also be used with arrays, which will be the topic of my next Java post.

Now let’s begin with an example of a simple random number generator:

package javalessons;
import java.util.Random;

public class JavaLessons
{

public static void main(String[] args)
{
Random generator = new Random ();
int random = generator.nextInt(50);
System.out.println(random);
}
}

And here are two sample outputs:

run:
46
BUILD SUCCESSFUL (total time: 0 seconds)

run:
12
BUILD SUCCESSFUL (total time: 1 second)

Before writing the code for your program, remember to import java.util.Random (include this package anytime you plan on using the random number generator).

So what does all of this code mean? First of all, the Random generator = new Random () line indicates your random number generator variable. You keep the parentheses empty since you will indicate the boundaries in a separate int variable, which in this case is found in the line int Random = generator.nextInt(50). This line indicates that you want your program to print out a random number between 0 and 50 (the number in the parentheses indicates the upper limit of your random number generator). System.out.println(random) prints out a random number between 0 and 50

But what if you wanted to print out several random numbers? We can do just that, with the help of our random number generator and a handy for-loop:

package javalessons;
import java.util.Random;

public class JavaLessons
{

public static void main(String[] args)
{
Random generator = new Random ();

for (int i = 0; i <= 8; i++)
{
int random = generator.nextInt(120);
System.out.print(random + ” , “);
}

}
}

And here are two sample outputs:

run:
6 , 17 , 20 , 98 , 82 , 46 , 43 , 5 , 19 , BUILD SUCCESSFUL (total time: 1 second)

run:
65 , 13 , 116 , 16 , 5 , 1 , 6 , 28 , 72 , BUILD SUCCESSFUL (total time: 2 seconds)

I kept the variables the same as the previous example; all I did here was add a for loop (for int = 0; i <= 8; i++) and changed the upper limit to 120. This program will print out 8 random numbers between 0 and 120, each separated by a comma.

So what if you want to use a custom range for your random number generator, as opposed to relying on the default of 0? Here’s how:

package javalessons;
import java.util.Random;

public class JavaLessons
{

public static void main(String[] args)
{
Random generator = new Random ();

for (int i = 0; i <= 8; i++)
{
int random = 200 + generator.nextInt(600);
System.out.print(random + ” , “);
}

}
}

And here are two sample outputs:

run:
555 , 211 , 587 , 707 , 545 , 653 , 594 , 785 , 520 , BUILD SUCCESSFUL (total time: 2 seconds)

run:
415 , 345 , 579 , 462 , 660 , 695 , 593 , 337 , 505 , BUILD SUCCESSFUL (total time: 2 seconds)

The program is the same as the previous example, save for changing the upper limit to 600 and adding 200 + in front of the generator.nextInt line. Remember how I said I was going to show you how to use a custom range for your random number generator? Well, the 200 + is exactly how you can do that because putting this (or any number for that matter) in front of the generator.nextInt(600) line ensures that the range of your random number generator will only be from 200 to 600.

And last but not least, let’s see what happens when you want to input the value for the upper limit:

package javalessons;
import java.util.Random;
import java.util.Scanner;

public class JavaLessons
{

public static void main(String[] args)
{
Random generator = new Random ();
Scanner sc = new Scanner (System.in);
System.out.println(“Please choose a number: “);
int custom = sc.nextInt();

for (int i = 0; i <= 8; i++)
{
int random = generator.nextInt(custom);
System.out.print(random + ” , “);
}

}
}

And here are two sample outputs (using 35 for the first example and 91 for the second):

run:
Please choose a number:
35
31 , 23 , 4 , 5 , 22 , 9 , 14 , 11 , 10 , BUILD SUCCESSFUL (total time: 4 seconds)

run:
Please choose a number:
91
78 , 45 , 63 , 82 , 89 , 58 , 12 , 8 , 61 , BUILD SUCCESSFUL (total time: 2 seconds)

In this program, I added a Scanner variable (along with importing the Scanner class) and a custom int variable, which allows you to choose the endpoint for your random number generator (the start-point is still 0). As you can see from each sample output, the program will print out 8 random numbers that are less than or equal to the number that was inputted.

Thanks for reading,

Michael

R Analysis 3: K-Means vs Hierarchical Clustering

Advertisements

Hello everybody,

It’s Michael, and I will be doing an R analysis on this post. More specifically, I will be doing a comparative clustering analysis, which means I’ll take a dataset and perform both k-means and hierarchical clustering analysis with that dataset to analyze the results of each method. However, this analysis will be unique, since I will be revisiting of the earliest datasets I used for this blog-TV shows-which first appeared in R Lesson 4: Logistic Regression Models on July 11, 2018 (exactly nine months ago!) In case you forgot what this dataset was about, it basically gives 85 shows that aired during the 2017-18 TV season and whether or not they were renewed for the 2018-19 TV season along with other aspects of those shows (such as the year they premiered and the network they air on). I’ll admit I chose this dataset because I wanted to analyze one of my old datasets in a different way (remember I performed linear and logistic regression the first time I used this dataset).

So, as always, let’s load the file and get a basic understanding of our data:

As you can see, we have 85 observations of 10 variables. Here’s a detailed breakdown of each variable:

TV.Show-The name of the show
Genre-The genre of the show
Premiere.Year-The year the show premiered; for revivals like Roseanne, I used the original premiere year (1988) as opposed to the revival premiere year (2018)
X..of.seasons..17.18.-How many seasons the show had aired as of the end of the 2017-18 TV season
Network-The network (or streaming service) the show airs on
X2018.19.renewal.-Whether or not the show was renewed for the 2018-19 TV season; 1 denotes renewal and 0 denotes cancellation
Rating-The content rating for the show. Here’s a more detailed breakdown:
- 1 means TV-G
- 2 means TV-PG
- 3 means TV-14
- 4 means TV-MA
- 5 means not applicable
Usual.Day.of.Week-The usual day of the week the show airs its new episodes. Here’s a more detailed breakdown:
- 1 means the show airs on Mondays
- 2 means the show airs on Tuesdays
- 3 means the show airs on Wednesdays
- 4 means the show airs on Thursdays
- 5 means the show airs on Fridays
- 6 means the show airs on Saturdays
- 7 means the show airs on Sundays
- 8 means the show doesn’t have a regular air-day (usually applies to talk shows or shows on streaming services)
Medium-the type of network the show airs on. Here’s a more detailed breakdown:
- 1 means the show airs on either one of the big 4 broadcast networks (ABC, NBC, FOX or CBS) or the CW (which isn’t part of the big 4)
- 2 means the show airs on a cable channel (AMC, Bravo, etc.)
- 3 means the show airs on a streaming service (Hulu, Amazon Prime, etc.)
Episode Count-the new variable I added for this analysis; this variable shows how many episodes a show has had overall since the end of the 2017-18 TV season. For certain shows whose seasons cross the 17-18 and 18-19 seasons, I will count how many episodes each show has had as of September 24, 2018 (the beginning of the 2018-19 TV season)

Now that we’ve learned more about our variables, let’s start our analysis. But first, I convert the final four variables into factors, since I think it’ll be more appropriate for the analysis:

Ok, now onto the analysis. I’ll start with k-means:

Here, I created a data subset using our third and tenth columns (Premiere.Year and Episode.Count respectively) and displayed the head (the first six observations) of my cluster.

Now let’s do some k-means clustering:

I created the variable tvCluster to store my k-means model using the name of my data subset-cluster1-the number of clusters I wanted to include (4) and nstart, which tells the models to start with 35 random points then select the one with the lowest variation.

I then type in tvCluster to get a better idea of what my cluster looks like. The first thing I see is “K-means clustering with (X) clusters of sizes”, before mentioning the amount of observations in each cluster (which are 17, 64, 1 and 3, respectively). In total, all 85 observations were used since I didn’t have any missing data points.

The next thing that is mentioned is cluster means, which gives the mean for each variable used in the clustering analysis (in this case, Episode.Count and Premiere.Year). Interestingly enough, Cluster 2 has the highest mean Premiere.Year (2015) but the lowest mean Episode.Count (49. rounded to the nearest whole number).

After that, you can see the clustering vector, which shows you which observations belong to which cluster. Even though the position of the observation (e.g. 1st, 23rd) isn’t explicitly mentioned, you can tell which observation you are looking at since the vector starts with the first observation and works its way down to the eighty-fifth observation (and since there is no missing data, all 85 observations are used in this clustering model). For instance, the first three observations all correspond to cluster 1 (the first three shows listed in this dataset are NCIS, Big Bang Theory, and The Simpsons). Likewise, the final three observations all correspond to cluster 2 (the corresponding shows are The Americans, Baskets, and Comic Book Men).

Next you will see the within cluster sum of squares for each cluster, which I will abbreviate as WCSSBC; this is a measurement of the variability of the observations in each cluster. Remember that the smaller this amount is, the more compact the cluster. In this case, 3 of the 4 WCSSBC are above 100,000, while the other WCSSBC is 0 (which I’m guessing is cluster 3, which has only one observation).

Last but not least is between_SS/total_SS=94.5%, which represents the between sum-of-squares and total sum-of-squares ratio, which as you may recall from the k-means lesson is a measure of the goodness-of-fit of the model. 94.5% indicates that there is an excellent goodness-of-fit for this model.

Last but not least, let’s graph our model:

In this graph, the debut year is on the x-axis, while the episode count is on the y-axis. As you can see, the 2 largest clusters (represented by the black and red dots) are fairly close together while the 2 smallest clusters (represented by the blue and green dots) are fairly spread out (granted, the two smallest clusters only have 1 and 3 observations, respectively). An observation about this graph that I wanted to point out is that the further back a show premiered doesn’t always mean the show has more episodes than another show that premiered fairly recently (let’s say anytime from 2015 onward). This happens for several reasons, including:

Revived series like American Idol (which took a hiatus in 2017 before its 2018 revival) and Roseanne (which had been dormant for 21 years before its 2018 revival)
Different shows air a different number of episodes per season; for instance, talk shows like Jimmy Kimmel live have at least 100 episodes per season while shows on the Big 4 networks tend to have between 20-24 episodes per season (think Simpsons, The Big Bang Theory, and Grey’s Anatomy). Cable and streaming shows usually have even less episodes per season (between 6-13, like how South Park only does 10 episode seasons)
Some shows just take long breaks (like how Jessica Jones on Netflix didn’t release any new episodes between November 2015 and March 2018)

Now time to do some hierarchical clustering on our data. And yes, I plan to use all the methods covered in the post R Lesson 12: Hierarchical Clustering.

Let’s begin by scaling all numerical variables in our data (don’t include the ones that were converted into factor types):

Now let’s start off with some agglomerative clustering (with both the dendrogram and code):

After setting up the model using Euclidean distance and complete linkage, I then plot my dendrogram, which you can see above. This dendrogram is a lot neater than the ones I created in R Lesson 12, but then again, this dataset only has 85 observations, while the one in that post had nearly 7,200. The names of the shows themselves aren’t mentioned, but each of the numbers displayed correspond to a certain show. For instance, 74 corresponds to The Voice, since it is the 74th show listed in our dataset. Look at the spreadsheet to figure out which number corresponds to which show.

You may recall that I mentioned two general rules for interpreting dendrograms. They are:

The higher the height of the fusion, the more dissimilar two items are
The wider the branch between two observations, the more dissimilar they are

Those rules certainly apply here, granted, the highest height is 8 in this case, as opposed to 70. For instance, since the brand between shows 7 and 63 is fairly narrow, these two shows have a lot in common according to the model (even though the two shows in question are Bob’s Burgers and The Walking Dead-the former being a cartoon sitcom and the latter revolving around the zombie apocalypse). On the other hand, the gap between shows 74 and 49 is wider, which means they don’t share much in common (even though the two shows are The Voice and Shark Tank, which both qualify as reality shows, though the former is more competition-oriented than the latter). All in all, I think it’s interesting to see how these clusters were created, since the shows that were grouped closer together seem to have nothing in common.

Now let’s try AGNES (remember that stands for agglomerative clustering):

First of all, remember to install the package cluster. Also, remember that the one important result is the ac, or agglomerative coefficient, which measures the strength of clustering structure. As you can see, our ac is 96.1%, which indicates very strong clustering structure (I personally think any ac at east 90% is good).

Now let’s compare this ac (which used complete linkage) to the ac we get with other linkage methods (not including centroid). Remember to install the purrr package:

Of the four linkage method’s, Ward’s method gives us the highest agglomerative coefficient (98.2%), so that’s what we’ll use for the next part of this analysis.

Using ward’s method and AGNES, here’s a dendrogram of our data (and the corresponding code):

Aside from having a greater maximum height than our previous dendrogram (the latter had a maximum height of 8 while this diagram has a maximum height of presumably 18), the observations are also placed differently. For instance, unlike in the previous dendrogram, observations 30 and 71 are side by side. But just as with the last dendrogram, shows that have almost nothing in common are oddly grouped together; for instance, the 30th and 71st observations correspond to The Gifted and Transparent; the former is a sci-fi show based off the X-Men universe while the latter is a transgender-oriented drama. The 4th and 17th observations are another good example of this, as the corresponding shows are The Simpsons and Taken; the former is a long-running cartoon sitcom while the latter is based off of an action movie trilogy.

The last method I will use for this analysis is DIANA (stands for divisive analysis). Recall that the main difference between DIANA and AGNES is that DIANA works in a top-down manner (objects start in a single supercluster and are divided into smaller clusters until single-element clusters are created) while AGNES works in a bottom-up (objects start in single-element clusters and are morphed into progressively larger clusters until a single supercluster is created). Here’s the code and dendrogram for our DIANA analysis:

Remember that the divisive coefficient is pretty much identical to the agglomerative coefficient, since both measure strength of clustering structure and the closer each amount is to 1, the stronger the clustering structure. Also, in both cases, a coefficient of .9 (or 90%) or higher indicates excellent clustering structure. In this case, the dc (divisive coefficient) is 95.9%, which indicates excellent clustering structure.

Just as with the previous two dendrograms, most of the observation pairs still have nothing in common. For instance, the 41st and 44th observations have nothing in common, since the corresponding shows are House of Cards (a political drama) and Brooklyn 99 (a sitcom), respectively. An exception to this would be the 9th and 31st observations, since both of the corresponding shows-Designated Survivor and Bull respectively-are dramas and both are on the big 4 broadcast networks (though the former airs on ABC while the latter airs on FOX).

Now, let’s assign clusters to the data points. I’ll go with 4 clusters, since that’s how many I used for my k-means analysis (plus I think it’s an ideal amount). I’m going to use the DIANA example I just mentioned:

Now let’s visualize our clusters in a scatterplot (remember to install the factoextra package):

As you can see, cluster 3 has the most observations while cluster 4 has the least (only one observation corresponding to Jimmy Kimmel Live). Some of the observations in each cluster have something in common, like the 1st and 2nd observations (NCIS and The Big Bang Theory in cluster 1, both of which air on CBS) and the 42nd and 80th observations in cluster 3 (Watch What Happens Live! and The Chew-both talk shows).

Now, let’s visualize these clusters on a dendrogram (using the same DIANA example):

The first line of code is exactly the same line I used when I first plotted my DIANA dendrogram. The rect.hclust line draws the borders to denote each cluster; remember to set k to the amount of clusters you created for your scatterplot (in this case, 4). Granted, the coloring scheme is different from the scatterplot, but you can tell which cluster is which judging by the size of the rectangle (for instance, the rectangle for cluster 4 only contains the 79th observation, even though it is light blue on our dendrogram and purple on our scatterplot). Plus, all the observations are in the same cluster in both the scatterplot and dendrogram.

Thanks for reading.

Michael

R Lesson 13: Naive Bayes Classification

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be on Naive Bayes classification in R.

But first, what exactly is Naive Bayes classification? It’s a simple probability mechanism based on Bayes’ Theorem.

OK, so what is Bayes’ Theorem? Well, let me give you a little math and history lesson, The theorem was devised sometime in the 18th century by a guy named Reverend Thomas Bayes. Bayes’ Theorem essentially describes the likelihood that an event will occur, based on prior knowledge of conditions that might be related to the event, For instance, if the likelihood of getting into a car accident was based on how much driving experience someone had, then, with Bayes’ Theorem, we can more accurately assess the likelihood that someone will get into a car accident based on the amount of driving experience they have, as opposed to trying to figure out the chances that someone will get into a car accident without knowledge of the person’s driving experience.

Here’s a mathematical representation of Bayes’ Theorem:

Here’s an explanation of each part in the equation:

P(A|B)-The conditional probability that event A will occur depending on the occurrence of event B
P(B|A)-The conditional probability that event B will occur depending on the occurrence of event A
P(A)-The probability that event A will occur independent of the occurrence of event B
P(B)-The probability that event B will occur independent of the occurrence of event A

Here’s an example. Let’s say there’s a 55% chance that the Miami Heat will win their game on 3/30/19 against the NY Knicks. If they win that game, then there is a 30% chance the Miami Heat will win their game on 4/1/19 against the Boston Celtics. However, if the Heat don’t beat the Knicks, then there is only a 25% chance that they will beat the Celtics. (Keep in mind I made up all of these odds)

So let’s say “Heat beat Knicks” is event A, and “Heat beat Celtics” is event B. Here’s a probability breakdown of all possible outcomes:

P(A)-55%
P(A’)-45%
- the apostrophe right by the A means NOT, as in the likelihood event A will NOT happen (or in this case, the odds that the Heat will lose to the Knicks)
P(B|A)-30%
P(B’|A)-70%
- the odds that the Heat will NOT beat the Celtics if they beat the Knicks
P(B|A’)-25%
P(B’|A’)-75%
- the odds that the Heat will lose to BOTH the Knicks and the Celtics

The first thing we’d calculate is the probability of the Heat beating the Celtics, which would be the sum of the products of the probability of “beating the Knicks and Celtics” AND “losing to the Knicks and Celtics”.

Here’s what I mean by “sum of the products of the probability”:

(.55*.3)+(.45*.7)=48%

So there is a 48% chance the Heat will beat the Celtics on 4/1/19.

If we were wondering what are the odds that the Heat beat the Knicks if they beat the Celtics, we would use Bayes’ Theorem:

(.3*.55)/(.48)=34.4%

So assuming the Heat beat the Celtics, then there is a 34.4% chance that they will beat the Knicks. By the way, the .48 would be P(B), or the probability that event B will occur, which I calculated in the previous problem.

Now, how exactly does Bayes’ Theorem relate to Naive Bayes classification? Naive Bayes is a collection of probability-based classification algorithms based off of Bayes’ Theorem; it’s not a single algorithm but several algorithms that share a common principle-that every feature being classified is independent of every other feature.

For instance, let’s say we were trying to classify vegetables based on their features. Also, let’s assume that a vegetable that is green, long, and comes in an arch shape is a piece of celery. A Naive Bayes classifier considers that each of these three aforementioned features will contribute independently to the likelihood that a certain vegetable is a piece of celery, regardless of any commonalities between these features. However, classification features do sometimes correlate with each other. This is a disadvantage of using Naive Bayes classification, as this method makes strong assumptions regarding the independence between features (the strong assumptions regarding independence among features are why this classification method is referred to as “Naive Bayes”).

The whole point of Naive Bayes classification is to allow us to predict a class given a set of features. To use another vegetable example, let’s say we could predict whether a vegetable is a carrot, piece of celery, or corn cob (the class) based on its taste, shape, etc. (the features).

Even though Naive Bayes is a relatively simple algorithm, it can outperform more complex algorithms like k-means clustering. One real-world use of this algorithm is spam detection (as in e-mail spam detection).

Ok, now that I’ve got that explanation out of the way, it’s time to do some Naive Bayes in R. The dataset I will be using is-Youtube04-Eminem-which gives 453 random comments on Eminem’s “Love The Way You Lie” video (ft. Rihanna). With this dataset, I will show you guys how to do spam detection with Naive Bayes using R by detecting the amount of spam and non-spam comments on this video (after all, Naive Bayes works for any type of spam detection, not just email spam).

But first, I wanted to acknowledge that I got this dataset from UCI Machine Learning Repository. Just like Kaggle, this website contains several datasets covering a wide variety of topics (sports, business, etc.) that you can use for analytical projects; here, you can find datasets ranging from the late 1980s to the present year (a lot better than the archaic datasets R offers for free). The website is maintained by the University of California-Irvine Center for Machine Learning and Intelligent Systems.

Now let’s get started. But first, as with any R analysis, let’s try to understand our data:

This dataset contains 453 comments and 5 different variables related to the comments. Here’s what each variable means:

COMMENT_ID-YouTube’s alphanumeric comment ID for each comment
AUTHOR-The YouTube username of the person who wrote the comment; there are only 396 because some people commented more than once on this video
DATE-The date and time a comment was posted; T denotes the comment’s timestamp
- Only less than half the comments have a corresponding date
CONTENT-The content of a comment; there are duplicate comments
CLASS-Whether or not a comment might be spam; 0 denotes non-spam and 1 denotes spam

So how exactly does Naive Bayes factor into all of this? In this case, Naive Bayes will calculate the likelihood that a comment is spam based on the words it contains. The strong assumptions about independence that are associated with Naive Bayes will come into play here, since the algorithm will assume that the probability of a certain word being found in a spam comment (e.g. Instagram) will be independent of the probability of another word being found in a spam comment (e.g. likes).

Next we create a table using the CLASS variable to show how the percentage of spam to non-spam comments (denoted by 1 and 0 respectively):

According to the table, 45.9% of our comments are non-spam (0) while 54.1% are spam (1).

Now, let’s create subsets of spam and non-spam comments:

In these subsets, I wanted to include the CONTENT (as in the actual comment) and the value of the CLASS variable (1 for spam and 0 for non-spam).

Now let’s make some wordclouds, which can help us visualize words that frequently occur in spam and non-spam comments (remember to install the wordcloud package). We will make two wordclouds, one for spam comments and the other for non-spam comments. Remember to use the subset variables that I just mentioned (spam and nonspam) when creating the wordcloud.

The basic idea of wordclouds is that the bigger a word appears on the wordcloud, the more common that word is in a certain category (spam or non-spam comments).

As you can see, some of the most common words in the spam category include check, please, channel and subscribe. This is not surprising, as many YouTube spammers often post something along the lines of “Hey guys please check out and subscribe to my channel: <link to spam channel>” in the comments section.

In the non-spam category, some of the most common words include Eminem, love, song, and Megan. This is also not surprising, since many people who leave comments on a music video would say how much they love the song and/or the artist (Eminem), often mentioning the artist’s name in the comments; the music video is for a song called “Love The Way You Lie”, which could be another reason why the word “love” is one of the most common amongst non-spam comments. Megan is also another one of the most common words amongst non-spam comments; this is likely because Megan Fox appears in the video.

One interesting observation is that the names of the two singers-Eminem and Rihanna-appear in both the spam and non-spam wordclouds. I think it’s interesting because I didn’t think spammers would mention the names of the artists in the comments section. However, keep in mind that Eminem and Rihanna are a lot less commonly mentioned in spam comments than they are in non-spam comments.

Now I’ll show you how to prepare the data for statical analysis. But first we must create a corpus, which is a collection of documents from the text in our file; use the CONTENT variable as it contains the comments themselves. Remember to install the tm package:

In case you are wondering what print(fileCorpus) does, it just prints out all the text of the comments; duplicate comments in our dataset are only printed once.

Now we have to make a document term matrix (or dtm) from the corpus; in our dtm, the comments themselves are shown in rows while the words that occur in the comments are shown in the columns:

In order to prepare our dtm, we must first clean our data by making all words lowercase, removing any numbers and punctuation (and presumably emojis) that are in the comments, and stem all of our words. Stemming removes the suffix from words, which makes it easier for analysis since words with similar meanings (such as verbs with different tenses and plural nouns) are combined into one. For instance, “driving”, “drives” and “driven” would be converted into “drive”.

Here is the structure of our dtm, though it isn’t too relevant with regards to our analysis:

Now we must split our data into training and testing datasets. There isn’t a single correct training-to-testing split, but I think 75-25 is ideal; this means we should use 75% of our data to build and train our model which we will then test on the other 25% of the dataset. First we should split the file, then split the dtm:

In this case, observations 1-340 (for both the file and dtm) will be part of the training set while observations 341-453 (for both the file and dtm) will be part of the testing set. And yes, I had to round here to get the stopping point for my testing set, since 75% of 453 is 339.75.

As you can see, the spam-to-nonspam proportions in our training set are different from those in our testing set. In our training set, the spam-to-nonspam proportions are 55.6%-44.4%. In our testing set, the spam-to-nonspam proportions are 49.6%-50.4%.

Now we must further clean our data by removing infrequent words from dtmTrain (the training dataset derived from our document term matrix), as they are unlikely to be useful in our analysis. I will only include words that are used at least 3 times:

The document term matrix uses 1s and 0s to determine whether or not a word appears in a comment or not; 1 indicates an appearance and 0 indicates no appearance. This is applied to every column (hence the MARGIN = 2).

Now let’s create our Naive Bayes classifier. Remember to install the package e1071:

Also remember to use training, not testing datasets!

Now let’s see how our classifier works, using the word check (which is commonly found in comments like “Plz check out my channel”) as an example. Using our trainLabels, the output for the word check is displayed:

In this table, the likelihood that the word check will occur in a spam and non-spam comment is displayed. Here’s a breakdown as to what this table means:

There is a 100% chance that the word check will NOT be found in a non spam comment but only a 33.9% chance that check will NOT be found in a spam comment.
On the other hand, there is a 0% chance that the word check will be found in a nonspam comment but a 66.1% chance that check will be found in a spam comment.

To evaluate the accuracy of our Naive Bayes classifier, we create a confusion matrix using our testing set. Remember to install the gmodels package:

Our testing set has 113 observations; the amount of comments that are correctly and incorrectly classified is shown in the matrix above. The sum of the numbers in 0A-0P and 1A-1P (A means actual and P means predicted) represents the amount of correctly classified comments-110 (56+54). On the other hand, the sum of the numbers in 0A-1P and 1A-0P represents the amount of incorrectly classified comments-3 (2+1). Since only 3 of our 113 comments were misclassified, our Naive Bayes classifier has fantastic accuracy (97.3%).

Thanks for reading,

Michael