hierarchical clustering Archives - Michael's Programming Bytes

R Analysis 3: K-Means vs Hierarchical Clustering

Hello everybody,

It’s Michael, and I will be doing an R analysis on this post. More specifically, I will be doing a comparative clustering analysis, which means I’ll take a dataset and perform both k-means and hierarchical clustering analysis with that dataset to analyze the results of each method. However, this analysis will be unique, since I will be revisiting of the earliest datasets I used for this blog-TV shows-which first appeared in R Lesson 4: Logistic Regression Models on July 11, 2018 (exactly nine months ago!) In case you forgot what this dataset was about, it basically gives 85 shows that aired during the 2017-18 TV season and whether or not they were renewed for the 2018-19 TV season along with other aspects of those shows (such as the year they premiered and the network they air on). I’ll admit I chose this dataset because I wanted to analyze one of my old datasets in a different way (remember I performed linear and logistic regression the first time I used this dataset).

So, as always, let’s load the file and get a basic understanding of our data:

Screen Shot 2019-04-07 at 10.27.47 PM

As you can see, we have 85 observations of 10 variables. Here’s a detailed breakdown of each variable:

TV.Show-The name of the show
Genre-The genre of the show
Premiere.Year-The year the show premiered; for revivals like Roseanne, I used the original premiere year (1988) as opposed to the revival premiere year (2018)
X..of.seasons..17.18.-How many seasons the show had aired as of the end of the 2017-18 TV season
Network-The network (or streaming service) the show airs on
X2018.19.renewal.-Whether or not the show was renewed for the 2018-19 TV season; 1 denotes renewal and 0 denotes cancellation
Rating-The content rating for the show. Here’s a more detailed breakdown:
- 1 means TV-G
- 2 means TV-PG
- 3 means TV-14
- 4 means TV-MA
- 5 means not applicable
Usual.Day.of.Week-The usual day of the week the show airs its new episodes. Here’s a more detailed breakdown:
- 1 means the show airs on Mondays
- 2 means the show airs on Tuesdays
- 3 means the show airs on Wednesdays
- 4 means the show airs on Thursdays
- 5 means the show airs on Fridays
- 6 means the show airs on Saturdays
- 7 means the show airs on Sundays
- 8 means the show doesn’t have a regular air-day (usually applies to talk shows or shows on streaming services)
Medium-the type of network the show airs on. Here’s a more detailed breakdown:
- 1 means the show airs on either one of the big 4 broadcast networks (ABC, NBC, FOX or CBS) or the CW (which isn’t part of the big 4)
- 2 means the show airs on a cable channel (AMC, Bravo, etc.)
- 3 means the show airs on a streaming service (Hulu, Amazon Prime, etc.)
Episode Count-the new variable I added for this analysis; this variable shows how many episodes a show has had overall since the end of the 2017-18 TV season. For certain shows whose seasons cross the 17-18 and 18-19 seasons, I will count how many episodes each show has had as of September 24, 2018 (the beginning of the 2018-19 TV season)

Now that we’ve learned more about our variables, let’s start our analysis. But first, I convert the final four variables into factors, since I think it’ll be more appropriate for the analysis:

Screen Shot 2019-04-07 at 10.28.52 PM

Ok, now onto the analysis. I’ll start with k-means:

Screen Shot 2019-04-08 at 3.18.33 PM

Here, I created a data subset using our third and tenth columns (Premiere.Year and Episode.Count respectively) and displayed the head (the first six observations) of my cluster.

Now let’s do some k-means clustering:

Screen Shot 2019-04-08 at 3.26.48 PM

I created the variable tvCluster to store my k-means model using the name of my data subset-cluster1-the number of clusters I wanted to include (4) and nstart, which tells the models to start with 35 random points then select the one with the lowest variation.

I then type in tvCluster to get a better idea of what my cluster looks like. The first thing I see is “K-means clustering with (X) clusters of sizes”, before mentioning the amount of observations in each cluster (which are 17, 64, 1 and 3, respectively). In total, all 85 observations were used since I didn’t have any missing data points.

The next thing that is mentioned is cluster means, which gives the mean for each variable used in the clustering analysis (in this case, Episode.Count and Premiere.Year). Interestingly enough, Cluster 2 has the highest mean Premiere.Year (2015) but the lowest mean Episode.Count (49. rounded to the nearest whole number).

After that, you can see the clustering vector, which shows you which observations belong to which cluster. Even though the position of the observation (e.g. 1st, 23rd) isn’t explicitly mentioned, you can tell which observation you are looking at since the vector starts with the first observation and works its way down to the eighty-fifth observation (and since there is no missing data, all 85 observations are used in this clustering model). For instance, the first three observations all correspond to cluster 1 (the first three shows listed in this dataset are NCIS, Big Bang Theory, and The Simpsons). Likewise, the final three observations all correspond to cluster 2 (the corresponding shows are The Americans, Baskets, and Comic Book Men).

Next you will see the within cluster sum of squares for each cluster, which I will abbreviate as WCSSBC; this is a measurement of the variability of the observations in each cluster. Remember that the smaller this amount is, the more compact the cluster. In this case, 3 of the 4 WCSSBC are above 100,000, while the other WCSSBC is 0 (which I’m guessing is cluster 3, which has only one observation).

Last but not least is between_SS/total_SS=94.5%, which represents the between sum-of-squares and total sum-of-squares ratio, which as you may recall from the k-means lesson is a measure of the goodness-of-fit of the model. 94.5% indicates that there is an excellent goodness-of-fit for this model.

Last but not least, let’s graph our model:

Screen Shot 2019-04-08 at 10.32.27 PM

In this graph, the debut year is on the x-axis, while the episode count is on the y-axis. As you can see, the 2 largest clusters (represented by the black and red dots) are fairly close together while the 2 smallest clusters (represented by the blue and green dots) are fairly spread out (granted, the two smallest clusters only have 1 and 3 observations, respectively). An observation about this graph that I wanted to point out is that the further back a show premiered doesn’t always mean the show has more episodes than another show that premiered fairly recently (let’s say anytime from 2015 onward). This happens for several reasons, including:

Revived series like American Idol (which took a hiatus in 2017 before its 2018 revival) and Roseanne (which had been dormant for 21 years before its 2018 revival)
Different shows air a different number of episodes per season; for instance, talk shows like Jimmy Kimmel live have at least 100 episodes per season while shows on the Big 4 networks tend to have between 20-24 episodes per season (think Simpsons, The Big Bang Theory, and Grey’s Anatomy). Cable and streaming shows usually have even less episodes per season (between 6-13, like how South Park only does 10 episode seasons)
Some shows just take long breaks (like how Jessica Jones on Netflix didn’t release any new episodes between November 2015 and March 2018)

Now time to do some hierarchical clustering on our data. And yes, I plan to use all the methods covered in the post R Lesson 12: Hierarchical Clustering.

Let’s begin by scaling all numerical variables in our data (don’t include the ones that were converted into factor types):

Now let’s start off with some agglomerative clustering (with both the dendrogram and code):

Screen Shot 2019-04-09 at 3.32.27 PM

After setting up the model using Euclidean distance and complete linkage, I then plot my dendrogram, which you can see above. This dendrogram is a lot neater than the ones I created in R Lesson 12, but then again, this dataset only has 85 observations, while the one in that post had nearly 7,200. The names of the shows themselves aren’t mentioned, but each of the numbers displayed correspond to a certain show. For instance, 74 corresponds to The Voice, since it is the 74th show listed in our dataset. Look at the spreadsheet to figure out which number corresponds to which show.

You may recall that I mentioned two general rules for interpreting dendrograms. They are:

The higher the height of the fusion, the more dissimilar two items are
The wider the branch between two observations, the more dissimilar they are

Those rules certainly apply here, granted, the highest height is 8 in this case, as opposed to 70. For instance, since the brand between shows 7 and 63 is fairly narrow, these two shows have a lot in common according to the model (even though the two shows in question are Bob’s Burgers and The Walking Dead-the former being a cartoon sitcom and the latter revolving around the zombie apocalypse). On the other hand, the gap between shows 74 and 49 is wider, which means they don’t share much in common (even though the two shows are The Voice and Shark Tank, which both qualify as reality shows, though the former is more competition-oriented than the latter). All in all, I think it’s interesting to see how these clusters were created, since the shows that were grouped closer together seem to have nothing in common.

Now let’s try AGNES (remember that stands for agglomerative clustering):

Screen Shot 2019-04-09 at 3.56.41 PM

First of all, remember to install the package cluster. Also, remember that the one important result is the ac, or agglomerative coefficient, which measures the strength of clustering structure. As you can see, our ac is 96.1%, which indicates very strong clustering structure (I personally think any ac at east 90% is good).

Now let’s compare this ac (which used complete linkage) to the ac we get with other linkage methods (not including centroid). Remember to install the purrr package:

Screen Shot 2019-04-09 at 4.03.23 PM

Of the four linkage method’s, Ward’s method gives us the highest agglomerative coefficient (98.2%), so that’s what we’ll use for the next part of this analysis.

Using ward’s method and AGNES, here’s a dendrogram of our data (and the corresponding code):

Screen Shot 2019-04-10 at 3.08.32 PM

Aside from having a greater maximum height than our previous dendrogram (the latter had a maximum height of 8 while this diagram has a maximum height of presumably 18), the observations are also placed differently. For instance, unlike in the previous dendrogram, observations 30 and 71 are side by side. But just as with the last dendrogram, shows that have almost nothing in common are oddly grouped together; for instance, the 30th and 71st observations correspond to The Gifted and Transparent; the former is a sci-fi show based off the X-Men universe while the latter is a transgender-oriented drama. The 4th and 17th observations are another good example of this, as the corresponding shows are The Simpsons and Taken; the former is a long-running cartoon sitcom while the latter is based off of an action movie trilogy.

The last method I will use for this analysis is DIANA (stands for divisive analysis). Recall that the main difference between DIANA and AGNES is that DIANA works in a top-down manner (objects start in a single supercluster and are divided into smaller clusters until single-element clusters are created) while AGNES works in a bottom-up (objects start in single-element clusters and are morphed into progressively larger clusters until a single supercluster is created). Here’s the code and dendrogram for our DIANA analysis:

Screen Shot 2019-04-10 at 3.24.19 PM

Remember that the divisive coefficient is pretty much identical to the agglomerative coefficient, since both measure strength of clustering structure and the closer each amount is to 1, the stronger the clustering structure. Also, in both cases, a coefficient of .9 (or 90%) or higher indicates excellent clustering structure. In this case, the dc (divisive coefficient) is 95.9%, which indicates excellent clustering structure.

Just as with the previous two dendrograms, most of the observation pairs still have nothing in common. For instance, the 41st and 44th observations have nothing in common, since the corresponding shows are House of Cards (a political drama) and Brooklyn 99 (a sitcom), respectively. An exception to this would be the 9th and 31st observations, since both of the corresponding shows-Designated Survivor and Bull respectively-are dramas and both are on the big 4 broadcast networks (though the former airs on ABC while the latter airs on FOX).

Now, let’s assign clusters to the data points. I’ll go with 4 clusters, since that’s how many I used for my k-means analysis (plus I think it’s an ideal amount). I’m going to use the DIANA example I just mentioned:

Now let’s visualize our clusters in a scatterplot (remember to install the factoextra package):

Screen Shot 2019-04-10 at 10.34.40 PM

As you can see, cluster 3 has the most observations while cluster 4 has the least (only one observation corresponding to Jimmy Kimmel Live). Some of the observations in each cluster have something in common, like the 1st and 2nd observations (NCIS and The Big Bang Theory in cluster 1, both of which air on CBS) and the 42nd and 80th observations in cluster 3 (Watch What Happens Live! and The Chew-both talk shows).

Now, let’s visualize these clusters on a dendrogram (using the same DIANA example):

Screen Shot 2019-04-11 at 3.20.51 PM

The first line of code is exactly the same line I used when I first plotted my DIANA dendrogram. The rect.hclust line draws the borders to denote each cluster; remember to set k to the amount of clusters you created for your scatterplot (in this case, 4). Granted, the coloring scheme is different from the scatterplot, but you can tell which cluster is which judging by the size of the rectangle (for instance, the rectangle for cluster 4 only contains the 79th observation, even though it is light blue on our dendrogram and purple on our scatterplot). Plus, all the observations are in the same cluster in both the scatterplot and dendrogram.

Thanks for reading.

Michael

R Lesson 12: Hierarchical Clustering

Hey everybody,

It’s Michael, and today’s lesson will be on hierarchical clustering in R. The dataset I will be using is AppleStore, which contains statistics, such as cost and user ratings. for roughly 7200 apps that are sold on the iTunes Store as of July 2017.

I found this dataset on Kaggle, which is an excellent source to find massive datasets on various topics (whether politics, sports, or something else entirely). You can even sign up for Kaggle with your Facebook or Gmail account to access the thousands of free datasets. The best part is that Kaggle’s datasets have recent information, which is much better than R’s sample datasets with information between 40-70 years old. (I wasn’t paid to write this blurb, I just really think Kaggle is an excellent resource to find data for analytical projects).

Now, as I had mentioned, today’s R post will be about hierarchical clustering. In hierarchical clustering, clusters are created in such a way that they have a hierarchy. To get a better idea of the concept of hierarchy, think of the US. The US is divided into 50 states. Each state has several counties, which in turn have several cities. Those cities each have their own neighborhoods (like Hialeah, Westchester, and Miami Springs in Miami, Florida). Those neighborhoods have several streets, which in turn have several houses. Get the idea?

So, as we should always do with our analyses, let’s first load the file into R and understand our data:

15mar capture1

In this dataset, there are 7197 observations (referring to apps) of 16 variables (referring to aspects of the apps). Here is what each variable means:

id-the iTunes ID for the app
track_name-the name of the app
size_bytes-how much space the app takes up (in bytes)
currency-what currency the app uses for purchases (the only currency in this dataset is USD, or US dollars)
price-how much the app costs to buy on the iTunes store (0 means its either a free or freemium app)
rating_count_tot-the user rating count for all versions of the app
rating_count_ver-the user rating count for the current version of the app (whatever version was current in July 2017)
user_rating-the average user rating for all versions of the app
user_rating_ver-the average user rating for the current version of the app (as of July 2017)
ver-the version code for the most recent version fo the app
cont_rating-the content rating for the app; given as a number followed by a plus sign (so 7+ would means the app is meant for kids who are at least 7)
prime_genre-the category the app falls under
sup_devices.num-how many devices the app supports
iPadSc_urls.num-how many screenshots of the app are shown for display
lang.num-how many languages the app supports
vpp_lic-tells us whether the app has VPP Device Based Licensing enabled (this is a special app-licensing service offered by apple)

Now that I’ve covered that, let’s start clustering. But first, let’s make a missmap to see if we’ve got any missing data and if so, how much:

15Mar capture2

According to the missingness map, there are no missing spots in our data, which is a good thing.

Now, the last thing we need to do before we start clustering is scale the numeric variables in our data, which is something we should do so that the clustering algorithm doesn’t have to depend on an arbitrary variable unit. Scaling changes the means of numeric variables to 0 and the standard deviations to 1. Here’s how you can scale:

I created a data frame using all the numeric variables (of which there are 10) and then scaled that data frame. It’s that simple.

I know I didn’t scale the data when I covered k-means clustering; I didn’t think it was necessary to do so. But I think it’s necessary to do so for hierarchical clustering.

Now let’s start off with some agglomerative clustering:

16Mar capture3 16Mar capture2

First, we have to specify what distance method we would like to use for cluster distance measurement (let’s stick with Euclidean, but there are others you can use). Next, we have to create a cluster variable –hc1– using the hclust function and specifying a linkage method (let’s use complete, but there are other options). Finally, we plot the cluster with the parameters specified above and we see a funny looking tree-like graph called a dendrogram. I know it’s a little messy, and you can’t see the app names, but that what happens when your dataset is 7.197 data points long.

How do we interpret our dendrogram? In our diagram, each leaf (the bottommost part of the diagram) corresponds to one observation (or in this case, one app). As we move up the dendrogram, you can see that similar observations (or similar apps) are combined into branches, which are fused at higher and higher heights. The heights in this dendrogram range from 0 to 70. Finally, two general rules to remember when it comes to interpreting dendrograms is that the higher the height of the fusion, the more dissimilar two items are and the wider the branch between two observations, the less similar they are.

You might also be wondering “What are linkage methods?” They are essentially distance metrics that are used to measure distance between two clusters of observations. Here’s a table of the five major linkage methods:

table

And in case you were wondering what all the variables in the formula mean:

X1, X2, , Xk = Observations from cluster 1
Y1, Y2, , Yl = Observations from cluster 2
d (x, y) = Distance between a subject with observation vector x and a subject with observation vector y
||.|| = Euclidean norm

Complete linkage is the most popular method for hierarchical clustering.

Now let’s try another similar method of clustering known as AGNES (an acronym for agglomerative nesting):

Screen Shot 2019-03-19 at 11.29.07 PM

This method is pretty similar to the previous method (which is why I didn’t plot a dendrogram for this example) except that this method will also give you the agglomerative coefficient, which is a measurement of clustering structure (I used the line hc2$ac to find this amount). The closer this amount is to 1, the stronger the clustering structure; since the ac here is very, very close to 1, then this model has very, very strong clustering structure.

Remember to install the package “cluster”!

Now let’s compare the agglomerative coefficient using complete linkage with the agglomerative coefficient using the other linkage methods (not including centroid):

Screen Shot 2019-03-22 at 8.57.18 PM

Before writing any of this code, remember to install the purrr (yes with 3 r’s) package.

We first create a vector containing the method names, then create a comparison of linkage methods by analyzing which one has the highest agglomerative coefficient (in other words, which method gives us the strongest clustering structure). In this case, ward’s method gives us the highest agglomerative coefficient (99.7%).

Next we’ll create a dendrogram using AGNES and ward’s method:

Screen Shot 2019-03-22 at 9.32.17 PM

Next, I’ll demonstrate another method of hierarchical clustering known as DIANA (which is an acronym for divisive analysis). The main difference between this method and AGNES is that DIANA works in a top-down manner (meaning it starts with all objects in a single supercluster and further divides objects into smaller clusters until single-element clusters consisting of each individual observation are created) while AGNES works in a bottom-up manner (meaning it starts with single-element clusters consisting of each observation then groups similar elements into clusters, working its way up until a single supercluster is created). Here’s the code and the dendrogram for our DIANA example:

Screen Shot 2019-03-23 at 12.39.01 PM

As you can see, we have a divisive coefficient instead of an agglomerative coefficient, but each serves the same purpose, which is to measure the amount of clustering structure in our data (and in both cases, the closer this number is to 1, the stronger the clustering structure). In this case, the divisive coefficient is .992, which indicates a very strong clustering structure.

Last but not least, let’s assign clusters to the data points. We would use the function cutree, which splits a dendrogram into several groups, or clusters, based on either the desired number of clusters (k) or the height (h). At least one of these criteria must be specified; k will override h if you mention both criteria. Here’s the code (using our DIANA example):

We could also visualize our clusters in a scatterplot using the factoextra package (remember to install this!):

Screen Shot 2019-03-23 at 1.11.31 PM

In this graph, the observations are split into 5 clusters; there is a legend to the right of the graph that denotes which cluster corresponds to which shape. There are varying amounts of observations in each cluster, for instance, cluster 1 appears to have the majority of the observations while clusters 3 only has 2 observations and cluster 5 only has 1 observation. The observations, or apps, are denoted by numbers* rather than names, so the 2 apps in cluster 3 are the 116th and 1480th observations in the dataset, which correspond to Proloquo2Go and LAMP Words for Life, both of which are assistive communication tools for non-verbal disabled people. Likewise, the only app in cluster 5 is the 499th observation in the dataset, which corresponds to Infinity Blade, which was a fighting/role-playing game (it was removed from the App Store on December 10, 2018 due to difficulties in updating the game for newer hardware).

*the numbers represent the observation’s place in the dataset, so 33 would correspond to the 33rd observation in the dataset

If you’re wondering what Dim1 and Dim2 have to do with our analysis, they are the two dimensions chosen by R that show the most variation in the data. The amount of variation that each dimensions accounts for in the overall dataset are given in percentages; Dim1 accounts for 21.2% of the variation while Dim2 accounts for 13.7% of the variation in the data.

You can also visualize clusters inside the dendrogram itself by insetting rectangular borders in the dendrogram. Here’s how (using our DIANA example):

Screen Shot 2019-03-23 at 3.33.54 PM

The first line of code is the same pltree line that I used when I first made the DIANA dendrogram. The second line creates the rectangular borders for our dendrogram. Remember to set k to however many clusters you used in your scatterplot (5 in this case). As for the borders, remember to put a colon in between the two integers you will use. As for which two integers you use, remember that the number to the right of the colon should be (n-1) more than the number to the left of the colon-n being the number of clusters you are using. In this case, the number to the left of the colon is 2, while the number to the right of the colon is 6, which is (5-1) more than 2. Two doesn’t have to be the number to the left of the colon, but it’s a good idea to use it anyway.

As you can see, the red square takes up the bulk of the diagram, which is similar to the case with our scatterplot (where the red area consists of the majority of our observations).

Thanks for reading,

Michael