k-means clustering Archives - Michael's Programming Bytes

R Analysis 10: Linear Regression, K-Means Clustering, & the 2020 NBA Playoffs

Advertisements

Hello everybody,

Michael here, and today’s post will be an R analysis on the 2020 NBA Playoffs.

As many of you know, the 2019-20 NBA Season was suspended on March 11, 2020 due to COVID-19. However, the season resumed on July 30, 2020 inside a “bubble” (essentially an isolation zone without fans present) in Disney World. 22 teams played 8 regular-season “seeding” games in the bubble before playoffs commenced on August 14, 2020. The resumed season concluded on October 11, 2020, when the LA Lakers defeated the Miami Heat in the NBA Finals to win their 17th championship. The data comes from https://www.basketball-reference.com/playoffs/NBA_2020_totals.html (Basketball Reference), which is a great site if you’re looking for basketball statistics and/or want to do some basketball analyses.

Before we start analyzing the data, let’s first upload it to R and learn about the data (and here’s the data):

2020-nba-playoffs Download

This dataset contains the names of 217 NBA players who participated in the 2020 Playoffs along with 30 other types of player statistics.

Now, I know it’s a long list, but here’s an explanation of all 31 variables in this dataset:

..Player-The name of the player
Pos-The player’s position on the team. For an explanation of the five main basketball positions, please check out this link https://jr.nba.com/basketball-positions/
Age-The age of the player as of July 30, 2020 (the date the NBA season resumed)
Tm-The team the player played for during the 2019-20 NBA season
G-The games the player participated in during the 2020 NBA playoffs
GS-The amount of playoff games that the player was in the game’s starting lineup
MP-The amount of minutes a player was on the court during the playoffs
FG-The amount of field goals a player scored during the playoffs
- For those who don’t know, a field goal in basketball is any shot that’s not a free throw.
FGA-The amount of field goals a player tried to make during the playoffs (this includes both successful and unsuccessful field goals)
FG.-The percentage of a player’s field goal attempts that were successful.
X3P-The amount of 3-point field goals a player scored during the playoffs
- Field goals can be worth either 2 or 3 points, depending on where the player shoots the field goal from
X3PA-The amount of 3-point field goals a player attempted during the playoffs (both successful and unsuccessful 3-pointers)
X3P.-The percentage of a player’s 3-point field goals that were successful
X2P-The amount of 2-point field goals a player scored during the playoffs
X2PA-The amount of 2-point field goals a player attempted during the playoffs (both successful and unsuccessful 2-pointers)
X2P.-The percentage of a player’s 2-point field goals that were successful
eFG.-The percentage of a player’s successful field goal attempts, adjusting for the fact that 3-point field goals are scored higher than 2-point field goals
FT-The amount of free throws a player scored during the playoffs
FTA-The amount of free throw attempts a player made during the playoffs (both successful and unsuccessful attempts)
FT.-The percentage of a player’s successful free throw attempts during the playoffs
ORB-The amount of offensive rebounds a player made during the playoffs
- A player gets a rebound when they retrieve the ball after another player misses a free throw or field goal. The player gets an offensive rebound if their team is currently on offense and a defensive rebound if their team is currently on defense.
DRB-The amount of defensive rebounds a player made during the playoffs
TRB-The sum of a player’s ORB and DRB
AST-The amount of assists a player made during the playoffs
- A player gets an assist when they pass the ball to one of their teammates who then scores a field goal.
STL-The amount of steals a player made during the playoffs
- A player gets a steal when they cause a player on the opposite team to turnover the ball.
BLK-The amount of blocks a player made during the playoffs
- A player gets a block when they deflect a field goal attempt from a player on the other team
TOV-The amount of turnovers a player made during the playoffs
- A player gets a turnover when they lose possession of the ball to the opposing team. Just in case you were wondering, the goal for a player is to make as few turnovers as possible.
PF-The amount of personal fouls a player made during the playoffs
- A player gets a personal foul when they make illegal personal contact with a player on the other team. The goal for players is to make as few personal fouls as possible, since they will be disqualified from the remainder of the game if they get too many.
PTS-The sum of the amount of field goals and free throws a player made during the playoffs
Team-The name of the team the player played for during the 2020 NBA playoffs
- Team and Tm basically give you the same information, except Team gives you the name of the team (e.g. Mavericks) while Tm gives you the three-letter abbreviation for the team (e.g. in the case of the Mavericks the 3-letter abbreviation is DAL)
Position-The position the team reached in the 2020 NBA playoffs. There are 7 possible values for position which include:
- ECR1-The team made it to the first round of the Eastern Conference playoffs
- ECSF-The team made it to the Eastern Conference Semifinals
- ECF-The team made it to the Eastern Conference Finals
- WCR1-The team made it to the first round of the Western Conference playoffs
- WCSF-The team made it to the Western Conference Semifinals
- WCF-The team made it to the Western Conference Finals
- Finals-The team made it to the 2020 NBA Finals

Whew, that was a lot of information to explain, but I felt it was necessary to explain this in order to better understand the data (and the analyses I will do).

Now, before we start creating analyses, let’s first create a missmap to see if there are any missing rows in the data (to create the missmap, remember to install the Amelia package). To create the missmap, use this line of code-missmap(file):

As you can see here, 98% of the observations are present, while only 2% are missing.

Let’s take a look at the five columns with missing data. They are:

FT.-free throw percentage
X3P.-percentage of 3-pointers made during playoffs
X2P.-percentage of 2-pointers made during playoffs
eFG.-percentage of successful field goal attempts during playoffs (adjusted for the fact that 3-pointers are worth more than 2-pointers)
FG.-percentage of successful field goal attempts during playoffs (not adjusted for the score difference between 2- and 3-pointers)

What do all of these columns have in common? They are all percentages, and the reason why there would be missing data in these columns is because a player doesn’t have any statistics in these categories. For instance, the reason why FT. would be blank is because a player didn’t make any free throws during the playoffs (same logic applies to 3-pointers, 2-pointers, and field goals).

So what will we do with these columns? In this case, I will simply ignore these columns in my analysis because I won’t be focusing on percentages.

Alright, now that I went over the basics of the data, let’s start analyzing the data. First off, I’ll start with a couple of linear regressions.

The first linear regression I will do will analyze whether a player’s age affects how many playoff games they played. Here’s the formula for that linear regression:

linearModel1 <- lm(Age~G, data=file)

And here’s a summary of the linear model:

Now, what does all of this mean? (and for a refresher on linear regression models, please read my post _):

The residual standard error gives us the amount that the response variable (Age) deviates from the regression line. In this example, the player’s age deviates from the regression line by 4.066 years.
The Multiple R-Squared and Adjusted R-Squared are measures of how well the model fits the data. The closer the R-squared to 1, the better the fit of the model.
- In this case, the Multiple R-Squared is 3.42% and the Adjusted R-Squared is 2.97%, indicating that there is almost no correlation between a player’s age and the amount of playoff games they played.
The F-Statistic is a measure of the relationship (or lack thereof) between the dependent and independent variables. This metric (and the corresponding p-value) isn’t too important when dealing with simple linear regression models such as this one, but it is important when analyzing multiple linear regression models (i.e. models with multiple variables).
- To make things easier, just focus on the F-statistic’s corresponding p-value when analyzing the relationship between the dependent and independent variables. If the p-value is less than 0.05, accept the null hypothesis (that the independent variable and dependent variables aren’t related). If the p-value value is greater than 0.05, reject the null hypothesis.

OK, so it looks like there isn’t any correlation between a player’s age and the amount of playoff games they played. But what happens when I include Team in the analysis?

Here’s the code for that linear regression model:

linearModel2 <- lm(Age~G+Team, data=file)

However, something to note is that Team is of type chr, which isn’t allowed in linear regression analysis. So, we would need to use this simple line of code to convert Team to type factor:

file$Team <- as.factor(file$Team)

Alright, so let’s see the summary of this linear regression model:

Now, let’s analyze this model further:

The residual standard error shows us that a player’s age deviates from the regression line by 3.902 years (down from 4.066 years in the previous model).
The Multiple R-Squared and Adjusted R-Squared (17.26% and 10.64% respectively) shows us that this model has a better fit than the previous model, but both metrics are still too small to imply that there is any correlation between a player’s age, the team he plays for, and how many playoff games he played.
The F-statistic’s corresponding p-value is much less than 0.05 (0.001024), which indicates that a player’s age has no bearing on the amount of playoff games they played or the team they played for.
Notice how all of the values for Team are displayed. This is because when you include a factor in a linear regression, you will see all of the possible values for the factor (there are 16 possible values in this case for the 16 teams that made it to playoffs)

Now we will do some k-means clustering analyses. First, let’s create a cluster using the variables FT and FG (free throws and field goals, respectively):

As you can see, I have displayed the head (the first six observations) of the cluster.

I used 8 and 18 as the column numbers because FG and FT are the 8th and 18th columns in the dataset, respectively.

Alright, let’s do some k-means clustering:

So what does this output tell us? Let’s find out:

In this example, I created 5 clusters from cluster1. The 5 clusters have sizes of 104, 36, 10, 64, and 3 respectively.
The cluster means show us the means for each variable used in this analysis-FT and FG. Recall that cluster means are calculated from the values in each cluster, so the cluster mean for FG in cluster 1 is calculated from all of the FG values in cluster 1 (same logic applies for the other 4 clusters).
The clustering vector shows you which observations belong to which cluster. In this example, the cluster observations are sorted alphabetically by a player’s last name. The first three observations belong to clusters 1, 4, and 3, which correspond to Jaylen Adams, Steven Adams, and Edrice “Bam” Adebayo. Likewise, the last three observations belong to clusters 1, 1, and 2, which correspond to Nigel Williams-Goss, Delon Wright, and Ivica Zubac, respectively.
The WCSSBC (within cluster sum of squares by cluster) shows us the variablity of observations in a certain cluster. The smaller the WCSSBC, the more compact the cluster. In this example, cluster 1 is the most compact since it has the smallest WCSSBC (2886.26).
The between_SS/total_SS ratio shows the overall goodness-of-fit for the model. The higher this ratio is, the better the model fits the data. In this example, the between_SS/total_SS is 91.5%, indicating that this model is an excellent fit for the data.

Now let’s graph our clusters (and remember to install the ggplot2 package);

Here’s the code we’ll use to make the first graph:

plot(cluster1, col=NBAPlayoffs$cluster, main="2020 NBA Playoffs Field Goals vs Free Throws", xlab = "Playoff field goals", ylab="Playoffs free throws", pch=19)

In this graph, we have 5 clusters grouped by color. Let’s analyze each of these clusters:

The black cluster represent players who scored 20 or fewer playoff free throws and 20 or fewer playoff field goals. Some notable players that fall into this cluster include Jared Dudley (Lakers), Hassan Whiteside (Trail Blazers), and Bol Bol (Nuggets).
The dark blue cluster represents players who scored between 0 and 35 playoff free throws and between 15 and 50 playoff field goals. Some notable players that fall into this cluster include Andre Igoudala (Heat), Markieff Morris (Lakers), and Victor Oladipo (Pacers).
The red cluster represents players who scored between 20 and 50 playoff free throws and between 35 and 90 playoff field goals. Some notable players that fall into this cluster include Giannis Antentokonumpo (Bucks), Jae Crowder (Heat), and Rudy Gobert (Jazz).
The other two clusters (which are light green and light blue) represent players who scored at least 45 playoff free throws and at least 100 playoff field goals.
- In case you’re wondering, the three players who are in the light green cluster are Jimmy Butler (Heat), Anthony Davis (Lakers), and LeBron James (Lakers)-all three of whom participated in the 2020 NBA Finals.

So, what insights can we gather from this data? One insight is certain-a team’s playoff position alone doesn’t have much impact on the amount of free throws and field goals they scored. For instance, Dion Waiters, JR Smith, and Talen Horton-Tucker are in the lowest FG/FT cluster, but all three of them are on the Lakers, who won the NBA Championship. However, Dion Waiters and Talen Horton-Tucker didn’t play in any Finals games, while JR Smith did play in some Finals games but only averaged 7.5 minutes per games throughout the playoffs.

Likewise, you might think that the two highest FG/FT clusters would only have Heat and Lakers players, given that these two teams made it all the way to the Finals. Two exceptions to this are James Harden (Rockets) and Kawhi Leonard (Clippers), who were both eliminated in the Western Conference semifinals.

Now let’s do another k-means clustering analysis, this time using two different scoring categories-total rebounds and assists (represented by TRB and AST respectively).

First, let’s create the main cluster (let’s use 23 and 24 as the column numbers because TRB and AST are the 23rd and 24th columns of the dataset, respectively):

Now, let’s create the k-means analysis:

What does this output tell us? Let’s find out:

In this example, I created 4 clusters from cluster2 with sizes of 65, 18, 128, and 6, respectively.
The cluster means display the averages of TRB and AST for each cluster.
The clustering vector shows us which observations correspond to which cluster. From this vector, we can see that the first three observations belong to cluster 3.
The WCSSBC indicates the compactness of each cluster. Since cluster 4 has the smallest WCSSBC (13214.83), it is the most compact cluster. Cluster 1, on the other hand, has the highest WCSSBC (32543.42), so it is the least compact cluster.
The between_SS/total_SS ratio shows us the goodness-of-fit for the model. In this case, the between_SS/total_SS ratio is 82.9%, which indicates that this model is a good fit for the data (though this ratio is lower than the previous model’s ratio of 91.5%).

Alright, now let’s graph this cluster. Remember to install the ggplot2 package and use this line of code:

plot(cluster2, col=NBAPlayoffs2$cluster, main="2020 NBA Playoffs Rebounds vs Assists", xlab = "Playoff rebounds", ylab="Playoff assists", pch=19)

In this graph, we have 4 clusters grouped by color. Let’s see what each of these clusters mean:

The green cluster represents players who scored 40 or fewer playoff assists and 30 or fewer playoff rebounds. Notable players who fall into this cluster include Meyers Leonard (Heat), Dion Waiters (Lakers), and Al Horford (76ers).
The black cluster represents players who scored between 0 and 60 playoff assists and between 35 and 100 playoff rebounds. Notable players who fall into this cluster include Kelly Olynyk (Heat), Chris Paul (Thunder), and Kyle Kuzma (Lakers).
The red cluster represent players who scored between 15 and 130 playoff assists and between 45 and 130 playoff rebounds. Notable players who fall into this cluster include Russell Westbrook (Rockets), Jae Crowder (Heat), and Kyle Lowry (Raptors).
The blue cluster represents the six players in the top rebounds/assists tier, which include:
- Jimmy Butler (Heat)
- Jayson Tatum (Celtics)
- Nikola Jokic (Nuggets)
- Bam Adebayo (Heat)
- Anthony Davis (Lakers)
- LeBron James (Lakers)
  - Four of these players participated in the NBA Finals while Jayson Tatum and Nikola Jokic were eliminated in the Conference Finals.

So, what insights can we draw from this cluster analysis? First of all, similar to the previous cluster analysis, a team’s playoff position (i.e. Conference Finals) has no bearing as to which player falls in which cluster. For example, the Heat and Lakers made it to the NBA Finals, yet there are Heat and Lakers players in every cluster.

And just as with the previous cluster analysis, the top cluster doesn’t only have Finals players, as Jayson Tatum and Nikola Jokic are included in the top cluster-and both of them were eliminated in the Conference Finals.

Last but not least, let’s create a cluster analyzing two things players try not to get-personal fouls and turnovers. First, let’s create the main cluster (use 27 and 28 as the column numbers since TOV and PF are the 27th and 28th columns in the dataset respectively):

Next, let’s create the k-means analysis:

So, what does this output tell us? Let’s find out:

In this example, I created 4 clusters from cluster3 with sizes of 30, 70, 104, and 13, respectively.
The cluster means display the means of TOV and PF for each cluster.
The clustering vector shows us which observations belong to which cluster. In this example, the first three observations belong to cluster 3 (just like in the previous example).
The WCSSBC shows us how compact each cluster is; in this example, cluster 3 is the most compact since it has the smallest WCSSBC (1976.298).
The between_SS/total_SS ratio shows us the goodness-of-fit for the model; in this example, this ratio is 84.9%, which indicates that the model is a great fit for the data.
- Interestingly, the k-means analysis with the best goodness-of-fit is the first analysis with a between_SS/total_SS ratio of 91.5%. I think this is surprising since the first model had 5 clusters while this model and the previous model have 4 clusters.

Now, let’s plot our k-means analysis (and remember to install ggplot2!)

plot(cluster3, col=NBAPlayoffs3$cluster, main="2020 NBA Playoffs Turnovers vs Personal Fouls", xlab = "Playoff turnovers", ylab="Playoff personal fouls", pch=19)

Just as with the previous graph, we have four clusters grouped by color in this plot (and coincidentally in the same color order as the previous graph). What do each of these clusters mean? Let’s find out:

The green cluster represents players who committed 15 or fewer playoff turnovers as well as 12 or fewer playoff personal fouls. Notable players that fall into this cluster include JR Smith (Lakers), Kyle Korver (Bucks), and Damian Lillard (Trail Blazers).
- In this example, being in the lowest cluster is the best, since the aim of every player is to commit as few turnovers and personal fouls as possible. Plus, if a player commits too many personal fouls in a game, they’re ejected.
- Also, the more turnovers a player commits, the more opportunities the opposing team gets to score.
  - In reality, the data isn’t as black-and-white as I make it seem. This is because the further a player made it into the playoffs, the more opportunities they would have for committing turnovers and personal fouls.
  - Data is never black-and-white as it looks. The key to being a good data analyst is to analyze every piece of data in a dataset to see how it’s all interconnected.
The red cluster represents players who committed between 7 and 35 playoff turnovers and between 2 and 35 playoff personal fouls. Notable players who fall into this cluster include Carmelo Anthony (Trail Blazers), Kendrick Nunn (Heat), and Rudy Gobert (Jazz).
The black cluster represents players who committed between 9 and 35 playoff turnovers and between 33 and 66 playoff personal fouls. Notable players who fall into this cluster include Rajon Rondo (Lakers), Danny Green (Lakers), and Jae Crowder (Heat).
The blue cluster represents the 13 players who are in the highest turnover/personal foul tier. Here’s the full list:
- Goran Dragic (Heat)
- Paul George (Clippers)
- Jaylen Brown (Celtics)
- Tyler Herro (Heat)
- James Harden (Rockets)
- Marcus Smart (Celtics)
- Bam Adebayo (Heat)
- Jayson Tatum (Celtics)
- Jamal Murray (Nuggets)
- Anthony Davis (Lakers)
- Jimmy Butler (Heat)
- Nikola Jokic (Nuggets)
- and Lebron James (Lakers)
  - You might be surprised to see 6 NBA Finalists on this list, but I have a theory as to why that’s the case. See, the farther a player makes it in the playoffs, the more opportunities they have to score (and commit fouls and turnovers). And if you saw the 2020 NBA Finals, there were quite a bit of personal fouls and turnovers committed by both the Heat and Lakers (especially with Anthony Davis’s foul trouble in Game 3).

Thanks for reading,

Michael

R Analysis 9: ANOVA, K-Means Clustering & COVID-19

Advertisements

Hello everybody,

It’s Michael here, and today’s post is an R analysis about the COVID-19 pandemic using ANOVA & K-means clustering. I haven’t done many analyses on current events, so I thought I’d change things up here a little (especially since COVID-19 is an important moment in modern history).

Here’s the dataset-COVID-19 Worldwide (4-12-20)

First of all, let’s load our dataset into R and learn more about each variable:

This dataset contains 213 samples and 14 variables. Let’s learn more about each variable:

Country-The name of the country whose COVID-19 cases are being analyzed. Keep in mind that not every value in this column is a country-some exceptions include:
- World-The total amount of coronavirus cases worldwide
- Reunion-This is a French island located in the Indian Ocean
- Diamond Princess-This is a Carnival-owned cruise ship that first disembarked from Yokohama, Japan
Total Cases-The total amount of COVID-19 cases in a particular country.
- Keep in mind that these are the total amount of cases as of April 12, 2020. The numbers have (very likely) fluctuated a lot since this writing.
New Cases-The amount of new COVID-19 cases in a particular country on April 12, 2020
Total Deaths-The total amount of people that have died in a particular country from COVID-19
New Deaths-The amount of people in a particular country who died from COVID-19 on April 12, 2020.
Total Recovered-The total amount of people in a particular country who have recovered from COVID-19.
Active Cases-The amount of active cases in a particular country as of April 12, 2020.
Serious Critical-The amount of serious or critical cases in a particular country.
Total Cases 1M pop-The total amount of cases in a particular country per 1 million residents.
Deaths 1M pop-The total amount of deaths in a particular country per 1 million residents.
Total Tests-The total amount of COVID-19 tests a particular country has administered.
Tests 1M pop-The total amount of tests a particular country has administered per 1 million residents.
Continent-The continent a country is located on (if applicable)
Lockdown-The measures a particular country has taken to stop the spread of COVID-19, which can include:
- Curfew-country has enacted a curfew for its citizens, which restricts their movement at certain hours of the day
- Lockdown-country has enacted a mass lockdown/quarantine (terms are used interchangeably), requiring all citizens to stay at home at all hours of the day with few exceptions (e.g. getting groceries, performing an essential job)
- No large gatherings-country bans all large gatherings (e.g. concerts, festivals)
  - OK, so almost every country I’ve analyzed bans or strongly discourages large gatherings. This label is for countries that have only banned large gatherings without enacting any other significant restrictions.
- None-country has implemented no major restrictions on their citizens
- Partial curfew-country has enacted a curfew only in certain regions or for certain groups of people (e.g. elderly)
- Partial lockdown-country has enacted a lockdown only in certain regions or for certain groups of people
- Social distancing-country has enacted social distancing measures, such as closing schools, bars, and restaurants
  - Almost every affected country has enacted some form of social distancing, but this label only applies to those countries who have only enacted social distancing rules without any other major restrictions.
- Stay-at-home-country has issued a nationwide stay-at-home order, which requires citizens to stay at home except for essential activities (e.g. grocery shopping, working at an essential business) and outdoor exercise-provided social distancing is maintained.
- Travel restrictions-country has enacted travel restrictions, restricting citizens from leaving and/or foreigners from entering.
  - President Trump’s Europe travel ban doesn’t factor into this label.
- Various-country has enacted various measures throughout the nation
  - The US is the only country that has this label, as all 50 states have responded differently to the COVID-19 crisis (for instance, 43 states have issued stay-at-home orders, but 7 haven’t). Likewise, each state has their own definition of “essential businesses”.

Something I want to mention about factor variables that I didn’t mention earlier-levels in factor variables are organized alphabetically. So for the Lockdown variable, Curfew would have a factor level of 1, as it’s the first value alphabetically. Likewise, Various would have a factor level of 11, as it’s the last value alphabetically.

Now, let’s see how tidy our dataset is by seeing how many values are missing (remember to install the Amelia package and use the line misssmap(file) to obtain the missingness map):

As you can see, 19% of all values in the dataset are missing, while the other 81% of values are present.

Yes, I know the dataset is not as tidy as the ones I’ve used in other posts. But I think this is a good things, as several real-life datasets you’ll encounter could be considerably messier than this one.

So, what should I do about the missing values. First of all, since the New Deaths variable has too many missing values, I won’t use it in any of my analyses. As for the rest of the variables with missing values, I will simply replace those missing values with zeroes. As to why I’d do that, replacing the missing values with zeroes won’t impact any of the variables’ means. Here’s how to do that:

file$New.Cases[is.na(file$New.Cases)] <- 0
> file$Serious.Critical[is.na(file$Serious.Critical)] <- 0
> file$Tests..1M.pop[is.na(file$Tests..1M.pop)] <- 0
> file$Total.Tests[is.na(file$Total.Tests)] <- 0

file$Deaths..1M.pop[is.na(file$Deaths..1M.pop)] <- 0
> file$Total.Deaths[is.na(file$Total.Deaths)] <- 0
> file$Total.Recovered[is.na(file$Total.Recovered)] <- 0

These seven lines of code will replace any missing values within these seven variables with zeroes.

Now, for this analysis I will be doing two one-way ANOVA models. Here’s my first one-way ANOVA model:

In this model, I used Total Cases as the dependent variable and Lockdown as the independent variable; I’m analyzing the relationship a country’s containment measures would have on their total coronavirus cases.

As you can see, Lockdown isn’t that significant of a variable, as it only has a single period besides it, which means its p-value is between 0.05 and 0.1, which is too large to be considered statistically significant. The f-value is also low-1.749-which, along with the high p-value, means that there isn’t enough evidence to reject the null hypothesis (that a country’s containment measures don’t have a significant impact on their coronavirus cases).

Nonetheless, let’s performs a Tukey’s HSD test to see if we can get some interesting insights:

TukeyHSD(model1, conf.level=0.99)
Tukey multiple comparisons of means
99% family-wise confidence level

Fit: aov(formula = file$Total.Cases ~ file$Lockdown)

$`file$Lockdown`
diff lwr upr p adj
Lockdown-Curfew 15932.078 -112699.14 144563.30 0.9999954
No large gatherings-Curfew 22064.565 -477726.47 521855.60 1.0000000
None-Curfew 10803.501 -99222.47 120829.47 0.9999995
Partial curfew-Curfew 53433.565 -446357.47 553224.60 0.9999989
Partial lockdown-Curfew 4838.065 -355855.05 365531.18 1.0000000
Self-quarantine-Curfew -3134.435 -502925.47 496656.60 1.0000000
Social distancing-Curfew 61926.065 -298767.05 422619.18 0.9999023
Stay-at-home-Curfew 20215.565 -479575.47 520006.60 1.0000000
Travel restrictions-Curfew 3225.565 -496565.47 503016.60 1.0000000
Various-Curfew 546553.565 46762.53 1046344.60 0.0027658
No large gatherings-Lockdown 6132.487 -489368.42 501633.40 1.0000000
None-Lockdown -5128.577 -93648.53 83391.37 1.0000000
Partial curfew-Lockdown 37501.487 -457999.42 533002.40 1.0000000
Partial lockdown-Lockdown -11094.013 -365818.68 343630.65 1.0000000
Self-quarantine-Lockdown -19066.513 -514567.42 476434.40 1.0000000
Social distancing-Lockdown 45993.987 -308730.68 400718.65 0.9999929
Stay-at-home-Lockdown 4283.487 -491217.42 499784.40 1.0000000
Travel restrictions-Lockdown -12706.513 -508207.42 482794.40 1.0000000
Various-Lockdown 530621.487 35120.58 1026122.40 0.0038187
None-No large gatherings -11261.064 -502260.94 479738.81 1.0000000
Partial curfew-No large gatherings 31369.000 -660560.36 723298.36 1.0000000
Partial lockdown-No large gatherings -17226.500 -616454.91 582001.91 1.0000000
Self-quarantine-No large gatherings -25199.000 -717128.36 666730.36 1.0000000
Social distancing-No large gatherings 39861.500 -559366.91 639089.91 1.0000000
Stay-at-home-No large gatherings -1849.000 -693778.36 690080.36 1.0000000
Travel restrictions-No large gatherings -18839.000 -710768.36 673090.36 1.0000000
Various-No large gatherings 524489.000 -167440.36 1216418.36 0.1478284
Partial curfew-None 42630.064 -448369.81 533629.94 0.9999999
Partial lockdown-None -5965.436 -354375.13 342444.26 1.0000000
Self-quarantine-None -13937.936 -504937.81 477061.94 1.0000000
Social distancing-None 51122.564 -297287.13 399532.26 0.9999773
Stay-at-home-None 9412.064 -481587.81 500411.94 1.0000000
Travel restrictions-None -7577.936 -498577.81 483421.94 1.0000000
Various-None 535750.064 44750.19 1026749.94 0.0028637
Partial lockdown-Partial curfew -48595.500 -647823.91 550632.91 0.9999999
Self-quarantine-Partial curfew -56568.000 -748497.36 635361.36 0.9999999
Social distancing-Partial curfew 8492.500 -590735.91 607720.91 1.0000000
Stay-at-home-Partial curfew -33218.000 -725147.36 658711.36 1.0000000
Travel restrictions-Partial curfew -50208.000 -742137.36 641721.36 1.0000000
Various-Partial curfew 493120.000 -198809.36 1185049.36 0.2174452
Self-quarantine-Partial lockdown -7972.500 -607200.91 591255.91 1.0000000
Social distancing-Partial lockdown 57088.000 -432179.94 546355.94 0.9999974
Stay-at-home-Partial lockdown 15377.500 -583850.91 614605.91 1.0000000
Travel restrictions-Partial lockdown -1612.500 -600840.91 597615.91 1.0000000
Various-Partial lockdown 541715.500 -57512.91 1140943.91 0.0327333
Social distancing-Self-quarantine 65060.500 -534167.91 664288.91 0.9999987
Stay-at-home-Self-quarantine 23350.000 -668579.36 715279.36 1.0000000
Travel restrictions-Self-quarantine 6360.000 -685569.36 698289.36 1.0000000
Various-Self-quarantine 549688.000 -142241.36 1241617.36 0.1052700
Stay-at-home-Social distancing -41710.500 -640938.91 557517.91 1.0000000
Travel restrictions-Social distancing -58700.500 -657928.91 540527.91 0.9999995
Various-Social distancing 484627.500 -114600.91 1083855.91 0.0914743
Travel restrictions-Stay-at-home -16990.000 -708919.36 674939.36 1.0000000
Various-Stay-at-home 526338.000 -165591.36 1218267.36 0.1443175
Various-Travel restrictions 543328.000 -148601.36 1235257.36 0.1149652

Since Lockdown isn’t statistically significant in this model, none of these pairwise differences are statistically significant. The pairwise difference that is the closest to being statistically significant is that of Various-partial lockdown, which has a pairwise difference of 0.033 (rounded to three decimal places).

Various and partial lockdown are somewhat similar to each other, as some of the US’s (the only country with the various label) containment measures include stay-at-home orders in 43 states, which have the same intent as partial lockdowns but allow for outdoor exercise.

Now, let’s see what happens when we change the dependent variable but keep the same independent variable-Lockdown (keep in mind I’m still doing one-way ANOVA):

Ok, so this time I used Total.Tests as the dependent variable, which means I’m analyzing the effect of a country’s containment measures on their total amount of COVID-19 tests. And this time, it looks like Lockdown is a very statistically significant variable, as it has the highest significance code (three asterisks) and a p-value far less than 0.001-as well as an equally high corresponding f-value of 54.95. In this case, we can reject the null hypothesis and conclude that a country’s containment measures do have an effect on the amount of COVID-19 tests they have.

Since Lockdown has more statistical significance this time, the Tukey’s HSD test should have plenty of statistically significant pairwise differences. Let’s find out:

>TukeyHSD(model2, conf.level=0.99)
Tukey multiple comparisons of means
99% family-wise confidence level

Fit: aov(formula = file$Total.Tests ~ file$Lockdown.)

$`file$Lockdown.`
diff lwr upr p adj
Lockdown-Curfew 96151.24 -32955.08 225257.552 0.1662125
No large gatherings-Curfew 73822.83 -427814.18 575459.833 0.9999766
None-Curfew -10792.37 -121224.72 99639.985 0.9999995
Partial curfew-Curfew 348388.83 -153248.18 850025.833 0.2508108
Partial lockdown-Curfew 606940.83 244915.50 968966.152 0.0000001
Self-quarantine-Curfew 18835.83 -482801.18 520472.833 1.0000000
Social distancing-Curfew 644769.83 282744.50 1006795.152 0.0000000
Stay-at-home-Curfew 373840.83 -127796.18 875477.833 0.1654945
Travel restrictions-Curfew 41059.83 -460577.18 542696.833 0.9999999
Various-Curfew 2737994.83 2236357.82 3239631.833 0.0000000
No large gatherings-Lockdown -22328.41 -519659.44 475002.620 1.0000000
None-Lockdown -106943.60 -195790.50 -18096.706 0.0005266
Partial curfew-Lockdown 252237.59 -245093.44 749568.620 0.7128433
Partial lockdown-Lockdown 510789.59 154754.75 866824.425 0.0000104
Self-quarantine-Lockdown -77315.41 -574646.44 420015.620 0.9999610
Social distancing-Lockdown 548618.59 192583.75 904653.425 0.0000014
Stay-at-home-Lockdown 277689.59 -219641.44 775020.620 0.5800281
Travel restrictions-Lockdown -55091.41 -552422.44 442239.620 0.9999984
Various-Lockdown 2641843.59 2144512.56 3139174.620 0.0000000
None-No large gatherings -84615.19 -577428.56 408198.178 0.9999022
Partial curfew-No large gatherings 274566.00 -419918.99 969050.989 0.9230867
Partial lockdown-No large gatherings 533118.00 -68323.64 1134559.643 0.0400545
Self-quarantine-No large gatherings -54987.00 -749471.99 639497.989 0.9999999
Social distancing-No large gatherings 570947.00 -30494.64 1172388.643 0.0190437
Stay-at-home-No large gatherings 300018.00 -394466.99 994502.989 0.8706277
Travel restrictions-No large gatherings -32763.00 -727247.99 661721.989 1.0000000
Various-No large gatherings 2664172.00 1969687.01 3358656.989 0.0000000
Partial curfew-None 359181.19 -133632.18 851994.561 0.1904062
Partial lockdown-None 617733.19 268036.66 967429.727 0.0000000
Self-quarantine-None 29628.19 -463185.18 522441.561 1.0000000
Social distancing-None 655562.19 305865.66 1005258.727 0.0000000
Stay-at-home-None 384633.19 -108180.18 877446.561 0.1202463
Travel restrictions-None 51852.19 -440961.18 544665.561 0.9999991
Various-None 2748787.19 2255973.82 3241600.561 0.0000000
Partial lockdown-Partial curfew 258552.00 -342889.64 859993.643 0.8741210
Self-quarantine-Partial curfew -329553.00 -1024037.99 364931.989 0.7887362
Social distancing-Partial curfew 296381.00 -305060.64 897822.643 0.7474791
Stay-at-home-Partial curfew 25452.00 -669032.99 719936.989 1.0000000
Travel restrictions-Partial curfew -307329.00 -1001813.99 387155.989 0.8523867
Various-Partial curfew 2389606.00 1695121.01 3084090.989 0.0000000
Self-quarantine-Partial lockdown -588105.00 -1189546.64 13336.643 0.0133169
Social distancing-Partial lockdown 37829.00 -453246.04 528904.045 1.0000000
Stay-at-home-Partial lockdown -233100.00 -834541.64 368341.643 0.9320348
Travel restrictions-Partial lockdown -565881.00 -1167322.64 35560.643 0.0211145
Various-Partial lockdown 2131054.00 1529612.36 2732495.643 0.0000000
Social distancing-Self-quarantine 625934.00 24492.36 1227375.643 0.0058011
Stay-at-home-Self-quarantine 355005.00 -339479.99 1049489.989 0.7029526
Travel restrictions-Self-quarantine 22224.00 -672260.99 716708.989 1.0000000
Various-Self-quarantine 2719159.00 2024674.01 3413643.989 0.0000000
Stay-at-home-Social distancing -270929.00 -872370.64 330512.643 0.8377204
Travel restrictions-Social distancing -603710.00 -1205151.64 -2268.357 0.0095176
Various-Social distancing 2093225.00 1491783.36 2694666.643 0.0000000
Travel restrictions-Stay-at-home -332781.00 -1027265.99 361703.989 0.7785425
Various-Stay-at-home 2364154.00 1669669.01 3058638.989 0.0000000
Various-Travel restrictions 2696935.00 2002450.01 3391419.989 0.0000000

In this Tukey’s HSD test, there are several statistically significant pair-wise differences (those with a p-value of 0.01 or less). One thing I’ve noticed is that most of the statistically significant pair-wise differences have a lot in common. Here are three examples of this:

Partial lockdown-curfew (p-value of 0.0000001)-Partial lockdowns and curfews are similar since both are only partial restrictions of citizens’ movements outside of their homes, whether at certain hours of the day (in the case of curfew) or only for certain (usually high-risk) people (in the case of partial lockdown)
Social distancing-lockdown (p-value of 0.0000014)-Social distancing merely asks people to drastically limit their contact with others outside of the home and to stay at least 6-feet away from people when in public places like parks and grocery stores. A lockdown is a more strict form of social distancing which asks people to stay in their homes 24/7 except for essential activities (e.g. grocery shopping, working in essential jobs); for the most part, you can’t go outside for exercise during a lockdown (though countries under lockdown like the UK allow for outdoor exercise).
Social distancing-self quarantine (p-value of 0.0058011)-Self-quarantine is a more strict form of social distancing where someone wouldn’t leave their residence (or invite anybody into their residence) for at least two weeks; self-quarantining could also mean staying in a separate room for at least two weeks from other people living in your house. Self-quarantining is usually only necessary if you’ve been to a COVID-19 hotspot or have been in contact with someone who has COVID-19 within the last two weeks (e.g. coworkers, friends).

Alright, now since there are so many variables in this set, I thought I’d do another analysis that I haven’t done in quite some time-a k-means clustering. And yes, this is the first time I’m doing two different analyses in one post.

Anyway, let’s start building our cluster, first using the variables Total Cases and Total Tests:

In this example, I created a cluster using the second and eleventh variables in the dataset (Total Cases and Total Tests). I also displayed the head, or first six observations in the cluster (use tail if you want to see the last six observations).

Now let’s do some k-means clustering:

So, since it’s been a while since I’ve done a k-means clustering analysis, here’s what each of these output blocks mean:

The first line always goes K-means clustering with (X) clusters of sizes _____. This tells you how many clusters were created and the sizes of each cluster.
- In this example, the clusters have sizes of 3, 9, 1, 33, 166, and 1, which means that cluster 1 has 3 observations, cluster 2 has 9 observations, clusters 3 and 6 have only one observation, cluster 4 has 33 observations, and cluster 5 has 166 observations.
The next block of code gives you all of the cluster means for each variable used in the clustering analysis-Total Cases and Total Tests. Means for each variable are grouped by cluster, so the means for Total Cases and Total Tests in cluster 1 are calculated from all the values for Total Cases and Total Tests in cluster 1 (the same logic applies to the other 5 clusters).
After the cluster means, the next block of code contains the clustering vector, which shows you which observations belong to which cluster. Granted, the positions of the observations aren’t explicitly mentioned, but this vector starts with the first observation and ends with the 213th observation, so you can tell which observation you’re looking at. For instance, the first three observations belong to cluster 5 (which correspond to Afghanistan, Albania, and Algeria); the last three observations also happen to belong to cluster 5 (which correspond to Yemen, Zambia, and Zimbabwe).
After the clustering vector, the next block contains the WCSSBC (within cluster sum-of-squares by cluster), which measures the variability of observations in a certain cluster. The smaller the WCSSBC is, the more compact the cluster. I’m guessing the two WCSSBC of 0 belong to clusters 3 and 6, as each cluster only has one observation. The WCSSBC of the other clusters are all over 10 million.
In the same block of code as the WCSSBC, you’ll see the between_SS/total_SS amount, which is the ratio of the between sum-of-squares to the total sum-of-squares. In simple terms, this ratio measure the overall goodness-of-fit of the model-the higher this ratio is, the better the model fits the data. Since this ratio is 98.3%, the model is an excellent fit for the data.
The available components column isn’t relevant for our analysis, so I won’t go into detail here.

Now, let’s graph our clusters (remember to install the ggplot2 package):

Here’s the code to create the graph:

plot(cluster1, col=covid19$cluster, main="COVID-19 tests & cases", xlab = "COVID-19 cases", ylab="COVID-19 tests", pch=19)

In this graph, there are six clusters, which are grouped by color. Let’s analyze what each cluster means:

The bright green cluster represents the worldwide total of COVID-19 cases on April 12, 2020, which was roughly 1.8 million at the time (but is roughly 2.4 million as of April 19, 2020). Since the worldwide total of tests administered wasn’t factored into this dataset, 0 is used as the worldwide total of tests administered.
The pink cluster represents the COVID-19 cases in the US, as well as the number of tests the US has administered. The reason why the pink cluster looks so high is that even though the US had administered roughly 2.7 million tests as of April 12, 2020, the US also was the only country that had over 500,000 cases (the number is now closer to 800,000 as of April 19, 2020).
- Not counting the worldwide total, the US still has the most COVID-19 cases as of this writing.
The black cluster represent the countries that had anywhere between 0 and 200,000 COVID-19 cases as well as anywhere between a million and 1.5 millions tests administered.
The other three clusters (which are red, dark blue, and light blue) represent the countries that had anywhere between 0 and 200,000 COVID-19 cases but only administered between 0 and 750,000 tests.

All in all, the clusters are quite compact, but this graph does show that even though a country may have administered plenty of COVID-19 tests, this doesn’t mean that that country will have fewer cases. For instance, Italy had only administered just over a million tests but had approximately 156,000 cases. On the other hand, the US had administered roughly 2.7 million tests but had over 500,000 cases (so in other words, the US administered twice as many tests as Italy but still had over triple the amount of Italy’s cases).

Sure, administering more COVID-19 tests would help keep a country’s case count in check (just look at Germany), but if you only analyze the amount of tests a country has given, you’d be ignoring other factors that affect a country’s COVID-19 case count (such as whether or not a country has imposed a nationwide lockdown or something as simple as the country’s total population).

Thanks for reading, and remember to stay safe, wash your hands, and practice good social distancing. Together, we can flatten the curve!

Also, thank you to all the grocery store workers, firefighters, policemen, scientists, doctors, and anybody else who is helping us get through these rough times.

Michael

R Analysis 3: K-Means vs Hierarchical Clustering

Advertisements

Hello everybody,

It’s Michael, and I will be doing an R analysis on this post. More specifically, I will be doing a comparative clustering analysis, which means I’ll take a dataset and perform both k-means and hierarchical clustering analysis with that dataset to analyze the results of each method. However, this analysis will be unique, since I will be revisiting of the earliest datasets I used for this blog-TV shows-which first appeared in R Lesson 4: Logistic Regression Models on July 11, 2018 (exactly nine months ago!) In case you forgot what this dataset was about, it basically gives 85 shows that aired during the 2017-18 TV season and whether or not they were renewed for the 2018-19 TV season along with other aspects of those shows (such as the year they premiered and the network they air on). I’ll admit I chose this dataset because I wanted to analyze one of my old datasets in a different way (remember I performed linear and logistic regression the first time I used this dataset).

So, as always, let’s load the file and get a basic understanding of our data:

As you can see, we have 85 observations of 10 variables. Here’s a detailed breakdown of each variable:

TV.Show-The name of the show
Genre-The genre of the show
Premiere.Year-The year the show premiered; for revivals like Roseanne, I used the original premiere year (1988) as opposed to the revival premiere year (2018)
X..of.seasons..17.18.-How many seasons the show had aired as of the end of the 2017-18 TV season
Network-The network (or streaming service) the show airs on
X2018.19.renewal.-Whether or not the show was renewed for the 2018-19 TV season; 1 denotes renewal and 0 denotes cancellation
Rating-The content rating for the show. Here’s a more detailed breakdown:
- 1 means TV-G
- 2 means TV-PG
- 3 means TV-14
- 4 means TV-MA
- 5 means not applicable
Usual.Day.of.Week-The usual day of the week the show airs its new episodes. Here’s a more detailed breakdown:
- 1 means the show airs on Mondays
- 2 means the show airs on Tuesdays
- 3 means the show airs on Wednesdays
- 4 means the show airs on Thursdays
- 5 means the show airs on Fridays
- 6 means the show airs on Saturdays
- 7 means the show airs on Sundays
- 8 means the show doesn’t have a regular air-day (usually applies to talk shows or shows on streaming services)
Medium-the type of network the show airs on. Here’s a more detailed breakdown:
- 1 means the show airs on either one of the big 4 broadcast networks (ABC, NBC, FOX or CBS) or the CW (which isn’t part of the big 4)
- 2 means the show airs on a cable channel (AMC, Bravo, etc.)
- 3 means the show airs on a streaming service (Hulu, Amazon Prime, etc.)
Episode Count-the new variable I added for this analysis; this variable shows how many episodes a show has had overall since the end of the 2017-18 TV season. For certain shows whose seasons cross the 17-18 and 18-19 seasons, I will count how many episodes each show has had as of September 24, 2018 (the beginning of the 2018-19 TV season)

Now that we’ve learned more about our variables, let’s start our analysis. But first, I convert the final four variables into factors, since I think it’ll be more appropriate for the analysis:

Ok, now onto the analysis. I’ll start with k-means:

Here, I created a data subset using our third and tenth columns (Premiere.Year and Episode.Count respectively) and displayed the head (the first six observations) of my cluster.

Now let’s do some k-means clustering:

I created the variable tvCluster to store my k-means model using the name of my data subset-cluster1-the number of clusters I wanted to include (4) and nstart, which tells the models to start with 35 random points then select the one with the lowest variation.

I then type in tvCluster to get a better idea of what my cluster looks like. The first thing I see is “K-means clustering with (X) clusters of sizes”, before mentioning the amount of observations in each cluster (which are 17, 64, 1 and 3, respectively). In total, all 85 observations were used since I didn’t have any missing data points.

The next thing that is mentioned is cluster means, which gives the mean for each variable used in the clustering analysis (in this case, Episode.Count and Premiere.Year). Interestingly enough, Cluster 2 has the highest mean Premiere.Year (2015) but the lowest mean Episode.Count (49. rounded to the nearest whole number).

After that, you can see the clustering vector, which shows you which observations belong to which cluster. Even though the position of the observation (e.g. 1st, 23rd) isn’t explicitly mentioned, you can tell which observation you are looking at since the vector starts with the first observation and works its way down to the eighty-fifth observation (and since there is no missing data, all 85 observations are used in this clustering model). For instance, the first three observations all correspond to cluster 1 (the first three shows listed in this dataset are NCIS, Big Bang Theory, and The Simpsons). Likewise, the final three observations all correspond to cluster 2 (the corresponding shows are The Americans, Baskets, and Comic Book Men).

Next you will see the within cluster sum of squares for each cluster, which I will abbreviate as WCSSBC; this is a measurement of the variability of the observations in each cluster. Remember that the smaller this amount is, the more compact the cluster. In this case, 3 of the 4 WCSSBC are above 100,000, while the other WCSSBC is 0 (which I’m guessing is cluster 3, which has only one observation).

Last but not least is between_SS/total_SS=94.5%, which represents the between sum-of-squares and total sum-of-squares ratio, which as you may recall from the k-means lesson is a measure of the goodness-of-fit of the model. 94.5% indicates that there is an excellent goodness-of-fit for this model.

Last but not least, let’s graph our model:

In this graph, the debut year is on the x-axis, while the episode count is on the y-axis. As you can see, the 2 largest clusters (represented by the black and red dots) are fairly close together while the 2 smallest clusters (represented by the blue and green dots) are fairly spread out (granted, the two smallest clusters only have 1 and 3 observations, respectively). An observation about this graph that I wanted to point out is that the further back a show premiered doesn’t always mean the show has more episodes than another show that premiered fairly recently (let’s say anytime from 2015 onward). This happens for several reasons, including:

Revived series like American Idol (which took a hiatus in 2017 before its 2018 revival) and Roseanne (which had been dormant for 21 years before its 2018 revival)
Different shows air a different number of episodes per season; for instance, talk shows like Jimmy Kimmel live have at least 100 episodes per season while shows on the Big 4 networks tend to have between 20-24 episodes per season (think Simpsons, The Big Bang Theory, and Grey’s Anatomy). Cable and streaming shows usually have even less episodes per season (between 6-13, like how South Park only does 10 episode seasons)
Some shows just take long breaks (like how Jessica Jones on Netflix didn’t release any new episodes between November 2015 and March 2018)

Now time to do some hierarchical clustering on our data. And yes, I plan to use all the methods covered in the post R Lesson 12: Hierarchical Clustering.

Let’s begin by scaling all numerical variables in our data (don’t include the ones that were converted into factor types):

Now let’s start off with some agglomerative clustering (with both the dendrogram and code):

After setting up the model using Euclidean distance and complete linkage, I then plot my dendrogram, which you can see above. This dendrogram is a lot neater than the ones I created in R Lesson 12, but then again, this dataset only has 85 observations, while the one in that post had nearly 7,200. The names of the shows themselves aren’t mentioned, but each of the numbers displayed correspond to a certain show. For instance, 74 corresponds to The Voice, since it is the 74th show listed in our dataset. Look at the spreadsheet to figure out which number corresponds to which show.

You may recall that I mentioned two general rules for interpreting dendrograms. They are:

The higher the height of the fusion, the more dissimilar two items are
The wider the branch between two observations, the more dissimilar they are

Those rules certainly apply here, granted, the highest height is 8 in this case, as opposed to 70. For instance, since the brand between shows 7 and 63 is fairly narrow, these two shows have a lot in common according to the model (even though the two shows in question are Bob’s Burgers and The Walking Dead-the former being a cartoon sitcom and the latter revolving around the zombie apocalypse). On the other hand, the gap between shows 74 and 49 is wider, which means they don’t share much in common (even though the two shows are The Voice and Shark Tank, which both qualify as reality shows, though the former is more competition-oriented than the latter). All in all, I think it’s interesting to see how these clusters were created, since the shows that were grouped closer together seem to have nothing in common.

Now let’s try AGNES (remember that stands for agglomerative clustering):

First of all, remember to install the package cluster. Also, remember that the one important result is the ac, or agglomerative coefficient, which measures the strength of clustering structure. As you can see, our ac is 96.1%, which indicates very strong clustering structure (I personally think any ac at east 90% is good).

Now let’s compare this ac (which used complete linkage) to the ac we get with other linkage methods (not including centroid). Remember to install the purrr package:

Of the four linkage method’s, Ward’s method gives us the highest agglomerative coefficient (98.2%), so that’s what we’ll use for the next part of this analysis.

Using ward’s method and AGNES, here’s a dendrogram of our data (and the corresponding code):

Aside from having a greater maximum height than our previous dendrogram (the latter had a maximum height of 8 while this diagram has a maximum height of presumably 18), the observations are also placed differently. For instance, unlike in the previous dendrogram, observations 30 and 71 are side by side. But just as with the last dendrogram, shows that have almost nothing in common are oddly grouped together; for instance, the 30th and 71st observations correspond to The Gifted and Transparent; the former is a sci-fi show based off the X-Men universe while the latter is a transgender-oriented drama. The 4th and 17th observations are another good example of this, as the corresponding shows are The Simpsons and Taken; the former is a long-running cartoon sitcom while the latter is based off of an action movie trilogy.

The last method I will use for this analysis is DIANA (stands for divisive analysis). Recall that the main difference between DIANA and AGNES is that DIANA works in a top-down manner (objects start in a single supercluster and are divided into smaller clusters until single-element clusters are created) while AGNES works in a bottom-up (objects start in single-element clusters and are morphed into progressively larger clusters until a single supercluster is created). Here’s the code and dendrogram for our DIANA analysis:

Remember that the divisive coefficient is pretty much identical to the agglomerative coefficient, since both measure strength of clustering structure and the closer each amount is to 1, the stronger the clustering structure. Also, in both cases, a coefficient of .9 (or 90%) or higher indicates excellent clustering structure. In this case, the dc (divisive coefficient) is 95.9%, which indicates excellent clustering structure.

Just as with the previous two dendrograms, most of the observation pairs still have nothing in common. For instance, the 41st and 44th observations have nothing in common, since the corresponding shows are House of Cards (a political drama) and Brooklyn 99 (a sitcom), respectively. An exception to this would be the 9th and 31st observations, since both of the corresponding shows-Designated Survivor and Bull respectively-are dramas and both are on the big 4 broadcast networks (though the former airs on ABC while the latter airs on FOX).

Now, let’s assign clusters to the data points. I’ll go with 4 clusters, since that’s how many I used for my k-means analysis (plus I think it’s an ideal amount). I’m going to use the DIANA example I just mentioned:

Now let’s visualize our clusters in a scatterplot (remember to install the factoextra package):

As you can see, cluster 3 has the most observations while cluster 4 has the least (only one observation corresponding to Jimmy Kimmel Live). Some of the observations in each cluster have something in common, like the 1st and 2nd observations (NCIS and The Big Bang Theory in cluster 1, both of which air on CBS) and the 42nd and 80th observations in cluster 3 (Watch What Happens Live! and The Chew-both talk shows).

Now, let’s visualize these clusters on a dendrogram (using the same DIANA example):

The first line of code is exactly the same line I used when I first plotted my DIANA dendrogram. The rect.hclust line draws the borders to denote each cluster; remember to set k to the amount of clusters you created for your scatterplot (in this case, 4). Granted, the coloring scheme is different from the scatterplot, but you can tell which cluster is which judging by the size of the rectangle (for instance, the rectangle for cluster 4 only contains the 79th observation, even though it is light blue on our dendrogram and purple on our scatterplot). Plus, all the observations are in the same cluster in both the scatterplot and dendrogram.

Thanks for reading.

Michael

R Lesson 11: Missing Data and Basic K-Means Clustering

Advertisements

Hello everybody,

It’s Michael, and today I will be discussing an unsupervised machine learning technique (recall that from my previous R post?) known as k-means clustering. I will also discuss what to do with missing data observations in a column. I know those may seem like unrelated topics, but the dataset I will be working with utilizes both of those topics.

Here’s the dataset for those who want to work with it-2018 films (1).

But first, as always, let’s load the dataset into R and try to understand our variables:

This dataset contains all 212 wide-release movies for the year 2018. There are seven variables in this dataset:

Title-the title of the movie
US.Box.Office.Gross-How much money the movie made in the US (this doesn’t factor in worldwide grosses)
- I got this data from BoxOfficeMojo.com. Great site to collect movie-related data (IMDB works too).
Release.Date-The date a movie was released
- Remember to use the as.Date function seen in the picture to convert your dates from factor type to date type.
Genre-The genre of the movie
RT.Score-The movie’s critic score on Rotten Tomatoes; 0 represents a score of 0% while 1 represents a score of 100%
Audience.Score-The movie’s audience score on Rotten Tomatoes; just as with RT Score, 0 means 0% and 1 means 100%
Runtime-The movie’s runtime in minutes; so 100 minutes represents a movie that is 1 hour 40 minute long.

Now, the datasets I’ve used so far have been perfect and tidy. However, real datasets aren’t like that, as they are often messy and contain missing observations, which is the case with this dataset. So what should we do?

First, let me demonstrate a function called missmap that gives you a visual representation of missing observations (and the locations of those observations):

First, before creating the missmap, you need to install the Amelia package and use the library(Amelia) function. Then all you need to do is write missmap(file)-or whatever you named your file-to display the missingness map (that’s why the function is called missmap)

According to this missmap, there aren’t too many missing observations (only 1% as opposed to the 99% of present observations). The few missing observations are located in the RT Score and Audience Score columns, since some of the movies are missing critic or audience scores.

Keep in mind that this dataset is only a warmup; I will be working with messier datasets in future posts, so stay tuned.

So how can we tidy up our dataset? Well, here’s one way of approaching it:

na.rm=TRUE

This line of code would be included in my k-means model to tell it to exclude any NA (or missing) data points. I set na.rm to TRUE since TRUE represents any missing data points.

I could also replace all the missing values with the means for RT Score and Audience Score (62% and 66% respectively), but since missing values only make up 1% of all my data, I’ll just stick with excluding missing values.

Now, it’s time to do some k-means clustering. But first of all, what is k-means clustering?

K-means clustering is an unsupervised machine learning algorithm that tries to group data into k amount of clusters by minimizing the distance between individual observations. The aim of k-means clustering is to keep all the data points in a cluster as close to the centroid (center data point) as possible. In order do so, each data point must be assigned to the closest centroid utilizing Euclidean distance. Euclidean distance is data science lingo for straight line distance between a point in a cluster and the cluster’s centroid.

So the next step in our analysis would be to select the variables to use in our k-means clustering. We can do so by subsetting the data like this:

I chose two columns to include in my subset (US Box Office Gross and RT Score), which creates a data frame that I will use in my k-means clustering. The head function simply displays the first six rows of my subset (the first six box office grosses listed along with their corresponding RT score).

For k-means clustering, you should only include two columns in your subset

Now the next step would be to create our clusters. Here’s how:

I first created the movieCluster variable to store my k-means model using the name of my data subset-data1-, the number of clusters I wish to include (5), and nstart (which tells the model to start with 35 random points then select the one with the lowest variation).

I didn’t mention it here, but I included the line data1 <- na.omit (data1) in order to omit any rows with NA in them from my subset (remember RT Score had NA values). The k-means model won’t run if you don’t omit NA values (or replace them with means, but for now let’s stick to omitting the values).

I then type in movieCluster to get a better idea of what my cluster looks like. The first thing that is displayed is “K-means clustering with (x) clusters of sizes:”, which shows you the sizes of each cluster. In this case, the clusters in my k-means model have 62, 22, 3, 10, and 102 observations, respectively. In total, 199 observations were used, which means 13 weren’t used since their RT Scores were missing.

The next thing you see is cluster means, which give you the means of each variable in a certain cluster; in this case, the means for US Box Office Gross and RT Score are displayed for all 5 clusters.

After that, you will see the clustering vector, which will tell you what cluster each observation belongs to (labelled 1-5). As you can see, there are some missing observations (such as 201 and 211); this is because I omitted all NA rows from my data subset (and in turn, my k-means model).

Next you will see the within cluster sum of squares for each cluster; this is a measurement of the variability of the observations in each cluster . Usually the smaller this amount is, the more compact the cluster. As you can see, all of the WCSSBC (my acronym for within cluster sum of squares by cluster) are quite large; this is possibly due to the fact that most of the values in the US Box Office Gross column are over 500,000. I’d probably have smaller WCSSBC if the US Box Office Gross values were smaller, but that’s neither here nor there.

The last thing you will see is between_SS/total_SS=95.4%, which represents the between sum-of-squares and total sum-of-squares ratio; this is a measure of the goodness-of-fit of the model. 95.4% indicates that there is an excellent goodness-of-fit for this model.

Last but not least, let’s graph our model. Here’s how:

The movies are clustered by US Box Office Gross (referred here as simply Box Office Gross); the colors of the dots represent each cluster. Here’s what each color represents:

Light blue-movies that made between $0 to $25 million
Black-movies that made between $25 to $100 million (many limited releases)
Red-movies that made between $100 to $200 million
Dark blue-movies that made between $200 to $400 million
Green-movies that made between $600 to $700 million

As you can see, the green cluster is the outlier among the data, since only 3 movies in our dataset made at least $600 million (Black Panther, Avengers Infinity War, and Incredibles 2).

Thanks for reading,

Michael