Hello everybody,
Michael here, and today’s post will be an R analysis on the 2020 NBA Playoffs.
As many of you know, the 2019-20 NBA Season was suspended on March 11, 2020 due to COVID-19. However, the season resumed on July 30, 2020 inside a “bubble” (essentially an isolation zone without fans present) in Disney World. 22 teams played 8 regular-season “seeding” games in the bubble before playoffs commenced on August 14, 2020. The resumed season concluded on October 11, 2020, when the LA Lakers defeated the Miami Heat in the NBA Finals to win their 17th championship. The data comes from https://www.basketball-reference.com/playoffs/NBA_2020_totals.html (Basketball Reference), which is a great site if you’re looking for basketball statistics and/or want to do some basketball analyses.
Before we start analyzing the data, let’s first upload it to R and learn about the data (and here’s the data):

This dataset contains the names of 217 NBA players who participated in the 2020 Playoffs along with 30 other types of player statistics.
Now, I know it’s a long list, but here’s an explanation of all 31 variables in this dataset:
..Player-The name of the playerPos-The player’s position on the team. For an explanation of the five main basketball positions, please check out this link https://jr.nba.com/basketball-positions/Age-The age of the player as of July 30, 2020 (the date the NBA season resumed)Tm-The team the player played for during the 2019-20 NBA seasonG-The games the player participated in during the 2020 NBA playoffsGS-The amount of playoff games that the player was in the game’s starting lineupMP-The amount of minutes a player was on the court during the playoffsFG-The amount of field goals a player scored during the playoffs- For those who don’t know, a field goal in basketball is any shot that’s not a free throw.
FGA-The amount of field goals a player tried to make during the playoffs (this includes both successful and unsuccessful field goals)FG.-The percentage of a player’s field goal attempts that were successful.X3P-The amount of 3-point field goals a player scored during the playoffs- Field goals can be worth either 2 or 3 points, depending on where the player shoots the field goal from
X3PA-The amount of 3-point field goals a player attempted during the playoffs (both successful and unsuccessful 3-pointers)X3P.-The percentage of a player’s 3-point field goals that were successfulX2P-The amount of 2-point field goals a player scored during the playoffsX2PA-The amount of 2-point field goals a player attempted during the playoffs (both successful and unsuccessful 2-pointers)X2P.-The percentage of a player’s 2-point field goals that were successfuleFG.-The percentage of a player’s successful field goal attempts, adjusting for the fact that 3-point field goals are scored higher than 2-point field goalsFT-The amount of free throws a player scored during the playoffsFTA-The amount of free throw attempts a player made during the playoffs (both successful and unsuccessful attempts)FT.-The percentage of a player’s successful free throw attempts during the playoffsORB-The amount of offensive rebounds a player made during the playoffs- A player gets a rebound when they retrieve the ball after another player misses a free throw or field goal. The player gets an offensive rebound if their team is currently on offense and a defensive rebound if their team is currently on defense.
DRB-The amount of defensive rebounds a player made during the playoffsTRB-The sum of a player’sORBandDRBAST-The amount of assists a player made during the playoffs- A player gets an assist when they pass the ball to one of their teammates who then scores a field goal.
STL-The amount of steals a player made during the playoffs- A player gets a steal when they cause a player on the opposite team to turnover the ball.
BLK-The amount of blocks a player made during the playoffs- A player gets a block when they deflect a field goal attempt from a player on the other team
TOV-The amount of turnovers a player made during the playoffs- A player gets a turnover when they lose possession of the ball to the opposing team. Just in case you were wondering, the goal for a player is to make as few turnovers as possible.
PF-The amount of personal fouls a player made during the playoffs- A player gets a personal foul when they make illegal personal contact with a player on the other team. The goal for players is to make as few personal fouls as possible, since they will be disqualified from the remainder of the game if they get too many.
PTS-The sum of the amount of field goals and free throws a player made during the playoffsTeam-The name of the team the player played for during the 2020 NBA playoffsTeamandTmbasically give you the same information, exceptTeamgives you the name of the team (e.g. Mavericks) whileTmgives you the three-letter abbreviation for the team (e.g. in the case of the Mavericks the 3-letter abbreviation is DAL)
Position-The position the team reached in the 2020 NBA playoffs. There are 7 possible values for position which include:ECR1-The team made it to the first round of the Eastern Conference playoffsECSF-The team made it to the Eastern Conference SemifinalsECF-The team made it to the Eastern Conference FinalsWCR1-The team made it to the first round of the Western Conference playoffsWCSF-The team made it to the Western Conference SemifinalsWCF-The team made it to the Western Conference FinalsFinals-The team made it to the 2020 NBA Finals
Whew, that was a lot of information to explain, but I felt it was necessary to explain this in order to better understand the data (and the analyses I will do).
Now, before we start creating analyses, let’s first create a missmap to see if there are any missing rows in the data (to create the missmap, remember to install the Amelia package). To create the missmap, use this line of code-missmap(file):
As you can see here, 98% of the observations are present, while only 2% are missing.
Let’s take a look at the five columns with missing data. They are:
FT.-free throw percentageX3P.-percentage of 3-pointers made during playoffsX2P.-percentage of 2-pointers made during playoffseFG.-percentage of successful field goal attempts during playoffs (adjusted for the fact that 3-pointers are worth more than 2-pointers)FG.-percentage of successful field goal attempts during playoffs (not adjusted for the score difference between 2- and 3-pointers)
What do all of these columns have in common? They are all percentages, and the reason why there would be missing data in these columns is because a player doesn’t have any statistics in these categories. For instance, the reason why FT. would be blank is because a player didn’t make any free throws during the playoffs (same logic applies to 3-pointers, 2-pointers, and field goals).
So what will we do with these columns? In this case, I will simply ignore these columns in my analysis because I won’t be focusing on percentages.
Alright, now that I went over the basics of the data, let’s start analyzing the data. First off, I’ll start with a couple of linear regressions.
The first linear regression I will do will analyze whether a player’s age affects how many playoff games they played. Here’s the formula for that linear regression:
linearModel1 <- lm(Age~G, data=file)
And here’s a summary of the linear model:
Now, what does all of this mean? (and for a refresher on linear regression models, please read my post _):
- The residual standard error gives us the amount that the response variable (Age) deviates from the regression line. In this example, the player’s age deviates from the regression line by 4.066 years.
- The Multiple R-Squared and Adjusted R-Squared are measures of how well the model fits the data. The closer the R-squared to 1, the better the fit of the model.
- In this case, the Multiple R-Squared is 3.42% and the Adjusted R-Squared is 2.97%, indicating that there is almost no correlation between a player’s age and the amount of playoff games they played.
- The F-Statistic is a measure of the relationship (or lack thereof) between the dependent and independent variables. This metric (and the corresponding p-value) isn’t too important when dealing with simple linear regression models such as this one, but it is important when analyzing multiple linear regression models (i.e. models with multiple variables).
- To make things easier, just focus on the F-statistic’s corresponding p-value when analyzing the relationship between the dependent and independent variables. If the p-value is less than 0.05, accept the null hypothesis (that the independent variable and dependent variables aren’t related). If the p-value value is greater than 0.05, reject the null hypothesis.
OK, so it looks like there isn’t any correlation between a player’s age and the amount of playoff games they played. But what happens when I include Team in the analysis?
Here’s the code for that linear regression model:
linearModel2 <- lm(Age~G+Team, data=file)
However, something to note is that Team is of type chr, which isn’t allowed in linear regression analysis. So, we would need to use this simple line of code to convert Team to type factor:
file$Team <- as.factor(file$Team)
Alright, so let’s see the summary of this linear regression model:
Now, let’s analyze this model further:
- The residual standard error shows us that a player’s age deviates from the regression line by 3.902 years (down from 4.066 years in the previous model).
- The Multiple R-Squared and Adjusted R-Squared (17.26% and 10.64% respectively) shows us that this model has a better fit than the previous model, but both metrics are still too small to imply that there is any correlation between a player’s age, the team he plays for, and how many playoff games he played.
- The F-statistic’s corresponding p-value is much less than 0.05 (0.001024), which indicates that a player’s age has no bearing on the amount of playoff games they played or the team they played for.
- Notice how all of the values for
Teamare displayed. This is because when you include a factor in a linear regression, you will see all of the possible values for the factor (there are 16 possible values in this case for the 16 teams that made it to playoffs)
Now we will do some k-means clustering analyses. First, let’s create a cluster using the variables FT and FG (free throws and field goals, respectively):
As you can see, I have displayed the head (the first six observations) of the cluster.
- I used 8 and 18 as the column numbers because
FGandFTare the 8th and 18th columns in the dataset, respectively.
Alright, let’s do some k-means clustering:
So what does this output tell us? Let’s find out:
- In this example, I created 5 clusters from
cluster1. The 5 clusters have sizes of 104, 36, 10, 64, and 3 respectively. - The cluster means show us the means for each variable used in this analysis-
FTandFG. Recall that cluster means are calculated from the values in each cluster, so the cluster mean forFGin cluster 1 is calculated from all of theFGvalues in cluster 1 (same logic applies for the other 4 clusters). - The clustering vector shows you which observations belong to which cluster. In this example, the cluster observations are sorted alphabetically by a player’s last name. The first three observations belong to clusters 1, 4, and 3, which correspond to Jaylen Adams, Steven Adams, and Edrice “Bam” Adebayo. Likewise, the last three observations belong to clusters 1, 1, and 2, which correspond to Nigel Williams-Goss, Delon Wright, and Ivica Zubac, respectively.
- The WCSSBC (within cluster sum of squares by cluster) shows us the variablity of observations in a certain cluster. The smaller the WCSSBC, the more compact the cluster. In this example, cluster 1 is the most compact since it has the smallest WCSSBC (2886.26).
- The
between_SS/total_SSratio shows the overall goodness-of-fit for the model. The higher this ratio is, the better the model fits the data. In this example, thebetween_SS/total_SSis 91.5%, indicating that this model is an excellent fit for the data.
Now let’s graph our clusters (and remember to install the ggplot2 package);
Here’s the code we’ll use to make the first graph:
plot(cluster1, col=NBAPlayoffs$cluster, main="2020 NBA Playoffs Field Goals vs Free Throws", xlab = "Playoff field goals", ylab="Playoffs free throws", pch=19)
In this graph, we have 5 clusters grouped by color. Let’s analyze each of these clusters:
- The black cluster represent players who scored 20 or fewer playoff free throws and 20 or fewer playoff field goals. Some notable players that fall into this cluster include Jared Dudley (Lakers), Hassan Whiteside (Trail Blazers), and Bol Bol (Nuggets).
- The dark blue cluster represents players who scored between 0 and 35 playoff free throws and between 15 and 50 playoff field goals. Some notable players that fall into this cluster include Andre Igoudala (Heat), Markieff Morris (Lakers), and Victor Oladipo (Pacers).
- The red cluster represents players who scored between 20 and 50 playoff free throws and between 35 and 90 playoff field goals. Some notable players that fall into this cluster include Giannis Antentokonumpo (Bucks), Jae Crowder (Heat), and Rudy Gobert (Jazz).
- The other two clusters (which are light green and light blue) represent players who scored at least 45 playoff free throws and at least 100 playoff field goals.
- In case you’re wondering, the three players who are in the light green cluster are Jimmy Butler (Heat), Anthony Davis (Lakers), and LeBron James (Lakers)-all three of whom participated in the 2020 NBA Finals.
So, what insights can we gather from this data? One insight is certain-a team’s playoff position alone doesn’t have much impact on the amount of free throws and field goals they scored. For instance, Dion Waiters, JR Smith, and Talen Horton-Tucker are in the lowest FG/FT cluster, but all three of them are on the Lakers, who won the NBA Championship. However, Dion Waiters and Talen Horton-Tucker didn’t play in any Finals games, while JR Smith did play in some Finals games but only averaged 7.5 minutes per games throughout the playoffs.
Likewise, you might think that the two highest FG/FT clusters would only have Heat and Lakers players, given that these two teams made it all the way to the Finals. Two exceptions to this are James Harden (Rockets) and Kawhi Leonard (Clippers), who were both eliminated in the Western Conference semifinals.
Now let’s do another k-means clustering analysis, this time using two different scoring categories-total rebounds and assists (represented by TRB and AST respectively).
First, let’s create the main cluster (let’s use 23 and 24 as the column numbers because TRB and AST are the 23rd and 24th columns of the dataset, respectively):
Now, let’s create the k-means analysis:
What does this output tell us? Let’s find out:
- In this example, I created 4 clusters from
cluster2with sizes of 65, 18, 128, and 6, respectively. - The cluster means display the averages of
TRBandASTfor each cluster. - The clustering vector shows us which observations correspond to which cluster. From this vector, we can see that the first three observations belong to cluster 3.
- The WCSSBC indicates the compactness of each cluster. Since cluster 4 has the smallest WCSSBC (13214.83), it is the most compact cluster. Cluster 1, on the other hand, has the highest WCSSBC (32543.42), so it is the least compact cluster.
- The
between_SS/total_SSratio shows us the goodness-of-fit for the model. In this case, thebetween_SS/total_SSratio is 82.9%, which indicates that this model is a good fit for the data (though this ratio is lower than the previous model’s ratio of 91.5%).
Alright, now let’s graph this cluster. Remember to install the ggplot2 package and use this line of code:
plot(cluster2, col=NBAPlayoffs2$cluster, main="2020 NBA Playoffs Rebounds vs Assists", xlab = "Playoff rebounds", ylab="Playoff assists", pch=19)
In this graph, we have 4 clusters grouped by color. Let’s see what each of these clusters mean:
- The green cluster represents players who scored 40 or fewer playoff assists and 30 or fewer playoff rebounds. Notable players who fall into this cluster include Meyers Leonard (Heat), Dion Waiters (Lakers), and Al Horford (76ers).
- The black cluster represents players who scored between 0 and 60 playoff assists and between 35 and 100 playoff rebounds. Notable players who fall into this cluster include Kelly Olynyk (Heat), Chris Paul (Thunder), and Kyle Kuzma (Lakers).
- The red cluster represent players who scored between 15 and 130 playoff assists and between 45 and 130 playoff rebounds. Notable players who fall into this cluster include Russell Westbrook (Rockets), Jae Crowder (Heat), and Kyle Lowry (Raptors).
- The blue cluster represents the six players in the top rebounds/assists tier, which include:
- Jimmy Butler (Heat)
- Jayson Tatum (Celtics)
- Nikola Jokic (Nuggets)
- Bam Adebayo (Heat)
- Anthony Davis (Lakers)
- LeBron James (Lakers)
- Four of these players participated in the NBA Finals while Jayson Tatum and Nikola Jokic were eliminated in the Conference Finals.
So, what insights can we draw from this cluster analysis? First of all, similar to the previous cluster analysis, a team’s playoff position (i.e. Conference Finals) has no bearing as to which player falls in which cluster. For example, the Heat and Lakers made it to the NBA Finals, yet there are Heat and Lakers players in every cluster.
And just as with the previous cluster analysis, the top cluster doesn’t only have Finals players, as Jayson Tatum and Nikola Jokic are included in the top cluster-and both of them were eliminated in the Conference Finals.
Last but not least, let’s create a cluster analyzing two things players try not to get-personal fouls and turnovers. First, let’s create the main cluster (use 27 and 28 as the column numbers since TOV and PF are the 27th and 28th columns in the dataset respectively):
Next, let’s create the k-means analysis:
So, what does this output tell us? Let’s find out:
- In this example, I created 4 clusters from
cluster3with sizes of 30, 70, 104, and 13, respectively. - The cluster means display the means of
TOVandPFfor each cluster. - The clustering vector shows us which observations belong to which cluster. In this example, the first three observations belong to cluster 3 (just like in the previous example).
- The WCSSBC shows us how compact each cluster is; in this example, cluster 3 is the most compact since it has the smallest WCSSBC (1976.298).
- The
between_SS/total_SSratio shows us the goodness-of-fit for the model; in this example, this ratio is 84.9%, which indicates that the model is a great fit for the data.- Interestingly, the k-means analysis with the best goodness-of-fit is the first analysis with a
between_SS/total_SSratio of 91.5%. I think this is surprising since the first model had 5 clusters while this model and the previous model have 4 clusters.
- Interestingly, the k-means analysis with the best goodness-of-fit is the first analysis with a
Now, let’s plot our k-means analysis (and remember to install ggplot2!)
plot(cluster3, col=NBAPlayoffs3$cluster, main="2020 NBA Playoffs Turnovers vs Personal Fouls", xlab = "Playoff turnovers", ylab="Playoff personal fouls", pch=19)
Just as with the previous graph, we have four clusters grouped by color in this plot (and coincidentally in the same color order as the previous graph). What do each of these clusters mean? Let’s find out:
- The green cluster represents players who committed 15 or fewer playoff turnovers as well as 12 or fewer playoff personal fouls. Notable players that fall into this cluster include JR Smith (Lakers), Kyle Korver (Bucks), and Damian Lillard (Trail Blazers).
- In this example, being in the lowest cluster is the best, since the aim of every player is to commit as few turnovers and personal fouls as possible. Plus, if a player commits too many personal fouls in a game, they’re ejected.
- Also, the more turnovers a player commits, the more opportunities the opposing team gets to score.
- In reality, the data isn’t as black-and-white as I make it seem. This is because the further a player made it into the playoffs, the more opportunities they would have for committing turnovers and personal fouls.
- Data is never black-and-white as it looks. The key to being a good data analyst is to analyze every piece of data in a dataset to see how it’s all interconnected.
- The red cluster represents players who committed between 7 and 35 playoff turnovers and between 2 and 35 playoff personal fouls. Notable players who fall into this cluster include Carmelo Anthony (Trail Blazers), Kendrick Nunn (Heat), and Rudy Gobert (Jazz).
- The black cluster represents players who committed between 9 and 35 playoff turnovers and between 33 and 66 playoff personal fouls. Notable players who fall into this cluster include Rajon Rondo (Lakers), Danny Green (Lakers), and Jae Crowder (Heat).
- The blue cluster represents the 13 players who are in the highest turnover/personal foul tier. Here’s the full list:
- Goran Dragic (Heat)
- Paul George (Clippers)
- Jaylen Brown (Celtics)
- Tyler Herro (Heat)
- James Harden (Rockets)
- Marcus Smart (Celtics)
- Bam Adebayo (Heat)
- Jayson Tatum (Celtics)
- Jamal Murray (Nuggets)
- Anthony Davis (Lakers)
- Jimmy Butler (Heat)
- Nikola Jokic (Nuggets)
- and Lebron James (Lakers)
- You might be surprised to see 6 NBA Finalists on this list, but I have a theory as to why that’s the case. See, the farther a player makes it in the playoffs, the more opportunities they have to score (and commit fouls and turnovers). And if you saw the 2020 NBA Finals, there were quite a bit of personal fouls and turnovers committed by both the Heat and Lakers (especially with Anthony Davis’s foul trouble in Game 3).
Thanks for reading,
Michael