Python, Linear Regression & the 2024-25 NBA season

Advertisements

Hello everybody!

Michael here, and in today’s post, we’ll continue where we left off from the previous post Python, Linear Regression & An NBA Season Opening Day Special Post. As I mentioned in that post, we’ll use the linear regression equation we obtained to see if we can obtain predictions for the current 2024-25 NBA season?

Disclaimer

Yes, I know I’m trying to predict the various juicy outcomes of the 2024-25 NBA season, but these predictions are purely meant for educational purposes to display the methodology of the predictions, not for game-day parlays and/or your fantasy NBA team. After all, I am your friendly neighborhood coding blogger, but I am not your friendly neighborhood sportsbook. If you do decide to bet on anything during the NBA season, please bet responsibly :-).

Previously on Michael’s Programming Bytes…

In the previous post, we used data from the last 10 NBA seasons for each of the 30 teams to predict season record results, which in turn gave us this linear regression equation that I will use to predict team-by-team results and standings for the 2024-25 NBA season:

Just to recap, here’s what in this equation:

  • -0.47x (represents team’s losses in a given season)
  • -1.31x (represents team’s conference finish from 1-15 in a given season)
  • 0.4x (represents average age of team’s roster)
  • 34.13x (represents % of field goals made)
  • -22.12x (represents % of 3-pointers made)
  • 50.95 (linear regression model intercept)

Our predictions generated in the previous post came back with 91% accuracy/9% mean absolute percentage error, so I can tell we’re gonna get some good predictions here.

And now, for the predictions…

Yes, here comes the fun part, the predictions. For the predictions, I gathered the weighted averages of the five features we used in our model (losses, conference finish, average roster age, % of field goals made and % of 3-pointers made) and placed them into this spreadsheet:

Now, how did I calculate the weighted averages of these five features for each team? Well, I simply assigned different weights for different seasons like so:

  • 2021-22 to 2023-24 seasons-0.2 weight (higher weight for the three most recent seasons)
  • 2018-19 to 2020-21 seasons-0.1 weight (they’re a little further back, plus I factored in COVID impacts to the 2019-20 and 2020-21 seasons)
  • 2014-15 to 2017-18 season-0.025 weight (smaller weight since these are the furthest in the past, plus many players in the league during this time have since retired)

After assigning these weights, I calculated averages using the standard procedure for average calculation.

Here’s the basic Python code I used to calculate projected wins for all 30 NBA teams:



import pandas as pd

NBAAVG = pd.read_csv(r'C:\Users\mof39\OneDrive\Documents\NBA weighted averages.csv')

for n in NBAAVG['Team']:
    print(str(-0.47*NBAAVG['L']-1.31*NBAAVG['Finish']
                                      +0.4*NBAAVG['Age']+34.13*NBAAVG['FG%']
                                      -22.15*NBAAVG['3P%']+50.95))
    break

And here are the projected win totals for each team using this equation:

0     37.911934
1     52.863761
2     40.819851
3     31.742252
4     37.524958
5     40.441851
6     42.540851
7     51.103223
8     24.263654
9     45.852691
10    33.197160
11    38.736829
12    47.364055
13    41.338946
14    41.297291
15    45.202762
16    53.722600
17    39.185063
18    37.443009
19    38.462010
20    39.284500
21    32.296571
22    47.795819
23    45.063567
24    33.312626
25    37.493793
26    32.519145
27    42.072515
28    41.920285
29    31.773436

Granted, you don’t actually see the team names in this output, but since the team names are organized alphabetically in the dataset you can tell which team corresponds to which projected win total. However, just for clarity, I’ll elaborate on those totals below:

Atlanta Hawks: 37.911934 wins (38-44)
Boston Celtics: 52.863761 wins (53-29)
Brooklyn Nets: 40.819851 wins (41-41)
Charlotte Hornets: 31.742252 wins (32-50)
Chicago Bulls: 37.524958 wins (38-44)
Cleveland Cavaliers: 40.441851 wins (40-42)
Dallas Mavericks: 42.540851 wins (43-39)
Denver Nuggets: 51.103223 wins (51-31)
Detroit Pistons: 24.263654 wins (24-58)
Golden State Warriors: 45.852691 wins (46-36)
Houston Rockets: 33.197160 wins (33-49)
Indiana Pacers: 38.736829 wins (39-43)
LA Clippers: 47.364055 wins (47-35)
LA Lakers: 41.338946 wins (41-41)
Memphis Grizzlies: 41.297291 wins (41-41)
Miami Heat: 45.202762 wins (45-37)
Milwaukee Bucks: 53.722600 wins (54-28)
Minnesota Timberwolves: 39.185063 wins (39-43)
New Orleans Pelicans: 37.443009 wins (37-45)
New York Knicks: 38.462010 wins (38-44)
Oklahoma City Thunder: 39.284500 wins (39-43)
Orlando Magic: 32.296571 wins (32-50)
Philadelphia 76ers: 47.795819 wins (48-34)
Phoenix Suns: 45.063567 wins (45-37)
Portland Trailblazers: 33.312626 wins (33-49)
Sacramento Kings: 37.493793 wins (37-45)
San Antonio Spurs: 32.519145 wins (33-49)
Toronto Raptors: 42.072515 wins (42-40)
Utah Jazz: 41.920285 wins (42-40)
Washington Wizards: 31.773436 wins (32-50)

As you can see above, I have managed to predict the records for each team for the 2024-25 NBA season. A few things to note about my predictions:

  • Since NBA records are only counted in whole numbers, I rounded each team’s projected win total up or down to the nearest whole number. For instance, for the Milwaukee Bucks, since their projected win total was 53.722600, I rounded that up to 54 wins (and a 54-28 record).
  • According to my model, all team’s projected win totals fall between 24 and 54 wins. This make sense since in a given NBA season, a majority of teams’ win totals fall in the 24-54 win range. In the last NBA season (2023-24), 21 teams fell within the 24-54 win range.
  • Four teams obtained over 54 wins (Celtics with 64, Thunder and Nuggets with 57, and Timberwolves with 56) while five teams obtained less than 24 wins (Spurs with 22, Hornets and Trailblazers with 21, Wizards with 15 and Pistons with 14).
  • One thing to note about my predictions is that while I rounded up or down to the nearest whole number to get a projected record total, I’ll still factor in the entire decimal (e.g. 45.202762 for the Heat) when deciding how to seed teams, as teams with a higher decimal will be seeded higher in their respective conference.

Michael’s Magnificently Way-To-Early Playoff Picture

Yes, now that we have projected record totals for each of the 30 teams, the next thing we’ll do is predict each team’s seeding.

How will we seed the teams? Well, for one, I’ll rank the teams with the higher projected records higher in their respective conference. For instance, since the Bucks have a higher projected record than the Celtics, I’ll rank the Bucks higher than the Celtics.

However, what if two teams have a really, really close margin between them? For instance, the Minnesota Timberwolves and Oklahoma City Thunder’s projected records of 39.185063 wins and 39.284500 wins respectively are very close to each other. However, since OKC has a slightly higher projected win total, I’ll rank them higher than the Timberwolves.

So without further ado, here’s Michael’s Magnificently Way-Too-Early Playoff Picture!

Eastern Conference

INTO THE PLAYOFFSINTO THE PLAY-INOUT OF PLAYOFF RUNNING
1. Milwaukee Bucks7. Cleveland Cavaliers11. Chicago Bulls
2. Boston Celtics8. Indiana Pacers12. Washington Wizards
3. Philadelphia 76ers9. Atlanta Hawks13. Charlotte Hornets
4. Miami Heat10. New York Knicks14. Orlando Magic
5. Toronto Raptors15. Detroit Pistons
6. Brooklyn Nets

Western Conference

INTO THE PLAYOFFSINTO THE PLAY-INOUT OF PLAYOFF RUNNING
1. Denver Nuggets7. LA Lakers11. Sacramento Kings
2. LA Clippers8. Memphis Grizzlies12. New Orleans Pelicans
3. Golden State Warriors9. Oklahoma City Thunder13. San Antonio Spurs
4. Phoenix Suns10. Minnesota Timberwolves14. Portland Trailblazers
5. Dallas Mavericks15. Houston Rockets
6. Utah Jazz

And now, for some insights

Now that we have our predictions for both team’s projected win totals and projected conference seeding, let’s see if we can gather some insights into what the 2024-25 NBA season might bring for all 30 teams. Without further ado, here are insights across the NBA that I think will be interesting to see play out over the course of the season:

Will the Celtics repeat as champs?

For those who don’t know, the Boston Celtics came out on top as the champions of the 2023-24 NBA season, beating the Dallas Mavericks in 5 games in the 2024 NBA Finals.

Question is, can they do it again? There’s a good chance that can happen, even with the projected 2-seed in the Eastern Conference. After all, the Celtics have kept many of their key playmakers from their championship squad such as Al Horford, Derrick White, Jaylen Brown and of course, Jayson Tatum.

Interestingly, we’ve had SIX different teams win the NBA championship in the last six seasons, such as:

  • 2019-Raptors
  • 2020-Lakers
  • 2021-Bucks
  • 2022-Warriors
  • 2023-Nuggets
  • 2024-Celtics

Could we have a repeat champ for the first time since those seemingly endless Warriors-Cavs finals (remember those)? I’ll reiterate that it’s certainly possible, especially with Tatum in his prime.

Warriors for a deep playoff run?

Yes, I know they’ve had their ups and downs over the last 10 years, but after all, the Golden State Warriors have won 4 championships over the last 10 years, so I have reason to believe they’ll go on another deep playoff run.

Will the loss of Klay Thompson hurt? Yes. Stephen Curry is also on the back-nine of his career (he turns 37 in March), but he did put up the most points per game of anyone on the Warriors’ roster last season (26.4). Curry also had the highest 3-pointer percentage of anyone on the Warriors’ roster last season (40.8%)-recall that successful 3-pointer percentage was one of the five features I used in the linear regression model. Plus, Draymond Green will be returning to the Warriors this eason; he proved to be one of the Warriors’ strongest 3-point shooters and rebounders last season (though he is also in his later career as he will be 35 in March).

Interestingly, this model has the Warriors going 46-36 as the 3-seed in the Western Conference. Funny enough, the Warriors finished 46-36 last season but ended up as the 10-seed in the Western Conference and failed to make it past play-in.

This brings me to my next point…

Will the West be close again?

Last season, the Western Conference was incredibly close when it came to win totals and playoff seeding. After all, the 6-seed in the West last year (Phoenix Suns) still finished with a 49-33 record…and were promptly swept in the Western Conference first round (though that’s neither here nor there).

Another thing to put the closeness of last year’s Western Conference playoff race into perspective-the Warriors finished 46-36 yet only notched a 10-seed and the Houston Rockets finished with an even 41-41 record but missed the postseason entirely (they got the 11-seed).

Which brings be to my next point…

Will the East be far apart?

While last year’s Western Conference was quite competitive, the Eastern Conference was, well, another story:

Image from Wikipedia: https://en.wikipedia.org/wiki/2023%E2%80%9324_NBA_season.

Yes, the Celtics not only got the 1-seed in the East but also finished FOURTEEN games ahead of the 2-seed New York Knicks (yes, the Knicks finished 50-32 and still got the 2-seed). Two teams that had very up-and-down seasons-the Bulls and Hawks-both finished with under 40 wins yet still qualified for the play-in as the 9- and 10-seeds in the East, respectively.

Miami Heat to the play…offs?

Throughout the last 10 years, the Miami Heat have had a great deal of success, making it to the Finals twice in that span (’20 and ’23) and making it to the playoffs 7 of the last 10 seasons (exceptions being ’15, ’17 and ’19).

However, while they did make the playoffs the last two seasons, they had to do so through first making it through the play-ins-both times they made it as the 8-seed in the play-in (meaning they had to play two play-in games to even get a playoff slot).

In this model however, the Miami Heat will earn the 4-seed and make the actual playoffs, not the play-in. What could possibly work to their advantage? Here are a few factors:

  • While their successful field goal percentage was in the bottom half of the league last season, they came in 12th amongst all teams in successful 3-pointer percentage, which should help their case.
  • After losing Jimmy Butler and Terry Rozier before play-offs last season, both are now (as of this writing) healthy and ready to play.
  • Those 42.3 rebounds (both offensive and defensive) last year look pretty good.
  • Tyler Herro, Bam Adebayo and Jimmy Butler were the top-3 scorers on the Heat in both points per game and field goals last year…those stats certainly matter for big games. Plus Herro is 24 and Adebayo is 27, so both are still in the primes of their careers (though Jimmy Butler at 35 still plays like he’s in his prime in my opinion)

Will the Heat win an NBA championship or make another Finals appearance? TBD. However, it looks like (according to this model I made) that they will at least make it to the play-offs without needing to go through play-ins first (though their 8-seed to-the-Finals run in 2023 was certainly memorable).

And now, for the bottom of the conference

Most of my insights discussed more successful teams and (potential) deep playoff runs. However, I wanted to offer one more insight concerning the two teams at the (projected) bottom of their conferences-the Pistons in the East and Rockets in the West.

First off-the Detroit Pistons, who, according to my model, are projected to be the 15-seed again (they were the 15-seed last season); will they manage to improve this season? My guess is yes-at least in terms of having more wins this season (only 14 wins last year)-but I don’t think they’ll make a strong playoff run, and I know a 28 game losing streak last season to drop the Pistons to 2-30 at one point didn’t help make a case for their postseason hopes. However, give the Pistons credit for changing their coach (now JB Bickerstaff) and GM (now Trajan Langdon) and adding some solid free agents like Tobias Harris (48.7% of successful field goals last season-not too shabby). Again, I doubt they’ll make a strong playoff run, but they could very well finish higher than the 15-seed.

As for the Houston Rockets (projected 15-seed in the West), they finished in the 11-seed last year in a competitive Western Conference with an even 41-41 record. Judging from last years stats-coming in 9th on defense but 20th on offense-they do have some work to do to make a deep playoff run. However, with a good mix of young players like Tari Eason and veterans like Fred VanVleet (who was on the championship 2019 Toronto Raptors), the Rockets could make it past play-in.

Just for fun…Michael’s Play-In Predictions

Now for an added bonus for my loyal readers, here are my educated guess, just-for-fun play-in predictions for both the Eastern and Western conferences. Granted, while the model I made did help predict regular-season seeding in each conference, it didn’t predict who would make it past play-in to grab the 7- and 8-seeds in the conference. So without further ado, here are my play-in predictions based on what I saw in the teams last season:

Eastern Conference

Predictions: Pacers 7-seed, Cavaliers 8-seed

Western Conference

Predictions: Lakers 7-seed, Timberwolves 8-seed

Thanks for reading and I hope you learned something new from this post! Enjoy the NBA season and I will follow up with a Part 3 post on this topic sometime in April, or at least some time after the conclusion of the regular season. It will be interested to see how accurate or off my predictions were.

Michael

Python, Linear Regression & An NBA Season Opening Day Special Post

Advertisements

Hello readers,

Michael here, and in today’s lesson, we’re gonna try something special! For one, we’re going back to this blog’s statistical roots with a linear regression post; I covered linear regression with R in the way, way back of 2018 (R Lesson 6: Linear Regression) on this blog, so I thought I’d show you how to work the linear regression process in Python. Two, I’m going to try something I don’t normally do, which is predict the future. In this case, the future being the results of the just-beginning 2024-25 NBA season. Why try to predict NBA results you might ask? Well, for one, I wanted to try something new on this blog (hey, gotta keep things fresh six years in), and for two, I enjoy following along with the NBA season. Plus, I enjoyed writing my post on the 2020 NBA playoffs-R Analysis 10: Linear Regression, K-Means Clustering, & the 2020 NBA Playoffs.

Let’s load our data and import our packages!

Before we get started on the analysis, let’s first load our data into our IDE and import all necessary packages:

import pandas as pd
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

You’re likely quite familiar with pandas but for those of you that don’t know, sklearn is an open-source Python library commonly used for machine learning projects (like the linear regression we’re about to do)!

A note about uploading files via Google Colab

Once we import our necessary packages, the next thing we should do is upload the data-frame we’ll be using for this analysis.

This is the file we’ll be using; it contains team statistics such as turnovers (team total) and wins for all 30 NBA teams for the last 10 seasons (2014-15 to 2023-24). The data was retrieved from basketball-reference.com, which is a great place to go if you’re looking for juicy basketball data to analyze. This site comes from https://www.sports-reference.com/, which contains statistics on various sports from NBA to NFL to the other football (soccer for Americans), among other sports.

Now, since I used Google Colab for this analysis, I’ll show you how to upload Excel files into Colab (a different process from uploading Excel files into other IDEs):

To import local files into Google Colab, you’ll need to include the lines from google.colab import files and uploaded = files.upload() in the notebook since, for some odd reason, Google Colab won’t let you upload local files directly into your notebook. Once you run these two lines of code, you’ll need to select a file from the browser tool that you want to upload to Colab.

Next (and ideally in a separate cell), you’ll need to add the lines import io and dataframe = pd.read_csv(io.BytesIO(uploaded['dataframe name'])) to the notebook and run the code. This will officially upload your data-frame to your Colab notebook.

  • Yes, I know it’s annoying, but that’s just how Colab works. If you’re not using Colab to follow along with me, feel free to skip this section as a simple pd.read_csv() will do the trick to upload your data-frame onto the IDE.

Let’s learn about our data-frame!

Now that we’ve uploaded our data-frame into the IDE, let’s learn more about it!

NBA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Season  300 non-null    object 
 1   Team    300 non-null    object 
 2   W       300 non-null    int64  
 3   L       300 non-null    int64  
 4   Finish  300 non-null    int64  
 5   Age     300 non-null    float64
 6   Ht.     300 non-null    object 
 7   Wt.     300 non-null    int64  
 8   G       300 non-null    int64  
 9   MP      300 non-null    int64  
 10  FG      300 non-null    int64  
 11  FGA     300 non-null    int64  
 12  FG%     300 non-null    float64
 13  3P      300 non-null    int64  
 14  3PA     300 non-null    int64  
 15  3P%     300 non-null    float64
 16  2P      300 non-null    int64  
 17  2PA     300 non-null    int64  
 18  2P%     300 non-null    float64
 19  FT      300 non-null    int64  
 20  FTA     300 non-null    int64  
 21  FT%     300 non-null    float64
 22  ORB     300 non-null    int64  
 23  DRB     300 non-null    int64  
 24  TRB     300 non-null    int64  
 25  AST     300 non-null    int64  
 26  STL     300 non-null    int64  
 27  BLK     300 non-null    int64  
 28  TOV     300 non-null    int64  
 29  PF      300 non-null    int64  
 30  PTS     300 non-null    int64  
dtypes: float64(5), int64(23), object(3)
memory usage: 72.8+ KB

Running the NBA.info() command will allow us to see basic information about all 31 columns in our data-frame (such as column names, amount of records in dataset, and object type).

In case you’re wondering about all the abbreviations, here’s an explanation for each abbreviation:

  • Season-The specific season represented by the data (e.g. 2014-15)
  • Team-The team name
  • W-A team’s wins in a given season
  • L-A team’s losses in a given season
  • Finish-The seed a team finished in during a given season in their conference (e.g. Detroit Pistons finishing 15th seed in the East last season)
  • Age-The average age of a team’s roster as of February 1 of a given season (e.g. February 1, 2024 for the 2023-24 season)
  • Ht.-The average height of the team’s roster in a given season (e.g. 6’6)
  • Wt.-The average weight (in lbs.) of the team’s roster in a given season
  • G-Total amount of games played by the team in a given season
  • MP-Total minutes played as a team in a given season
  • FG-Field goals scored by the team in a given season
  • FGA-Field goal attempts made by the team in a given season
  • FG%-Percent of successful field goals made by team in a given season
  • 3P-3-point field goals scored by the team in a given season
  • 3PA-3-point field goal attempts made by the team in a given season
  • 3P%-Percent of successful 3-point field goals made by the team in a given season
  • 2P-2-point field goals scored by the team in a given season
  • 2PA-2-point field goal attempts made by the team in a given season
  • 2P%-Percent of successful 2-point field goals made by the team in a given season
  • FT-Free throws scored by the team in a given season
  • FTA-Free throw attempts made by the team in a given season
  • FT%-Percent of successful free throw attempts made by the team in a given season
  • ORB-Team’s total offensive rebounds in a given season
  • DRB-Team’s total defensive rebounds in a given season
  • TRB-Team’s total rebounds (both offensive and defensive) in a given season
  • AST-Team’s total assists in a given season
  • STL-Team’s total steals in a given season
  • BLK-Team’s total blocks in a given season
  • TOV-Team’s total turnovers in a given season
  • PF-Team’s total personal fouls in a given season
  • PTS-Team’s total points scored in a given season

Wow, that’s a lot of variables! Now that understand know the data we’re working with better, let’s see how we can make a simple linear regression model!

The K-Best Way To Set Up Your Model

Before we start the juicy analysis, let’s first pick the features we will use for the model. In this post, we’ll explore the Select K-Best algorithm, which is an algorithm commonly used in linear regression to help select the best features for a particular model:

X = NBA.drop(['Season', 'Team', 'W', 'Ht.'], axis=1)
y = NBA['W']

from sklearn.feature_selection import SelectKBest, f_regression
features = SelectKBest(score_func=f_regression, k=5)
features.fit(X, y)

selectedFeatures = X.columns[features.get_support()]
print(selectedFeatures)

Index(['L', 'Finish', 'Age', 'FG%', '3P%'], dtype='object')

According to the Select K-Best algorithm, the five best features to use in the linear regression are L, Finish, Age, FG% and 3P%. In other words, a team’s end-of-season seeding, total losses, average roster age, and percentage of successful field goals and 3-pointers are the five most important features to predict a team’s win total.

How did the model arrive to these conclusions? First of all, I set the X and y variables-this is important as the Select K-Best algorithm needs to know what is the dependent variable and what are possible independent variable selections that can be used in the model. In this example, the dependent (or y) variable is W (for team wins) while the X variable includes all other dataset columns except for W, Team, Season, and Ht. because W is the y variable and the other three variables are categorial (or non-numerical) variables, so they really won’t work in our analysis.

Next we import the SelectKBest and f_regression packages from the sklearn.feature_selection module. Why do we need these two packages? Well, SelectKBest will allow us to use the Select K-Best algorithm while f_regression is like a back-end feature selection method that allows the Select K-Best algorithm to select the best x-amount of features for the model (I used five features for this model).

After setting up the Select K-Best algorithm, we then fit both the X and y variables to the algorithm and then print out our top five selectedFeatures.

Train, test…split!

Once we have our top five features for model, it’s time for the train, test, splitting of the model! What is train, test, split you ask? Well, our linear regression model will be split into two types of data-training data (the data we use for training the model) and testing data (the data we use to test our model). Here’s how we can utilize the train, test, split for this model:

X = NBA[['L', 'Finish', 'Age', 'FG%', '3P%']]
y = NBA['W']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

How does the train, test, split work? Using sklearn’s train_test_split method, we pass in four parameters-our independent variables (X), our dependent variable (y), the size of the test data (a decimal between 0 and 1), and the random state (this can be kept at 0, but it doesn’t matter what number you use-42 is another common number). In this model, I will utilize an 80/20 train, test, split, which indicates that 80% of the data will be for training while the other 20% will be used for testing.

Other common train, test, splits are 70/30, 85/15, and 67/33, but I opted for 80/20 because our dataset is only 300 rows long. I would utilize these other train, test, splits for larger datasets.

  • Something worth noting: What we’re doing here is called multiple linear regression since we’re using five X variables to predict a Y variable. Simple linear regression would only use one X variable to predict a Y variable. Just thought I’d throw in this quick factoid!

And now, for the model-making

Now that we’ve done all the steps to set up our model, the next thing we’ll need to do is actually create the model!

Here’s how we can get started:

NBAMODEL = LinearRegression()
NBAMODEL.fit(X_train, y_train)

LinearRegression()

In this example, we create a LinearRegression() object (NBAMODEL) and fit it to both the X_train and y_train data.

Predictions, predictions

Once we’ve created our model, next comes the fun part-generating the predictions!

yPredictions = NBAMODEL.predict(X_test)

yPredictions

array([53.20097648, 28.89541793, 52.26551381, 53.22220829, 35.90676716,
       32.15874993, 47.72090936, 48.32896277, 39.4193884 , 40.1548429 ,
       19.62678175, 48.3263792 , 32.13473281, 43.50887634, 43.85260484,
       52.79795145, 27.35822648, 40.23392095, 18.85423981, 61.69624816,
       51.59650403, 23.86311747, 56.18087097, 54.15867678, 49.75211403,
       46.90177259, 31.80109001, 46.82531833, 37.50563942, 32.19863141,
       52.41205133, 25.09011881, 48.94542256, 38.80244997, 24.80146638,
       42.50107728, 43.27320835, 37.45199938, 46.7795962 , 28.11289951,
       57.64388881, 29.35812466, 18.3222965 , 36.26677012, 20.56912227,
       22.15266241, 19.9955299 , 44.84930613, 45.14740453, 23.19471644,
       53.940611  , 26.0780373 , 27.88093669, 61.23347337, 52.99948229,
       34.66653881, 30.04421016, 27.21669768, 48.55215233, 47.11060905])

The yPredictions are obtained through using the predict method on the model’s X_train data, which in this case consists of 60 of the 300 records..

Evaluating the model’s accuracy

Once we’ve created the model and made our predictions on the training data, it’s time to evaluate the model’s accuracy. Here’s how to do so:

from sklearn.metrics import mean_absolute_percentage_error

mean_absolute_percentage_error(y_test,yPredictions)

0.09147159762376074

There are several ways you can evaluate the accuracy of a linear regression model. One good method as shown here is the mean_absolute_percentage_error (imported from the sklearn.metrics package). The mean absolute percentage error evaluates the model’s accuracy by indicating how off the model’s predictions are. In this model, the mean absolute percentage error is 0.09147159762376074, indicating that the model’s predictions are off by roughly 9%-which also indicates that overall, the model’s predictions are roughly 91% accurate. Not too shabby for this model!

  • Interestingly, the two COVID impacted NBA seasons in the dataset (2019-20 and 2020-21) didn’t throw off the model’s accuracy much.

Don’t forget about the equation!

Evaluating the model’s accuracy isn’t the only thing you should do when analyzing the model. You should also grab the model’s coefficients and intercept-they will be important in the next post!

NBAMODEL.coef_

array([ -0.4663858 ,  -1.30716212,   0.39700734,  34.1325687 ,
       -22.12258585])
NBAMODEL.intercept_

50.945769772855854

All linear regression models will have a coefficient and an intercept, which form the linear regression equation. Since our model had five X variables, there are five coefficients.

Now, what would our equation look like?

Here is the equation in all it’s messy glory. We’re going to be using this equation in the next post.

Linear regression plotting

For the visual learners among my readers, I thought it would be nice to include a simple scatterplot to visualize the accuracy of our linear regression model. Here’s how to create that plot:

import matplotlib.pyplot as plt
plt.scatter(y_test, yPredictions, color="red")
plt.xlabel('Actual values', size=15)
plt.ylabel('Predicted values', size=15)
plt.title('Actual vs Predicted values', size=15)
plt.show()

First, I imported the matplotlib.pyplot module. Then, I ran the plt.scatter() method to create a scatterplot. I used three parameters for this method: the y_test values, the yPredictions values, and the color="red" parameter (this just indicated that I wanted red scatterplot dots). I then used the plt.xlabel(), plt.ylabel(), and plt.title() methods to give the scatterplot an x-label title, y-label title, and title, respectively. Lastly, I used the plt.show() method to display the scatterplot in all of its red-dotted glory.

As you can see from this plot, the predicted values match the actual values fairly closely, hence the 91% accuracy/9% error.

Thanks for reading, enjoy the upcoming NBA season action, and stay tuned for my next post where I reveal my predicted records and standings for each team, East and West! It will be interesting to see how my predictions pan out over the course of the season-after all, it’s certainly something different I’m trying on this blog!

And yes, perfect timing for this blog to come out on NBA season opening day! Serendipity am I right?

Also, here’s a link to the notebook in GitHub-https://github.com/mfletcher2021/DevopsBasics/blob/master/NBA_24_25_predictions.ipynb.

R Analysis 10: Linear Regression, K-Means Clustering, & the 2020 NBA Playoffs

Advertisements

Hello everybody,

Michael here, and today’s post will be an R analysis on the 2020 NBA Playoffs.

As many of you know, the 2019-20 NBA Season was suspended on March 11, 2020 due to COVID-19. However, the season resumed on July 30, 2020 inside a “bubble” (essentially an isolation zone without fans present) in Disney World. 22 teams played 8 regular-season “seeding” games in the bubble before playoffs commenced on August 14, 2020. The resumed season concluded on October 11, 2020, when the LA Lakers defeated the Miami Heat in the NBA Finals to win their 17th championship. The data comes from https://www.basketball-reference.com/playoffs/NBA_2020_totals.html (Basketball Reference), which is a great site if you’re looking for basketball statistics and/or want to do some basketball analyses.

Before we start analyzing the data, let’s first upload it to R and learn about the data (and here’s the data):

This dataset contains the names of 217 NBA players who participated in the 2020 Playoffs along with 30 other types of player statistics.

Now, I know it’s a long list, but here’s an explanation of all 31 variables in this dataset:

  • ..Player-The name of the player
  • Pos-The player’s position on the team. For an explanation of the five main basketball positions, please check out this link https://jr.nba.com/basketball-positions/
  • Age-The age of the player as of July 30, 2020 (the date the NBA season resumed)
  • Tm-The team the player played for during the 2019-20 NBA season
  • G-The games the player participated in during the 2020 NBA playoffs
  • GS-The amount of playoff games that the player was in the game’s starting lineup
  • MP-The amount of minutes a player was on the court during the playoffs
  • FG-The amount of field goals a player scored during the playoffs
    • For those who don’t know, a field goal in basketball is any shot that’s not a free throw.
  • FGA-The amount of field goals a player tried to make during the playoffs (this includes both successful and unsuccessful field goals)
  • FG.-The percentage of a player’s field goal attempts that were successful.
  • X3P-The amount of 3-point field goals a player scored during the playoffs
    • Field goals can be worth either 2 or 3 points, depending on where the player shoots the field goal from
  • X3PA-The amount of 3-point field goals a player attempted during the playoffs (both successful and unsuccessful 3-pointers)
  • X3P.-The percentage of a player’s 3-point field goals that were successful
  • X2P-The amount of 2-point field goals a player scored during the playoffs
  • X2PA-The amount of 2-point field goals a player attempted during the playoffs (both successful and unsuccessful 2-pointers)
  • X2P.-The percentage of a player’s 2-point field goals that were successful
  • eFG.-The percentage of a player’s successful field goal attempts, adjusting for the fact that 3-point field goals are scored higher than 2-point field goals
  • FT-The amount of free throws a player scored during the playoffs
  • FTA-The amount of free throw attempts a player made during the playoffs (both successful and unsuccessful attempts)
  • FT.-The percentage of a player’s successful free throw attempts during the playoffs
  • ORB-The amount of offensive rebounds a player made during the playoffs
    • A player gets a rebound when they retrieve the ball after another player misses a free throw or field goal. The player gets an offensive rebound if their team is currently on offense and a defensive rebound if their team is currently on defense.
  • DRB-The amount of defensive rebounds a player made during the playoffs
  • TRB-The sum of a player’s ORB and DRB
  • AST-The amount of assists a player made during the playoffs
    • A player gets an assist when they pass the ball to one of their teammates who then scores a field goal.
  • STL-The amount of steals a player made during the playoffs
    • A player gets a steal when they cause a player on the opposite team to turnover the ball.
  • BLK-The amount of blocks a player made during the playoffs
    • A player gets a block when they deflect a field goal attempt from a player on the other team
  • TOV-The amount of turnovers a player made during the playoffs
    • A player gets a turnover when they lose possession of the ball to the opposing team. Just in case you were wondering, the goal for a player is to make as few turnovers as possible.
  • PF-The amount of personal fouls a player made during the playoffs
    • A player gets a personal foul when they make illegal personal contact with a player on the other team. The goal for players is to make as few personal fouls as possible, since they will be disqualified from the remainder of the game if they get too many.
  • PTS-The sum of the amount of field goals and free throws a player made during the playoffs
  • Team-The name of the team the player played for during the 2020 NBA playoffs
    • Team and Tm basically give you the same information, except Team gives you the name of the team (e.g. Mavericks) while Tm gives you the three-letter abbreviation for the team (e.g. in the case of the Mavericks the 3-letter abbreviation is DAL)
  • Position-The position the team reached in the 2020 NBA playoffs. There are 7 possible values for position which include:
    • ECR1-The team made it to the first round of the Eastern Conference playoffs
    • ECSF-The team made it to the Eastern Conference Semifinals
    • ECF-The team made it to the Eastern Conference Finals
    • WCR1-The team made it to the first round of the Western Conference playoffs
    • WCSF-The team made it to the Western Conference Semifinals
    • WCF-The team made it to the Western Conference Finals
    • Finals-The team made it to the 2020 NBA Finals

Whew, that was a lot of information to explain, but I felt it was necessary to explain this in order to better understand the data (and the analyses I will do).

Now, before we start creating analyses, let’s first create a missmap to see if there are any missing rows in the data (to create the missmap, remember to install the Amelia package). To create the missmap, use this line of code-missmap(file):

As you can see here, 98% of the observations are present, while only 2% are missing.

Let’s take a look at the five columns with missing data. They are:

  • FT.-free throw percentage
  • X3P.-percentage of 3-pointers made during playoffs
  • X2P.-percentage of 2-pointers made during playoffs
  • eFG.-percentage of successful field goal attempts during playoffs (adjusted for the fact that 3-pointers are worth more than 2-pointers)
  • FG.-percentage of successful field goal attempts during playoffs (not adjusted for the score difference between 2- and 3-pointers)

What do all of these columns have in common? They are all percentages, and the reason why there would be missing data in these columns is because a player doesn’t have any statistics in these categories. For instance, the reason why FT. would be blank is because a player didn’t make any free throws during the playoffs (same logic applies to 3-pointers, 2-pointers, and field goals).

So what will we do with these columns? In this case, I will simply ignore these columns in my analysis because I won’t be focusing on percentages.

Alright, now that I went over the basics of the data, let’s start analyzing the data. First off, I’ll start with a couple of linear regressions.

The first linear regression I will do will analyze whether a player’s age affects how many playoff games they played. Here’s the formula for that linear regression:

linearModel1 <- lm(Age~G, data=file)

And here’s a summary of the linear model:

Now, what does all of this mean? (and for a refresher on linear regression models, please read my post _):

  • The residual standard error gives us the amount that the response variable (Age) deviates from the regression line. In this example, the player’s age deviates from the regression line by 4.066 years.
  • The Multiple R-Squared and Adjusted R-Squared are measures of how well the model fits the data. The closer the R-squared to 1, the better the fit of the model.
    • In this case, the Multiple R-Squared is 3.42% and the Adjusted R-Squared is 2.97%, indicating that there is almost no correlation between a player’s age and the amount of playoff games they played.
  • The F-Statistic is a measure of the relationship (or lack thereof) between the dependent and independent variables. This metric (and the corresponding p-value) isn’t too important when dealing with simple linear regression models such as this one, but it is important when analyzing multiple linear regression models (i.e. models with multiple variables).
    • To make things easier, just focus on the F-statistic’s corresponding p-value when analyzing the relationship between the dependent and independent variables. If the p-value is less than 0.05, accept the null hypothesis (that the independent variable and dependent variables aren’t related). If the p-value value is greater than 0.05, reject the null hypothesis.

OK, so it looks like there isn’t any correlation between a player’s age and the amount of playoff games they played. But what happens when I include Team in the analysis?

Here’s the code for that linear regression model:

linearModel2 <- lm(Age~G+Team, data=file)

However, something to note is that Team is of type chr, which isn’t allowed in linear regression analysis. So, we would need to use this simple line of code to convert Team to type factor:

file$Team <- as.factor(file$Team)

Alright, so let’s see the summary of this linear regression model:

Now, let’s analyze this model further:

  • The residual standard error shows us that a player’s age deviates from the regression line by 3.902 years (down from 4.066 years in the previous model).
  • The Multiple R-Squared and Adjusted R-Squared (17.26% and 10.64% respectively) shows us that this model has a better fit than the previous model, but both metrics are still too small to imply that there is any correlation between a player’s age, the team he plays for, and how many playoff games he played.
  • The F-statistic’s corresponding p-value is much less than 0.05 (0.001024), which indicates that a player’s age has no bearing on the amount of playoff games they played or the team they played for.
  • Notice how all of the values for Team are displayed. This is because when you include a factor in a linear regression, you will see all of the possible values for the factor (there are 16 possible values in this case for the 16 teams that made it to playoffs)

Now we will do some k-means clustering analyses. First, let’s create a cluster using the variables FT and FG (free throws and field goals, respectively):

As you can see, I have displayed the head (the first six observations) of the cluster.

  • I used 8 and 18 as the column numbers because FG and FT are the 8th and 18th columns in the dataset, respectively.

Alright, let’s do some k-means clustering:

So what does this output tell us? Let’s find out:

  • In this example, I created 5 clusters from cluster1. The 5 clusters have sizes of 104, 36, 10, 64, and 3 respectively.
  • The cluster means show us the means for each variable used in this analysis-FT and FG. Recall that cluster means are calculated from the values in each cluster, so the cluster mean for FG in cluster 1 is calculated from all of the FG values in cluster 1 (same logic applies for the other 4 clusters).
  • The clustering vector shows you which observations belong to which cluster. In this example, the cluster observations are sorted alphabetically by a player’s last name. The first three observations belong to clusters 1, 4, and 3, which correspond to Jaylen Adams, Steven Adams, and Edrice “Bam” Adebayo. Likewise, the last three observations belong to clusters 1, 1, and 2, which correspond to Nigel Williams-Goss, Delon Wright, and Ivica Zubac, respectively.
  • The WCSSBC (within cluster sum of squares by cluster) shows us the variablity of observations in a certain cluster. The smaller the WCSSBC, the more compact the cluster. In this example, cluster 1 is the most compact since it has the smallest WCSSBC (2886.26).
  • The between_SS/total_SS ratio shows the overall goodness-of-fit for the model. The higher this ratio is, the better the model fits the data. In this example, the between_SS/total_SS is 91.5%, indicating that this model is an excellent fit for the data.

Now let’s graph our clusters (and remember to install the ggplot2 package);

Here’s the code we’ll use to make the first graph:

plot(cluster1, col=NBAPlayoffs$cluster, main="2020 NBA Playoffs Field Goals vs Free Throws", xlab = "Playoff field goals", ylab="Playoffs free throws", pch=19)

In this graph, we have 5 clusters grouped by color. Let’s analyze each of these clusters:

  • The black cluster represent players who scored 20 or fewer playoff free throws and 20 or fewer playoff field goals. Some notable players that fall into this cluster include Jared Dudley (Lakers), Hassan Whiteside (Trail Blazers), and Bol Bol (Nuggets).
  • The dark blue cluster represents players who scored between 0 and 35 playoff free throws and between 15 and 50 playoff field goals. Some notable players that fall into this cluster include Andre Igoudala (Heat), Markieff Morris (Lakers), and Victor Oladipo (Pacers).
  • The red cluster represents players who scored between 20 and 50 playoff free throws and between 35 and 90 playoff field goals. Some notable players that fall into this cluster include Giannis Antentokonumpo (Bucks), Jae Crowder (Heat), and Rudy Gobert (Jazz).
  • The other two clusters (which are light green and light blue) represent players who scored at least 45 playoff free throws and at least 100 playoff field goals.
    • In case you’re wondering, the three players who are in the light green cluster are Jimmy Butler (Heat), Anthony Davis (Lakers), and LeBron James (Lakers)-all three of whom participated in the 2020 NBA Finals.

So, what insights can we gather from this data? One insight is certain-a team’s playoff position alone doesn’t have much impact on the amount of free throws and field goals they scored. For instance, Dion Waiters, JR Smith, and Talen Horton-Tucker are in the lowest FG/FT cluster, but all three of them are on the Lakers, who won the NBA Championship. However, Dion Waiters and Talen Horton-Tucker didn’t play in any Finals games, while JR Smith did play in some Finals games but only averaged 7.5 minutes per games throughout the playoffs.

Likewise, you might think that the two highest FG/FT clusters would only have Heat and Lakers players, given that these two teams made it all the way to the Finals. Two exceptions to this are James Harden (Rockets) and Kawhi Leonard (Clippers), who were both eliminated in the Western Conference semifinals.

Now let’s do another k-means clustering analysis, this time using two different scoring categories-total rebounds and assists (represented by TRB and AST respectively).

First, let’s create the main cluster (let’s use 23 and 24 as the column numbers because TRB and AST are the 23rd and 24th columns of the dataset, respectively):

Now, let’s create the k-means analysis:

What does this output tell us? Let’s find out:

  • In this example, I created 4 clusters from cluster2 with sizes of 65, 18, 128, and 6, respectively.
  • The cluster means display the averages of TRB and AST for each cluster.
  • The clustering vector shows us which observations correspond to which cluster. From this vector, we can see that the first three observations belong to cluster 3.
  • The WCSSBC indicates the compactness of each cluster. Since cluster 4 has the smallest WCSSBC (13214.83), it is the most compact cluster. Cluster 1, on the other hand, has the highest WCSSBC (32543.42), so it is the least compact cluster.
  • The between_SS/total_SS ratio shows us the goodness-of-fit for the model. In this case, the between_SS/total_SS ratio is 82.9%, which indicates that this model is a good fit for the data (though this ratio is lower than the previous model’s ratio of 91.5%).

Alright, now let’s graph this cluster. Remember to install the ggplot2 package and use this line of code:

plot(cluster2, col=NBAPlayoffs2$cluster, main="2020 NBA Playoffs Rebounds vs Assists", xlab = "Playoff rebounds", ylab="Playoff assists", pch=19)

In this graph, we have 4 clusters grouped by color. Let’s see what each of these clusters mean:

  • The green cluster represents players who scored 40 or fewer playoff assists and 30 or fewer playoff rebounds. Notable players who fall into this cluster include Meyers Leonard (Heat), Dion Waiters (Lakers), and Al Horford (76ers).
  • The black cluster represents players who scored between 0 and 60 playoff assists and between 35 and 100 playoff rebounds. Notable players who fall into this cluster include Kelly Olynyk (Heat), Chris Paul (Thunder), and Kyle Kuzma (Lakers).
  • The red cluster represent players who scored between 15 and 130 playoff assists and between 45 and 130 playoff rebounds. Notable players who fall into this cluster include Russell Westbrook (Rockets), Jae Crowder (Heat), and Kyle Lowry (Raptors).
  • The blue cluster represents the six players in the top rebounds/assists tier, which include:
    • Jimmy Butler (Heat)
    • Jayson Tatum (Celtics)
    • Nikola Jokic (Nuggets)
    • Bam Adebayo (Heat)
    • Anthony Davis (Lakers)
    • LeBron James (Lakers)
      • Four of these players participated in the NBA Finals while Jayson Tatum and Nikola Jokic were eliminated in the Conference Finals.

So, what insights can we draw from this cluster analysis? First of all, similar to the previous cluster analysis, a team’s playoff position (i.e. Conference Finals) has no bearing as to which player falls in which cluster. For example, the Heat and Lakers made it to the NBA Finals, yet there are Heat and Lakers players in every cluster.

And just as with the previous cluster analysis, the top cluster doesn’t only have Finals players, as Jayson Tatum and Nikola Jokic are included in the top cluster-and both of them were eliminated in the Conference Finals.

Last but not least, let’s create a cluster analyzing two things players try not to get-personal fouls and turnovers. First, let’s create the main cluster (use 27 and 28 as the column numbers since TOV and PF are the 27th and 28th columns in the dataset respectively):

Next, let’s create the k-means analysis:

So, what does this output tell us? Let’s find out:

  • In this example, I created 4 clusters from cluster3 with sizes of 30, 70, 104, and 13, respectively.
  • The cluster means display the means of TOV and PF for each cluster.
  • The clustering vector shows us which observations belong to which cluster. In this example, the first three observations belong to cluster 3 (just like in the previous example).
  • The WCSSBC shows us how compact each cluster is; in this example, cluster 3 is the most compact since it has the smallest WCSSBC (1976.298).
  • The between_SS/total_SS ratio shows us the goodness-of-fit for the model; in this example, this ratio is 84.9%, which indicates that the model is a great fit for the data.
    • Interestingly, the k-means analysis with the best goodness-of-fit is the first analysis with a between_SS/total_SS ratio of 91.5%. I think this is surprising since the first model had 5 clusters while this model and the previous model have 4 clusters.

Now, let’s plot our k-means analysis (and remember to install ggplot2!)

plot(cluster3, col=NBAPlayoffs3$cluster, main="2020 NBA Playoffs Turnovers vs Personal Fouls", xlab = "Playoff turnovers", ylab="Playoff personal fouls", pch=19)

Just as with the previous graph, we have four clusters grouped by color in this plot (and coincidentally in the same color order as the previous graph). What do each of these clusters mean? Let’s find out:

  • The green cluster represents players who committed 15 or fewer playoff turnovers as well as 12 or fewer playoff personal fouls. Notable players that fall into this cluster include JR Smith (Lakers), Kyle Korver (Bucks), and Damian Lillard (Trail Blazers).
    • In this example, being in the lowest cluster is the best, since the aim of every player is to commit as few turnovers and personal fouls as possible. Plus, if a player commits too many personal fouls in a game, they’re ejected.
    • Also, the more turnovers a player commits, the more opportunities the opposing team gets to score.
      • In reality, the data isn’t as black-and-white as I make it seem. This is because the further a player made it into the playoffs, the more opportunities they would have for committing turnovers and personal fouls.
      • Data is never black-and-white as it looks. The key to being a good data analyst is to analyze every piece of data in a dataset to see how it’s all interconnected.
  • The red cluster represents players who committed between 7 and 35 playoff turnovers and between 2 and 35 playoff personal fouls. Notable players who fall into this cluster include Carmelo Anthony (Trail Blazers), Kendrick Nunn (Heat), and Rudy Gobert (Jazz).
  • The black cluster represents players who committed between 9 and 35 playoff turnovers and between 33 and 66 playoff personal fouls. Notable players who fall into this cluster include Rajon Rondo (Lakers), Danny Green (Lakers), and Jae Crowder (Heat).
  • The blue cluster represents the 13 players who are in the highest turnover/personal foul tier. Here’s the full list:
    • Goran Dragic (Heat)
    • Paul George (Clippers)
    • Jaylen Brown (Celtics)
    • Tyler Herro (Heat)
    • James Harden (Rockets)
    • Marcus Smart (Celtics)
    • Bam Adebayo (Heat)
    • Jayson Tatum (Celtics)
    • Jamal Murray (Nuggets)
    • Anthony Davis (Lakers)
    • Jimmy Butler (Heat)
    • Nikola Jokic (Nuggets)
    • and Lebron James (Lakers)
      • You might be surprised to see 6 NBA Finalists on this list, but I have a theory as to why that’s the case. See, the farther a player makes it in the playoffs, the more opportunities they have to score (and commit fouls and turnovers). And if you saw the 2020 NBA Finals, there were quite a bit of personal fouls and turnovers committed by both the Heat and Lakers (especially with Anthony Davis’s foul trouble in Game 3).

Thanks for reading,

Michael

R Analysis 8: Linear Regression & the Top 200 Albums of the 2010s

Advertisements

Hello everybody,

It’s Michael, and welcome to the roaring (20)20s! With that said, it’s fitting that my first post of the 2020s will be about the 2010s-more specifically, Billboard’s Top 200 Albums of the 2010s, which I will analyze using linear regression.

Here’s the dataset-Billboard 200.

As always, let’s open up R, upload the file, and learn more about our dataset:

This dataset shows all of the album that ended up on Billboard’s Top 200 Decade-End Chart along with information about each album (such as number of tracks).

In total, there are 200 observations of 7 variables:

  • Rank-Where the album ranked on the Billboard’s Top 200 Decade-End Chart (anywhere from 1 to 200)
  • Album-The name of the album
  • Artist-The singer/group (or distributor in the case of some movie soundtrack) who created the album
  • Genre-The main genre of the album (note that albums can fit into several sub-genres, which I will explore in this analysis)
  • Tracks-How many tracks are on the album
  • Metacritic-The album’s Metacritic score
    • For those that don’t know, Metacritic is a movie/TV show/music review site, much like Rotten Tomatoes (except RT doesn’t review music)
  • Release Date-The album’s release date
    • Even though this is a 2010s Decade-End Chart, there are interestingly a handful of albums from the late ’00s. Guess they still held up into the ’10s.

Now, let’s check to see if there’s missing data (remember to install the Amelia package):

  • Also remember to type the command missmap(file) (or whatever variable you called your file) to see the missing-ness map.

As you can see, 97% of observations are present while 3% are missing. All of the missing observations are in the Metacritic column-this is because not all albums have a Metacritic score (there are plenty of other music review sites such as HipHop DX, but I only went off of Metacritic reviews to maintain consistency in the dataset).

Now, I don’t know if I’ve mentioned this before, but when there are missing values in a column in R, there are three things you can do:

  • Don’t use the column in analysis
  • Fill in missing column values with the mean of the column (meaning the mean you get from all non-NA values inn the column)
  • Fill in the missing column values with an arbitrary fixed value

The first option sometimes works, but not for this dataset, as I want to use the Metacritic column in some way in this analysis. The second option might work, but I won’t use it since I feel that imputing the mean Metacritic score for any NAs in the column wouldn’t make much sense (plus this option won’t work with non-numeric columns). In this case, the third option is my best best; I will fill in any missing values in the Metacritic column using the number 0. You could pick any number-I just chose 0 since doing so gives me an easy way to spot the albums without Metacritic scores. Plus 0 won’t impact the mean of the Metacritic column in any significant way.

Here’s the line of code to make the magic happen:

file$Metacritic[is.na(file$Metacritic)] <- 0

Once we run this line of code, here’s what the Metacritic column looks like now:

All the NAs are filled with zeroes, which, in my opinion, makes the column a lot neater looking.

Now, let’s do some linear regression. Here’s a simple model with one independent and one dependent variable:

> model1 <- lm(file$Metacritic~file$Genre)
> summary(model1)

Call:
lm(formula = file$Metacritic ~ file$Genre)

Residuals:
Min 1Q Median 3Q Max
-65.345 -7.384 2.661 13.667 55.692

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.222e-14 1.546e+01 0.000 1.000000
file$GenreChristmas 2.433e+01 2.187e+01 1.113 0.267277
file$GenreCountry 3.581e+01 1.619e+01 2.211 0.028262 *
file$GenreElectronic 4.700e+01 2.046e+01 2.298 0.022708 *
file$GenreFolk 6.650e+01 2.445e+01 2.720 0.007156 **
file$GenreJazz 3.750e+01 2.445e+01 1.534 0.126801
file$GenreMovie Soundtrack 2.231e+01 1.715e+01 1.300 0.195099
file$GenreMusical 8.500e+01 3.093e+01 2.748 0.006584 **
file$GenreOpera 5.400e+01 3.093e+01 1.746 0.082465 .
file$GenrePop 6.534e+01 1.586e+01 4.121 5.71e-05 ***
file$GenreR&B 4.925e+01 1.729e+01 2.849 0.004889 **
file$GenreRap 6.133e+01 1.591e+01 3.855 0.000160 ***
file$GenreReggae -6.545e-14 3.093e+01 0.000 1.000000
file$GenreRock 6.908e+01 1.729e+01 3.996 9.31e-05 ***
file$GenreSoul 7.550e+01 2.046e+01 3.691 0.000294 ***
file$GenreVarious -1.459e-13 2.445e+01 0.000 1.000000

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 26.78 on 184 degrees of freedom
Multiple R-squared: 0.314, Adjusted R-squared: 0.2581
F-statistic: 5.614 on 15 and 184 DF, p-value: 2.224e-09

In this model, I used Metacritic as the dependent variable (I did say I was going to use it in this analysis) and Genre as the independent variable. I chose these two variables because I wanted to analyze whether certain genres tend to get higher/lower Metacritic scores.

What does all of the output mean? Since my last linear regression post was over a year ago, let me give you guys a refresher:

  • The residual standard error refers to the amount that the dependent variable (Metacritic) deviates from the true regression line. In this case, the RSE is 26.78, meaning the Metacritic score deviates from the true regression line by 27 (rounded to the nearest whole number). Since Metacritic scores only go up to 100, 27 is quite a large RSE.
  • The R-squared is the measure of a model’s goodness-of-fit; the closer to 1 means the better the fit. The difference between the Multiple R-Squared and the Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. In this case, the Multiple R-Squared is 31.4% while the adjusted R-squared is 25.81%. This implies that there isn’t much of a correlation between an album’s genre and Metacritic score.
    • It’s not an official rule, but I’d say the Multiple R-Squared should be at 51% for there to be any correlation between a dependent variable and any independent variable(s). The Adjusted R-Squared can be slightly lower than 51%.
    • In the post R Analysis 2: Linear Regression & NFL Attendance, I mentioned the idea that “correlation does not imply causation”, which holds true here. In the context of this model, just because there is a slight correlation between an album’s genre and its Metacritic score, this doesn’t mean that certain album genres will tend to score higher/lower on Metacritic.
    • Disregard the F-statistic and corresponding p-value. However, if you want more context on both of these things, please check out the link in the previous bullet point for a more in-depth explanation.
  • Notice how the independent variable-Genre-is split up into several different sub-categories. These sub-categories represent all of the album genres listed in this dataset.
    • The asterisks right by the subcategories (after the Pr(>|t|) column) are significance codes, which in this case represent each individual genre’s significance to the album’s Metacritic score. In other words, the significance codes show which genres are likely to have an impact on an album’s Metacritic score. Any genre with two or three asterisks will significantly influence an album’s Metacritic score; such genres include:
      • Folk
      • Musical
      • Pop
      • Rap
      • R&B
      • Rock
      • Soul

Now, I want to try another model using Metacritic, but this time I will use two independent variables-Artist and Genre-and see if this improves the accuracy of the model:

model2 <- lm(file$Metacritic ~ file$Genre + file$Artist)

summary(model2)

Call:
lm(formula = file$Metacritic ~ file$Genre + file$Artist)

Residuals:
Min 1Q Median 3Q Max
-41.667 0.000 0.000 0.687 47.333

Coefficients: (6 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.167e+01 2.844e+01 -0.762 0.448674
file$GenreChristmas 2.167e+01 3.399e+01 0.637 0.525909
file$GenreCountry 1.502e+01 3.425e+01 0.438 0.662358
file$GenreElectronic -6.473e+01 3.785e+01 -1.710 0.091599 .
file$GenreFolk 1.977e+01 3.549e+01 0.557 0.579287
file$GenreJazz 5.917e+01 3.134e+01 1.888 0.063118 .
file$GenreMovie Soundtrack 2.167e+01 2.150e+01 1.008 0.316958
file$GenreMusical 1.067e+02 3.399e+01 3.138 0.002477 **
file$GenreOpera 7.567e+01 3.399e+01 2.226 0.029188 *
file$GenrePop 1.727e+01 2.719e+01 0.635 0.527498
file$GenreR&B 3.297e+01 2.963e+01 1.112 0.269683
file$GenreRap 2.167e+01 3.134e+01 0.691 0.491590
file$GenreReggae 2.167e+01 3.399e+01 0.637 0.525909
file$GenreRock 9.467e+01 3.399e+01 2.785 0.006858 **
file$GenreSoul 1.427e+01 3.549e+01 0.402 0.688886
file$GenreVarious 2.167e+01 3.876e+01 0.559 0.577891
file$Artist21 Pilots 7.000e+00 2.633e+01 0.266 0.791120
file$Artist21 Savage 8.100e+01 2.280e+01 3.552 0.000683 ***
file$ArtistA Boogie wit da Hoodie -1.858e-13 2.280e+01 0.000 1.000000
file$ArtistAdele 7.940e+01 3.115e+01 2.549 0.012978 *
file$ArtistAlicia Keys -1.130e+01 3.331e+01 -0.339 0.735394
file$ArtistAriana Grande 8.115e+01 2.666e+01 3.044 0.003271 **
file$ArtistBarbara Streisand 5.940e+01 3.115e+01 1.907 0.060612 .
file$ArtistBeyonce 7.720e+01 2.428e+01 3.180 0.002182 **
file$ArtistBillie Eilish 8.640e+01 3.115e+01 2.773 0.007083 **
file$ArtistBlake Shelton 7.065e+01 3.747e+01 1.886 0.063440 .
file$ArtistBob Marley NA NA NA NA
file$ArtistBrantley Gilbert 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistBruno Mars 7.140e+01 2.719e+01 2.626 0.010585 *
file$ArtistBryson Tiller -1.130e+01 3.331e+01 -0.339 0.735394
file$ArtistCamila Cabello 8.240e+01 3.115e+01 2.645 0.010052 *
file$ArtistCardi B 8.400e+01 2.280e+01 3.684 0.000445 ***
file$ArtistCarrie Underwood 6.865e+01 3.508e+01 1.957 0.054280 .
file$ArtistCast of Hamilton NA NA NA NA
file$ArtistChris Brown -1.130e+01 3.331e+01 -0.339 0.735394
file$ArtistChris Stapleton 9.165e+01 3.747e+01 2.446 0.016922 *
file$ArtistColdplay 6.940e+01 3.115e+01 2.228 0.029075 *
file$ArtistDaBaby 1.931e-13 2.280e+01 0.000 1.000000
file$ArtistDaft Punk 1.734e+02 4.079e+01 4.251 6.37e-05 ***
file$ArtistDisney -8.179e-14 2.150e+01 0.000 1.000000
file$ArtistDJ Khaled 6.100e+01 2.280e+01 2.675 0.009266 **
file$ArtistDrake 7.500e+01 1.493e+01 5.024 3.63e-06 ***
file$ArtistDrake & Future 7.000e+01 2.280e+01 3.070 0.003033 **
file$ArtistDreamWorks 8.081e-14 2.633e+01 0.000 1.000000
file$ArtistDuck Dynasty -7.019e-13 2.633e+01 0.000 1.000000
file$ArtistEd Sheeran 6.890e+01 2.824e+01 2.440 0.017179 *
file$ArtistEminem 6.567e+01 1.700e+01 3.864 0.000244 ***
file$ArtistEric Church 8.615e+01 3.508e+01 2.456 0.016503 *
file$ArtistFall Out Boy -1.000e+00 2.633e+01 -0.038 0.969811
file$ArtistFetty Wap 6.800e+01 2.280e+01 2.982 0.003920 **
file$ArtistFlorida Georgia Line 6.650e+00 3.508e+01 0.190 0.850186
file$Artistfun 6.440e+01 3.115e+01 2.067 0.042367 *
file$ArtistFuture 7.350e+01 1.862e+01 3.948 0.000184 ***
file$ArtistG-Eazy 7.400e+01 2.280e+01 3.245 0.001791 **
file$ArtistGoyte -4.000e+00 2.633e+01 -0.152 0.879682
file$ArtistHozier 8.640e+01 3.861e+01 2.238 0.028365 *
file$ArtistHunter Hayes 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistImagine Dragons -2.300e+01 2.280e+01 -1.009 0.316560
file$ArtistJ Cole 7.033e+01 1.700e+01 4.138 9.49e-05 ***
file$ArtistJason Aldean 3.715e+01 3.382e+01 1.098 0.275734
file$ArtistJay-Z 6.000e+01 2.280e+01 2.631 0.010426 *
file$ArtistJohn Legend 6.070e+01 3.331e+01 1.823 0.072583 .
file$ArtistJuice WRLD 3.050e+01 1.862e+01 1.638 0.105806
file$ArtistJustin Bieber 7.040e+01 2.666e+01 2.641 0.010160 *
file$ArtistJustin Timberlake 7.190e+01 2.824e+01 2.546 0.013054 *
file$ArtistKane Brown 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistKanye West 7.500e+01 2.280e+01 3.289 0.001566 **
file$ArtistKanye West & Jay-Z 7.600e+01 2.280e+01 3.333 0.001367 **
file$ArtistKaty Perry 6.090e+01 2.824e+01 2.157 0.034405 *
file$ArtistKelly Clarkson 7.300e+01 2.633e+01 2.773 0.007099 **
file$ArtistKendrick Lamar 9.300e+01 1.862e+01 4.995 4.06e-06 ***
file$ArtistKesha 1.404e+02 4.079e+01 3.442 0.000972 ***
file$ArtistKevin Gates 8.100e+01 2.280e+01 3.552 0.000683 ***
file$ArtistKhalid 1.770e+01 3.059e+01 0.579 0.564710
file$ArtistLady Antebellum 6.965e+01 3.508e+01 1.986 0.050951 .
file$ArtistLady Gaga 7.780e+01 2.428e+01 3.205 0.002025 **
file$ArtistLana Del Rey 6.640e+01 3.115e+01 2.131 0.036524 *
file$ArtistLil Baby 2.779e-14 2.280e+01 0.000 1.000000
file$ArtistLil Baby & Gunna 7.600e+01 2.280e+01 3.333 0.001367 **
file$ArtistLil Uzi Vert 7.500e+01 2.280e+01 3.289 0.001566 **
file$ArtistLil Wayne 6.633e+01 1.700e+01 3.903 0.000214 ***
file$ArtistLionel Richie 7.840e+01 3.115e+01 2.517 0.014113 *
file$ArtistLittle Big Town 8.165e+01 3.747e+01 2.179 0.032638 *
file$ArtistLizzo 8.400e+01 2.280e+01 3.684 0.000445 ***
file$ArtistLMFAO 1.334e+02 4.079e+01 3.270 0.001659 **
file$ArtistLorde 8.340e+01 3.115e+01 2.677 0.009219 **
file$ArtistLuke Bryan 4.832e+01 3.425e+01 1.411 0.162647
file$ArtistLuke Combs 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistMacklemore & Ryan Lewis 7.400e+01 2.280e+01 3.245 0.001791 **
file$ArtistMaroon 5 6.007e+01 2.719e+01 2.209 0.030413 *
file$ArtistMeek Mill 7.700e+01 2.280e+01 3.377 0.001193 **
file$ArtistMeghan Trainor 6.340e+01 3.115e+01 2.035 0.045580 *
file$ArtistMetalica -8.437e-14 2.633e+01 0.000 1.000000
file$ArtistMichael Buble NA NA NA NA
file$ArtistMigos 7.400e+01 1.862e+01 3.975 0.000167 ***
file$ArtistMiley Cyrus 6.540e+01 3.115e+01 2.099 0.039351 *
file$ArtistMiranda Lambert 9.265e+01 3.747e+01 2.473 0.015804 *
file$ArtistMumford & Sons -7.500e+00 2.280e+01 -0.329 0.743190
file$ArtistNicki Minaj 6.900e+01 1.862e+01 3.706 0.000414 ***
file$ArtistOf Monsters and Men 6.790e+01 3.861e+01 1.759 0.082931 .
file$ArtistOne Direction 6.840e+01 2.666e+01 2.566 0.012402 *
file$ArtistOneRepublic -8.000e+00 2.633e+01 -0.304 0.762141
file$ArtistPanic! At the Disco 7.440e+01 3.115e+01 2.388 0.019597 *
file$ArtistPentatonix 2.167e+01 3.134e+01 0.691 0.491590
file$ArtistPharrell 7.140e+01 3.115e+01 2.292 0.024884 *
file$ArtistPhillip Phillips 6.540e+01 3.115e+01 2.099 0.039351 *
file$ArtistPink 8.140e+01 3.115e+01 2.613 0.010953 *
file$ArtistPost Malone 2.550e+01 1.862e+01 1.370 0.175117
file$ArtistQueen 7.000e+01 2.633e+01 2.659 0.009690 **
file$ArtistRihanna 7.440e+01 2.824e+01 2.635 0.010324 *
file$ArtistRIhanna 6.690e+01 2.824e+01 2.369 0.020542 *
file$ArtistRobin Thicke 4.400e+00 3.115e+01 0.141 0.888085
file$ArtistSade 8.640e+01 3.861e+01 2.238 0.028365 *
file$ArtistSam Hunt 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistSam Smith 7.140e+01 2.824e+01 2.529 0.013672 *
file$ArtistScotty McCreery 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistShawn Mendes 4.090e+01 2.824e+01 1.449 0.151874
file$ArtistSia 7.140e+01 3.115e+01 2.292 0.024884 *
file$ArtistSony Pictures -8.989e-14 2.633e+01 0.000 1.000000
file$ArtistSusan Boyle NA NA NA NA
file$ArtistSZA 7.470e+01 3.331e+01 2.243 0.028026 *
file$ArtistTaylor Swift 7.965e+01 2.666e+01 2.988 0.003854 **
file$ArtistThe Band Perry 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistThe Black Eyed Peas 6.440e+01 3.115e+01 2.067 0.042367 *
file$ArtistThe Black Keys 1.000e+01 2.280e+01 0.439 0.662319
file$ArtistThe Lumineers NA NA NA NA
file$ArtistThe Weeknd 5.920e+01 3.059e+01 1.935 0.056961 .
file$ArtistThomas Rhett 6.650e+00 3.747e+01 0.177 0.859634
file$ArtistTravis Scott 7.450e+01 1.862e+01 4.001 0.000153 ***
file$ArtistUniversal Studios 2.167e+01 2.150e+01 1.008 0.316958
file$ArtistUsher 4.570e+01 3.331e+01 1.372 0.174332
file$ArtistVarious 3.005e-14 2.280e+01 0.000 1.000000
file$ArtistWarner Bros. -1.194e-13 2.633e+01 0.000 1.000000
file$ArtistXXXTentacion NA NA NA NA
file$ArtistZac Brown Band 3.032e+01 3.425e+01 0.885 0.379001

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.62 on 71 degrees of freedom
Multiple R-squared: 0.8721, Adjusted R-squared: 0.6415
F-statistic: 3.782 on 128 and 71 DF, p-value: 3.626e-09

So how does this model differ from the previous model. Let’s analyze:

  • The residual standard error is smaller than that of the previous model by nearly 8 (assuming you round model 1’s RSE to 27 and model 2’s RSE to 19). In the context of Metacritic scores, 19 is a much smaller RSE than 27, so this helps improve the model’s accuracy.
  • This model’s Multiple R-Squared and Adjusted R-Squared are considerably larger than those of the previous model (87.21% and 64.15%, respectively, while the previous model’s Multiple R-Squared and Adjusted R-Squared were 31.4% and 25.81% respectively). This model’s R-Squared (both multiple and adjusted) imply that there is a strong correlation between an album’s genre and Metacritic score if the album’s artist is factored in to the analysis.
  • Just as with the previous model, the independent variables are divided into sub-categories which encompass all the possible values for that variable. All of the genres listed in the dataset are present, as are all of the artists listed in the dataset. The significance codes are also present here, but this time they are present for both the genre and artist subcategories. Any genre that has either two or three asterisks besides it does significantly affect an album’s Metacritic score-same logic applies for any artists with two or three asterisks beside their name.
    • So, what are the genres that most significantly impact an album’s Metacritic score (this is different from the previous model):
      • Musical
      • Rock
    • Now, which artists are most likely to have a significant impact on an album’s Metacritic score?:
      1. 21 Savage
      2. Ariana Grande
      3. Beyonce
      4. Billie Eilish
      5. Cardi B
      6. Daft Punk
      7. DJ Khaled
      8. Drake
      9. Drake & Future (they did an album together so I listed them both as the artist)
      10. Eminem
      11. Fetty Wap
      12. Future
      13. G-Eazy
      14. J Cole
      15. Kanye West
      16. Kanye West & Jay-Z
      17. Kelly Clarkson
      18. Kendrick Lamar
      19. Kesha
      20. Kevin Gates
      21. Lady Gaga
      22. Lil Baby & Gunna
      23. Lil Uzi Vert
      24. Lil Wayne
      25. Lizzo
      26. LMFAO
      27. Lorde
      28. Macklemore & Ryan Lewis
      29. Meek Mill
      30. Migos
      31. Nicki Minaj
      32. Queen
      33. Taylor Swift
      34. Travis Scott
    • Yes, 34 of the 120 artists listed have a significant impact on an album’s Metacritic score. And 23 of them are rappers (though keep in mind that if two artists made an album/mixtape together, I listed them on the same bullet point).
    • Personally, I think it’s interesting that Queen is on this list. But that could be just because of the 2018 Queen movie Bohemian Rhapsody.
  • Remember how I said to disregard the F-statistic and corresponding p-value when I was analyzing the previous model. Since this linear regression model has two independent variables, the F-statistic and corresponding P-value are important. The f-statistic is a numerical measure of the relationship (or lack thereof) between the dependent variable and any independent variables. However, the F-statistic must be analyzed in conjunction with the P-value in order to get a sense of the independent variables’ relationship with the dependent variable.
    • The concepts of null and alternative hypotheses are important here.
      • The null hypothesis states that the independent variables DON’T have a significant impact on the dependent variable while the alternative hypotheses states the opposite.
    • If the P-value is less than 0.001, you can safely reject the null hypotheses. Since the P-value in this model is much less than 0.001, you can reject the null hypothesis. This means that the combination of an album’s genre and respective artist does impact the album’s Metacritic score.

So, which model is better? After analyzing each model, I’d say model #2 is much better than model #1-and not just because model #2 has two independent variables (though that certainly helps make the model more accurate). The fact that model #2 has a lower RSE, higher Multiple/Adjusted R-Squared, 36 statistically significant subcategories (2 for genre and 34 for artist), and a P-value low enough to reject the null hypothesis all help make model #2 the better model.

Thanks for reading and here’s to lots of great content in 2020,

Michael

R Lesson 8: Predictions For Linear & Logistic Regression/Multiple Linear Regression

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be about predictions for both linear and logistic regression models. I will be using the same dataset that I used for R Analysis 2: Linear Regression & NFL Attendance, except I added some variables so I could create both linear and logistic regression models from the data. Here is the modified dataset-NFL attendance 2014-18

Now, as always, let’s first try to understand our variables:

I described most of these variables in R Analysis 2, but here are what the two new ones mean (I’m referring to the two bottommost variables):

  • Playoffs-whether or not a team made the playoffs. Teams that made playoffs are represented by a 1, while teams that didn’t make playoffs are represented by a 0. Recall that teams who finished 1st-6th in their respective conferences made playoffs, while teams that finished 7th-16th did not.
  • Division-What division a team belongs to, of which there are 8:
    • 1-AFC East (Patriots, Jets, Dolphins, Bills)
    • 2-AFC North (Browns, Steelers, Ravens, Bengals)
    • 3-AFC South (Colts, Jaguars, Texans, Titans)
    • 4-AFC West (Chargers, Broncos, Chiefs, Raiders)
    • 5-NFC East (Cowboys, Eagles, Giants, Redskins)
    • 6-NFC North (Packers, Bears, Vikings, Lions)
    • 7-NFC South (Falcons, Saints, Panthers, Buccaneers)
    • 8-NFC West (Seahawks, 49ers, Cardinals, Rams)

I added these two variables so that I could create logistic regression models from the data. In both cases, I used dummy variables (remember those?).

Another function I think will help you in your analyses is sapply. Here’s how it works:

As you can see, you can do two things with supply-find out if there are any missing variables (as seen on the top function) or find out how many unique values there are for a certain variable (as seen on the bottom function). According to the output, there are no missing values for any variables (in other words, there are no blank spots in any column of the spreadsheet). Also, on the bottom function, you can see how many distinct values correspond to a certain variable (e.g. Conference Standing has 16 distinct values).

Before I get into analysis of the models, I want to introduce two new concepts-training data and testing data:

The difference between training and testing data is that training data are used as guidelines for how a model (whether linear or logistic) should make decisions while testing data just gives us an idea as to how well the model is performing. When splitting up your data, a good rule of thumb is 80-20, meaning that 80% of the data should be for training while 20% of the data should be for testing (It doesn’t have to be 80-20, but it should always be majority of the data for training and the minority of the data for testing). In this model, observations 1-128 are part of the training dataset while observations 129-160 are part of the testing dataset.

I will post four models in total-two using linear regression and two using logistic regression. I will start with the logistic regression:

In this model, I chose playoffs as the binary dependent variable and Division and Win Total as the independent variables. As you can see, intercept (referring to Playoffs) and Win Total are statistically significant variables, while Division is not statistically significant. Also, notice the data = train line, which indicates that the training dataset will be used for this analysis (you should always use the training dataset to create the model)

Now let’s create some predictions using our test dataset:

The fitted.results variable calculates the predictions while the ifelse function determines whether each of the observations in our test dataset (observations 129-160) is significant to the model. A 1 under an observation number indicates that the observation has at least a 50% significance to the model while a 0 indicates that the observation has less than a 50% significance to the model.

If we wanted to figure out exactly how significant each observation is to the model (along with the overall accuracy of the model), here’s how:

The misClasificError basically indicates the model’s margin of error using the fitted.results derived from the test dataset. The accuracy is calculated by subtracting 1 from the misClasificError, which turns out to be 87%, indicating very good accuracy (and indicating that the model’s margin of error is 13%).

Finally, let’s plot the model:

We can also predict various what-if scenarios using the model and the predict function. Here’s an example:

Using the AFC South as an example, I calculated the possible odds for a team in that division to make the playoffs based on various possible win totals. As you can see, an AFC South team with 10 or 14 wins is all but guaranteed to make the playoffs, as odds for both of those win totals are greater than 1. However, AFC South teams with only 2 or 8 wins aren’t likely to go to playoffs because the odds for both of those win totals are negative (however 8 wins will fare better than 2).

Let’s try another example, this time examining the effects of 9 wins across all 8 divisions (I chose 9 because 9 wins sometimes results in playoff berths, sometimes it doesn’t):

As you can see, 9 wins will most likely earn a playoff berth for AFC East teams (55.6% chance) and least likely to earn a playoff spot in the NFC West (35.7% chance)

I know it looks like all the lines are squished into one big line, but you can imply that the more wins a team has, the greater its chances are at making the playoffs. The pink line that appears to be the most visible represents the NFC West (Rams, Seahawks, 49ers, Cardinals). Unsurprisingly, the teams likeliest to make the playoffs were the teams with 9 or more wins (expect for the 2017 Seahawks, who finished 9-7 and missed the playoffs).

Now let’s create another logistic regression model that is similar to the last one except with the addition of the Total Attendance variable

The summary output looks similar to that of the previous model (I also use the training dataset for this model), except that this time, none of the variables have asterisks right by them, meaning none of them are statistically significant (which happens when the p-value is above 0.1).  Nevertheless, I’ll still analyze this model to see if it is better than my first logistic regression model.

Now let’s create some predictions using the test dataset:

Like our previous model, this model also has a nice mix of 0s and 1s, except this model only has 11 1s, while the previous model had 14 1s.

And now let’s find the overall accuracy of the model:

Ok, so I know 379% seems like crazy accuracy for a logistic regression model. Here’s how it was calculated:

R took the sum of these numbers and divided that sum by 32 to find the average of the fitted results. R then subtracted 1 from the average to get the accuracy measure.

Just as we did with the first model, we can also create what-if scenarios. Here’s an example:

Using the AFC North as an example, I analyzed the effect of win total on a team’s playoff chances while keeping total attendance the same (1,400,000). Unsurprisingly (if total attendance is roughly 1.4 million fans in a given season), teams with a losing record (7-8-1 or lower) are less likely to make the playoffs than teams with a split or winning record (8=8 or higher). Given both record and a total attendance of 1,400,000 fans, the threshold for clinching a playoff berth appears to be 12 or 13 wins (though barring attendance, most AFC North teams fare well with 10, 9, or even 8 wins).

Now here’s another example. this time using the NFC East (and changing both win totals and total attendance):

So given increasing win totals and total attendance, an NFC East team’s playoff chances increase. The playoff threshold here, just as it been with most of my predictions, is 9 or 10 wins.

Now let’s see what happens when win totals increase but attendance goes down (also using the NFC East):

Ultimately (with regards to the NFC East), it’s not total attendance that matters, but a team’s win totals. As you can see, regardless of total attendance, playoff clinching odds increase with higher win totals (win threshold remains at 9 or 10).

And here’s our model plotted:

Now, I know this graph is just about as easy-to-read as the last graph (not very, but that’s how R works), but just like with the last graph, you can draw some conclusions. Since this graph factors in Total Attendance and Win Total (even though only Total Attendance is displayed), you can tell that even though a team’s fanbase may love coming to their games, if the wins are low, so are the playoff chances.

Now, before we start the linear regression models, let’s compare the logistic regression models to see which is the better of the two by analyzing various criteria:

  • Difference between null & residual deviance
    • Model 1-73.25 with a decrease of two degrees of freedom
    • Model 2-115.82 with a decrease of three degrees of freedom
    • Better model-Model 1
  • AIC
    • Model 1-101.86
    • Model 2-60.483
    • Better model-Model 2 (41.377 difference)
  • Number of Fisher Scoring Iterations
    • Model 1-5
    • Model 2-7
    • Better model-Model 1 (less Fisher iterations)
  • Overall Accuracy
    • Model 1-87%
    • Model 2-379%
    • Better model-Model 1 (379% sounds too good to be true)

Overall better model: Model 1

Now here’s the first linear regression model:

This model has Win Total as the dependent variable and Total Attendance and Conference Standing as the independent variables. This will also by my first model created with multiple linear regression, which is basically linear regression with more than one independent variable.

And finally, let’s plot the model:

In cases of multiple linear regression such as this, I had to graph each independent variable separately; graphing Total Attendance and Conference Standing separately allows us to examine the effects each independent variable has on our dependent variable (Win Total). As you can see, Total Attendance increases with an increasing Win Total while Conference Standing decreases with a decreasing Win Total. Both graphs make lots of sense, as fans are more tempted to come to a team’s games when the team has a high win total and conference standings tend to decrease with lower win totals (an interesting exception is the 2014 Carolina Panthers, who finished 4th in the NFC despite a 7-8-1 record).

  • In case you are wondering what the layout function does, it basically allowed two graphs to be displayed side by side. I can also alter the function depending on how many independent variables I use; if for instance I used 4 independent variables, I could change c to 2,2 to display the graphs in a 2 by 2 matrix.

Multiple linear regression equations are quite similar to those of simple linear regression, except for an added variable. In this case, the equation would be:

  • Win Total = 6.366e-6(Total Attendance)-5.756e-1(Conference Standing)+5.917

Now, using the predict function that I showed you for my logistic regression models won’t be very efficient here, so we can go the old-fashioned way by plugging numbers into the equation. Here’s an example:

Regardless of what conference a team is part of, a total attendance of at least 750,000 fans and a bottom seed in the conference should at least bring the team a 1-15 record. For teams with a total attendance of at least 1.1 million fans who fall just short of the playoffs with a 7th seed, a 9-7 record would be likely. Top of the conference teams with an attendance of at least 1.45 million should net a 14-2 record.

Now, let’s see what happens when conference standing improves, but attendance decreases:

According to my predictions, bottom-seeded teams with a total attendance of at least 1.5 million fans should net at least a 6-10 record. However, as conference standings improve and total attendance decreases, predicted records stagnate at either 9-7 or 8-8.

Now here’s my second linear model:

In this model, I used two different independent variables-Home Attendance and Average Age of Roster-but I still used Win Total as my dependent variable.

The equation goes like this:

  • Win Total = 1.051e-5(Home Attendance)+5.534e-1(Average Age of Roster)-1.229e+1

Now just like I did with both of my logistic regression models and the linear regression model, let’s create some what-if scenarios:

In this scenario, home attendance is increasing along with the average age of roster. Win total also increases with a higher average age of roster. For instance, teams with a home attendance of at least 350,000 fans and an average roster age of 24 (meaning the team is full of rookies and other fairly-fresh faces) should expect at least a 5-11 record. On the other hand, teams with a roster full of veterans (yes, 28.5 is old for an average roster age)  and a home attendance of at least 1.2 million fans should expect a perfect 16-0 season.

Now let’s try a scenario where home attendance decreases but average age of roster increases:

In this scenario, when home attendance decreases but average age of roster increases, a team’s projected win total also goes down. For teams full of fresh-faces and breakout stars (average age 24) and a home attendance of at least 1.1 million fans, a 13-3 record seems likely. On the other hand, for teams full of veterans (average age 28.5) and a home attendance of at least 300,000 fans, a 7-9 record appears in reach.

One thing to keep in mind with my linear regression predictions is that I rounded projected win totals to the nearest whole number. So I got the 13-3 record projection from the 12.5526 output.

Now let’s plot the model:

Just as I did with linear1, I graphed the two independent variables separately, not only because it’s the easiest way to graph multiple linear regression but also because we can see each variable’s effect on Win Total. As you can see, Home Attendance and Average Age of Roster increases with an increasing win total, though the increase in Average Age of Roster is smaller than that of Home Attendance. Each scenario makes sense, as teams are likelier to have a higher win total if they have more supportive fans in attendance (particularly in their 7 or 8 home games per season) and having more recognizable veterans on a team (like the Saints with QB Drew Brees or the Broncos with LB Von Miller) will be better for the team’s overall record than having a team full of newbies (like the Browns with QB Baker Mayfield or the Giants with RB Saquon Barkley).

  • The Home Attendance numbers are displayed in scientific notation, which is how R displays large numbers. 1e+05 is 100,000, 3e+05 is 300,000, and so on.

Now, before I go, let’s compare the two linear models:

  • Residual Standard Error
    • Model 1-1.09 wins
    • Model 2-2.948 wins
    • Better Model-Model 1 (less deviation)
  • R-Squared (Multiple and Adjusted respectively)
    • Model 1-88.72% and 88.58%
    • Model 2-17.49% and 16.44%
    • Better Model-Model 1 (much higher than Model 2)
  • F-statistic & P-Value (since there are 2 degrees of freedom, this is an important metric)
    • Model 1-617.5 on 2 and 157 degrees of freedom; 2.79e-7
    • Model 2-16.64 on 2 and 157 degrees of freedom; 2.79e-7
    • Better Model-Model 1 (both result in the same p-value, but the f-statistic on Model 1 is much larger)
  • Overall better model-Model 1

Thanks for reading,

Michael

 

 

 

 

 

 

 

 

 

R Analysis 2: Linear Regression & NFL Attendance

Advertisements

Hello everybody,

It’s Michael, and today’s post will be an R analysis post using the concept of linear regression. The dataset I will be using NFL attendance 2014-18, which details NFL attendance for each team from the 2014-2018 NFL seasons along with other factors that might affect attendance (such as average roster age and win count).

First, as we should do for any analysis, we should read the file and understand our variables:

  • Team-The team name corresponding to a row of data; there are 32 NFL teams total
  • Home Attendance-How many fans attend a team’s home games (the NFL’s International games count towards this total)
  • Road Attendance-How many fans attend a team’s road games
    • Keep in mind that teams have 8 home games and 8 away games.
  • Total Attendance-The total number of fans who go see a team’s games in a particular season (attendance for home games + attendance for away games)
  • Win Total-how many wins a team had for a particular season
  • Win.. (meaning win percentage)-the precent of games won by a particular team (keep in mind that ties are counted as half-wins when calculating win percentages)
  •  NFL Season-the season corresponding to the attendance totals (e.g. 2017 NFL season is referred to as simply 2017)
  • Conference Standing-Each team’s seeding in their respective conference (AFC or NFC), which ranges from 1 to 16-1 being the best and 16 being the worst. The teams that were seeded 1-6 in their conference made the playoffs that season while teams seeded 7-16 did not; teams seeded 1-4 won their respective divisions while teams seeded 5 and 6 made the playoffs as wildcards.
    • As the 2018 season is still in progress, these standings only reflect who is LIKELY to make the playoffs as of Week 11 of the NFL season. So far, no team has clinched a playoff spot yet.
  • Average Age of Roster-The average age of a team’s players once the final 53-man roster has been set (this is before Week 1 of the NFL regular season)

One thing to note is that I removed the thousands separators for the Home Attendance, Road Attendance, and Total Attendance variables so that they would read as ints and not factors. The file still has the separators though.

Now let’s set up our model (I’m going to be using three models in this post for comparison purposes):

In this model, I used Total Attendance as the dependent variable and Win Total as the independent variables. In other words, I am using this model to determine if there is any relationship between fans’ attendance at a team’s games and a team’s win total.

Remember how in R Lesson 7 I mentioned that you should pay close attention to the three bottom lines in the output? Here’s what they mean for this model:

  • As I mentioned earlier, the residual standard error refers to the amount that the response variable (total attendance) deviates from the true regression line. In this case, the RSE is 1,828,000, meaning the total attendance deviates from the true regression line by 1,828,000 fans.
    • I didn’t mention this in the previous post, but the way to find the percentage error is to divide the RSE by the average of the dependent variable (in this case, Total Attendance). The lower the percentage error, the better.
    • In this case, the percentage error is 185.43% (the mean for Total Attendance is 985,804 fans, rounded to the nearest whole number).
  • The R-Squared is a measure of the goodness-of-fit of a model-the closer to 1, the better the fit. The difference between the Multiple R-Squared and the Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. In this model, the Multiple R-Squared is 20.87% while the Adjusted R-Squared is 20.37%, indicating a very slight correlation.
    • Remember the idea that “correlation does not imply causation”, which states that even though there may be a strong correlation between the dependent and independent variable, this doesn’t mean the latter causes the former.
    • In the context of this model, even though a team’s total attendance and win total have a very slight correlation, this doesn’t mean that a team’s win total causes higher/lower attendance.
  • The F-squared measures the relationship (or lack thereof) between independent and dependent variables. As I mentioned in the previous post, for models with only 1 degree of freedom, the F-squared is basically the independent variable’s t-value squared (6.456²=41.68). The F-Squared (and resulting p-value) aren’t too significant to determining the accuracy of simple linear regression models such as this one, but are more significant when dealing with with multiple linear regression models.

Now let’s set up the equation for the line (note the coef function I mentioned in the previous post isn’t necessary):

Remember the syntax for the equation is just like the syntax of the slope-intercept equation (y=mx+b) you may remember from algebra class. The equation for the line is (rounded to 2 decimal places):

  • Total Attendance = 29022(Win Total)+773943

Let’s try the equation out using some scenarios:

  • “Perfect” Season (no wins): 29022(0)+773943=expected total attendance of 773,943
  • Split Season (eight wins): 29022(8)+773943=expected total attendance of 1,006,119
  • Actual Perfect Season (sixteen wins): 29022(16)+773943=expected total attendance of 1,238,295

And finally, let’s create the graph (and the regression line):

As seen in the graph above, few points touch the line (which explains the low Multiple R-Squared of 20.68%). According to the regression line, total attendance INCREASES with better win totals, which indicates a direct relationship. One possible reason for this is that fans of consistently well-performing teams (like the Patriots and Steelers) are more eager to attend games than are fans of consistently struggling teams (like the Browns and Jaguars). An interesting observation would be that the 2015 4-12 Dallas Cowboys had better total attendance than the 2015 15-1 Carolina Panthers had. The 2016 and 2017 Cleveland Browns fared pretty well for attendance-each of those seasons had a total attendance of at least 900,000 fans (the records were 1-15 and 0-16 respectively).

Let’s create another model, once again using Total Attendance as the dependent variable but choosing Conference Standing as the independent variable:

So, is this model better than lr1? Let’s find out:

  • The residual standard error is much smaller than that of the previous model (205,100 fans as opposed to 1,828,000). As a result, the percentage error is much smaller-20.81%-and there is less variation among the observation points around the regression line.
  • The Multiple R-Squared and Adjusted R-Squared (0.4% and -0.2% respectively) are much lower than the R-Squared amounts for lr1. Thus, there is even less of a correlation between Total Attendance and Conference Standing than there is between Total Attendance and Win Total (for.a particular team).
  • Disregard the F-statistic and p-value.

Now let’s set up our equation:

From this information, we get the equation:

  • Total Attendance = -2815(Conference Standing)+1009732

Here are some scenarios using this equation:

  • Top of the conference (1st place): -2815(1)+1009732=expected total attendance of 1,006,917
  • Conference wildcard (5th place): -2815(5)+1009732=expected total attendance of 995,657
  • Bottom of the pack (16th place): -2815(16)+1009732=expected total attendance of 964,692

Finally, let’s make a graph:

As seen in the graph, few points touch the line (less points touch this line than in the line for lr1). The line itself has a negative slope, which implies that total attendance DECREASES with WORSE conference standings (or increases with better conference standings). Yes, I know the numbers under conference standing are increasing, but keep in mind that 1 is the best possible conference finish for a team, while 16 is the worst possible finish for a team. One possible reason that total attendance decreases with lower conference standings is that fans are possibly more enticed to come to games for consistently top conference teams and division winners (like the Patriots and Panthers) rather than teams who miss playoffs year after year (like the Jaguars, save for the 2017 squad that made it to the AFC Championship). Interestingly enough, the 2015 4-12 Dallas Cowboys rank second overall in total attendance (16th place in their conference), just behind the 2016 13-3 Dallas Cowboys (first in their conference).

Now let’s make one more graph, this time using Average Age of Roster as the independent variable:

Is this model better than lr2? Let’s find out:

  • The residual standard error is the smallest one of the three (204,600 fans) and thus, the percentage error is the smallest of the three-20.75%.
  • The Multiple R-Squared and Adjusted R-Squared are smaller than those of lr1 but larger than that of lr2 (0.84% and 0.22% respectively). Thus, Average Age of Roster correlates better with Total Attendance than does Conference Standing, however Win Total correlates the best with Total Attendance.
  • Once again, disregard the F-Statistic & corresponding p-value.

Now let’s create the equation:

  • Total Attendance = 36556(Average Age of Roster)+33594

Here are some scenarios using this equation

  • Roster with mostly rookies and 2nd-years (an average age of 24)=36556(24)+33594=expected total attendance of 910,938
  • Roster with a mix of newbies and veterans (an average age of 26)=36556(26)+33594=expected total attendance of 984,050
  • Roster with mostly veterans (an average age of 28)=36556(28)+33594=expected total attendance of 1,057,162

And finally, let’s create a graph:

Like the graph for lr2, few points touch the line. As for the line itself, the slope is positive, implying that Total Attendance INCREASES with an INCREASING Average Age of Roster. One possible reason for this is that fans are more interested in coming to games if the team has several veteran stars* (names like Phillip Rivers, Tom Brady, Jordy Nelson, Antonio Gates, Rob Gronkowski, Richard Sherman, Julius Peppers, Marshawn Lynch and many more) rather than if the team is full of rookies and/or unknowns (Myles Garrett, Sam Darnold, Josh Rosen, Leighton Vander Esch, among others). Interestingly enough, the team with the oldest roster (the 2018 Oakland Raiders, with an average age of 27.4), have the second lowest total attendance, just ahead of the 2018 LA Chargers (with an average age of 25.8).

*I’ll use any player who has at least 6 seasons of NFL experience as an example of a “veteran star”.

So, which is the best model to use? I’d say lr1 would be the best model to use, because even though it has the highest RSE (1,828,000), it also has the best correlation between the independent and dependent variables (a Multiple R-Squared of 20.87%). All in all, according to my three analyses, a team’s Win Total has the greatest influence on how many fans go to their games (both home and away) during a particular season.

Thanks for reading, and happy Thanksgiving to you all. Enjoy your feasts (and those who are enjoying those feasts with you),

Michael

R Lesson 7: Graphing Linear Regression & Determining Accuracy of the Model

Advertisements

Hello everybody,

It’s Michael, and today’s post will be a continuation of the previous post because I will be covering how to graph linear regression models using the dataset and model created in the previous post.

Just to recap, this model is measuring whether the age of a person influenced how many times their name was mentioned on cable news reports (period measured is from October 1-December 7, 2017, at the beginning of the #MeToo movement). Now let’s graph the model:

The basic syntax for creating a plot like this is plot(dataset name$y-axis~dataset name$x-axis); the name of the y-axis is always listed BEFORE the name of the x-axis. The portion of code that reads main = #MeToo coverage, xlab = Age, ylab = Number of Mentions is completely optional, as all it does is allow you to label the x- and y-axes and display a name for the graph.

The abline(linearModel) line adds a line to the graph based on the equation for the model (value=0.09171(age)-3.39221). However, for this function to work, don’t close the window displaying the graph. The line is immediately displayed on the graph after you write the line of code and hit enter.

  • Use the name of your linear regression model in the parentheses so that the line that is created matches the equation for your model.
  • Remember to always hit enter after writing the plot() line, then go back to the coding window, write the abline() line and hit enter again.

So we know how to graph our model, but how do we evaluate the accuracy of the model? Take a look at the summary(linearModel) output below:

Focus on the last three lines of this code block, as they will help determine the accuracy (or goodness-of-fit) of this model. Here’s a better explanation of the output:

  • The residual standard error is a measure of the QUALITY of the fit of the model. Every linear regression model usually contains an error term (E) and as a result, there will be some margin of error in our regression line. The residual standard error refers to the amount that the response variable (value) will deviate from the true regression line. In this case, the actual number of times someone was mentioned on the news deviates from the true regression line is 16.2 mentions.
  • The R-squared is a measure of how well the model ACTUALLY fits within the data. The closer R-squared is to 1, the better the fit of the model. The main difference between Multiple R-Squared and Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. The Adjusted R-Squared will decrease for variables irrelevant to the model, which makes a good metric to use when trying to find relevant variables for your model
    • As you can see, the Multiple R-Squared and Adjusted R-Squared values are quite low (0.47% and 0.46% respectively), indicating that there isn’t a significant relationship between a person’s age and how many time their name was mentioned in news reports.
    • Keep the idea that “correlation does not imply causation” in mind when analyzing the R-Squared. This idea implies that even though a high R-Squared shows that the independent variable and dependent variable perfectly correlate to each other, this does not indicate the independent variable causes the dependent variable (the opposite is true for models with a low R-Squared). In the context of this model, even though someone’s age and the amount of times their name was mentioned on news reports don’t appear correlated, this doesn’t mean that someone’s age isn’t a factor in the amount of news coverage they receive.
  • The F-statistic is a measure of the relationship (or lack thereof) between the dependent and independent variables. This value (along with the corresponding p-value) isn’t really significant when dealing with simple linear regression models such as this one but it is an important metric to analyze when dealing with multiple linear regression models (just like simple linear regression except with multiple independent variables).
    • I will cover multiple linear regression models in a future post, so keep your eyes peeled.
    • In cases of simple linear regression, the f-statistic is basically the independent variable’s t-value squared. The summary output displayed above proves my point, as the squared t-value for age (8.049) equals the f-statistic (64.78). Keep in mind that the f-statistic=(t-value)² rule only applies to models with 1 degree of freedom, just like the one displayed above.

Thanks for reading,

Michael

 

R Lesson 6: Linear Regression

Advertisements

Hello everybody,

Michael here, and now that I’ve completed my series of MySQL posts (for now) I thought I’d start posting more R lessons. Today’s lesson will be on linear regression, which, like logistical regression, is another type of regression method (go back to R Lesson 4: Logistic Regression Models for a refresher).

The main difference between the two types of regression models are in the dependent variables. While you may recall that the dependent variable for logistic regression is binary (meaning there are only 2 possible outcomes), the dependent variable for linear regression is continuous (meaning there are plenty of possible outcomes).

The dataset I will be using for this analysis is MeToo, which details the amount of media coverage regarding 45 famous men accused of some form of sexual misconduct in the wake of the MeToo movement.

The first step when working with linear regression (or any type of analysis in R) is to understand the data, which we do by creating a file variable and utilizing the read.csv command. We then use the str(file) command to display the variables in the dataset.

Here’s some more context on each of the variables used:

  • date_start-The date of each news report
  • date_end-This just mentions the date of each news report along with the timestamp T23:59:59Z (denoting the end of the day); this variable will not be relevant in our analysis.
  • date_resolution-This just mentions “day” along each corresponding date; this variable is also irrelevant to our analysis
  • station-The network that covers each news report, of which there are six listed on this database (Bloomberg, CNBC, CNN, FOX Business, FOX News, and MSNBC)
  • value-The number of time each person’s name is mentioned on a particular broadcast on a specific date
    • This only counts times where full names are mentioned (e.g. only mentions of “Matt Lauer” not just “Matt” or “Lauer”)
  • name-The subject of the news report (e.g. Al Franken, Blake Farenthold)
  • age-The age of the subject of the news report
  • occupation-The profession of the subject of the news report (e.g. Louis CK is a comedian)

Now that we understand our variables, let’s build the model.

In this model, I choose the variables value and age (they are the two most quantitative) and built the model on full data (hence data = file), with value being the dependent variable-that’s why I listed it first-and age being the independent variable. The display for summary(linearModel) is very similar to what you’ll see if you ask for a summary of a logistic regression model.

  • Remember that linear regression models use lm while logistic regression models use glm!

Now, let me introduce a new command-print-which in this case will print the coefficients used in the equation of this model.

In this model, value is a function of age, so when we create the equation for our model, here’s what we get:

  • value=0.09171(age)+(-3.39221)

If this format looks familiar, note that it the same setup as the slope-intercept equation y=mx+b (remember this from algebra class?).

On the next post, we will learn how to graph this model.

Thanks for reading,

Michael