Python, Linear Regression & An NBA Season Opening Day Special Post

Let’s load our data and import our packages!

Before we get started on the analysis, let’s first load our data into our IDE and import all necessary packages:

import pandas as pd
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

You’re likely quite familiar with pandas but for those of you that don’t know, sklearn is an open-source Python library commonly used for machine learning projects (like the linear regression we’re about to do)!

A note about uploading files via Google Colab

Once we import our necessary packages, the next thing we should do is upload the data-frame we’ll be using for this analysis.

This is the file we’ll be using; it contains team statistics such as turnovers (team total) and wins for all 30 NBA teams for the last 10 seasons (2014-15 to 2023-24). The data was retrieved from basketball-reference.com, which is a great place to go if you’re looking for juicy basketball data to analyze. This site comes from https://www.sports-reference.com/, which contains statistics on various sports from NBA to NFL to the other football (soccer for Americans), among other sports.

NBA analysis Download

Now, since I used Google Colab for this analysis, I’ll show you how to upload Excel files into Colab (a different process from uploading Excel files into other IDEs):

To import local files into Google Colab, you’ll need to include the lines from google.colab import files and uploaded = files.upload() in the notebook since, for some odd reason, Google Colab won’t let you upload local files directly into your notebook. Once you run these two lines of code, you’ll need to select a file from the browser tool that you want to upload to Colab.

Next (and ideally in a separate cell), you’ll need to add the lines import io and dataframe = pd.read_csv(io.BytesIO(uploaded['dataframe name'])) to the notebook and run the code. This will officially upload your data-frame to your Colab notebook.

Yes, I know it’s annoying, but that’s just how Colab works. If you’re not using Colab to follow along with me, feel free to skip this section as a simple pd.read_csv() will do the trick to upload your data-frame onto the IDE.

Let’s learn about our data-frame!

Now that we’ve uploaded our data-frame into the IDE, let’s learn more about it!

NBA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Season  300 non-null    object 
 1   Team    300 non-null    object 
 2   W       300 non-null    int64  
 3   L       300 non-null    int64  
 4   Finish  300 non-null    int64  
 5   Age     300 non-null    float64
 6   Ht.     300 non-null    object 
 7   Wt.     300 non-null    int64  
 8   G       300 non-null    int64  
 9   MP      300 non-null    int64  
 10  FG      300 non-null    int64  
 11  FGA     300 non-null    int64  
 12  FG%     300 non-null    float64
 13  3P      300 non-null    int64  
 14  3PA     300 non-null    int64  
 15  3P%     300 non-null    float64
 16  2P      300 non-null    int64  
 17  2PA     300 non-null    int64  
 18  2P%     300 non-null    float64
 19  FT      300 non-null    int64  
 20  FTA     300 non-null    int64  
 21  FT%     300 non-null    float64
 22  ORB     300 non-null    int64  
 23  DRB     300 non-null    int64  
 24  TRB     300 non-null    int64  
 25  AST     300 non-null    int64  
 26  STL     300 non-null    int64  
 27  BLK     300 non-null    int64  
 28  TOV     300 non-null    int64  
 29  PF      300 non-null    int64  
 30  PTS     300 non-null    int64  
dtypes: float64(5), int64(23), object(3)
memory usage: 72.8+ KB

Running the NBA.info() command will allow us to see basic information about all 31 columns in our data-frame (such as column names, amount of records in dataset, and object type).

In case you’re wondering about all the abbreviations, here’s an explanation for each abbreviation:

Season-The specific season represented by the data (e.g. 2014-15)
Team-The team name
W-A team’s wins in a given season
L-A team’s losses in a given season
Finish-The seed a team finished in during a given season in their conference (e.g. Detroit Pistons finishing 15th seed in the East last season)
Age-The average age of a team’s roster as of February 1 of a given season (e.g. February 1, 2024 for the 2023-24 season)
Ht.-The average height of the team’s roster in a given season (e.g. 6’6)
Wt.-The average weight (in lbs.) of the team’s roster in a given season
G-Total amount of games played by the team in a given season
MP-Total minutes played as a team in a given season
FG-Field goals scored by the team in a given season
FGA-Field goal attempts made by the team in a given season
FG%-Percent of successful field goals made by team in a given season
3P-3-point field goals scored by the team in a given season
3PA-3-point field goal attempts made by the team in a given season
3P%-Percent of successful 3-point field goals made by the team in a given season
2P-2-point field goals scored by the team in a given season
2PA-2-point field goal attempts made by the team in a given season
2P%-Percent of successful 2-point field goals made by the team in a given season
FT-Free throws scored by the team in a given season
FTA-Free throw attempts made by the team in a given season
FT%-Percent of successful free throw attempts made by the team in a given season
ORB-Team’s total offensive rebounds in a given season
DRB-Team’s total defensive rebounds in a given season
TRB-Team’s total rebounds (both offensive and defensive) in a given season
AST-Team’s total assists in a given season
STL-Team’s total steals in a given season
BLK-Team’s total blocks in a given season
TOV-Team’s total turnovers in a given season
PF-Team’s total personal fouls in a given season
PTS-Team’s total points scored in a given season

Wow, that’s a lot of variables! Now that understand know the data we’re working with better, let’s see how we can make a simple linear regression model!

If you’re not familiar with basketball jargon, the NBA has a great glossary of basic terms on their website: https://www.nba.com/stats/help/glossary

The K-Best Way To Set Up Your Model

Before we start the juicy analysis, let’s first pick the features we will use for the model. In this post, we’ll explore the Select K-Best algorithm, which is an algorithm commonly used in linear regression to help select the best features for a particular model:

X = NBA.drop(['Season', 'Team', 'W', 'Ht.'], axis=1)
y = NBA['W']

from sklearn.feature_selection import SelectKBest, f_regression
features = SelectKBest(score_func=f_regression, k=5)
features.fit(X, y)

selectedFeatures = X.columns[features.get_support()]
print(selectedFeatures)

Index(['L', 'Finish', 'Age', 'FG%', '3P%'], dtype='object')

According to the Select K-Best algorithm, the five best features to use in the linear regression are L, Finish, Age, FG% and 3P%. In other words, a team’s end-of-season seeding, total losses, average roster age, and percentage of successful field goals and 3-pointers are the five most important features to predict a team’s win total.

How did the model arrive to these conclusions? First of all, I set the X and y variables-this is important as the Select K-Best algorithm needs to know what is the dependent variable and what are possible independent variable selections that can be used in the model. In this example, the dependent (or y) variable is W (for team wins) while the X variable includes all other dataset columns except for W, Team, Season, and Ht. because W is the y variable and the other three variables are categorial (or non-numerical) variables, so they really won’t work in our analysis.

Next we import the SelectKBest and f_regression packages from the sklearn.feature_selection module. Why do we need these two packages? Well, SelectKBest will allow us to use the Select K-Best algorithm while f_regression is like a back-end feature selection method that allows the Select K-Best algorithm to select the best x-amount of features for the model (I used five features for this model).

After setting up the Select K-Best algorithm, we then fit both the X and y variables to the algorithm and then print out our top five selectedFeatures.

Train, test…split!

Once we have our top five features for model, it’s time for the train, test, splitting of the model! What is train, test, split you ask? Well, our linear regression model will be split into two types of data-training data (the data we use for training the model) and testing data (the data we use to test our model). Here’s how we can utilize the train, test, split for this model:

X = NBA[['L', 'Finish', 'Age', 'FG%', '3P%']]
y = NBA['W']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

How does the train, test, split work? Using sklearn’s train_test_split method, we pass in four parameters-our independent variables (X), our dependent variable (y), the size of the test data (a decimal between 0 and 1), and the random state (this can be kept at 0, but it doesn’t matter what number you use-42 is another common number). In this model, I will utilize an 80/20 train, test, split, which indicates that 80% of the data will be for training while the other 20% will be used for testing.

Other common train, test, splits are 70/30, 85/15, and 67/33, but I opted for 80/20 because our dataset is only 300 rows long. I would utilize these other train, test, splits for larger datasets.

Something worth noting: What we’re doing here is called multiple linear regression since we’re using five X variables to predict a Y variable. Simple linear regression would only use one X variable to predict a Y variable. Just thought I’d throw in this quick factoid!

And now, for the model-making

Now that we’ve done all the steps to set up our model, the next thing we’ll need to do is actually create the model!

Here’s how we can get started:

NBAMODEL = LinearRegression()
NBAMODEL.fit(X_train, y_train)

LinearRegression()

In this example, we create a LinearRegression() object (NBAMODEL) and fit it to both the X_train and y_train data.

Predictions, predictions

Once we’ve created our model, next comes the fun part-generating the predictions!

yPredictions = NBAMODEL.predict(X_test)

yPredictions

array([53.20097648, 28.89541793, 52.26551381, 53.22220829, 35.90676716,
       32.15874993, 47.72090936, 48.32896277, 39.4193884 , 40.1548429 ,
       19.62678175, 48.3263792 , 32.13473281, 43.50887634, 43.85260484,
       52.79795145, 27.35822648, 40.23392095, 18.85423981, 61.69624816,
       51.59650403, 23.86311747, 56.18087097, 54.15867678, 49.75211403,
       46.90177259, 31.80109001, 46.82531833, 37.50563942, 32.19863141,
       52.41205133, 25.09011881, 48.94542256, 38.80244997, 24.80146638,
       42.50107728, 43.27320835, 37.45199938, 46.7795962 , 28.11289951,
       57.64388881, 29.35812466, 18.3222965 , 36.26677012, 20.56912227,
       22.15266241, 19.9955299 , 44.84930613, 45.14740453, 23.19471644,
       53.940611  , 26.0780373 , 27.88093669, 61.23347337, 52.99948229,
       34.66653881, 30.04421016, 27.21669768, 48.55215233, 47.11060905])

The yPredictions are obtained through using the predict method on the model’s X_train data, which in this case consists of 60 of the 300 records..

Evaluating the model’s accuracy

Once we’ve created the model and made our predictions on the training data, it’s time to evaluate the model’s accuracy. Here’s how to do so:

from sklearn.metrics import mean_absolute_percentage_error

mean_absolute_percentage_error(y_test,yPredictions)

0.09147159762376074

There are several ways you can evaluate the accuracy of a linear regression model. One good method as shown here is the mean_absolute_percentage_error (imported from the sklearn.metrics package). The mean absolute percentage error evaluates the model’s accuracy by indicating how off the model’s predictions are. In this model, the mean absolute percentage error is 0.09147159762376074, indicating that the model’s predictions are off by roughly 9%-which also indicates that overall, the model’s predictions are roughly 91% accurate. Not too shabby for this model!

Interestingly, the two COVID impacted NBA seasons in the dataset (2019-20 and 2020-21) didn’t throw off the model’s accuracy much.

Don’t forget about the equation!

Evaluating the model’s accuracy isn’t the only thing you should do when analyzing the model. You should also grab the model’s coefficients and intercept-they will be important in the next post!

NBAMODEL.coef_

array([ -0.4663858 ,  -1.30716212,   0.39700734,  34.1325687 ,
       -22.12258585])

NBAMODEL.intercept_

50.945769772855854

All linear regression models will have a coefficient and an intercept, which form the linear regression equation. Since our model had five X variables, there are five coefficients.

Now, what would our equation look like?

Here is the equation in all it’s messy glory. We’re going to be using this equation in the next post.

Linear regression plotting

For the visual learners among my readers, I thought it would be nice to include a simple scatterplot to visualize the accuracy of our linear regression model. Here’s how to create that plot:

import matplotlib.pyplot as plt
plt.scatter(y_test, yPredictions, color="red")
plt.xlabel('Actual values', size=15)
plt.ylabel('Predicted values', size=15)
plt.title('Actual vs Predicted values', size=15)
plt.show()

First, I imported the matplotlib.pyplot module. Then, I ran the plt.scatter() method to create a scatterplot. I used three parameters for this method: the y_test values, the yPredictions values, and the color="red" parameter (this just indicated that I wanted red scatterplot dots). I then used the plt.xlabel(), plt.ylabel(), and plt.title() methods to give the scatterplot an x-label title, y-label title, and title, respectively. Lastly, I used the plt.show() method to display the scatterplot in all of its red-dotted glory.

As you can see from this plot, the predicted values match the actual values fairly closely, hence the 91% accuracy/9% error.

Thanks for reading, enjoy the upcoming NBA season action, and stay tuned for my next post where I reveal my predicted records and standings for each team, East and West! It will be interesting to see how my predictions pan out over the course of the season-after all, it’s certainly something different I’m trying on this blog!

And yes, perfect timing for this blog to come out on NBA season opening day! Serendipity am I right?

Also, here’s a link to the notebook in GitHub-https://github.com/mfletcher2021/DevopsBasics/blob/master/NBA_24_25_predictions.ipynb.