Hello readers,
Michael here, and in today’s lesson, we’re gonna try something special! For one, we’re going back to this blog’s statistical roots with a linear regression post; I covered linear regression with R in the way, way back of 2018 (R Lesson 6: Linear Regression) on this blog, so I thought I’d show you how to work the linear regression process in Python. Two, I’m going to try something I don’t normally do, which is predict the future. In this case, the future being the results of the just-beginning 2024-25 NBA season. Why try to predict NBA results you might ask? Well, for one, I wanted to try something new on this blog (hey, gotta keep things fresh six years in), and for two, I enjoy following along with the NBA season. Plus, I enjoyed writing my post on the 2020 NBA playoffs-R Analysis 10: Linear Regression, K-Means Clustering, & the 2020 NBA Playoffs.
Let’s load our data and import our packages!
Before we get started on the analysis, let’s first load our data into our IDE and import all necessary packages:
import pandas as pd
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression
You’re likely quite familiar with pandas but for those of you that don’t know, sklearn is an open-source Python library commonly used for machine learning projects (like the linear regression we’re about to do)!
A note about uploading files via Google Colab
Once we import our necessary packages, the next thing we should do is upload the data-frame we’ll be using for this analysis.
This is the file we’ll be using; it contains team statistics such as turnovers (team total) and wins for all 30 NBA teams for the last 10 seasons (2014-15 to 2023-24). The data was retrieved from basketball-reference.com, which is a great place to go if you’re looking for juicy basketball data to analyze. This site comes from https://www.sports-reference.com/, which contains statistics on various sports from NBA to NFL to the other football (soccer for Americans), among other sports.
Now, since I used Google Colab for this analysis, I’ll show you how to upload Excel files into Colab (a different process from uploading Excel files into other IDEs):

To import local files into Google Colab, you’ll need to include the lines from google.colab import files and uploaded = files.upload() in the notebook since, for some odd reason, Google Colab won’t let you upload local files directly into your notebook. Once you run these two lines of code, you’ll need to select a file from the browser tool that you want to upload to Colab.
Next (and ideally in a separate cell), you’ll need to add the lines import io and dataframe = pd.read_csv(io.BytesIO(uploaded['dataframe name'])) to the notebook and run the code. This will officially upload your data-frame to your Colab notebook.
- Yes, I know it’s annoying, but that’s just how Colab works. If you’re not using Colab to follow along with me, feel free to skip this section as a simple
pd.read_csv()will do the trick to upload your data-frame onto the IDE.
Let’s learn about our data-frame!
Now that we’ve uploaded our data-frame into the IDE, let’s learn more about it!
NBA.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Season 300 non-null object
1 Team 300 non-null object
2 W 300 non-null int64
3 L 300 non-null int64
4 Finish 300 non-null int64
5 Age 300 non-null float64
6 Ht. 300 non-null object
7 Wt. 300 non-null int64
8 G 300 non-null int64
9 MP 300 non-null int64
10 FG 300 non-null int64
11 FGA 300 non-null int64
12 FG% 300 non-null float64
13 3P 300 non-null int64
14 3PA 300 non-null int64
15 3P% 300 non-null float64
16 2P 300 non-null int64
17 2PA 300 non-null int64
18 2P% 300 non-null float64
19 FT 300 non-null int64
20 FTA 300 non-null int64
21 FT% 300 non-null float64
22 ORB 300 non-null int64
23 DRB 300 non-null int64
24 TRB 300 non-null int64
25 AST 300 non-null int64
26 STL 300 non-null int64
27 BLK 300 non-null int64
28 TOV 300 non-null int64
29 PF 300 non-null int64
30 PTS 300 non-null int64
dtypes: float64(5), int64(23), object(3)
memory usage: 72.8+ KB
Running the NBA.info() command will allow us to see basic information about all 31 columns in our data-frame (such as column names, amount of records in dataset, and object type).
In case you’re wondering about all the abbreviations, here’s an explanation for each abbreviation:
Season-The specific season represented by the data (e.g. 2014-15)Team-The team nameW-A team’s wins in a given seasonL-A team’s losses in a given seasonFinish-The seed a team finished in during a given season in their conference (e.g. Detroit Pistons finishing 15th seed in the East last season)Age-The average age of a team’s roster as of February 1 of a given season (e.g. February 1, 2024 for the 2023-24 season)Ht.-The average height of the team’s roster in a given season (e.g. 6’6)Wt.-The average weight (in lbs.) of the team’s roster in a given seasonG-Total amount of games played by the team in a given seasonMP-Total minutes played as a team in a given seasonFG-Field goals scored by the team in a given seasonFGA-Field goal attempts made by the team in a given seasonFG%-Percent of successful field goals made by team in a given season3P-3-point field goals scored by the team in a given season3PA-3-point field goal attempts made by the team in a given season3P%-Percent of successful 3-point field goals made by the team in a given season2P-2-point field goals scored by the team in a given season2PA-2-point field goal attempts made by the team in a given season2P%-Percent of successful 2-point field goals made by the team in a given seasonFT-Free throws scored by the team in a given seasonFTA-Free throw attempts made by the team in a given seasonFT%-Percent of successful free throw attempts made by the team in a given seasonORB-Team’s total offensive rebounds in a given seasonDRB-Team’s total defensive rebounds in a given seasonTRB-Team’s total rebounds (both offensive and defensive) in a given seasonAST-Team’s total assists in a given seasonSTL-Team’s total steals in a given seasonBLK-Team’s total blocks in a given seasonTOV-Team’s total turnovers in a given seasonPF-Team’s total personal fouls in a given seasonPTS-Team’s total points scored in a given season
Wow, that’s a lot of variables! Now that understand know the data we’re working with better, let’s see how we can make a simple linear regression model!
- If you’re not familiar with basketball jargon, the NBA has a great glossary of basic terms on their website: https://www.nba.com/stats/help/glossary
The K-Best Way To Set Up Your Model
Before we start the juicy analysis, let’s first pick the features we will use for the model. In this post, we’ll explore the Select K-Best algorithm, which is an algorithm commonly used in linear regression to help select the best features for a particular model:
X = NBA.drop(['Season', 'Team', 'W', 'Ht.'], axis=1)
y = NBA['W']
from sklearn.feature_selection import SelectKBest, f_regression
features = SelectKBest(score_func=f_regression, k=5)
features.fit(X, y)
selectedFeatures = X.columns[features.get_support()]
print(selectedFeatures)
Index(['L', 'Finish', 'Age', 'FG%', '3P%'], dtype='object')
According to the Select K-Best algorithm, the five best features to use in the linear regression are L, Finish, Age, FG% and 3P%. In other words, a team’s end-of-season seeding, total losses, average roster age, and percentage of successful field goals and 3-pointers are the five most important features to predict a team’s win total.
How did the model arrive to these conclusions? First of all, I set the X and y variables-this is important as the Select K-Best algorithm needs to know what is the dependent variable and what are possible independent variable selections that can be used in the model. In this example, the dependent (or y) variable is W (for team wins) while the X variable includes all other dataset columns except for W, Team, Season, and Ht. because W is the y variable and the other three variables are categorial (or non-numerical) variables, so they really won’t work in our analysis.
Next we import the SelectKBest and f_regression packages from the sklearn.feature_selection module. Why do we need these two packages? Well, SelectKBest will allow us to use the Select K-Best algorithm while f_regression is like a back-end feature selection method that allows the Select K-Best algorithm to select the best x-amount of features for the model (I used five features for this model).
After setting up the Select K-Best algorithm, we then fit both the X and y variables to the algorithm and then print out our top five selectedFeatures.
Train, test…split!
Once we have our top five features for model, it’s time for the train, test, splitting of the model! What is train, test, split you ask? Well, our linear regression model will be split into two types of data-training data (the data we use for training the model) and testing data (the data we use to test our model). Here’s how we can utilize the train, test, split for this model:
X = NBA[['L', 'Finish', 'Age', 'FG%', '3P%']]
y = NBA['W']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
How does the train, test, split work? Using sklearn’s train_test_split method, we pass in four parameters-our independent variables (X), our dependent variable (y), the size of the test data (a decimal between 0 and 1), and the random state (this can be kept at 0, but it doesn’t matter what number you use-42 is another common number). In this model, I will utilize an 80/20 train, test, split, which indicates that 80% of the data will be for training while the other 20% will be used for testing.
Other common train, test, splits are 70/30, 85/15, and 67/33, but I opted for 80/20 because our dataset is only 300 rows long. I would utilize these other train, test, splits for larger datasets.
- Something worth noting: What we’re doing here is called multiple linear regression since we’re using five X variables to predict a Y variable. Simple linear regression would only use one X variable to predict a Y variable. Just thought I’d throw in this quick factoid!
And now, for the model-making
Now that we’ve done all the steps to set up our model, the next thing we’ll need to do is actually create the model!
Here’s how we can get started:
NBAMODEL = LinearRegression()
NBAMODEL.fit(X_train, y_train)
LinearRegression()
In this example, we create a LinearRegression() object (NBAMODEL) and fit it to both the X_train and y_train data.
Predictions, predictions
Once we’ve created our model, next comes the fun part-generating the predictions!
yPredictions = NBAMODEL.predict(X_test)
yPredictions
array([53.20097648, 28.89541793, 52.26551381, 53.22220829, 35.90676716,
32.15874993, 47.72090936, 48.32896277, 39.4193884 , 40.1548429 ,
19.62678175, 48.3263792 , 32.13473281, 43.50887634, 43.85260484,
52.79795145, 27.35822648, 40.23392095, 18.85423981, 61.69624816,
51.59650403, 23.86311747, 56.18087097, 54.15867678, 49.75211403,
46.90177259, 31.80109001, 46.82531833, 37.50563942, 32.19863141,
52.41205133, 25.09011881, 48.94542256, 38.80244997, 24.80146638,
42.50107728, 43.27320835, 37.45199938, 46.7795962 , 28.11289951,
57.64388881, 29.35812466, 18.3222965 , 36.26677012, 20.56912227,
22.15266241, 19.9955299 , 44.84930613, 45.14740453, 23.19471644,
53.940611 , 26.0780373 , 27.88093669, 61.23347337, 52.99948229,
34.66653881, 30.04421016, 27.21669768, 48.55215233, 47.11060905])
The yPredictions are obtained through using the predict method on the model’s X_train data, which in this case consists of 60 of the 300 records..
Evaluating the model’s accuracy
Once we’ve created the model and made our predictions on the training data, it’s time to evaluate the model’s accuracy. Here’s how to do so:
from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_test,yPredictions)
0.09147159762376074
There are several ways you can evaluate the accuracy of a linear regression model. One good method as shown here is the mean_absolute_percentage_error (imported from the sklearn.metrics package). The mean absolute percentage error evaluates the model’s accuracy by indicating how off the model’s predictions are. In this model, the mean absolute percentage error is 0.09147159762376074, indicating that the model’s predictions are off by roughly 9%-which also indicates that overall, the model’s predictions are roughly 91% accurate. Not too shabby for this model!
- Interestingly, the two COVID impacted NBA seasons in the dataset (2019-20 and 2020-21) didn’t throw off the model’s accuracy much.
Don’t forget about the equation!
Evaluating the model’s accuracy isn’t the only thing you should do when analyzing the model. You should also grab the model’s coefficients and intercept-they will be important in the next post!
NBAMODEL.coef_
array([ -0.4663858 , -1.30716212, 0.39700734, 34.1325687 ,
-22.12258585])
NBAMODEL.intercept_
50.945769772855854
All linear regression models will have a coefficient and an intercept, which form the linear regression equation. Since our model had five X variables, there are five coefficients.
Now, what would our equation look like?
Here is the equation in all it’s messy glory. We’re going to be using this equation in the next post.
Linear regression plotting
For the visual learners among my readers, I thought it would be nice to include a simple scatterplot to visualize the accuracy of our linear regression model. Here’s how to create that plot:
import matplotlib.pyplot as plt
plt.scatter(y_test, yPredictions, color="red")
plt.xlabel('Actual values', size=15)
plt.ylabel('Predicted values', size=15)
plt.title('Actual vs Predicted values', size=15)
plt.show()
First, I imported the matplotlib.pyplot module. Then, I ran the plt.scatter() method to create a scatterplot. I used three parameters for this method: the y_test values, the yPredictions values, and the color="red" parameter (this just indicated that I wanted red scatterplot dots). I then used the plt.xlabel(), plt.ylabel(), and plt.title() methods to give the scatterplot an x-label title, y-label title, and title, respectively. Lastly, I used the plt.show() method to display the scatterplot in all of its red-dotted glory.
As you can see from this plot, the predicted values match the actual values fairly closely, hence the 91% accuracy/9% error.
Thanks for reading, enjoy the upcoming NBA season action, and stay tuned for my next post where I reveal my predicted records and standings for each team, East and West! It will be interesting to see how my predictions pan out over the course of the season-after all, it’s certainly something different I’m trying on this blog!
And yes, perfect timing for this blog to come out on NBA season opening day! Serendipity am I right?
Also, here’s a link to the notebook in GitHub-https://github.com/mfletcher2021/DevopsBasics/blob/master/NBA_24_25_predictions.ipynb.