machine learning Archives - Michael's Programming Bytes

And Now For Michael’s Programming Bytes 2025-26 NBA Season Predictions

Advertisements

Hello everybody,

Michael here, and in today’s post we will discuss what I think is the fun part about this NBA post prediction series-the predictions themselves.

That’s right, now that we have our model, let’s make some predictions for the season!

Now where did we leave off?

Before we get into the juicy NBA season predictions, let’s first revisit where we left off on the previous post Another Crack At Linear Regression NBA Machine Learning Predictions (2025-26 edition):

Towards the end of the previous post, we generated this equation to assist us in generating our linear regression NBA season predictions for this year. To recap what the equation means:

64.8 * (field goal %)
PLUS 113 * (3-point %)
PLUS 15.4 * (2-point %)
MINUS 1.94 * (seeding at end of season)
PLUS 0.011 * (total rebounds)
MINUS 0.00346 * (total assists)
PLUS 0.0215 * (total steals)
PLUS 0.00663 * (total blocks)
MINUS 0.0097 * (total turnovers)
MINUS 60.38 (the intercept)

That’s quite a mouthful, but I’ll show you the Python calculations we’ll be doing in order to generate those juicy predictions!

I’ll admit that even I’m not perfect with my blogs here, as I made a small mistake on the previous post that showed part of the equations as 215 * (total steals) rather than 0.0215 * (total steals). As it turns out, even experienced coders like me make oversights, so apologies for that!

A little disclaimer here

Before we dive in to our predictions, I want to clarify that these are simply win total/conference seeding predictions based off of a simple linear regression model configured by me. I personally wouldn’t use these predictions for any bets or parlays because first and foremost, I am your friendly neighborhood coding blogger, not your friendly neighborhood sportsbook. You can count on me for juicy, way-too-early predictions, but certainly not for any juicy over/unders.

If you do bet on NBA games this season, please do so responsibly! Thank you!

The way of the weighted averages

You may recall that for my post on last NBA season’s predictions, we used weighted averages to help generate the predictions. Since I personally liked that method, I’ll do so again.

Here’s the file with the weighted averages, which we’ll be using to calculate the predictions:

We’ll use the same methodology as we did last year for calculating the weighted averages, which went like this:

2022-23 to 2024-25 (last 3 seasons)-0.2 weight (higher weight for the three most recent seasons)
2019-20 to 2021-22 (three seasons prior to that)-0.1 (less weight for seasons further in the past, plus this timespan does include the two COVID shortened seasons)
2015-16 to 2018-19 (four seasons further back)-0.025 (even less weight for these seasons further in the past)

Now here’s the weighted averages file for all 30 teams:

weighted averages 2025-26 Download

So, without further ado, let’s predict some win totals!

import pandas as pd

NBAAVG = pd.read_csv(r'C:\Users\mof39\OneDrive\Documents\weighted averages 2025-26.csv')

for n in NBAAVG['Team']:
    print(64.8*NBAAVG['FG%'] + 113*NBAAVG['3P%'] + 15.4*NBAAVG['2P%'] - 1.94*NBAAVG['Finish'] + 0.011*NBAAVG['TRB'] - 0.00346*NBAAVG['AST'] + 0.0215*NBAAVG['STL'] + 0.00663*NBAAVG['BLK'] - 0.0097*NBAAVG['TOV'] - 60.38)
    break

0     40.21900 (Atlanta Hawks)
1     52.98400 (Boston Celtics)
2     37.83554 (Brooklyn Nets)
3     34.54740 (Charlotte Hornets)
4     40.36300 (Chicago Bulls)
5     50.23470 (Cleveland Cavaliers)
6     43.38590 (Dallas Mavericks)
7     50.17750 (Denver Nuggets)
8     33.35420 (Detroit Pistons)
9     45.65995 (Golden State Warriors)
10    41.07520 (Houston Rockets)
11    45.93936 (Indiana Pacers)
12    51.35416 (LA Clippers)
13    45.88495 (LA Lakers)
14    47.79176 (Memphis Grizzlies)
15    41.26986 (Miami Heat)
16    48.55712 (Milwaukee Bucks)
17    47.12266 (Minnesota Timberwolves)
18    40.65833 (New Orleans Pelicans)
19    47.90818 (NY Knicks)
20    58.57943 (Oklahoma City Thunder)
21    41.25042 (Orlando Magic)
22    43.73122 (Philadelphia 76ers)
23    45.39194 (Phoenix Suns)
24    39.99757 (Portland Trail Blazers)
25    43.89877 (Sacramento Kings)
26    39.38897 (San Antonio Spurs)
27    39.13594 (Toronto Raptors)
28    40.82412 (Utah Jazz)
29    36.85122 (Washington Wizards)

Once I read the weighted averages CSV and ran the equation for all 30 teams, I get the predicted win totals for all 30 teams, which I will use for my way-too-early East/West seeding chart. Note that since the team names aren’t shown in the output, I took the liberty of manually adding each team name by each predicted win total so you know your favorite team’s projected win total (according to my model, of course).

One interesting difference between this year’s projected win totals and last year’s is the narrower range of possible win totals in this year’s model. See, the range of possible win totals in last year’s model was 24-54 wins, while the range of possible win totals in this year’s model is just 33-59 wins. Could the narrower possible win total range be due to the different features I used in this year’s model? It’ll be interesting to see how the season plays out.

Another interesting thing to note is that even though there is a narrower range of potential wins in this year’s model, the majority of teams’ win counts last season fell into this range-20 teams won between 33 and 59 games last season (Knicks, Pacers, Bucks, Pistons, Magic, Hawks, Bulls, Heat, Rockets, Lakers, Nuggets, Clippers, Timberwolves, Warriors, Grizzlies, Kings, Mavericks, Suns, TrailBlazers and Spurs).

How will the win counts look this time around? We’ll see as the season unfolds!

Michael’s Way-Too-Early Conference Seeding:

And now, for the stuff I really wanted to share with you all in this post: Michael’s Way-Too-Early Conference Seeding. Now that we’ve got our projected win totals for each team, it’s time to seed them in their projected spots! But that’s not all I’m going to do!

In addition to the model’s projected seedings, I’ll also give you my own personal seedings for all 30 teams. That’s right-this year, I want to see which set of predictions comes out more accurate-my predications or my model’s predictions. This will be fun to revisit next July once the season wraps up!

Eastern Conference predictions

To begin, let’s start with the model’s Eastern Conference predictions:

Play-Offs	Play-Ins	Maybe Next Year
1. Boston Celtics	7. Miami Heat	11. Toronto Raptors
2. Cleveland Cavaliers	8. Orlando Magic	12. Brooklyn Nets
3. Milwaukee Bucks	9. Chicago Bulls	13. Washington Wizards
4. New York Knicks	10. Atlanta Hawks	14. Charlotte Hornets
5. Indiana Pacers		15. Detroit Pistons
6. Philadelphia 76ers

And now, let’s see my personal Eastern Conference predictions:

Play-Offs	Play-Ins	Maybe Next Year
1. New York Knicks	7. Orlando Magic	11. Toronto Raptors
2. Cleveland Cavaliers	8. Milwaukee Bucks	12. Philadelphia 76ers
3. Boston Celtics	9. Atlanta Hawks	13. Brooklyn Nets
4. Detroit Pistons	10. Chicago Bulls	14. Charlotte Hornets
5. Miami Heat		15. Washington Wizards
6. Indiana Pacers

Here are some interesting observations about both the model’s predictions and my own personal predictions:

The Eastern Conference teams that made last season’s play-in (Heat, Hawks, Bulls, Magic) are the same ones projected to make another go at play-ins this year. In other words, could we see the same teams stuck in another year of play-ins?
Personally, I think the Hawks, Bulls and Magic will make another trip to the play-in. On the other hand, I think the Heat will eke out a 5 (maybe 6) seed in the East because of some great new acquisitions like small forward Simone Fontecchio and shooting guard Norman Powell.
I honestly don’t know why the model hates the Detroit Pistons, as it placed them at the bottom of the East once more. I ranked them as a possible 4-seed because after their improvement last year (44-38 from a dismal 14-68 in 2023-24), I feel they could be quite the playoff contender-and it was certainly nice to see 2021 1st Overall Pick Cade Cunningham finally develop into a star-quality player. The acquisition of the former Heat small forward Duncan Robinson should be exciting to see.
This might sound like a hot take here, but I don’t think the Sixers will even qualify for play-in, let alone playoffs given the plethora of issues they had last season. Least of all, Paul George and Joel Embiid-two of the biggest Sixers names-weren’t at the top of their game last season when they were healthy (and both of them missed significant time due to injuries).
Unlike my model, I think the Knicks could really take the top spot in the East this season. Despite falling just short of the 2025 NBA Finals, the Knicks showed they can certainly make a deep playoff run with talent such as Jalen Brunson (winner of the Clutch Player of the Year award), OG Anunoby and their acquisition of Karl-Anthony Towns from the Timberwolves during the 2024 offseason.
With two of the biggest names in the East-Jayson Tatum and Tyrese Haliburton-out for most if not all of this season due to Achilles injuries they got during last season’s playoffs, I think the East is wide open. Granted, I still think the Pacers and Celtics have a good chance at making the playoffs this year, but I don’t think either of them is a shoo-in for the top spot in the East, which in my opinion leaves the East playoff race wide open for another team to take the top spot (which as I said earlier, I think it could be the Knicks’ year to do just that). Also, I still think the Celtics could realistically clinch the 3-seed in the East despite the offseason departures of Jrue Holiday, Kristaps Porzingis, Al Horford and Luke Kornet, who were all key players in the Celtics 2024 Championship run.

Western Conference predictions:

First, let’s start with how the model think the Western Conference standings will play out this season:

Play-Offs	Play-Ins	Maybe Next Year
1. Oklahoma City Thunder	7. Golden State Warriors	11. Houston Rockets
2. LA Clippers	8. Phoenix Suns	12. Utah Jazz
3. Denver Nuggets	9. Sacramento Kings	13. New Orleans Pelicans
4. Memphis Grizzlies	10. Dallas Mavericks	14. Portland Trail Blazers
5. Minnesota Timberwolves		15. San Antonio Spurs
6. LA Lakers

Just as with the model’s Eastern Conference predictions, I certainly have disagreements with the Western Conference predictions. Here’s how I think the Western Conference standings will play out this season:

Play-Offs	Play-Ins	Maybe Next Year
1. Oklahoma City Thunder	7. Golden State Warriors	11. Dallas Mavericks
2. Houston Rockets	8. LA Clippers	12. Memphis Grizzlies
3. Minnesota Timberwolves	9. Sacramento Kings	13. Utah Jazz
4. Denver Nuggets	10. San Antonio Spurs	14. Portland Trail Blazers
5. Houston Rockets		15. New Orleans Pelicans
6. LA Lakers

As I did with my Eastern Conference predictions, here are some interesting observations between the model’s projected conference standings and my personal projected conference standings:

I’m sure the question on every NBA fan’s mind-including mine-is “Can the Oklahoma City Thunder pull off another championship?”. My guess-I think of all the champions we’ve seen in the 2020s alone, I think they’ve got the best shot at a repeat title. Why might that be? One big reason that could happen-the Thunder kept their core Big 3 (SGA, Chet Holmgren, and Jaylin Williams) around along with several other key players from the championship run such as Isaiah Hartenstein, Lu Dort, among others. Personally, I think that NBA teams would be wise not to go full rebuild-mode after winning their first championship, and it seems the Thunder have done just that (they only traded second-year small forward Dillon Jones, who played limited minutes in OKC’s championship run). Even if the Thunder don’t end up repeating as champions, I think, at the very least, the 1-seed in the West could be theirs for the taking once more.
Another interesting Western Conference storyline to watch would be whether Cooper Flagg (the 2025 #1 overall pick) becomes the next Luka Doncic for the Mavericks. After Doncic got traded for Anthony Davis during last year’s midseason trades, it’s safe to say the Mavericks’ season went south. A controversial trade and injuries to many key players-Anthony Davis (after the trade) and Kyrie Irving being the two most notable examples-didn’t help matters. Then again, having such an injury-struck roster to the point where the Mavericks nearly (but thankfully didn’t) have to forfeit games only added to their problems last season after the infamous Doncic-Davis trade. The drafting of 6’9″, 18-year-old forward Cooper Flagg could bring a spark to the struggling Mavericks (and from watching some of his highlights, I think Flagg has potential), but I think Flagg will need at least a year to gel with the Mavericks before they once again become Western Conference contenders.
Just as I was surprised that my model placed the Detroit Pistons at the bottom of the Eastern Conference given their improvements last season, I can say I’m just as surprised that the San Antonio Spurs were placed at the bottom of the Western Conference. Granted, they haven’t made the playoffs since 2019 and just went through a coaching change (Popovich stepped down and Mitch Johnson was named as head coach after serving as interim last season), but they did also improve their record from 22-60 in ’23-’24 to 34-48 last season. The Spurs also have their own solid Big 3 in De’Aaron Fox, Stephon Castle, and of course 2023 #1 overall pick Victor Wembanyama. Even though Wemby’s season was cut short last year due to deep vein thrombosis (a type of blood clot), his improved shooting and double-doubles could certainly help the Spurs once he’s fully recovered.
How might the Golden State Warriors do with their 35-and-over Big 3 (Jimmy Butler is 36, Draymond Green is 35, and Steph Curry is 37)? Given that they earned their playoff spot last season through play-ins, I’ve got a hunch that the Warriors might be seeing the play-ins once more-but will likely get a playoff spot in this manner. Yes, they had quite the herky-jerky trajectory last season, but the midseason acquisition of Jimmy Butler certainly gave them an extra spark down the regular season stretch-Butler’s basketball skills certainly paired well with guys like Steph and Draymond. Upsetting the 2-seeded Houston Rockets in the Western Conference quarterfinals last season certainly helps the Warriors’ momentum heading into this season, but I do wonder how the loss of their championship-winning forward Kevon Looney would affect the Warriors dynamic.
I know I said that I think the Thunder have a great chance to repeat as champions, but I also wonder if the Timberwolves would be a team to look out for in the 2026 postseason. After all, despite losing franchise mainstay Karl-Anthony Towns to the Knicks in the 2024 offseason, the Timberwolves adapted quite well as stars like Anthony Edwards and Naz Reid rose to the challenge by helping the team get to the Western Conference finals for the second year in a row (even though they got knocked out at the Western Conference finals for the second year in a row too). All in all, in terms of every NBA trade ever made, I think the Karl-Anthony towns trade-along with the players the Timberwolves got in exchange (Julius Randle and Donte DiVincenzo)-was one of the most even trades for both teams involved, as both the Knicks and Timberwolves made it to their respective conference finals.
Just as with my play-in predictions for the Eastern conference, at least three of the four projected play-in teams (according to the model) for the Western Conference made the play-ins last season-the Mavericks, Warriors, and Kings. I think the Warriors have the best shot at cracking the actual playoffs while the Mavericks could use another year for Cooper Flagg to develop (plus buy some time to get stars like Kyrie Irving back). It will be interesting to see how the Sacramento Kings fare because even though Domantis Sabonis, Zack LaVine and DeMar DeRozan fared well despite the disappointing finish, the talent around them could use some improvement. Perhaps the addition of Russell Westbrook (who’s in his 18th year in the NBA) could spice up the Kings’ offense, as he certainly showed he still had the athleticism and speed needed for basketball last season with the Denver Nuggets.

And now for something a little scandalous…

Boy oh boy this is certainly going to be the most interesting (or at least the most interestingly-timed) post I’ve written during this blog’s run. Why might that be?

Well, last Thursday (October 23, 2025) news broke that the FBI (US Federal Bureau of Investigation) had arrested 34 people for a pair of scandals that certainly rocked pro basketball-one involving colluding with Italian Mafia families (specifically the Gambino, Bonnano and Genovese crime families) to conduct a series of rigged poker games and another involved colluding to rig sports betting.

Here’s the wildest part though-among the 34 arrested were the current head coach of the Portland Trail Blazers (Chauncey Billups), a current Miami Heat star (Terry Rozier), and a former Cavaliers player (Damon Jones). Billups and Rozier were placed on leave by their respective teams.

Want to know some other juicy, scandalous details? Here are a few takeaways from the indictments:

Chauncey Billups was allegedly used by these Mafia families to lure in victims to the rigged poker games in order to make the poker games appear legitimate.
How the poker games were rigged is possibly the wildest part, with everything that was alleged to have happened sounding like it could’ve come from a James Bond movie. Among the methods used to rig these poker games were X-Ray tables that allowed these Mafia families to see opponents’ hands and rigged shuffling machines that could be used to predict what opponents’ hands would look like.
As for Rozier, the game that led to him being investigated was a March 23, 2023 game while Rozier was still with the Charlotte Hornets. In this game, Rozier left the game early due to a “foot injury”-which wasn’t true as Rozier conspired with a longtime friend of his that he planned to fake the “foot injury” in order to net this friend over $200,000 on his “under” statistics (that Rozier would underperform in the game in other words).
As for Damon Jones, he sold insider information to his co-conspirators during the 2022-23 season while working for the Lakers. The information concerned insider tips on lineup decisions and injury reports on star Lakers players; the co-conspirators were able to place significant wagers on their bets with this information. It was later revealed that one of the players whose injury report was leaked was LeBron James, who hasn’t been implicated in any wrongdoing.

All in all, it will be interesting to see how this scandal plays out-especially to see if anyone else get busted as part of this massive gambling ring. Here’s an October 23, 2025 release from the US DOJ (Department of Justice) describing the basics of the gambling ring (keep in mind that anyone involved is presumed innocent until proven guilty)-https://www.justice.gov/usao-edny/pr/current-and-former-national-basketball-association-players-and-four-other-individuals.

Here’s a snippet of a conference from FBI Director Kash Patel on October 23, 2025 regarding the charges-https://www.youtube.com/shorts/4F4_JMGVJXw.

All I will say is that it will be very very interesting to see not only how the rest of the NBA season plays out but also to see how commissioner Adam Silver will change league gambling policy-especially when it comes to players and coaching staff. Assuming other players and/or coaching staff get busted in the gambling ring (which could happen) the trials will be interesting-mostly because we’ll get to see who will snitch on who to get a sweet plea deal. Maybe there will be some RICO charges in the mix-which given what occurred, isn’t a stretch to think.

Anyway, thanks for reading as always, and enjoy the juicy action of the 2025-26 NBA season! The season is still young, so it’s anyone’s game!

Michael

Another Crack At Linear Regression NBA Machine Learning Predictions (2025-26 edition)

Advertisements

Hi everybody,

Michael here, and in today’s post, I thought I’d try something a little familiar. You may recall that last October, I released a pair of posts (Python, Linear Regression & An NBA Season Opening Day Special Post and Python, Linear Regression & the 2024-25 NBA season) attempting to predict each NBA team’s win total and conference seeding based off of their performance from the previous 10 seasons.

All in all, after seeing how the season played out-I managed to get only 3/30 teams in the correct seeding. So what would I do here?

I’ll give my ML NBA machine learning predictions another go, also using data from the previous 10 seasons (2015-16 to 2024-25). You may be wondering why I’m trying to predict the outcomes of the upcoming NBA season once more given how off last year’s predictions were-the reason I’m giving the whole “Michael’s NBA crystal ball” thing another go is because I’m not only interested in how my predictions change from one season to the next but also because I plan to use a slightly different model than I did last year (it’ll still be good old linear regression, however) so I can analyze how different factors might play a role in a team’s record and ultimately their conference seeding.

So, without further ado, let’s jump right in to Michael’s Linear Regression NBA Season Predictions II!

Reading the data

Before we dive in to our juicy predictions, the first thing we need to do is read in the data to the IDE. Here’s the file:

NBA analysis 2025-26 Download

Now let’s import the necessary packages and read in the data!

import pandas as pd
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

from google.colab import files
uploaded = files.upload()

import io

NBA = pd.read_excel(io.BytesIO(uploaded['NBA analysis 2025-26.xlsx']))

NBA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Season  300 non-null    object 
 1   Team    300 non-null    object 
 2   W       300 non-null    int64  
 3   L       300 non-null    int64  
 4   Finish  300 non-null    int64  
 5   Age     300 non-null    float64
 6   Ht.     300 non-null    object 
 7   Wt.     300 non-null    int64  
 8   G       300 non-null    int64  
 9   MP      300 non-null    int64  
 10  FG      300 non-null    int64  
 11  FGA     300 non-null    int64  
 12  FG%     300 non-null    float64
 13  3P      300 non-null    int64  
 14  3PA     300 non-null    int64  
 15  3P%     300 non-null    float64
 16  2P      300 non-null    int64  
 17  2PA     300 non-null    int64  
 18  2P%     300 non-null    float64
 19  FT      300 non-null    int64  
 20  FTA     300 non-null    int64  
 21  FT%     300 non-null    float64
 22  ORB     300 non-null    int64  
 23  DRB     300 non-null    int64  
 24  TRB     300 non-null    int64  
 25  AST     300 non-null    int64  
 26  STL     300 non-null    int64  
 27  BLK     300 non-null    int64  
 28  TOV     300 non-null    int64  
 29  PF      300 non-null    int64  
 30  PTS     300 non-null    int64  
dtypes: float64(5), int64(23), object(3)
memory usage: 72.8+ KB

As you can see, we’ve still got all 31 features that we had in last year’s dataset-the only difference between this dataset and last year’s is the timeframe covered (this dataset starts with the 2015-16 and ends with the 2024-25 season).

Just like last year, this year’s edition of the predictions comes from http://basketball-reference.com, where you can search up plenty of juicy statistics from both the NBA and WNBA. Also, just like last year, the only thing I changed in the data from Basketball Reference is the Finish variable, which represents a team’s conference finish (seeding-wise) as opposed to divisional finish (since divisional finishes are largely irrelevant for a team’s playoff standings).
If you want a better explanation of these terms, please feel free to refer to last year’s edition of my predictions post-Python, Linear Regression & An NBA Season Opening Day Special Post.

Now that we’ve read our file into the IDE, let’s create our model!

Creating the model

You may recall that last year, before we created the model, we used the Select-K-Best algorithm to help us pick the optimal model features. For a refresher, here’s what Select-K-Best chose for us:

['L', 'Finish', 'Age', 'FG%', '3P%']

After seeking the five best features for our model from the Select-K-Best algorithm, this is what we got. However, we’re not going to use the Select-K-Best suggestions this year as there are other factors I’d like to analyze when it comes to making upcoming season NBA predictions.

Granted, I’ll keep the Finish, FG%, and 3P% as I feel they provide some value to the model’s predictions, but I’ll also add a few more features of my own choosing:

X = NBA[['FG%', '3P%', '2P%', 'Finish', 'TRB', 'AST', 'STL', 'BLK', 'TOV']]
y = NBA['W']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Along with the features I chose from last year’s model, I’ll also add the following other scoring categories:

2P%-The percentage of a team’s successful 2-pointers in a given season
TRB-A team’s total rebounds in a season
AST-A team’s total assists in a season
STL-A team’s total steals in a season
BLK-A team’s total blocks in a season
TOV-A team’s total turnovers in a season

The y-variable will still be W, as we’re still trying to predict an NBA team’s win total for the upcoming season based off of all our x-variables.

Now, let’s create a linear regression model object and run our predictions through that model object:

NBAMODEL = LinearRegression()
NBAMODEL.fit(X_train, y_train)

yPredictions = NBAMODEL.predict(X_test)

yPredictions

array([43.2515066 , 36.4291265 , 55.14626364, 46.01164579, 24.18679591,
       35.59131124, 35.59836527, 49.98114132, 48.57869061, 50.65733101,
       21.296126  , 49.94020238, 31.98306604, 41.89217714, 45.65373458,
       50.57831266, 32.76923727, 45.6898562 , 20.4393901 , 55.28944034,
       52.79027154, 21.81113366, 50.79142468, 50.95798684, 53.23802534,
       50.00199063, 48.4639119 , 49.1671417 , 51.12760913, 31.20606334,
       45.3090483 , 25.02488097, 43.67955061, 48.47484838, 33.74041157,
       41.7463038 , 36.10796911, 40.5399278 , 35.30656175, 16.92677689,
       49.77947698, 39.2160337 , 22.08871355, 31.83549487, 15.2675987 ,
       18.24486804, 21.71657476, 42.21505537, 22.84745758, 25.56862333,
       43.6212702 , 20.28339646, 44.60289296, 49.20316062, 53.69182149,
       29.48304908, 44.60789347, 42.44466633, 55.93637972, 54.89728291])

Just as with last year’s model, the predictions are run on the test dataset, which consists of the last 60 of the dataset’s 300 total records.

And now for the equation…

Now that we’ve generated predictions for our test dataset, let’s find out all of the coefficients and the intercept for the equation I will use to make this year’s NBA predictions:

NBAMODEL.coef_

array([ 6.48260593e+01,  1.13945178e+02,  1.54195451e+01, -1.94822281e+00,
        1.10428617e-02, -3.46015457e-03,  2.15326621e-02,  6.63810730e-03,
       -9.70593407e-03])

NBAMODEL.intercept_

np.float64(-60.37720744829896)

Now that we know what our coefficients are, let’s see what this year’s equation looks like:

Although it’s much more of a mouthful than last year’s equation, it follows the same logic in that it uses the features of this year’s model in the order that I listed them:

['FG%', '3P%', '2P%', 'Finish', 'TRB', 'AST', 'STL', 'BLK', 'TOV']

A is FG%, B is 3P%, and so on until you get to I (which represents TOV).

Since all the coefficients are listed in scientific notation, I rounded them to two decimal places before converting them for this equation. Same thing for the intercept.
In case you’re wondering, no you can’t add all the coefficients together for this equation as each coefficient plays a part in the overall equation. Just like last year, we’re going to do the weighted-averages thing to generate projected win totals. Keep your eyes peeled for the next post, which covers the juicy predictions.

…and the accuracy test!

So now that we’ve got our 2025-26 NBA predictions model, let’s see how accurate it is:

from sklearn.metrics import mean_absolute_percentage_error

mean_absolute_percentage_error(y_test,yPredictions)

0.09573425883736708

Using the MASE (mean absolute percentage error) from sklearn like we did in last year’s analysis, we see that the model’s margin of error is roughly 9.57%. I’ll round that up to 10%, which means that despite not choosing the model’s features from a prebuilt algorithm, the overall accuracy of the model is still 90%.

Now, whether the model’s accuracy and my predictions hold up is something I’ll certainly revisit in 8 months time for another end-of-season reflection. After all, last season I only got 3 of the 30 teams in the correct seeding, though I did do better with predicting which teams didn’t make playoffs though.

Recall that to find the accuracy of the model using the MASE, subtract 100 from the (MASE * 100). Since the MASE rounds out to 10 as the nearest whole number (rounded to 2 decimal places), 100-10 gives us an accuracy of 90%

Last but not least, it’s prediction visualization time!

Before we go, the last thing I want to cover is how to visualize this year’s model’s predictions. Just like last year, we’re going to use the PYPLOT module from MATPLOTLIB:

import matplotlib.pyplot as plt

plt.scatter(y_test, yPredictions, color="red")
plt.xlabel('Actual values', size=15)
plt.ylabel('Predicted values', size=15)
plt.title('Actual vs Predicted values', size=15)
plt.show()

As you can see, the plot forms a sort of diagonal-line shape, which reinforces the model’s 90% prediction accuracy rate.

Also, just for comparison’s sake, here’s what my predictions looked like on last year’s model (the one where I used Select-K-Best to choose the model features):

This also looks like a diagonal-line shape, and last year’s model had a 91% accuracy rate.

Here’s the link to the Colab notebook in my GitHub-https://github.com/mfletcher2021/blogcode/blob/main/NBA_25_26_predictions.ipynb

Thanks for reading, and keep an eye out for my 2025-26 season predictions,

Michael

Python, Linear Regression & An NBA Season Opening Day Special Post

Advertisements

Hello readers,

Michael here, and in today’s lesson, we’re gonna try something special! For one, we’re going back to this blog’s statistical roots with a linear regression post; I covered linear regression with R in the way, way back of 2018 (R Lesson 6: Linear Regression) on this blog, so I thought I’d show you how to work the linear regression process in Python. Two, I’m going to try something I don’t normally do, which is predict the future. In this case, the future being the results of the just-beginning 2024-25 NBA season. Why try to predict NBA results you might ask? Well, for one, I wanted to try something new on this blog (hey, gotta keep things fresh six years in), and for two, I enjoy following along with the NBA season. Plus, I enjoyed writing my post on the 2020 NBA playoffs-R Analysis 10: Linear Regression, K-Means Clustering, & the 2020 NBA Playoffs.

Let’s load our data and import our packages!

Before we get started on the analysis, let’s first load our data into our IDE and import all necessary packages:

import pandas as pd
from sklearn.model_selection import train_test_split
from pandas.core.common import random_state
from sklearn.linear_model import LinearRegression

You’re likely quite familiar with pandas but for those of you that don’t know, sklearn is an open-source Python library commonly used for machine learning projects (like the linear regression we’re about to do)!

A note about uploading files via Google Colab

Once we import our necessary packages, the next thing we should do is upload the data-frame we’ll be using for this analysis.

This is the file we’ll be using; it contains team statistics such as turnovers (team total) and wins for all 30 NBA teams for the last 10 seasons (2014-15 to 2023-24). The data was retrieved from basketball-reference.com, which is a great place to go if you’re looking for juicy basketball data to analyze. This site comes from https://www.sports-reference.com/, which contains statistics on various sports from NBA to NFL to the other football (soccer for Americans), among other sports.

NBA analysis Download

Now, since I used Google Colab for this analysis, I’ll show you how to upload Excel files into Colab (a different process from uploading Excel files into other IDEs):

To import local files into Google Colab, you’ll need to include the lines from google.colab import files and uploaded = files.upload() in the notebook since, for some odd reason, Google Colab won’t let you upload local files directly into your notebook. Once you run these two lines of code, you’ll need to select a file from the browser tool that you want to upload to Colab.

Next (and ideally in a separate cell), you’ll need to add the lines import io and dataframe = pd.read_csv(io.BytesIO(uploaded['dataframe name'])) to the notebook and run the code. This will officially upload your data-frame to your Colab notebook.

Yes, I know it’s annoying, but that’s just how Colab works. If you’re not using Colab to follow along with me, feel free to skip this section as a simple pd.read_csv() will do the trick to upload your data-frame onto the IDE.

Let’s learn about our data-frame!

Now that we’ve uploaded our data-frame into the IDE, let’s learn more about it!

NBA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300 entries, 0 to 299
Data columns (total 31 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Season  300 non-null    object 
 1   Team    300 non-null    object 
 2   W       300 non-null    int64  
 3   L       300 non-null    int64  
 4   Finish  300 non-null    int64  
 5   Age     300 non-null    float64
 6   Ht.     300 non-null    object 
 7   Wt.     300 non-null    int64  
 8   G       300 non-null    int64  
 9   MP      300 non-null    int64  
 10  FG      300 non-null    int64  
 11  FGA     300 non-null    int64  
 12  FG%     300 non-null    float64
 13  3P      300 non-null    int64  
 14  3PA     300 non-null    int64  
 15  3P%     300 non-null    float64
 16  2P      300 non-null    int64  
 17  2PA     300 non-null    int64  
 18  2P%     300 non-null    float64
 19  FT      300 non-null    int64  
 20  FTA     300 non-null    int64  
 21  FT%     300 non-null    float64
 22  ORB     300 non-null    int64  
 23  DRB     300 non-null    int64  
 24  TRB     300 non-null    int64  
 25  AST     300 non-null    int64  
 26  STL     300 non-null    int64  
 27  BLK     300 non-null    int64  
 28  TOV     300 non-null    int64  
 29  PF      300 non-null    int64  
 30  PTS     300 non-null    int64  
dtypes: float64(5), int64(23), object(3)
memory usage: 72.8+ KB

Running the NBA.info() command will allow us to see basic information about all 31 columns in our data-frame (such as column names, amount of records in dataset, and object type).

In case you’re wondering about all the abbreviations, here’s an explanation for each abbreviation:

Season-The specific season represented by the data (e.g. 2014-15)
Team-The team name
W-A team’s wins in a given season
L-A team’s losses in a given season
Finish-The seed a team finished in during a given season in their conference (e.g. Detroit Pistons finishing 15th seed in the East last season)
Age-The average age of a team’s roster as of February 1 of a given season (e.g. February 1, 2024 for the 2023-24 season)
Ht.-The average height of the team’s roster in a given season (e.g. 6’6)
Wt.-The average weight (in lbs.) of the team’s roster in a given season
G-Total amount of games played by the team in a given season
MP-Total minutes played as a team in a given season
FG-Field goals scored by the team in a given season
FGA-Field goal attempts made by the team in a given season
FG%-Percent of successful field goals made by team in a given season
3P-3-point field goals scored by the team in a given season
3PA-3-point field goal attempts made by the team in a given season
3P%-Percent of successful 3-point field goals made by the team in a given season
2P-2-point field goals scored by the team in a given season
2PA-2-point field goal attempts made by the team in a given season
2P%-Percent of successful 2-point field goals made by the team in a given season
FT-Free throws scored by the team in a given season
FTA-Free throw attempts made by the team in a given season
FT%-Percent of successful free throw attempts made by the team in a given season
ORB-Team’s total offensive rebounds in a given season
DRB-Team’s total defensive rebounds in a given season
TRB-Team’s total rebounds (both offensive and defensive) in a given season
AST-Team’s total assists in a given season
STL-Team’s total steals in a given season
BLK-Team’s total blocks in a given season
TOV-Team’s total turnovers in a given season
PF-Team’s total personal fouls in a given season
PTS-Team’s total points scored in a given season

Wow, that’s a lot of variables! Now that understand know the data we’re working with better, let’s see how we can make a simple linear regression model!

If you’re not familiar with basketball jargon, the NBA has a great glossary of basic terms on their website: https://www.nba.com/stats/help/glossary

The K-Best Way To Set Up Your Model

Before we start the juicy analysis, let’s first pick the features we will use for the model. In this post, we’ll explore the Select K-Best algorithm, which is an algorithm commonly used in linear regression to help select the best features for a particular model:

X = NBA.drop(['Season', 'Team', 'W', 'Ht.'], axis=1)
y = NBA['W']

from sklearn.feature_selection import SelectKBest, f_regression
features = SelectKBest(score_func=f_regression, k=5)
features.fit(X, y)

selectedFeatures = X.columns[features.get_support()]
print(selectedFeatures)

Index(['L', 'Finish', 'Age', 'FG%', '3P%'], dtype='object')

According to the Select K-Best algorithm, the five best features to use in the linear regression are L, Finish, Age, FG% and 3P%. In other words, a team’s end-of-season seeding, total losses, average roster age, and percentage of successful field goals and 3-pointers are the five most important features to predict a team’s win total.

How did the model arrive to these conclusions? First of all, I set the X and y variables-this is important as the Select K-Best algorithm needs to know what is the dependent variable and what are possible independent variable selections that can be used in the model. In this example, the dependent (or y) variable is W (for team wins) while the X variable includes all other dataset columns except for W, Team, Season, and Ht. because W is the y variable and the other three variables are categorial (or non-numerical) variables, so they really won’t work in our analysis.

Next we import the SelectKBest and f_regression packages from the sklearn.feature_selection module. Why do we need these two packages? Well, SelectKBest will allow us to use the Select K-Best algorithm while f_regression is like a back-end feature selection method that allows the Select K-Best algorithm to select the best x-amount of features for the model (I used five features for this model).

After setting up the Select K-Best algorithm, we then fit both the X and y variables to the algorithm and then print out our top five selectedFeatures.

Train, test…split!

Once we have our top five features for model, it’s time for the train, test, splitting of the model! What is train, test, split you ask? Well, our linear regression model will be split into two types of data-training data (the data we use for training the model) and testing data (the data we use to test our model). Here’s how we can utilize the train, test, split for this model:

X = NBA[['L', 'Finish', 'Age', 'FG%', '3P%']]
y = NBA['W']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

How does the train, test, split work? Using sklearn’s train_test_split method, we pass in four parameters-our independent variables (X), our dependent variable (y), the size of the test data (a decimal between 0 and 1), and the random state (this can be kept at 0, but it doesn’t matter what number you use-42 is another common number). In this model, I will utilize an 80/20 train, test, split, which indicates that 80% of the data will be for training while the other 20% will be used for testing.

Other common train, test, splits are 70/30, 85/15, and 67/33, but I opted for 80/20 because our dataset is only 300 rows long. I would utilize these other train, test, splits for larger datasets.

Something worth noting: What we’re doing here is called multiple linear regression since we’re using five X variables to predict a Y variable. Simple linear regression would only use one X variable to predict a Y variable. Just thought I’d throw in this quick factoid!

And now, for the model-making

Now that we’ve done all the steps to set up our model, the next thing we’ll need to do is actually create the model!

Here’s how we can get started:

NBAMODEL = LinearRegression()
NBAMODEL.fit(X_train, y_train)

LinearRegression()

In this example, we create a LinearRegression() object (NBAMODEL) and fit it to both the X_train and y_train data.

Predictions, predictions

Once we’ve created our model, next comes the fun part-generating the predictions!

yPredictions = NBAMODEL.predict(X_test)

yPredictions

array([53.20097648, 28.89541793, 52.26551381, 53.22220829, 35.90676716,
       32.15874993, 47.72090936, 48.32896277, 39.4193884 , 40.1548429 ,
       19.62678175, 48.3263792 , 32.13473281, 43.50887634, 43.85260484,
       52.79795145, 27.35822648, 40.23392095, 18.85423981, 61.69624816,
       51.59650403, 23.86311747, 56.18087097, 54.15867678, 49.75211403,
       46.90177259, 31.80109001, 46.82531833, 37.50563942, 32.19863141,
       52.41205133, 25.09011881, 48.94542256, 38.80244997, 24.80146638,
       42.50107728, 43.27320835, 37.45199938, 46.7795962 , 28.11289951,
       57.64388881, 29.35812466, 18.3222965 , 36.26677012, 20.56912227,
       22.15266241, 19.9955299 , 44.84930613, 45.14740453, 23.19471644,
       53.940611  , 26.0780373 , 27.88093669, 61.23347337, 52.99948229,
       34.66653881, 30.04421016, 27.21669768, 48.55215233, 47.11060905])

The yPredictions are obtained through using the predict method on the model’s X_train data, which in this case consists of 60 of the 300 records..

Evaluating the model’s accuracy

Once we’ve created the model and made our predictions on the training data, it’s time to evaluate the model’s accuracy. Here’s how to do so:

from sklearn.metrics import mean_absolute_percentage_error

mean_absolute_percentage_error(y_test,yPredictions)

0.09147159762376074

There are several ways you can evaluate the accuracy of a linear regression model. One good method as shown here is the mean_absolute_percentage_error (imported from the sklearn.metrics package). The mean absolute percentage error evaluates the model’s accuracy by indicating how off the model’s predictions are. In this model, the mean absolute percentage error is 0.09147159762376074, indicating that the model’s predictions are off by roughly 9%-which also indicates that overall, the model’s predictions are roughly 91% accurate. Not too shabby for this model!

Interestingly, the two COVID impacted NBA seasons in the dataset (2019-20 and 2020-21) didn’t throw off the model’s accuracy much.

Don’t forget about the equation!

Evaluating the model’s accuracy isn’t the only thing you should do when analyzing the model. You should also grab the model’s coefficients and intercept-they will be important in the next post!

NBAMODEL.coef_

array([ -0.4663858 ,  -1.30716212,   0.39700734,  34.1325687 ,
       -22.12258585])

NBAMODEL.intercept_

50.945769772855854

All linear regression models will have a coefficient and an intercept, which form the linear regression equation. Since our model had five X variables, there are five coefficients.

Now, what would our equation look like?

Here is the equation in all it’s messy glory. We’re going to be using this equation in the next post.

Linear regression plotting

For the visual learners among my readers, I thought it would be nice to include a simple scatterplot to visualize the accuracy of our linear regression model. Here’s how to create that plot:

import matplotlib.pyplot as plt
plt.scatter(y_test, yPredictions, color="red")
plt.xlabel('Actual values', size=15)
plt.ylabel('Predicted values', size=15)
plt.title('Actual vs Predicted values', size=15)
plt.show()

First, I imported the matplotlib.pyplot module. Then, I ran the plt.scatter() method to create a scatterplot. I used three parameters for this method: the y_test values, the yPredictions values, and the color="red" parameter (this just indicated that I wanted red scatterplot dots). I then used the plt.xlabel(), plt.ylabel(), and plt.title() methods to give the scatterplot an x-label title, y-label title, and title, respectively. Lastly, I used the plt.show() method to display the scatterplot in all of its red-dotted glory.

As you can see from this plot, the predicted values match the actual values fairly closely, hence the 91% accuracy/9% error.

Thanks for reading, enjoy the upcoming NBA season action, and stay tuned for my next post where I reveal my predicted records and standings for each team, East and West! It will be interesting to see how my predictions pan out over the course of the season-after all, it’s certainly something different I’m trying on this blog!

And yes, perfect timing for this blog to come out on NBA season opening day! Serendipity am I right?

Also, here’s a link to the notebook in GitHub-https://github.com/mfletcher2021/DevopsBasics/blob/master/NBA_24_25_predictions.ipynb.

R Lesson 10: Intro to Machine Learning-Supervised and Unsupervised

Advertisements

Hello everybody,

It’s Michael, and today I’ll be discussing machine learning-supervised and unsupervised-in R. I won’t be doing any coding or analytics here, as I am using this post to give you guys background on the topics I will be discussing in the next few posts, which will contain some interesting analyses.

Anyway, what is machine learning? It’s basically an automated way to create analytical models. See, the whole idea of machine learning is that R (or any programming tool for that matter) is capable of learning from data, finding patterns, and making decisions on its own. A good example of machine learning would be the logistic and linear regression models I covered in earlier R lessons.

Now, there are two types of machine learning-supervised and unsupervised. Supervised machine learning occurs when you have clearly defined input and output variables. Think of the function y=f(x). You know what x-the input-will be and you can figure out a pattern for what y-the output-will be depending on the value of x. That function is similar to the whole idea of supervised machine learning-the analyst provides a template for R (or any other analytical tool) to draw conclusions and create analytical models.

Supervised machine learning can be classified into two main categories-classification and regression-depending on their output variable. A classification problem has a category value as the output, such as “male” or “female”, “puppy” or “kitty”. Regression problems have real values as the output; a person’s age would be a good example of a real value.

Unsupervised learning, on the other hand, has defined input variables, but no clearly defined output variables. Thus, there is no template to create analytical models, which is the whole point of unsupervised machine learning. See, where the ideas of supervised and unsupervised machine learning differ is that in unsupervised machine learning, R can discern pattern, draw conclusion, and create models without a pre-written template; this is not the case for supervised machine learning. However, one thing supervised and unsupervised machine learning have in common is each methodology has two main categories. In unsupervised machine learning, those categories are clustering and association problems (unlike supervised machine learning, however, these categories don’t depend on your output values). In clustering problems, you are trying to group data based on certain similarities (like demographics for toy sales). In association problems, you are trying to discover trends that describe your data (like girls that tend to buy toy X will usually buy toy Y as well).

Now before I go, here’s a visual example of the concepts I just discussed:

This is a photo of puppies and kittens, which I will use to illustrate the concepts of supervised and unsupervised machine learning.

In supervised machine learning, the machine is trained to identify puppies and kittens based on certain common traits (e.g. dogs have longer noses than cats).

Let’s say I give the machine a new photo and ask it to identify whether the animal shown is a puppy or kitten:

Since the machine knows that puppies have longer noses than cats, and that this animal has a longer nose, then the machine will identify this animal as a puppy based on what it has learned from previous data.

Now, let’s say we still wanted the machine to identify puppies and kittens, but we want to do so using unsupervised machine learning. However, unlike supervised machine learning, the machine doesn’t have a clear idea as to the traits that identify a puppy or kitten, so the machine would have to automatically analyze the traits of puppies and kittens in order to find out which traits define puppies and which define kittens.

Let’s use the photo of the puppy above as an example. Using unsupervised machine learning, the machine will automatically find traits unique to dogs and cats based on any information provided and based on the machine’s findings, the animal will either be classified as a dog or cat.

Thanks for reading,

Michael