R Lesson 8: Predictions For Linear & Logistic Regression/Multiple Linear Regression

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be about predictions for both linear and logistic regression models. I will be using the same dataset that I used for R Analysis 2: Linear Regression & NFL Attendance, except I added some variables so I could create both linear and logistic regression models from the data. Here is the modified dataset-NFL attendance 2014-18

Now, as always, let’s first try to understand our variables:

23Nov capture3

I described most of these variables in R Analysis 2, but here are what the two new ones mean (I’m referring to the two bottommost variables):

Playoffs-whether or not a team made the playoffs. Teams that made playoffs are represented by a 1, while teams that didn’t make playoffs are represented by a 0. Recall that teams who finished 1st-6th in their respective conferences made playoffs, while teams that finished 7th-16th did not.
Division-What division a team belongs to, of which there are 8:
- 1-AFC East (Patriots, Jets, Dolphins, Bills)
- 2-AFC North (Browns, Steelers, Ravens, Bengals)
- 3-AFC South (Colts, Jaguars, Texans, Titans)
- 4-AFC West (Chargers, Broncos, Chiefs, Raiders)
- 5-NFC East (Cowboys, Eagles, Giants, Redskins)
- 6-NFC North (Packers, Bears, Vikings, Lions)
- 7-NFC South (Falcons, Saints, Panthers, Buccaneers)
- 8-NFC West (Seahawks, 49ers, Cardinals, Rams)

I added these two variables so that I could create logistic regression models from the data. In both cases, I used dummy variables (remember those?).

Another function I think will help you in your analyses is sapply. Here’s how it works:

As you can see, you can do two things with supply-find out if there are any missing variables (as seen on the top function) or find out how many unique values there are for a certain variable (as seen on the bottom function). According to the output, there are no missing values for any variables (in other words, there are no blank spots in any column of the spreadsheet). Also, on the bottom function, you can see how many distinct values correspond to a certain variable (e.g. Conference Standing has 16 distinct values).

Before I get into analysis of the models, I want to introduce two new concepts-training data and testing data:

The difference between training and testing data is that training data are used as guidelines for how a model (whether linear or logistic) should make decisions while testing data just gives us an idea as to how well the model is performing. When splitting up your data, a good rule of thumb is 80-20, meaning that 80% of the data should be for training while 20% of the data should be for testing (It doesn’t have to be 80-20, but it should always be majority of the data for training and the minority of the data for testing). In this model, observations 1-128 are part of the training dataset while observations 129-160 are part of the testing dataset.

I will post four models in total-two using linear regression and two using logistic regression. I will start with the logistic regression:

In this model, I chose playoffs as the binary dependent variable and Division and Win Total as the independent variables. As you can see, intercept (referring to Playoffs) and Win Total are statistically significant variables, while Division is not statistically significant. Also, notice the data = train line, which indicates that the training dataset will be used for this analysis (you should always use the training dataset to create the model)

Now let’s create some predictions using our test dataset:

The fitted.results variable calculates the predictions while the ifelse function determines whether each of the observations in our test dataset (observations 129-160) is significant to the model. A 1 under an observation number indicates that the observation has at least a 50% significance to the model while a 0 indicates that the observation has less than a 50% significance to the model.

If we wanted to figure out exactly how significant each observation is to the model (along with the overall accuracy of the model), here’s how:

The misClasificError basically indicates the model’s margin of error using the fitted.results derived from the test dataset. The accuracy is calculated by subtracting 1 from the misClasificError, which turns out to be 87%, indicating very good accuracy (and indicating that the model’s margin of error is 13%).

Finally, let’s plot the model:

We can also predict various what-if scenarios using the model and the predict function. Here’s an example:

Using the AFC South as an example, I calculated the possible odds for a team in that division to make the playoffs based on various possible win totals. As you can see, an AFC South team with 10 or 14 wins is all but guaranteed to make the playoffs, as odds for both of those win totals are greater than 1. However, AFC South teams with only 2 or 8 wins aren’t likely to go to playoffs because the odds for both of those win totals are negative (however 8 wins will fare better than 2).

Let’s try another example, this time examining the effects of 9 wins across all 8 divisions (I chose 9 because 9 wins sometimes results in playoff berths, sometimes it doesn’t):

As you can see, 9 wins will most likely earn a playoff berth for AFC East teams (55.6% chance) and least likely to earn a playoff spot in the NFC West (35.7% chance)

I know it looks like all the lines are squished into one big line, but you can imply that the more wins a team has, the greater its chances are at making the playoffs. The pink line that appears to be the most visible represents the NFC West (Rams, Seahawks, 49ers, Cardinals). Unsurprisingly, the teams likeliest to make the playoffs were the teams with 9 or more wins (expect for the 2017 Seahawks, who finished 9-7 and missed the playoffs).

Now let’s create another logistic regression model that is similar to the last one except with the addition of the Total Attendance variable

The summary output looks similar to that of the previous model (I also use the training dataset for this model), except that this time, none of the variables have asterisks right by them, meaning none of them are statistically significant (which happens when the p-value is above 0.1). Nevertheless, I’ll still analyze this model to see if it is better than my first logistic regression model.

Now let’s create some predictions using the test dataset:

Like our previous model, this model also has a nice mix of 0s and 1s, except this model only has 11 1s, while the previous model had 14 1s.

And now let’s find the overall accuracy of the model:

Ok, so I know 379% seems like crazy accuracy for a logistic regression model. Here’s how it was calculated:

R took the sum of these numbers and divided that sum by 32 to find the average of the fitted results. R then subtracted 1 from the average to get the accuracy measure.

Just as we did with the first model, we can also create what-if scenarios. Here’s an example:

Using the AFC North as an example, I analyzed the effect of win total on a team’s playoff chances while keeping total attendance the same (1,400,000). Unsurprisingly (if total attendance is roughly 1.4 million fans in a given season), teams with a losing record (7-8-1 or lower) are less likely to make the playoffs than teams with a split or winning record (8=8 or higher). Given both record and a total attendance of 1,400,000 fans, the threshold for clinching a playoff berth appears to be 12 or 13 wins (though barring attendance, most AFC North teams fare well with 10, 9, or even 8 wins).

Now here’s another example. this time using the NFC East (and changing both win totals and total attendance):

So given increasing win totals and total attendance, an NFC East team’s playoff chances increase. The playoff threshold here, just as it been with most of my predictions, is 9 or 10 wins.

Now let’s see what happens when win totals increase but attendance goes down (also using the NFC East):

Ultimately (with regards to the NFC East), it’s not total attendance that matters, but a team’s win totals. As you can see, regardless of total attendance, playoff clinching odds increase with higher win totals (win threshold remains at 9 or 10).

And here’s our model plotted:

Now, I know this graph is just about as easy-to-read as the last graph (not very, but that’s how R works), but just like with the last graph, you can draw some conclusions. Since this graph factors in Total Attendance and Win Total (even though only Total Attendance is displayed), you can tell that even though a team’s fanbase may love coming to their games, if the wins are low, so are the playoff chances.

Now, before we start the linear regression models, let’s compare the logistic regression models to see which is the better of the two by analyzing various criteria:

Difference between null & residual deviance
- Model 1-73.25 with a decrease of two degrees of freedom
- Model 2-115.82 with a decrease of three degrees of freedom
- Better model-Model 1
AIC
- Model 1-101.86
- Model 2-60.483
- Better model-Model 2 (41.377 difference)
Number of Fisher Scoring Iterations
- Model 1-5
- Model 2-7
- Better model-Model 1 (less Fisher iterations)
Overall Accuracy
- Model 1-87%
- Model 2-379%
- Better model-Model 1 (379% sounds too good to be true)

Overall better model: Model 1

Now here’s the first linear regression model:

This model has Win Total as the dependent variable and Total Attendance and Conference Standing as the independent variables. This will also by my first model created with multiple linear regression, which is basically linear regression with more than one independent variable.

And finally, let’s plot the model:

In cases of multiple linear regression such as this, I had to graph each independent variable separately; graphing Total Attendance and Conference Standing separately allows us to examine the effects each independent variable has on our dependent variable (Win Total). As you can see, Total Attendance increases with an increasing Win Total while Conference Standing decreases with a decreasing Win Total. Both graphs make lots of sense, as fans are more tempted to come to a team’s games when the team has a high win total and conference standings tend to decrease with lower win totals (an interesting exception is the 2014 Carolina Panthers, who finished 4th in the NFC despite a 7-8-1 record).

In case you are wondering what the layout function does, it basically allowed two graphs to be displayed side by side. I can also alter the function depending on how many independent variables I use; if for instance I used 4 independent variables, I could change c to 2,2 to display the graphs in a 2 by 2 matrix.

Multiple linear regression equations are quite similar to those of simple linear regression, except for an added variable. In this case, the equation would be:

Win Total = 6.366e-6(Total Attendance)-5.756e-1(Conference Standing)+5.917

Now, using the predict function that I showed you for my logistic regression models won’t be very efficient here, so we can go the old-fashioned way by plugging numbers into the equation. Here’s an example:

Regardless of what conference a team is part of, a total attendance of at least 750,000 fans and a bottom seed in the conference should at least bring the team a 1-15 record. For teams with a total attendance of at least 1.1 million fans who fall just short of the playoffs with a 7th seed, a 9-7 record would be likely. Top of the conference teams with an attendance of at least 1.45 million should net a 14-2 record.

Now, let’s see what happens when conference standing improves, but attendance decreases:

According to my predictions, bottom-seeded teams with a total attendance of at least 1.5 million fans should net at least a 6-10 record. However, as conference standings improve and total attendance decreases, predicted records stagnate at either 9-7 or 8-8.

Now here’s my second linear model:

In this model, I used two different independent variables-Home Attendance and Average Age of Roster-but I still used Win Total as my dependent variable.

The equation goes like this:

Win Total = 1.051e-5(Home Attendance)+5.534e-1(Average Age of Roster)-1.229e+1

Now just like I did with both of my logistic regression models and the linear regression model, let’s create some what-if scenarios:

In this scenario, home attendance is increasing along with the average age of roster. Win total also increases with a higher average age of roster. For instance, teams with a home attendance of at least 350,000 fans and an average roster age of 24 (meaning the team is full of rookies and other fairly-fresh faces) should expect at least a 5-11 record. On the other hand, teams with a roster full of veterans (yes, 28.5 is old for an average roster age) and a home attendance of at least 1.2 million fans should expect a perfect 16-0 season.

Now let’s try a scenario where home attendance decreases but average age of roster increases:

In this scenario, when home attendance decreases but average age of roster increases, a team’s projected win total also goes down. For teams full of fresh-faces and breakout stars (average age 24) and a home attendance of at least 1.1 million fans, a 13-3 record seems likely. On the other hand, for teams full of veterans (average age 28.5) and a home attendance of at least 300,000 fans, a 7-9 record appears in reach.

One thing to keep in mind with my linear regression predictions is that I rounded projected win totals to the nearest whole number. So I got the 13-3 record projection from the 12.5526 output.

Now let’s plot the model:

Just as I did with linear1, I graphed the two independent variables separately, not only because it’s the easiest way to graph multiple linear regression but also because we can see each variable’s effect on Win Total. As you can see, Home Attendance and Average Age of Roster increases with an increasing win total, though the increase in Average Age of Roster is smaller than that of Home Attendance. Each scenario makes sense, as teams are likelier to have a higher win total if they have more supportive fans in attendance (particularly in their 7 or 8 home games per season) and having more recognizable veterans on a team (like the Saints with QB Drew Brees or the Broncos with LB Von Miller) will be better for the team’s overall record than having a team full of newbies (like the Browns with QB Baker Mayfield or the Giants with RB Saquon Barkley).

The Home Attendance numbers are displayed in scientific notation, which is how R displays large numbers. 1e+05 is 100,000, 3e+05 is 300,000, and so on.

Now, before I go, let’s compare the two linear models:

Residual Standard Error
- Model 1-1.09 wins
- Model 2-2.948 wins
- Better Model-Model 1 (less deviation)
R-Squared (Multiple and Adjusted respectively)
- Model 1-88.72% and 88.58%
- Model 2-17.49% and 16.44%
- Better Model-Model 1 (much higher than Model 2)
F-statistic & P-Value (since there are 2 degrees of freedom, this is an important metric)
- Model 1-617.5 on 2 and 157 degrees of freedom; 2.79e-7
- Model 2-16.64 on 2 and 157 degrees of freedom; 2.79e-7
- Better Model-Model 1 (both result in the same p-value, but the f-statistic on Model 1 is much larger)
Overall better model-Model 1

Thanks for reading,

Michael

Share this:

Leave a ReplyCancel reply

Discover more from Michael's Programming Bytes