November 2018 - Michael's Programming Bytes

R Analysis 2: Linear Regression & NFL Attendance

Advertisements

Hello everybody,

It’s Michael, and today’s post will be an R analysis post using the concept of linear regression. The dataset I will be using NFL attendance 2014-18, which details NFL attendance for each team from the 2014-2018 NFL seasons along with other factors that might affect attendance (such as average roster age and win count).

First, as we should do for any analysis, we should read the file and understand our variables:

19Nov capture1

Team-The team name corresponding to a row of data; there are 32 NFL teams total
Home Attendance-How many fans attend a team’s home games (the NFL’s International games count towards this total)
Road Attendance-How many fans attend a team’s road games
- Keep in mind that teams have 8 home games and 8 away games.
Total Attendance-The total number of fans who go see a team’s games in a particular season (attendance for home games + attendance for away games)
Win Total-how many wins a team had for a particular season
Win.. (meaning win percentage)-the precent of games won by a particular team (keep in mind that ties are counted as half-wins when calculating win percentages)
NFL Season-the season corresponding to the attendance totals (e.g. 2017 NFL season is referred to as simply 2017)
Conference Standing-Each team’s seeding in their respective conference (AFC or NFC), which ranges from 1 to 16-1 being the best and 16 being the worst. The teams that were seeded 1-6 in their conference made the playoffs that season while teams seeded 7-16 did not; teams seeded 1-4 won their respective divisions while teams seeded 5 and 6 made the playoffs as wildcards.
- As the 2018 season is still in progress, these standings only reflect who is LIKELY to make the playoffs as of Week 11 of the NFL season. So far, no team has clinched a playoff spot yet.
Average Age of Roster-The average age of a team’s players once the final 53-man roster has been set (this is before Week 1 of the NFL regular season)

One thing to note is that I removed the thousands separators for the Home Attendance, Road Attendance, and Total Attendance variables so that they would read as ints and not factors. The file still has the separators though.

Now let’s set up our model (I’m going to be using three models in this post for comparison purposes):

In this model, I used Total Attendance as the dependent variable and Win Total as the independent variables. In other words, I am using this model to determine if there is any relationship between fans’ attendance at a team’s games and a team’s win total.

Remember how in R Lesson 7 I mentioned that you should pay close attention to the three bottom lines in the output? Here’s what they mean for this model:

As I mentioned earlier, the residual standard error refers to the amount that the response variable (total attendance) deviates from the true regression line. In this case, the RSE is 1,828,000, meaning the total attendance deviates from the true regression line by 1,828,000 fans.
- I didn’t mention this in the previous post, but the way to find the percentage error is to divide the RSE by the average of the dependent variable (in this case, Total Attendance). The lower the percentage error, the better.
- In this case, the percentage error is 185.43% (the mean for Total Attendance is 985,804 fans, rounded to the nearest whole number).
The R-Squared is a measure of the goodness-of-fit of a model-the closer to 1, the better the fit. The difference between the Multiple R-Squared and the Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. In this model, the Multiple R-Squared is 20.87% while the Adjusted R-Squared is 20.37%, indicating a very slight correlation.
- Remember the idea that “correlation does not imply causation”, which states that even though there may be a strong correlation between the dependent and independent variable, this doesn’t mean the latter causes the former.
- In the context of this model, even though a team’s total attendance and win total have a very slight correlation, this doesn’t mean that a team’s win total causes higher/lower attendance.
The F-squared measures the relationship (or lack thereof) between independent and dependent variables. As I mentioned in the previous post, for models with only 1 degree of freedom, the F-squared is basically the independent variable’s t-value squared (6.456²=41.68). The F-Squared (and resulting p-value) aren’t too significant to determining the accuracy of simple linear regression models such as this one, but are more significant when dealing with with multiple linear regression models.

Now let’s set up the equation for the line (note the coef function I mentioned in the previous post isn’t necessary):

Remember the syntax for the equation is just like the syntax of the slope-intercept equation (y=mx+b) you may remember from algebra class. The equation for the line is (rounded to 2 decimal places):

Total Attendance = 29022(Win Total)+773943

Let’s try the equation out using some scenarios:

“Perfect” Season (no wins): 29022(0)+773943=expected total attendance of 773,943
Split Season (eight wins): 29022(8)+773943=expected total attendance of 1,006,119
Actual Perfect Season (sixteen wins): 29022(16)+773943=expected total attendance of 1,238,295

And finally, let’s create the graph (and the regression line):

As seen in the graph above, few points touch the line (which explains the low Multiple R-Squared of 20.68%). According to the regression line, total attendance INCREASES with better win totals, which indicates a direct relationship. One possible reason for this is that fans of consistently well-performing teams (like the Patriots and Steelers) are more eager to attend games than are fans of consistently struggling teams (like the Browns and Jaguars). An interesting observation would be that the 2015 4-12 Dallas Cowboys had better total attendance than the 2015 15-1 Carolina Panthers had. The 2016 and 2017 Cleveland Browns fared pretty well for attendance-each of those seasons had a total attendance of at least 900,000 fans (the records were 1-15 and 0-16 respectively).

Let’s create another model, once again using Total Attendance as the dependent variable but choosing Conference Standing as the independent variable:

So, is this model better than lr1? Let’s find out:

The residual standard error is much smaller than that of the previous model (205,100 fans as opposed to 1,828,000). As a result, the percentage error is much smaller-20.81%-and there is less variation among the observation points around the regression line.
The Multiple R-Squared and Adjusted R-Squared (0.4% and -0.2% respectively) are much lower than the R-Squared amounts for lr1. Thus, there is even less of a correlation between Total Attendance and Conference Standing than there is between Total Attendance and Win Total (for.a particular team).
Disregard the F-statistic and p-value.

Now let’s set up our equation:

From this information, we get the equation:

Total Attendance = -2815(Conference Standing)+1009732

Here are some scenarios using this equation:

Top of the conference (1st place): -2815(1)+1009732=expected total attendance of 1,006,917
Conference wildcard (5th place): -2815(5)+1009732=expected total attendance of 995,657
Bottom of the pack (16th place): -2815(16)+1009732=expected total attendance of 964,692

Finally, let’s make a graph:

As seen in the graph, few points touch the line (less points touch this line than in the line for lr1). The line itself has a negative slope, which implies that total attendance DECREASES with WORSE conference standings (or increases with better conference standings). Yes, I know the numbers under conference standing are increasing, but keep in mind that 1 is the best possible conference finish for a team, while 16 is the worst possible finish for a team. One possible reason that total attendance decreases with lower conference standings is that fans are possibly more enticed to come to games for consistently top conference teams and division winners (like the Patriots and Panthers) rather than teams who miss playoffs year after year (like the Jaguars, save for the 2017 squad that made it to the AFC Championship). Interestingly enough, the 2015 4-12 Dallas Cowboys rank second overall in total attendance (16th place in their conference), just behind the 2016 13-3 Dallas Cowboys (first in their conference).

Now let’s make one more graph, this time using Average Age of Roster as the independent variable:

Is this model better than lr2? Let’s find out:

The residual standard error is the smallest one of the three (204,600 fans) and thus, the percentage error is the smallest of the three-20.75%.
The Multiple R-Squared and Adjusted R-Squared are smaller than those of lr1 but larger than that of lr2 (0.84% and 0.22% respectively). Thus, Average Age of Roster correlates better with Total Attendance than does Conference Standing, however Win Total correlates the best with Total Attendance.
Once again, disregard the F-Statistic & corresponding p-value.

Now let’s create the equation:

Total Attendance = 36556(Average Age of Roster)+33594

Here are some scenarios using this equation

Roster with mostly rookies and 2nd-years (an average age of 24)=36556(24)+33594=expected total attendance of 910,938
Roster with a mix of newbies and veterans (an average age of 26)=36556(26)+33594=expected total attendance of 984,050
Roster with mostly veterans (an average age of 28)=36556(28)+33594=expected total attendance of 1,057,162

And finally, let’s create a graph:

Like the graph for lr2, few points touch the line. As for the line itself, the slope is positive, implying that Total Attendance INCREASES with an INCREASING Average Age of Roster. One possible reason for this is that fans are more interested in coming to games if the team has several veteran stars* (names like Phillip Rivers, Tom Brady, Jordy Nelson, Antonio Gates, Rob Gronkowski, Richard Sherman, Julius Peppers, Marshawn Lynch and many more) rather than if the team is full of rookies and/or unknowns (Myles Garrett, Sam Darnold, Josh Rosen, Leighton Vander Esch, among others). Interestingly enough, the team with the oldest roster (the 2018 Oakland Raiders, with an average age of 27.4), have the second lowest total attendance, just ahead of the 2018 LA Chargers (with an average age of 25.8).

*I’ll use any player who has at least 6 seasons of NFL experience as an example of a “veteran star”.

So, which is the best model to use? I’d say lr1 would be the best model to use, because even though it has the highest RSE (1,828,000), it also has the best correlation between the independent and dependent variables (a Multiple R-Squared of 20.87%). All in all, according to my three analyses, a team’s Win Total has the greatest influence on how many fans go to their games (both home and away) during a particular season.

Thanks for reading, and happy Thanksgiving to you all. Enjoy your feasts (and those who are enjoying those feasts with you),

Michael

Month: November 2018

R Analysis 2: Linear Regression & NFL Attendance

R Lesson 7: Graphing Linear Regression & Determining Accuracy of the Model

R Lesson 6: Linear Regression