R Analysis 2: Linear Regression & NFL Attendance

Advertisements

Hello everybody,

It’s Michael, and today’s post will be an R analysis post using the concept of linear regression. The dataset I will be using NFL attendance 2014-18, which details NFL attendance for each team from the 2014-2018 NFL seasons along with other factors that might affect attendance (such as average roster age and win count).

First, as we should do for any analysis, we should read the file and understand our variables:

19Nov capture1

  • Team-The team name corresponding to a row of data; there are 32 NFL teams total
  • Home Attendance-How many fans attend a team’s home games (the NFL’s International games count towards this total)
  • Road Attendance-How many fans attend a team’s road games
    • Keep in mind that teams have 8 home games and 8 away games.
  • Total Attendance-The total number of fans who go see a team’s games in a particular season (attendance for home games + attendance for away games)
  • Win Total-how many wins a team had for a particular season
  • Win.. (meaning win percentage)-the precent of games won by a particular team (keep in mind that ties are counted as half-wins when calculating win percentages)
  •  NFL Season-the season corresponding to the attendance totals (e.g. 2017 NFL season is referred to as simply 2017)
  • Conference Standing-Each team’s seeding in their respective conference (AFC or NFC), which ranges from 1 to 16-1 being the best and 16 being the worst. The teams that were seeded 1-6 in their conference made the playoffs that season while teams seeded 7-16 did not; teams seeded 1-4 won their respective divisions while teams seeded 5 and 6 made the playoffs as wildcards.
    • As the 2018 season is still in progress, these standings only reflect who is LIKELY to make the playoffs as of Week 11 of the NFL season. So far, no team has clinched a playoff spot yet.
  • Average Age of Roster-The average age of a team’s players once the final 53-man roster has been set (this is before Week 1 of the NFL regular season)

One thing to note is that I removed the thousands separators for the Home Attendance, Road Attendance, and Total Attendance variables so that they would read as ints and not factors. The file still has the separators though.

Now let’s set up our model (I’m going to be using three models in this post for comparison purposes):

In this model, I used Total Attendance as the dependent variable and Win Total as the independent variables. In other words, I am using this model to determine if there is any relationship between fans’ attendance at a team’s games and a team’s win total.

Remember how in R Lesson 7 I mentioned that you should pay close attention to the three bottom lines in the output? Here’s what they mean for this model:

  • As I mentioned earlier, the residual standard error refers to the amount that the response variable (total attendance) deviates from the true regression line. In this case, the RSE is 1,828,000, meaning the total attendance deviates from the true regression line by 1,828,000 fans.
    • I didn’t mention this in the previous post, but the way to find the percentage error is to divide the RSE by the average of the dependent variable (in this case, Total Attendance). The lower the percentage error, the better.
    • In this case, the percentage error is 185.43% (the mean for Total Attendance is 985,804 fans, rounded to the nearest whole number).
  • The R-Squared is a measure of the goodness-of-fit of a model-the closer to 1, the better the fit. The difference between the Multiple R-Squared and the Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. In this model, the Multiple R-Squared is 20.87% while the Adjusted R-Squared is 20.37%, indicating a very slight correlation.
    • Remember the idea that “correlation does not imply causation”, which states that even though there may be a strong correlation between the dependent and independent variable, this doesn’t mean the latter causes the former.
    • In the context of this model, even though a team’s total attendance and win total have a very slight correlation, this doesn’t mean that a team’s win total causes higher/lower attendance.
  • The F-squared measures the relationship (or lack thereof) between independent and dependent variables. As I mentioned in the previous post, for models with only 1 degree of freedom, the F-squared is basically the independent variable’s t-value squared (6.456²=41.68). The F-Squared (and resulting p-value) aren’t too significant to determining the accuracy of simple linear regression models such as this one, but are more significant when dealing with with multiple linear regression models.

Now let’s set up the equation for the line (note the coef function I mentioned in the previous post isn’t necessary):

Remember the syntax for the equation is just like the syntax of the slope-intercept equation (y=mx+b) you may remember from algebra class. The equation for the line is (rounded to 2 decimal places):

  • Total Attendance = 29022(Win Total)+773943

Let’s try the equation out using some scenarios:

  • “Perfect” Season (no wins): 29022(0)+773943=expected total attendance of 773,943
  • Split Season (eight wins): 29022(8)+773943=expected total attendance of 1,006,119
  • Actual Perfect Season (sixteen wins): 29022(16)+773943=expected total attendance of 1,238,295

And finally, let’s create the graph (and the regression line):

As seen in the graph above, few points touch the line (which explains the low Multiple R-Squared of 20.68%). According to the regression line, total attendance INCREASES with better win totals, which indicates a direct relationship. One possible reason for this is that fans of consistently well-performing teams (like the Patriots and Steelers) are more eager to attend games than are fans of consistently struggling teams (like the Browns and Jaguars). An interesting observation would be that the 2015 4-12 Dallas Cowboys had better total attendance than the 2015 15-1 Carolina Panthers had. The 2016 and 2017 Cleveland Browns fared pretty well for attendance-each of those seasons had a total attendance of at least 900,000 fans (the records were 1-15 and 0-16 respectively).

Let’s create another model, once again using Total Attendance as the dependent variable but choosing Conference Standing as the independent variable:

So, is this model better than lr1? Let’s find out:

  • The residual standard error is much smaller than that of the previous model (205,100 fans as opposed to 1,828,000). As a result, the percentage error is much smaller-20.81%-and there is less variation among the observation points around the regression line.
  • The Multiple R-Squared and Adjusted R-Squared (0.4% and -0.2% respectively) are much lower than the R-Squared amounts for lr1. Thus, there is even less of a correlation between Total Attendance and Conference Standing than there is between Total Attendance and Win Total (for.a particular team).
  • Disregard the F-statistic and p-value.

Now let’s set up our equation:

From this information, we get the equation:

  • Total Attendance = -2815(Conference Standing)+1009732

Here are some scenarios using this equation:

  • Top of the conference (1st place): -2815(1)+1009732=expected total attendance of 1,006,917
  • Conference wildcard (5th place): -2815(5)+1009732=expected total attendance of 995,657
  • Bottom of the pack (16th place): -2815(16)+1009732=expected total attendance of 964,692

Finally, let’s make a graph:

As seen in the graph, few points touch the line (less points touch this line than in the line for lr1). The line itself has a negative slope, which implies that total attendance DECREASES with WORSE conference standings (or increases with better conference standings). Yes, I know the numbers under conference standing are increasing, but keep in mind that 1 is the best possible conference finish for a team, while 16 is the worst possible finish for a team. One possible reason that total attendance decreases with lower conference standings is that fans are possibly more enticed to come to games for consistently top conference teams and division winners (like the Patriots and Panthers) rather than teams who miss playoffs year after year (like the Jaguars, save for the 2017 squad that made it to the AFC Championship). Interestingly enough, the 2015 4-12 Dallas Cowboys rank second overall in total attendance (16th place in their conference), just behind the 2016 13-3 Dallas Cowboys (first in their conference).

Now let’s make one more graph, this time using Average Age of Roster as the independent variable:

Is this model better than lr2? Let’s find out:

  • The residual standard error is the smallest one of the three (204,600 fans) and thus, the percentage error is the smallest of the three-20.75%.
  • The Multiple R-Squared and Adjusted R-Squared are smaller than those of lr1 but larger than that of lr2 (0.84% and 0.22% respectively). Thus, Average Age of Roster correlates better with Total Attendance than does Conference Standing, however Win Total correlates the best with Total Attendance.
  • Once again, disregard the F-Statistic & corresponding p-value.

Now let’s create the equation:

  • Total Attendance = 36556(Average Age of Roster)+33594

Here are some scenarios using this equation

  • Roster with mostly rookies and 2nd-years (an average age of 24)=36556(24)+33594=expected total attendance of 910,938
  • Roster with a mix of newbies and veterans (an average age of 26)=36556(26)+33594=expected total attendance of 984,050
  • Roster with mostly veterans (an average age of 28)=36556(28)+33594=expected total attendance of 1,057,162

And finally, let’s create a graph:

Like the graph for lr2, few points touch the line. As for the line itself, the slope is positive, implying that Total Attendance INCREASES with an INCREASING Average Age of Roster. One possible reason for this is that fans are more interested in coming to games if the team has several veteran stars* (names like Phillip Rivers, Tom Brady, Jordy Nelson, Antonio Gates, Rob Gronkowski, Richard Sherman, Julius Peppers, Marshawn Lynch and many more) rather than if the team is full of rookies and/or unknowns (Myles Garrett, Sam Darnold, Josh Rosen, Leighton Vander Esch, among others). Interestingly enough, the team with the oldest roster (the 2018 Oakland Raiders, with an average age of 27.4), have the second lowest total attendance, just ahead of the 2018 LA Chargers (with an average age of 25.8).

*I’ll use any player who has at least 6 seasons of NFL experience as an example of a “veteran star”.

So, which is the best model to use? I’d say lr1 would be the best model to use, because even though it has the highest RSE (1,828,000), it also has the best correlation between the independent and dependent variables (a Multiple R-Squared of 20.87%). All in all, according to my three analyses, a team’s Win Total has the greatest influence on how many fans go to their games (both home and away) during a particular season.

Thanks for reading, and happy Thanksgiving to you all. Enjoy your feasts (and those who are enjoying those feasts with you),

Michael

R Lesson 7: Graphing Linear Regression & Determining Accuracy of the Model

Advertisements

Hello everybody,

It’s Michael, and today’s post will be a continuation of the previous post because I will be covering how to graph linear regression models using the dataset and model created in the previous post.

Just to recap, this model is measuring whether the age of a person influenced how many times their name was mentioned on cable news reports (period measured is from October 1-December 7, 2017, at the beginning of the #MeToo movement). Now let’s graph the model:

The basic syntax for creating a plot like this is plot(dataset name$y-axis~dataset name$x-axis); the name of the y-axis is always listed BEFORE the name of the x-axis. The portion of code that reads main = #MeToo coverage, xlab = Age, ylab = Number of Mentions is completely optional, as all it does is allow you to label the x- and y-axes and display a name for the graph.

The abline(linearModel) line adds a line to the graph based on the equation for the model (value=0.09171(age)-3.39221). However, for this function to work, don’t close the window displaying the graph. The line is immediately displayed on the graph after you write the line of code and hit enter.

  • Use the name of your linear regression model in the parentheses so that the line that is created matches the equation for your model.
  • Remember to always hit enter after writing the plot() line, then go back to the coding window, write the abline() line and hit enter again.

So we know how to graph our model, but how do we evaluate the accuracy of the model? Take a look at the summary(linearModel) output below:

Focus on the last three lines of this code block, as they will help determine the accuracy (or goodness-of-fit) of this model. Here’s a better explanation of the output:

  • The residual standard error is a measure of the QUALITY of the fit of the model. Every linear regression model usually contains an error term (E) and as a result, there will be some margin of error in our regression line. The residual standard error refers to the amount that the response variable (value) will deviate from the true regression line. In this case, the actual number of times someone was mentioned on the news deviates from the true regression line is 16.2 mentions.
  • The R-squared is a measure of how well the model ACTUALLY fits within the data. The closer R-squared is to 1, the better the fit of the model. The main difference between Multiple R-Squared and Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. The Adjusted R-Squared will decrease for variables irrelevant to the model, which makes a good metric to use when trying to find relevant variables for your model
    • As you can see, the Multiple R-Squared and Adjusted R-Squared values are quite low (0.47% and 0.46% respectively), indicating that there isn’t a significant relationship between a person’s age and how many time their name was mentioned in news reports.
    • Keep the idea that “correlation does not imply causation” in mind when analyzing the R-Squared. This idea implies that even though a high R-Squared shows that the independent variable and dependent variable perfectly correlate to each other, this does not indicate the independent variable causes the dependent variable (the opposite is true for models with a low R-Squared). In the context of this model, even though someone’s age and the amount of times their name was mentioned on news reports don’t appear correlated, this doesn’t mean that someone’s age isn’t a factor in the amount of news coverage they receive.
  • The F-statistic is a measure of the relationship (or lack thereof) between the dependent and independent variables. This value (along with the corresponding p-value) isn’t really significant when dealing with simple linear regression models such as this one but it is an important metric to analyze when dealing with multiple linear regression models (just like simple linear regression except with multiple independent variables).
    • I will cover multiple linear regression models in a future post, so keep your eyes peeled.
    • In cases of simple linear regression, the f-statistic is basically the independent variable’s t-value squared. The summary output displayed above proves my point, as the squared t-value for age (8.049) equals the f-statistic (64.78). Keep in mind that the f-statistic=(t-value)² rule only applies to models with 1 degree of freedom, just like the one displayed above.

Thanks for reading,

Michael

 

R Lesson 6: Linear Regression

Advertisements

Hello everybody,

Michael here, and now that I’ve completed my series of MySQL posts (for now) I thought I’d start posting more R lessons. Today’s lesson will be on linear regression, which, like logistical regression, is another type of regression method (go back to R Lesson 4: Logistic Regression Models for a refresher).

The main difference between the two types of regression models are in the dependent variables. While you may recall that the dependent variable for logistic regression is binary (meaning there are only 2 possible outcomes), the dependent variable for linear regression is continuous (meaning there are plenty of possible outcomes).

The dataset I will be using for this analysis is MeToo, which details the amount of media coverage regarding 45 famous men accused of some form of sexual misconduct in the wake of the MeToo movement.

The first step when working with linear regression (or any type of analysis in R) is to understand the data, which we do by creating a file variable and utilizing the read.csv command. We then use the str(file) command to display the variables in the dataset.

Here’s some more context on each of the variables used:

  • date_start-The date of each news report
  • date_end-This just mentions the date of each news report along with the timestamp T23:59:59Z (denoting the end of the day); this variable will not be relevant in our analysis.
  • date_resolution-This just mentions “day” along each corresponding date; this variable is also irrelevant to our analysis
  • station-The network that covers each news report, of which there are six listed on this database (Bloomberg, CNBC, CNN, FOX Business, FOX News, and MSNBC)
  • value-The number of time each person’s name is mentioned on a particular broadcast on a specific date
    • This only counts times where full names are mentioned (e.g. only mentions of “Matt Lauer” not just “Matt” or “Lauer”)
  • name-The subject of the news report (e.g. Al Franken, Blake Farenthold)
  • age-The age of the subject of the news report
  • occupation-The profession of the subject of the news report (e.g. Louis CK is a comedian)

Now that we understand our variables, let’s build the model.

In this model, I choose the variables value and age (they are the two most quantitative) and built the model on full data (hence data = file), with value being the dependent variable-that’s why I listed it first-and age being the independent variable. The display for summary(linearModel) is very similar to what you’ll see if you ask for a summary of a logistic regression model.

  • Remember that linear regression models use lm while logistic regression models use glm!

Now, let me introduce a new command-print-which in this case will print the coefficients used in the equation of this model.

In this model, value is a function of age, so when we create the equation for our model, here’s what we get:

  • value=0.09171(age)+(-3.39221)

If this format looks familiar, note that it the same setup as the slope-intercept equation y=mx+b (remember this from algebra class?).

On the next post, we will learn how to graph this model.

Thanks for reading,

Michael