R Lesson 7: Graphing Linear Regression & Determining Accuracy of the Model

Hello everybody,

It’s Michael, and today’s post will be a continuation of the previous post because I will be covering how to graph linear regression models using the dataset and model created in the previous post.

Just to recap, this model is measuring whether the age of a person influenced how many times their name was mentioned on cable news reports (period measured is from October 1-December 7, 2017, at the beginning of the #MeToo movement). Now let’s graph the model:

11Nov capture2

11Nov capture

The basic syntax for creating a plot like this is plot(dataset name$y-axis~dataset name$x-axis); the name of the y-axis is always listed BEFORE the name of the x-axis. The portion of code that reads main = #MeToo coverage, xlab = Age, ylab = Number of Mentions is completely optional, as all it does is allow you to label the x- and y-axes and display a name for the graph.

The abline(linearModel) line adds a line to the graph based on the equation for the model (value=0.09171(age)-3.39221). However, for this function to work, don’t close the window displaying the graph. The line is immediately displayed on the graph after you write the line of code and hit enter.

  • Use the name of your linear regression model in the parentheses so that the line that is created matches the equation for your model.
  • Remember to always hit enter after writing the plot() line, then go back to the coding window, write the abline() line and hit enter again.

So we know how to graph our model, but how do we evaluate the accuracy of the model? Take a look at the summary(linearModel) output below:

12Nov capture

Focus on the last three lines of this code block, as they will help determine the accuracy (or goodness-of-fit) of this model. Here’s a better explanation of the output:

  • The residual standard error is a measure of the QUALITY of the fit of the model. Every linear regression model usually contains an error term (E) and as a result, there will be some margin of error in our regression line. The residual standard error refers to the amount that the response variable (value) will deviate from the true regression line. In this case, the actual number of times someone was mentioned on the news deviates from the true regression line is 16.2 mentions.
  • The R-squared is a measure of how well the model ACTUALLY fits within the data. The closer R-squared is to 1, the better the fit of the model. The main difference between Multiple R-Squared and Adjusted R-Squared is that the former isn’t dependent on the amount of variables in the model while the latter is. The Adjusted R-Squared will decrease for variables irrelevant to the model, which makes a good metric to use when trying to find relevant variables for your model
    • As you can see, the Multiple R-Squared and Adjusted R-Squared values are quite low (0.47% and 0.46% respectively), indicating that there isn’t a significant relationship between a person’s age and how many time their name was mentioned in news reports.
    • Keep the idea that “correlation does not imply causation” in mind when analyzing the R-Squared. This idea implies that even though a high R-Squared shows that the independent variable and dependent variable perfectly correlate to each other, this does not indicate the independent variable causes the dependent variable (the opposite is true for models with a low R-Squared). In the context of this model, even though someone’s age and the amount of times their name was mentioned on news reports don’t appear correlated, this doesn’t mean that someone’s age isn’t a factor in the amount of news coverage they receive.
  • The F-statistic is a measure of the relationship (or lack thereof) between the dependent and independent variables. This value (along with the corresponding p-value) isn’t really significant when dealing with simple linear regression models such as this one but it is an important metric to analyze when dealing with multiple linear regression models (just like simple linear regression except with multiple independent variables).
    • I will cover multiple linear regression models in a future post, so keep your eyes peeled.
    • In cases of simple linear regression, the f-statistic is basically the independent variable’s t-value squared. The summary output displayed above proves my point, as the squared t-value for age (8.049) equals the f-statistic (64.78). Keep in mind that the f-statistic=(t-value)² rule only applies to models with 1 degree of freedom, just like the one displayed above.

Thanks for reading,

Michael

 

One thought on “R Lesson 7: Graphing Linear Regression & Determining Accuracy of the Model”

Leave a Reply