R Lesson 6: Linear Regression

Advertisements

Hello everybody,

Michael here, and now that I’ve completed my series of MySQL posts (for now) I thought I’d start posting more R lessons. Today’s lesson will be on linear regression, which, like logistical regression, is another type of regression method (go back to R Lesson 4: Logistic Regression Models for a refresher).

The main difference between the two types of regression models are in the dependent variables. While you may recall that the dependent variable for logistic regression is binary (meaning there are only 2 possible outcomes), the dependent variable for linear regression is continuous (meaning there are plenty of possible outcomes).

The dataset I will be using for this analysis is MeToo, which details the amount of media coverage regarding 45 famous men accused of some form of sexual misconduct in the wake of the MeToo movement.

The first step when working with linear regression (or any type of analysis in R) is to understand the data, which we do by creating a file variable and utilizing the read.csv command. We then use the str(file) command to display the variables in the dataset.

9Nov capture

Here’s some more context on each of the variables used:

  • date_start-The date of each news report
  • date_end-This just mentions the date of each news report along with the timestamp T23:59:59Z (denoting the end of the day); this variable will not be relevant in our analysis.
  • date_resolution-This just mentions “day” along each corresponding date; this variable is also irrelevant to our analysis
  • station-The network that covers each news report, of which there are six listed on this database (Bloomberg, CNBC, CNN, FOX Business, FOX News, and MSNBC)
  • value-The number of time each person’s name is mentioned on a particular broadcast on a specific date
    • This only counts times where full names are mentioned (e.g. only mentions of “Matt Lauer” not just “Matt” or “Lauer”)
  • name-The subject of the news report (e.g. Al Franken, Blake Farenthold)
  • age-The age of the subject of the news report
  • occupation-The profession of the subject of the news report (e.g. Louis CK is a comedian)

Now that we understand our variables, let’s build the model.

In this model, I choose the variables value and age (they are the two most quantitative) and built the model on full data (hence data = file), with value being the dependent variable-that’s why I listed it first-and age being the independent variable. The display for summary(linearModel) is very similar to what you’ll see if you ask for a summary of a logistic regression model.

  • Remember that linear regression models use lm while logistic regression models use glm!

Now, let me introduce a new command-print-which in this case will print the coefficients used in the equation of this model.

In this model, value is a function of age, so when we create the equation for our model, here’s what we get:

  • value=0.09171(age)+(-3.39221)

If this format looks familiar, note that it the same setup as the slope-intercept equation y=mx+b (remember this from algebra class?).

On the next post, we will learn how to graph this model.

Thanks for reading,

Michael

 

Leave a ReplyCancel reply