Logistic Regression Archives - Michael's Programming Bytes

R Analysis 1: Logistic Regression & The 2017-18 TV Season

Advertisements

Hello everybody,

Yes, I know you all wanted to learn about MySQL queries, but I am still preparing the database (don’t worry it’s coming, just taking a while to prepare). And since I did mention I’ll be doing analyses on this blog, that is what I will be doing on this post. It’s basically an expansion of the TV show set from R Lesson 4: Logistic Regression Models & R Lesson 5: Graphing Logistic Regression Models with 3 new variables.

So, as we should always do, let’s load the file into R and get an understanding of our variables, with str(file).

17Aug capture

As for the new variables, let’s explain. By the way, the numbers you see for the new variables are dummy variables (remember those?). I thought the dummy variables would be a better way to categorize the variables.

Rating-a TV show’s parental rating (no not how good it is)
- 1-TV G
- 2-TV PG
- 3-TV 14
- 4-TV MA
- 5-Not applicable
Usual day of week-the day of the week a show usually airs its new episodes
- 1-Monday
- 2-Tuesday
- 3-Wednesday
- 4-Thursday
- 5-Friday
- 6-Saturday
- 7-Sunday
- 8-Not applicable (either the show airs on a streaming service or airs 5 days a week like a talk show or doesn’t have a consistent airtime)
Medium-what network the show airs on
- 1-Network TV (CBS, ABC, NBC, FOX or the CW)
- 2-Cable TV (Comedy Central, Bravo, HBO, etc.)
- 3-Streaming TV (Amazon, Hulu, etc.)

I decided to do three logistic regression models (one for each of the new variables). The renewed/cancelled variable (known as X2018.19.renewal.) is still the binary variable, and the other dependent variable I used for the three models is season count (known as X..of.seasons..17.18.).

First, remember to install (and use the library function for) the ggplot2 package. This will come in handy for the graphing portion.

Here’s my first logistic regression model, with my binary variable and two dependent variables (season count and rating). If you’re wondering what the output means, check out R Lesson 4: Logistic Regression Models for a more detailed explanation.

Here are two functions you need to help set up the model. The top function help set up the grid and designate which categorical variable you want to use in your graph. The bottom function helps predict the probabilities of renewal for each show in a certain category. In this case, it would be the rating category (the ones with TV-G, TV-PG, etc.)

Here’s the ggplot function. Geom_line() creates the lines for each level of your categorical variable; here are 5 lines for the 5 categories.

Here’s the graph. As you see, there are five lines, one for each of the ratings. What are some inferences that can be made?

The TV-G shows (category 1) usually have the lowest chance of renewal. In this model, a TV-G show would need to have run for approximately a minimum of 22 seasons for at least 50% chance of renewal. (Granted, the only TV-G show on this database is Fixer Upper, which was not renewed)
The TV-PG shows have a slightly better chance at renewal as renewal odds for these shows are at least 25%. To attain at a minimum 50% of renewal, these shows would only need to have run for approximately a minimum of 17 seasons, not 22 (like The Simpsons).
The TV-14 shows have a minimum 50% chance of renewal, regardless of how many seasons they have run. They would need to have run for at least 25 seasons to attain a minimum 75% chance of renewal, however (SNL would be the only applicable example here, as it was renewed and has run for 43 seasons).
The TV-MA shows have a minimum 76% (approximately) chance of renewal no matter how many seasons they have aired. Shows like South Park, Archer, Real Time, Big Mouth and Orange is the New Black are all TV-MA, and all of them were renewed.
The unrated shows had the best chances at renewal, as they had a minimum 92% (approximately) chance at renewal. (Granted, Watch What Happens Live! is the only unrated show on this list)

Next, we repeat the process used to create the plot for the first model for these next two models.

What are some inferences that can be made? (I know this graph is hard to read, but we can still make observations from this graph.

The orange line (representing Tuesday shows) is the lowest on the graph, so this means Tuesday shows usually had the lowest chances of renewal. This makes sense, as Tuesday shows like LA to Vegas, The Mick, and Rosanne were all cancelled.
On the other end, the pink line (representing shows that either aired on streaming services, did not have a consistent time slot, or aired every day like talk shows) is the highest on the graph, so this means shows without a regular time slot had the best chances at renewal (such as Atypical, Jimmy Kimmel Live!, and House of Cards).

What inferences can we make from this graph?

The network shows (from the 5 major broadcast networks CBS, ABC, NBC, FOX and the CW) had the lowest chances at renewal. At least 11 seasons would be needed for a minimum 50% chance of renewal.
- Some shows would include The Simpsons (29 seasons), Family Guy (16 seasons), The Big Bang Theory (11 seasons), and NCIS (15 seasons), all of which were renewed.
The cable shows (from channels such as Comedy Central, HBO, and Bravo) have a minimum 58% (approximately) chance of renewal, but at least 15 seasons would be needed for a minimum 70% chance of renewal.
- Some shows would include South Park (21 seasons) and Real Time (16 seasons), both of which were renewed.
The streaming shows (from services such as Netflix, Hulu, or CBS All Access) had the best odds for renewal (approximately 76% minimum chance at renewal). At least 30 seasons would be needed for a 90% chance at renewal.
- This doesn’t make any sense yet, as streaming shows have only been around since the early-2010s.

Thanks for reading, and I’ll be sure to have the MySQL database ready so you can start learning about querying.

Michael

R Lesson 5: Graphing Logistic Regression Models

Advertisements

Hello everybody,

It’s Michael, and today I’ll be discussing graphing with logistic regression. This will serve as a continuation of R Lesson 4: Logistic Regression Models (I’ll be using the dataset and the models from that post).

Let’s start by graphing the second model from R Lesson 4. That’s the one that includes season count and premiere year (I feel this would be more appropriate to graph as it is the more quantitative of the two models).

Here’s the formula for the model if you’re interested (as well as the output):

Now let’s plot the model (but first, let’s remember to install the ggplot2 package).

Next we have to figure out the probabilities that each show will be renewed (or not).

And finally, let’s plot the model.

What are some conclusions we can draw from the model?

The shows with less than 25 seasons and that premiered between 1975 and the early 90s (such as Roseanne which had 10 seasons and premiered in 1988) had no chance at renewal.
For shows with less than 25 seasons, the more recently the show premiered, the more likely it was renewed (as shown by the progressively brighter colors).
For the few outlier shows with more than 25 seasons (regardless of when they premiered) they had a 100% chance at renewal.
- The two notable examples would be The Simpsons (at 29 seasons) and SNL (at 43 seasons)

Thanks for reading,

Michael

R Lesson 4: Logistic Regression Models

Advertisements

Hello everybody,

It’s Michael, and today’s post will be the first to cover data modeling in R. The model I will be discussing is the logistic regression model. For those that don’t know, logistic regression models explore the relationship between a binary* dependent variable and one or more independent variables.

*refers to variable with only 2 possible values, like yes/no, wrong/right, healthy/sick etc.

The data set I will be using is-TV shows-which gives a list of 85 random TV shows of various genres that were currently airing during the 2017-18 TV season and whether or not each show was renewed for the 2018-19 TV season. So, like any good data scientist, let’s first load the file and read (as well as understand) the data.

The variables include

TV Show-the name of the TV show
Genre-the genre of the TV show
Premiere Year-the year the TV show premiered (for reboots like Roseanne, I included the premiere date of the original, not the revival)
X..of.seasons..17.18. (I’ll refer to it as season count)-how many seasons the show had aired at the conclusion of the 2017-18 TV season (in the case of revived shows like American Idol, I counted both the original run and revival, which added up to 16 seasons)
Network-the network the show was airing on at the end of the 2017-18 TV season
X2018.19.renewal. (my binary variable)-Whether or not the show was renewed for the 2018-19 TV season
- You’ll notice I used 0 and 1 for this variable; this is because it is a good idea to use dummy variables (the 0 and 1) for your binary dependent variable to help quantify qualitative data.
  - The qualitative data in this case being whether a show was renewed for the 2018-19 TV season (shown by 1) or not (shown by 0)

Now that we know the variables in our data set, let’s figure out what we want to analyze.

Let’s analyze the factors (eg. network, genre) that affected a certain TV show’s renewal or cancellation (the binary variable represented by 0/1)

So here’s the code to build the model, using the binary dependent variable and two of the independent variables (I’ll use genre and premiere year)

What does all of this output mean?

The call just reprints the model we created.
The estimate represents the change in log odds (or logarithm of the odds) for the dependent variable should a certain independent variable variable be increased by 1.
- Log odds function–>log(p/(1-p))
- For instance, if the premiere year increases by 1 (let’s say from 2009 to 2010), the odds that it was renewed for the 18-19 TV season decrease by 6.73% (as evidenced by the -0.06763 as the premiere year estimate)
Standard error represents how far the sample mean is from the population mean. In the case of premiere year, the two means are close together. In the case of genre however, the two means are mostly far apart (then again, genre isn’t numerical).
Z-value is the ratio of the estimate to the standard error
P-value (denoted by Pr(|>z|)) helps you determine the significance of your results by giving you a number between 0 and 1
- P-values are used to either prove or disprove your null hypothesis (a claim you are making about your data)
  - Let’s say you think a show’s genre and premiere year affected its chances of renewal; this would be your null hypothesis.
  - Your alternative hypothesis would be the opposite of your null hypothesis; that is, genre and premiere year don’t affect a shows chances of renewal
- Small p-values (those <=0.05) indicate strong evidence against the null hypothesis, so in these cases, you can reject the null hypothesis. For p-values larger than 0.05, you should accept the null hypothesis.
  - Since all the p-values are well above 0.05, you can accept the null hypothesis
Null deviance shows how well our dependent variable (whether or not a show got renewed) is predicted by a model that includes only the intercept
Residual deviance shows how well our dependent variable (whether or not a show got renewed) is predicted by a model that includes the intercept as well as any independent variables
- As you can see here, the residual deviance is 89.496 on 71 degrees of freedom, a decrease of 20.876 from null deviance (as well as a decrease of 13 degrees of freedom).
AIC (or Akaike Information Criterion) is a way to gauge the quality of your model through comparison of related models; the point of the AIC is to prevent you from using irrelevant independent variables.
- The AIC itself is meaningless unless we have another model to compare it to, which I will include in this post.
The number of Fisher scoring iterations shows how many times the model ran to attain maximum likelihood, 17 in this case. This number isn’t too significant.

Now let’s create another model, this time including season count in place of genre.

How does this compare to the previous model?

There is a smaller difference between null & residual deviance (12.753 and 2 degrees of freedom, as opposed to 20.876 and 13 degrees of freedom)
The AIC is 13.88 smaller than that of the previous model, which indicates a better quality of the model
The number of Fisher scoring iterations is also lower than the previous model (5 as opposed to 17), which means it took less tries to attain maximum likelihood (that a show was renewed)
The estimate for premiere year also increased
- This time, if premiere year increases by 1, the odds that a show was renewed for the 2018-19 TV season increased by 14.81%, rather than decreased.
- If season count increased by 1 (say from 4 to 5 seasons), then the odds a show was renewed increased by 31.45%
The asterisk by season count just gives an idea of the range of p-values of season count (denoted by Pr(|>z|))
- The p-value of season count is >0.01 but <0.05 (which makes perfect sense as Pr(|>z|) is 0.027
Let’s create two null hypotheses-premiere year and season count affects a show’s chances of renewal (we are treating these as separate hypotheses).
- Premiere year is greater than 0.05, so accept this null hypothesis.
  - In other words, premiere year did affect a show’s chances for renewal.
- Season count is less than 0.05, so reject this null hypothesis.
  - In other words, season count didn’t affect a show’s chances for renewal.

That’s all for now. Thanks for reading.

Michael

Exit mobile version

%%footer%%