December 2018 - Michael's Programming Bytes

R Lesson 9: Time Series Data

Advertisements

Hello everybody,

It’s Michael, and today’s post will be on time series analysis, which analyzes time-dependent data (such as the weather in Miami, Florida over the course of 2018 or the Cleveland Browns’ records over the last decade or the price of bitcoin in the last 2 years, just to give some examples) over a certain period of time.

In the post, I will be utilizing search data from Google Trends to analyze how often certain famous people are searched. For those that don’t know, Google Trends is a fascinating tool that allows you to see how often something (whether a food, person, event, animal, etc.) is searched on Google over a certain timeframe (all the way back to January 1, 2004). Google Trends also has several fascinating analyses, like The Year in Search (which details the most popular worldwide Google searches in a given year).

Here’s the spreadsheet-Google Trends-I swapped out <1s for 0s so that R would be everything as an int and not a factor.

Anyway, I’ll be making several graphs, analyzing two people at a time in each graph that have something in common.

Now, let’s load our file and try to understand the data:

16Dec capture

This data basically consists of 52 dates (shown by the Week variable), and the search popularity in the US for 22 people over the last year (from the week of December 17, 2017 to December 9, 2018). The numbers 0-100 are used as a metric to determine how often a certain person’s name was searched in a given week; 0 means there either wasn’t enough data or that person’s name wasn’t searched at all while 100 means the person’s name was searched for (presumably) millions of times that week. All the dates listed are Sundays (12/17/17, 12/24/17, etc.), meaning in this case, a week is measured from Sunday-Saturday,

Now before we start graphing, we need to be sure the strings in the Week variable are converted to dates for the purpose of the graph, which is what this line does (more specifically, dates are converted into month/day/year format-exactly the way they are listed in the spreadsheet)

Now time to graph (remember to install the ggplot2 package). I will be looking at two people (who have something in common) at a time and doing a comparative analysis.

I’ll start by analyzing Jared Fogle and Bill Cosby-two celebrities who had very public falls from grace and are both currently incarcerated.

As you can see, Bill Cosby was more popular than Jared Fogle in American Google Searches. This is likely because Fogle has been incarcerated for his crimes since November 2015, while Cosby was re-tried, convicted and ultimately sent to prison in the span of five months (April-September 2018). Cosby had plenty of legal drama this year, which could explain the greater fluctuation in his graph (compared to Fogle’s). Cosby also has two major peaks in his search history graph for the weeks of April 22 and September 23-the weeks he was convicted and sent to prison, respectively.

The peaks aren’t the only things you should be analyzing. Check the numbers on the y-axis to get an idea of the maximum search history metric. For instance, Jared Fogle’s highest search metric is 1, while Bill Cosby’s is 100. This indicates that more Americans searched Cosby’s name than Fogle’s (Cosby was also the more newsworthy of the two this past year)

Now let’s analyze US search history trends for Kyle Kulinski and Ana Kasparian-two famous left-wing commentators. Kulinski hosts the Secular Talk YouTube channel while Kasparian is a member of progressive YouTube news channel The Young Turks.

As you can see, even though Kasparian’s graph has more fluctuation than Kulinski’s, searches for Kulinski’s name were more popular than searches for Kasparian because the search history metric for Kulinski goes above 75 two times, while the metric for Kasparian doesn’t exceed 25. The discrepancy between Kulinski’s search history metric and Kasparian’s could be because more people subscribe to Secular Talk than The Young Turks (I’m just theorizing here).

Now let’s analyze the search metric history for Mike Shinoda and Chester Bennington-two members of Linkin Park.

As you can see, Chester Bennington’s highest metric is 100, while Mike Shinoda’s highest metric is 44. I’m guessing the reason Bennington’s metric is higher is that many people still enjoy listening to Linkin Park’s music-and hear his voice-after his death. Also worth noting is that Bennington’s metric peaked on the week of July 15, which was around the one-year anniversary of his death on 7-20-17. Shinoda’s search history metric peaked on the week of June 17, which was when his solo album Post Traumatic was released in its entirety (and which he created after Bennington’s death).

Now time to compare the American search metric history for JaMarcus Russell and Ryan Leaf-two of the biggest NFL busts of all time (and both quarterbacks).

As you can see, Leaf’s graph has more fluctuation than Russell’s, but Russell’s graph peaks at 100 while Leaf’s only peaks at 26. Then again, the search history metric average for Russell is 5.7 and for Leaf is only 5.3, meaning neither individual’s name widely pops up in US Google Searches. However, the one thing that can explain Russell’s peak of 100 on the week of November 4 could be this article with an interesting story about Russell-https://bleacherreport.com/articles/2804453-david-diehl-raiders-gave-jamarcus-russell-blank-tapes-to-see-if-qb-watched-film.

Now let’s compare the search history metrics of Dwayne Wade and Hassan Whiteside, two current Miami Heat players.

As you can see, Wade’s graph peaks higher than Whiteside’s (100 to Whiteside’s 20). This is likely because Wade had a more eventful year than Whiteside, as he returned to the Heat (week of February 4), announced his retirement (week of September 16), welcomed another baby (week of November 4), and played in his 1000th career game (week of December 9).

Now time to analyze the search history metrics for Samuel J Comroe and Shin Lim-two contestants on AGT Season 13. Samuel J Comroe was a stand-up comedian who finished in 4th place, while Shin Lim was a close-up magician who finished as the season’s winner.

As you can see, Shin Lim’s peak is much higher than Samuel J Comroe’s (100 to 8, respectively). Neither contestant has much fluctuation in their graphs, but both peak on the week of September 16 (this was the week of the AGT Finals, which both Comroe and Lim competed in and finished in the Top 5).

Now let’s analyze the search history metrics for Tom Brady and Nick Foles-the two starting quarterbacks for Super Bowl LII.

As you can see, neither QB’s graph fluctuates much. Both graphs hit their peaks on the weeks of January 21 (AFC/NFC Championships) and February 4 (Super Bowl LII). Interestingly enough, Brady’s graph has the higher peak (100 to Foles’s 54), even though Foles and the Eagles won the Super Bowl. I guess this means that Brady is still the more popular of the two QBs (after all, Foles was a backup after the Eagles lost their main QB Carson Wentz).

Now time to analyze Alexandria Ocasio-Cortez and Rick Scott, two politicians who got elected to Congress during the 2018 midterm elections. Ocasio-Cortez (D-NY) got elected to the House and Scott (R-FL) got elected to the Senate.

Both Scott’s and Ocasio-Cortez’s graphs have relatively high peaks (100 for Scott and 61 for Ocasio-Cortez) since both had quite eventful elections. Ocasio-Cortez’s graph peaks on the weeks of June 24 and November 4, which was the week of her stunning primary upset against 10-term Democrat Joe Crowley and the week of her eventual election to the House. Scott’s graph also peaks on the week of November 4, which was the week he got elected to the Senate (this was right before the tense recount between him and incumbent Bill Nelson, after which Scott was confirmed the winner). One reason I think Scott’s graph has the higher peak is because his name is the more recognized of the two; after all, Scott was governor of Florida when he got elected to the Senate while Ocasio-Cortez was a relatively unknown bartender when she won the primaries and eventually, the house.

Now time to analyze Meghan Markle and Kate Middleton, two women who had very public (and televised) royal weddings (Markle’s being this year while Middleton’s was in 2011). The women’s husbands also happened to be siblings-Prince William (Middleton’s husband) and Prince Harry (Markle’s husband).

Markle’s graph has a much higher peak than Middleton’s (100 to Middelton’s 17), most likely because her royal wedding was this year, while Middleton’s was in 2011. Unsurprisingly, Markle’s graph peaks on the week of May 13, which was the week of her royal wedding. Some other reasons why Markle’s graph peaks higher than Middleton’s could be because Markle is one of the few Americans to marry into British royalty (Wallis Simpson, who married England’s King Edward VII in 1937, is another notable example), she’s also one of the first biracial royal fiancees, she’s older than Prince Harry (most royal grooms are older than the brides), and she was quite famous in the US having had an extensive acting career on shows like Suits.

The next analysis will be comparing Fred Guttenberg and Andrew Pollack, two Parkland parent-activists who lost their daughters in the Stoneman Douglas shooting.

Both individuals have high peaks (Pollack at 100, Guttenberg at 61) likely because both parents have appeared on several media outlets (CNN, Fox News, etc.) plenty of times since the shooting. One reason I think Pollack’s graph peaks higher than Guttenberg’s is because unlike many of the Parkland students and parents, he isn’t campaigning for tighter gun laws. A photo of Pollack in a Trump 2020 shirt also got considerable attention during the few days after the shooting-this could also explain the higher peak.

Finally, let’s analyze Mikaela Shiffrin and Maia Shibutani, two female participants of this year’s Winter Olympics in PyongCheng. Shiffrin is an alpine skier specializing in slalom skiing while Shibutani is a figure skater who competes with her older brother Alex.

Both graphs are pretty stagnant, save for a single peak (Shiffrin’s occurring on the week of February 11 and Shibutani’s occurring on the week of February 18, both during the 2018 Winter Olympics). Shiffrin’s peak is much higher though (100 compared to Shibutani’s 7), likely because Shiffrin won golds and silvers while Shibutani only won bronzes.

Now, before I go, remember that just because a graph fluctuates a lot doesn’t mean the search history metric is always going to be very high. R adjusts the scales on the graphs based on the highest number in a column.

Thanks for reading and happy holidays,

Michael

R Lesson 8: Predictions For Linear & Logistic Regression/Multiple Linear Regression

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be about predictions for both linear and logistic regression models. I will be using the same dataset that I used for R Analysis 2: Linear Regression & NFL Attendance, except I added some variables so I could create both linear and logistic regression models from the data. Here is the modified dataset-NFL attendance 2014-18

Now, as always, let’s first try to understand our variables:

I described most of these variables in R Analysis 2, but here are what the two new ones mean (I’m referring to the two bottommost variables):

Playoffs-whether or not a team made the playoffs. Teams that made playoffs are represented by a 1, while teams that didn’t make playoffs are represented by a 0. Recall that teams who finished 1st-6th in their respective conferences made playoffs, while teams that finished 7th-16th did not.
Division-What division a team belongs to, of which there are 8:
- 1-AFC East (Patriots, Jets, Dolphins, Bills)
- 2-AFC North (Browns, Steelers, Ravens, Bengals)
- 3-AFC South (Colts, Jaguars, Texans, Titans)
- 4-AFC West (Chargers, Broncos, Chiefs, Raiders)
- 5-NFC East (Cowboys, Eagles, Giants, Redskins)
- 6-NFC North (Packers, Bears, Vikings, Lions)
- 7-NFC South (Falcons, Saints, Panthers, Buccaneers)
- 8-NFC West (Seahawks, 49ers, Cardinals, Rams)

I added these two variables so that I could create logistic regression models from the data. In both cases, I used dummy variables (remember those?).

Another function I think will help you in your analyses is sapply. Here’s how it works:

As you can see, you can do two things with supply-find out if there are any missing variables (as seen on the top function) or find out how many unique values there are for a certain variable (as seen on the bottom function). According to the output, there are no missing values for any variables (in other words, there are no blank spots in any column of the spreadsheet). Also, on the bottom function, you can see how many distinct values correspond to a certain variable (e.g. Conference Standing has 16 distinct values).

Before I get into analysis of the models, I want to introduce two new concepts-training data and testing data:

The difference between training and testing data is that training data are used as guidelines for how a model (whether linear or logistic) should make decisions while testing data just gives us an idea as to how well the model is performing. When splitting up your data, a good rule of thumb is 80-20, meaning that 80% of the data should be for training while 20% of the data should be for testing (It doesn’t have to be 80-20, but it should always be majority of the data for training and the minority of the data for testing). In this model, observations 1-128 are part of the training dataset while observations 129-160 are part of the testing dataset.

I will post four models in total-two using linear regression and two using logistic regression. I will start with the logistic regression:

In this model, I chose playoffs as the binary dependent variable and Division and Win Total as the independent variables. As you can see, intercept (referring to Playoffs) and Win Total are statistically significant variables, while Division is not statistically significant. Also, notice the data = train line, which indicates that the training dataset will be used for this analysis (you should always use the training dataset to create the model)

Now let’s create some predictions using our test dataset:

The fitted.results variable calculates the predictions while the ifelse function determines whether each of the observations in our test dataset (observations 129-160) is significant to the model. A 1 under an observation number indicates that the observation has at least a 50% significance to the model while a 0 indicates that the observation has less than a 50% significance to the model.

If we wanted to figure out exactly how significant each observation is to the model (along with the overall accuracy of the model), here’s how:

The misClasificError basically indicates the model’s margin of error using the fitted.results derived from the test dataset. The accuracy is calculated by subtracting 1 from the misClasificError, which turns out to be 87%, indicating very good accuracy (and indicating that the model’s margin of error is 13%).

Finally, let’s plot the model:

We can also predict various what-if scenarios using the model and the predict function. Here’s an example:

Using the AFC South as an example, I calculated the possible odds for a team in that division to make the playoffs based on various possible win totals. As you can see, an AFC South team with 10 or 14 wins is all but guaranteed to make the playoffs, as odds for both of those win totals are greater than 1. However, AFC South teams with only 2 or 8 wins aren’t likely to go to playoffs because the odds for both of those win totals are negative (however 8 wins will fare better than 2).

Let’s try another example, this time examining the effects of 9 wins across all 8 divisions (I chose 9 because 9 wins sometimes results in playoff berths, sometimes it doesn’t):

As you can see, 9 wins will most likely earn a playoff berth for AFC East teams (55.6% chance) and least likely to earn a playoff spot in the NFC West (35.7% chance)

I know it looks like all the lines are squished into one big line, but you can imply that the more wins a team has, the greater its chances are at making the playoffs. The pink line that appears to be the most visible represents the NFC West (Rams, Seahawks, 49ers, Cardinals). Unsurprisingly, the teams likeliest to make the playoffs were the teams with 9 or more wins (expect for the 2017 Seahawks, who finished 9-7 and missed the playoffs).

Now let’s create another logistic regression model that is similar to the last one except with the addition of the Total Attendance variable

The summary output looks similar to that of the previous model (I also use the training dataset for this model), except that this time, none of the variables have asterisks right by them, meaning none of them are statistically significant (which happens when the p-value is above 0.1). Nevertheless, I’ll still analyze this model to see if it is better than my first logistic regression model.

Now let’s create some predictions using the test dataset:

Like our previous model, this model also has a nice mix of 0s and 1s, except this model only has 11 1s, while the previous model had 14 1s.

And now let’s find the overall accuracy of the model:

Ok, so I know 379% seems like crazy accuracy for a logistic regression model. Here’s how it was calculated:

R took the sum of these numbers and divided that sum by 32 to find the average of the fitted results. R then subtracted 1 from the average to get the accuracy measure.

Just as we did with the first model, we can also create what-if scenarios. Here’s an example:

Using the AFC North as an example, I analyzed the effect of win total on a team’s playoff chances while keeping total attendance the same (1,400,000). Unsurprisingly (if total attendance is roughly 1.4 million fans in a given season), teams with a losing record (7-8-1 or lower) are less likely to make the playoffs than teams with a split or winning record (8=8 or higher). Given both record and a total attendance of 1,400,000 fans, the threshold for clinching a playoff berth appears to be 12 or 13 wins (though barring attendance, most AFC North teams fare well with 10, 9, or even 8 wins).

Now here’s another example. this time using the NFC East (and changing both win totals and total attendance):

So given increasing win totals and total attendance, an NFC East team’s playoff chances increase. The playoff threshold here, just as it been with most of my predictions, is 9 or 10 wins.

Now let’s see what happens when win totals increase but attendance goes down (also using the NFC East):

Ultimately (with regards to the NFC East), it’s not total attendance that matters, but a team’s win totals. As you can see, regardless of total attendance, playoff clinching odds increase with higher win totals (win threshold remains at 9 or 10).

And here’s our model plotted:

Now, I know this graph is just about as easy-to-read as the last graph (not very, but that’s how R works), but just like with the last graph, you can draw some conclusions. Since this graph factors in Total Attendance and Win Total (even though only Total Attendance is displayed), you can tell that even though a team’s fanbase may love coming to their games, if the wins are low, so are the playoff chances.

Now, before we start the linear regression models, let’s compare the logistic regression models to see which is the better of the two by analyzing various criteria:

Difference between null & residual deviance
- Model 1-73.25 with a decrease of two degrees of freedom
- Model 2-115.82 with a decrease of three degrees of freedom
- Better model-Model 1
AIC
- Model 1-101.86
- Model 2-60.483
- Better model-Model 2 (41.377 difference)
Number of Fisher Scoring Iterations
- Model 1-5
- Model 2-7
- Better model-Model 1 (less Fisher iterations)
Overall Accuracy
- Model 1-87%
- Model 2-379%
- Better model-Model 1 (379% sounds too good to be true)

Overall better model: Model 1

Now here’s the first linear regression model:

This model has Win Total as the dependent variable and Total Attendance and Conference Standing as the independent variables. This will also by my first model created with multiple linear regression, which is basically linear regression with more than one independent variable.

And finally, let’s plot the model:

In cases of multiple linear regression such as this, I had to graph each independent variable separately; graphing Total Attendance and Conference Standing separately allows us to examine the effects each independent variable has on our dependent variable (Win Total). As you can see, Total Attendance increases with an increasing Win Total while Conference Standing decreases with a decreasing Win Total. Both graphs make lots of sense, as fans are more tempted to come to a team’s games when the team has a high win total and conference standings tend to decrease with lower win totals (an interesting exception is the 2014 Carolina Panthers, who finished 4th in the NFC despite a 7-8-1 record).

In case you are wondering what the layout function does, it basically allowed two graphs to be displayed side by side. I can also alter the function depending on how many independent variables I use; if for instance I used 4 independent variables, I could change c to 2,2 to display the graphs in a 2 by 2 matrix.

Multiple linear regression equations are quite similar to those of simple linear regression, except for an added variable. In this case, the equation would be:

Win Total = 6.366e-6(Total Attendance)-5.756e-1(Conference Standing)+5.917

Now, using the predict function that I showed you for my logistic regression models won’t be very efficient here, so we can go the old-fashioned way by plugging numbers into the equation. Here’s an example:

Regardless of what conference a team is part of, a total attendance of at least 750,000 fans and a bottom seed in the conference should at least bring the team a 1-15 record. For teams with a total attendance of at least 1.1 million fans who fall just short of the playoffs with a 7th seed, a 9-7 record would be likely. Top of the conference teams with an attendance of at least 1.45 million should net a 14-2 record.

Now, let’s see what happens when conference standing improves, but attendance decreases:

According to my predictions, bottom-seeded teams with a total attendance of at least 1.5 million fans should net at least a 6-10 record. However, as conference standings improve and total attendance decreases, predicted records stagnate at either 9-7 or 8-8.

Now here’s my second linear model:

In this model, I used two different independent variables-Home Attendance and Average Age of Roster-but I still used Win Total as my dependent variable.

The equation goes like this:

Win Total = 1.051e-5(Home Attendance)+5.534e-1(Average Age of Roster)-1.229e+1

Now just like I did with both of my logistic regression models and the linear regression model, let’s create some what-if scenarios:

In this scenario, home attendance is increasing along with the average age of roster. Win total also increases with a higher average age of roster. For instance, teams with a home attendance of at least 350,000 fans and an average roster age of 24 (meaning the team is full of rookies and other fairly-fresh faces) should expect at least a 5-11 record. On the other hand, teams with a roster full of veterans (yes, 28.5 is old for an average roster age) and a home attendance of at least 1.2 million fans should expect a perfect 16-0 season.

Now let’s try a scenario where home attendance decreases but average age of roster increases:

In this scenario, when home attendance decreases but average age of roster increases, a team’s projected win total also goes down. For teams full of fresh-faces and breakout stars (average age 24) and a home attendance of at least 1.1 million fans, a 13-3 record seems likely. On the other hand, for teams full of veterans (average age 28.5) and a home attendance of at least 300,000 fans, a 7-9 record appears in reach.

One thing to keep in mind with my linear regression predictions is that I rounded projected win totals to the nearest whole number. So I got the 13-3 record projection from the 12.5526 output.

Now let’s plot the model:

Just as I did with linear1, I graphed the two independent variables separately, not only because it’s the easiest way to graph multiple linear regression but also because we can see each variable’s effect on Win Total. As you can see, Home Attendance and Average Age of Roster increases with an increasing win total, though the increase in Average Age of Roster is smaller than that of Home Attendance. Each scenario makes sense, as teams are likelier to have a higher win total if they have more supportive fans in attendance (particularly in their 7 or 8 home games per season) and having more recognizable veterans on a team (like the Saints with QB Drew Brees or the Broncos with LB Von Miller) will be better for the team’s overall record than having a team full of newbies (like the Browns with QB Baker Mayfield or the Giants with RB Saquon Barkley).

The Home Attendance numbers are displayed in scientific notation, which is how R displays large numbers. 1e+05 is 100,000, 3e+05 is 300,000, and so on.

Now, before I go, let’s compare the two linear models:

Residual Standard Error
- Model 1-1.09 wins
- Model 2-2.948 wins
- Better Model-Model 1 (less deviation)
R-Squared (Multiple and Adjusted respectively)
- Model 1-88.72% and 88.58%
- Model 2-17.49% and 16.44%
- Better Model-Model 1 (much higher than Model 2)
F-statistic & P-Value (since there are 2 degrees of freedom, this is an important metric)
- Model 1-617.5 on 2 and 157 degrees of freedom; 2.79e-7
- Model 2-16.64 on 2 and 157 degrees of freedom; 2.79e-7
- Better Model-Model 1 (both result in the same p-value, but the f-statistic on Model 1 is much larger)
Overall better model-Model 1

Thanks for reading,

Michael