July 2018 - Michael's Programming Bytes

MySQL Lesson 1: Building an ER Diagram

Hello everybody,

This is Michael, and as I mentioned in the last post, I will start building the database that I will be using for this series of posts.

The database will store information about 54 albums (3 for each year from 2000 to 2018) such as track listings, artists, featured artists on certain tracks, genre, release year, duration (of album and individual track), etc.

In the previous post, I did mention that MySQL is meant for query-based analysis. However, before beginning to do queries, we must create our database. As the title explains, this post will focus on the creation of an ER (entity-relationship) diagram. An ER diagram is a graphical representation of items in a database (in this case, albums, track listings, artists, etc.) and how they are related to each other (like how albums can have several track listings).

So, without further hesitation, here is the ER diagram for the database I will be using.

28Jul capture1

Now you may be confused by all of the arrows and tables in the diagram. Here’s an explanation.

This ER diagram will represent the relationship between albums and singers, songs on the album, featured artists, as well as any awards the album either won or received nominations.
How do all of these tables relate to each other? Here’s how.
- Each album must have several tracks, while each track belongs to one and only one album (the thing that looks like a three-pointed arrow means “many” while the thing with two vertical lines represents “one”)
- Each track can have several, one, or no feature artists (that’s why you see a circle) but each featured artist must belong to one and only track.
- Each singer can appear more than once (if they have several albums in the database) or just once but each album must correspond to one and only one singer.
- Each album can be nominated for one or several awards (eg. Grammys, MTV VMAs, etc.) but each award must correspond to only one album.
  - You’ll notice that this is the only dotted line in the diagram. This is because the relationship between album and awards is non-identifying, meaning that you can identify the award based on idAwards field alone, without needing the album field for identification.
  - As for the rest of the relationships (which are known as identifying relationships), each table is dependent on the other table for identification
    - For example, you can’t identify a featured artist without knowing what track they appear on. Likewise, you can’t identify a song without knowing what album it is a part of. Nor can you identify an album without knowing which singer/group created it.

Now what about the attributes in each table (those are the things with diamonds right by them)? Here’s what each of them mean.

The album field contains the attributes
- Name-the name of the album
- Duration-how long the album is (given in hours:minutes:seconds)
- Release Year-the year the album came out
- Album number-how many albums has the artist made up through that point; in other words, is this the artist’s 1st album? 4th? 5th?
- Genre-the genre of the album
The singer field contains the attributes
- Singer Name-the singer’s (or group’s) name
- Age-the singer’s age as of August 1, 2018 (if they are still living)
- Birthplace-the singer’s birthplace (or where the group was formed)
- Date of Death-the date the artist died
  - You’ll notice the attributes age and date of death have white diamonds right by them; this is because each of them can be null (have no value). For instance, the date of death field can stay blank for living artists. For the other attributes that have blue diamonds besides them, they have to have some sort of value (can’t be null in other words).
The tracks field contains the attributes
- Track Name-the song’s name
- Track Duration-the length of the song (given in hours:minutes:seconds)
The featured artist field contains the attribute
- Featured artist name-the name of any artist who appears on a particular track
The awards field contains the attributes
- Ceremony-the ceremony where the album either was nominated for or won an award (Grammys, Billboard Music Awards, etc.)
- Ceremony Year-the year of the ceremony where the album either got nominated for or won an award
- Won/Nominated-whether an album won or was nominated for a particular award

But wait, what are those keys right by some of the attributes? Those are called primary keys, which are one or more columns with data used to uniquely identify each row in the table. Primary keys are usually stored in auto-incrementing indexes (starting with 1, then 2, then 3, and so on) to ensure uniqueness. For example, in the album table, 1 would be the primary key for the first album in the database, then 2, then 3, all the way to 54.

Take this part of the diagram:

28Jul capture2

idSinger and idAlbum are both primary keys in their respective tables. But wait, why does Singer_idSinger1 appear in the album table? That is because Singer_idSinger1 is a foreign key, which is a column or set of columns in a table that refers to the primary key in another table-which would be the primary key for the singer table. Foreign keys basically serve as a means to connect the referencing table (album) with the referenced table (singer).

If you want to know how primary keys and foreign keys differ from each other, here’s a handy table (Source-https://www.essentialsql.com/what-is-the-difference-between-a-primary-key-and-a-foreign-key/)

That’s all for now. Thanks for reading,

Michael

What is MySQL?

Hello everybody,

This is Michael, and as I mentioned in the welcome post, I will include other programming languages on this blog. Don’t worry, I’ll still post plenty of R lessons and analyses, but I thought it was time to include other programming languages. The next one I will introduce is MySQL, which is an open-source (meaning free to use) relational database system.

Relational databases are created to recognize relations among items in a database. Let’s say you wanted to make a database of NFL teams and include team name, quarterback, running back, center, safety, linebacker, wide receiver, and any other football positions I missed here, along with season record. Team name would be related to any of the positions I just mentioned, as teams have someone for each of the positions. Team name would also be related to season record, as each team has a win-loss-or-sometimes-tie record each year.

To clarify, MySQL and SQL are two totally different things, being that MySQL is database management software whereas SQL is a programming language (it stands for structured query language) used to manage relational databases.

Another thing I wanted to point out is that MySQL and R-although they are both great analytical tools-serve two different purposes. Personally, I would use R to analyze data from a statistical standpoint (as seen by my logistic regression posts) while I would use MySQL for query-based analysis. Each tool has its pros and cons, as R is better for analysis and visualization of data yet the syntax is more complicated than MySQL (that’s just my opinion). Likewise, MySQL is great for query-based analysis, which is more difficult to do in R, but isn’t the best for performing advanced analyses or creating data visualizations. MySQL is also restricted to relational databases, while R is not.

For this series of posts, I’ll build a database (I’ll be using the same database throughout this series of MySQL posts) using MySQL Workbench, which I’d recommend for anyone wanting to make their own MySQL databases. If you want to install it, here’s a handy link-http://www.ccs.neu.edu/home/kathleen/classes/cs3200/MySQLWorkbenchMAC10.pdf

Don’t worry everybody, I’ll actually start building the database with my next MySQL post. This post was just meant to explain the basics of MySQL.

Thank you for reading,

Michael

R Lesson 5: Graphing Logistic Regression Models

Hello everybody,

It’s Michael, and today I’ll be discussing graphing with logistic regression. This will serve as a continuation of R Lesson 4: Logistic Regression Models (I’ll be using the dataset and the models from that post).

Let’s start by graphing the second model from R Lesson 4. That’s the one that includes season count and premiere year (I feel this would be more appropriate to graph as it is the more quantitative of the two models).

Here’s the formula for the model if you’re interested (as well as the output):

20Jul capture1

Now let’s plot the model (but first, let’s remember to install the ggplot2 package).

20Jul capture5

Next we have to figure out the probabilities that each show will be renewed (or not).

And finally, let’s plot the model.

20Jul capture4

What are some conclusions we can draw from the model?

The shows with less than 25 seasons and that premiered between 1975 and the early 90s (such as Roseanne which had 10 seasons and premiered in 1988) had no chance at renewal.
For shows with less than 25 seasons, the more recently the show premiered, the more likely it was renewed (as shown by the progressively brighter colors).
For the few outlier shows with more than 25 seasons (regardless of when they premiered) they had a 100% chance at renewal.
- The two notable examples would be The Simpsons (at 29 seasons) and SNL (at 43 seasons)

Thanks for reading,

Michael

R Lesson 4: Logistic Regression Models

Hello everybody,

It’s Michael, and today’s post will be the first to cover data modeling in R. The model I will be discussing is the logistic regression model. For those that don’t know, logistic regression models explore the relationship between a binary* dependent variable and one or more independent variables.

*refers to variable with only 2 possible values, like yes/no, wrong/right, healthy/sick etc.

The data set I will be using is-TV shows-which gives a list of 85 random TV shows of various genres that were currently airing during the 2017-18 TV season and whether or not each show was renewed for the 2018-19 TV season. So, like any good data scientist, let’s first load the file and read (as well as understand) the data.

9Jul capture1

The variables include

TV Show-the name of the TV show
Genre-the genre of the TV show
Premiere Year-the year the TV show premiered (for reboots like Roseanne, I included the premiere date of the original, not the revival)
X..of.seasons..17.18. (I’ll refer to it as season count)-how many seasons the show had aired at the conclusion of the 2017-18 TV season (in the case of revived shows like American Idol, I counted both the original run and revival, which added up to 16 seasons)
Network-the network the show was airing on at the end of the 2017-18 TV season
X2018.19.renewal. (my binary variable)-Whether or not the show was renewed for the 2018-19 TV season
- You’ll notice I used 0 and 1 for this variable; this is because it is a good idea to use dummy variables (the 0 and 1) for your binary dependent variable to help quantify qualitative data.
  - The qualitative data in this case being whether a show was renewed for the 2018-19 TV season (shown by 1) or not (shown by 0)

Now that we know the variables in our data set, let’s figure out what we want to analyze.

Let’s analyze the factors (eg. network, genre) that affected a certain TV show’s renewal or cancellation (the binary variable represented by 0/1)

So here’s the code to build the model, using the binary dependent variable and two of the independent variables (I’ll use genre and premiere year)

9Jul capture3

What does all of this output mean?

The call just reprints the model we created.
The estimate represents the change in log odds (or logarithm of the odds) for the dependent variable should a certain independent variable variable be increased by 1.
- Log odds function–>log(p/(1-p))
- For instance, if the premiere year increases by 1 (let’s say from 2009 to 2010), the odds that it was renewed for the 18-19 TV season decrease by 6.73% (as evidenced by the -0.06763 as the premiere year estimate)
Standard error represents how far the sample mean is from the population mean. In the case of premiere year, the two means are close together. In the case of genre however, the two means are mostly far apart (then again, genre isn’t numerical).
Z-value is the ratio of the estimate to the standard error
P-value (denoted by Pr(|>z|)) helps you determine the significance of your results by giving you a number between 0 and 1
- P-values are used to either prove or disprove your null hypothesis (a claim you are making about your data)
  - Let’s say you think a show’s genre and premiere year affected its chances of renewal; this would be your null hypothesis.
  - Your alternative hypothesis would be the opposite of your null hypothesis; that is, genre and premiere year don’t affect a shows chances of renewal
- Small p-values (those <=0.05) indicate strong evidence against the null hypothesis, so in these cases, you can reject the null hypothesis. For p-values larger than 0.05, you should accept the null hypothesis.
  - Since all the p-values are well above 0.05, you can accept the null hypothesis
Null deviance shows how well our dependent variable (whether or not a show got renewed) is predicted by a model that includes only the intercept
Residual deviance shows how well our dependent variable (whether or not a show got renewed) is predicted by a model that includes the intercept as well as any independent variables
- As you can see here, the residual deviance is 89.496 on 71 degrees of freedom, a decrease of 20.876 from null deviance (as well as a decrease of 13 degrees of freedom).
AIC (or Akaike Information Criterion) is a way to gauge the quality of your model through comparison of related models; the point of the AIC is to prevent you from using irrelevant independent variables.
- The AIC itself is meaningless unless we have another model to compare it to, which I will include in this post.
The number of Fisher scoring iterations shows how many times the model ran to attain maximum likelihood, 17 in this case. This number isn’t too significant.

Now let’s create another model, this time including season count in place of genre.

10Jul capture1

How does this compare to the previous model?

There is a smaller difference between null & residual deviance (12.753 and 2 degrees of freedom, as opposed to 20.876 and 13 degrees of freedom)
The AIC is 13.88 smaller than that of the previous model, which indicates a better quality of the model
The number of Fisher scoring iterations is also lower than the previous model (5 as opposed to 17), which means it took less tries to attain maximum likelihood (that a show was renewed)
The estimate for premiere year also increased
- This time, if premiere year increases by 1, the odds that a show was renewed for the 2018-19 TV season increased by 14.81%, rather than decreased.
- If season count increased by 1 (say from 4 to 5 seasons), then the odds a show was renewed increased by 31.45%
The asterisk by season count just gives an idea of the range of p-values of season count (denoted by Pr(|>z|))
- The p-value of season count is >0.01 but <0.05 (which makes perfect sense as Pr(|>z|) is 0.027
Let’s create two null hypotheses-premiere year and season count affects a show’s chances of renewal (we are treating these as separate hypotheses).
- Premiere year is greater than 0.05, so accept this null hypothesis.
  - In other words, premiere year did affect a show’s chances for renewal.
- Season count is less than 0.05, so reject this null hypothesis.
  - In other words, season count didn’t affect a show’s chances for renewal.

That’s all for now. Thanks for reading.

Michael

R Lesson 3: Basic graphing with R

Hello everybody,

This is Michael, and today’s post will be on basic graphing with R. I’ll be using a different dataset for this post-murder_2015_final , which details the change in homicide rates from 2014 to 2015 as well as the individual homicide rates for 2014 and 2015 in 83 US cities (I felt this one was more quantitive than the dataset I used in my last two posts).

So let’s begin with a bar chart.

If you can’t read this, here’s the code
- plot(file$X2015_murders, file$change, pch=20, col=”red”, main=”2014-2015 murder rate changes”, xlab=”2015 murders”, ylab=”Change from 2014 homicide rate”)

29Jun capture1

As you can see, there are two outliers at the upper-right hand corner of the screen. If you want to find out what those cities might be, here’s how you would add labels to each of the points.

Remember not to close the window with the graph when typing this command!

1Jul capture

From this graph, we can see that the two outliers (or cities with the largest 2014-to-2015 rise in murder rates) are Chicago and Baltimore.

Let’s try a bar graph now. Here’s the command to make a basic bar chart.

1Jul capture3

As you can see, 53 of the cities had a year-to-year rise in murder rates, 4 had no change in murder rates, and 26 had a year-to-year drop in murder rates (if you’re wondering what those cities are, check the spreadsheet attached to this post).

Let’s make another graph-the box plot. Here is the command

1Jul capture5

Some things to know when reading a box plot

The bold dashes represent the median value for the murders in a certain state (or only value if a state appears just once)
The top and bottom lines represent the lowest and highest values corresponding to a certain state
The yellow bars denote the range of the majority of values for a certain state
The dashed lines on the top and bottom of the chart show the highest and lowest values not in the range denoted by the yellow bar
- If there aren’t any dashed lines, then the yellow bars denote all of the values, not just the majority
Any circles you see are outliers corresponding to a particular state.

One more thing, if you’re wondering where I got this data from, here the website-https://github.com/fivethirtyeight/data/blob/master/murder_2016/murder_2015_final.csv. The website is FiveThirtyEight.com, which writes interesting data-driven articles, such as The Lebron James Decision-Making Machine. FiveThirtyEight then posts the code and data used in these articles on GitHub so anyone can perform statistical analyses on the data (good place to look for free datasets for your own data analysis project, and much more interesting than the free datasets that come with R with data 40+ years old).

Thank you,

Michael