R Lesson 11: Missing Data and Basic K-Means Clustering

Advertisements

Hello everybody,

It’s Michael, and today I will be discussing an unsupervised machine learning technique (recall that from my previous R post?) known as k-means clustering. I will also discuss what to do with missing data observations in a column. I know those may seem like unrelated topics, but the dataset I will be working with utilizes both of those topics.

Here’s the dataset for those who want to work with it-2018 films (1).

But first, as always, let’s load the dataset into R and try to understand our variables:

2Mar capture1

This dataset contains all 212 wide-release movies for the year 2018. There are seven variables in this dataset:

  • Title-the title of the movie
  • US.Box.Office.Gross-How much money the movie made in the US (this doesn’t factor in worldwide grosses)
    • I got this data from BoxOfficeMojo.com. Great site to collect movie-related data (IMDB works too).
  • Release.Date-The date a movie was released
    • Remember to use the as.Date function seen in the picture to convert your dates from factor type to date type.
  • Genre-The genre of the movie
  • RT.Score-The movie’s critic score on Rotten Tomatoes; 0 represents a score of 0% while 1 represents a score of 100%
  • Audience.Score-The movie’s audience score on Rotten Tomatoes; just as with RT Score, 0 means 0% and 1 means 100%
  • Runtime-The movie’s runtime in minutes; so 100 minutes represents a movie that is 1 hour 40 minute long.

Now, the datasets I’ve used so far have been perfect and tidy. However, real datasets aren’t like that, as they are often messy and contain missing observations, which is the case with this dataset. So what should we do?

First, let me demonstrate a function called missmap that gives you a visual representation of missing observations (and the locations of those observations):

First, before creating the missmap, you need to install the Amelia package and use the library(Amelia) function. Then all you need to do is write missmap(file)-or whatever you named your file-to display the missingness map (that’s why the function is called missmap)

According to this missmap, there aren’t too many missing observations (only 1% as opposed to the 99% of present observations). The few missing observations are located in the RT Score and Audience Score columns, since some of the movies are missing critic or audience scores.

  • Keep in mind that this dataset is only a warmup; I will be working with messier datasets in future posts, so stay tuned.

So how can we tidy up our dataset? Well, here’s one way of approaching it:

na.rm=TRUE

This line of code would be included in my k-means model to tell it to exclude any NA (or missing) data points. I set na.rm to TRUE since TRUE represents any missing data points.

I could also replace all the missing values with the means for RT Score and Audience Score (62% and 66% respectively), but since missing values only make up 1% of all my data, I’ll just stick with excluding missing values.

Now, it’s time to do some k-means clustering. But first of all, what is k-means clustering?

K-means clustering is an unsupervised machine learning algorithm that tries to group data into k amount of clusters by minimizing the distance between individual observations. The aim of k-means clustering is to keep all the data points in a cluster as close to the centroid (center data point) as possible. In order do so, each data point must be assigned to the closest centroid   utilizing Euclidean distance. Euclidean distance is data science lingo for straight line distance between a point in a cluster and the cluster’s centroid.

So the next step in our analysis would be to select the variables to use in our k-means clustering. We can do so by subsetting the data like this:

I chose two columns to include in my subset (US Box Office Gross and RT Score), which creates a data frame that I will use in my k-means clustering. The head function simply displays the first six rows of my subset (the first six box office grosses listed along with their corresponding RT score).

  • For k-means clustering, you should only include two columns in your subset

Now the next step would be to create our clusters. Here’s how:

I first created the movieCluster variable to store my k-means model using the name of my data subset-data1-, the number of clusters I wish to include (5), and nstart (which tells the model to start with 35 random points then select the one with the lowest variation).

  • I didn’t mention it here, but I included the line data1 <- na.omit (data1) in order to omit any rows with NA in them from my subset (remember RT Score had NA values). The k-means model won’t run if you don’t omit NA values (or replace them with means, but for now let’s stick to omitting the values).

I then type in movieCluster to get a better idea of what my cluster looks like. The first thing that is displayed is “K-means clustering with (x) clusters of sizes:”, which shows you  the sizes of each cluster. In this case, the clusters in my k-means model have 62, 22, 3, 10, and 102 observations, respectively. In total, 199 observations were used, which means 13 weren’t used since their RT Scores were missing.

The next thing you see is cluster means, which give you the means of each variable in a certain cluster; in this case, the means for US Box Office Gross and RT Score are displayed for all 5 clusters.

After that, you will see the clustering vector, which will tell you what cluster each observation belongs to (labelled 1-5). As you can see, there are some missing observations (such as 201 and 211); this is because I omitted all NA rows from my data subset (and in turn, my k-means model).

Next you will see the within cluster sum of squares for each cluster; this is a measurement of the variability of the observations in each cluster . Usually the smaller this amount is, the more compact the cluster. As you can see, all of the WCSSBC (my acronym for within cluster sum of squares by cluster) are quite large; this is possibly due to the fact that most of the values in the US Box Office Gross column are over 500,000. I’d probably have smaller WCSSBC if the US Box Office Gross values were smaller, but that’s neither here nor there.

The last thing you will see is between_SS/total_SS=95.4%, which represents the between sum-of-squares and total sum-of-squares ratio; this is a measure of the goodness-of-fit of the model. 95.4% indicates that there is an excellent goodness-of-fit for this model.

Last but not least, let’s graph our model. Here’s how:

The movies are clustered by US Box Office Gross (referred here as simply Box Office Gross); the colors of the dots represent each cluster. Here’s what each color represents:

  • Light blue-movies that made between $0 to $25 million
  • Black-movies that made between $25 to $100 million (many limited releases)
  • Red-movies that made between $100 to $200 million
  • Dark blue-movies that made between $200 to $400 million
  • Green-movies that made between $600 to $700 million

As you can see, the green cluster is the outlier among the data, since only 3 movies in our dataset made at least $600 million (Black Panther, Avengers Infinity War, and Incredibles 2).

Thanks for reading,

Michael

 

 

 

 

 

 

 

 

 

Leave a ReplyCancel reply