R Lesson 14: Random Forests

Advertisements

Hello everybody,

It’s Michael, and today’s post will be on random forests in R.

You’re probably wondering what a random forest does? Me too, since I’ve never actually done any random forest problems (the other R lessons covered concepts from my undergrad business analytics coursework, so I was familiar with them).

Random forests are a form of supervised learning (look at R Lesson 10: Intro to Machine Learning-Supervised and Unsupervised for an explanation on supervised and unsupervised learning). But to better explain the concept of random forests, let me provide an easy to understand example.

Let’s say you wanted to visit the new ice cream shop in town, but aren’t sure if its any good. Would you go in and visit the place anyway, regardless of any doubts you have? It’s unlikely, since you’d probably want to ask all of your buddies and/or relatives about their thoughts on the new ice cream place. You would ideally ask multiple people for their opinions, since the opinion of one person is usually biased based on his/her preferences. Therefore, by asking several people about their opinions, we can eliminate bias in our decision making process as much as possible. After all, one person might not like the ice cream shop in question because they weren’t satisfied with the service, or. they don’t like the shop’s options, or a myriad of other reasons.

In data analytics jargon, the above paragraph provides a great example of a technique called ensembling, which is a type of supervised learning technique where several models are trained on a training set with their individual outputs combined by a set rule to determine the final output.

Now, what does all of that mean? First of all, when I mentioned that several models are trained on a training set, this means that either the same model with different parameters or different models can utilize the training set.

When I mentioned that their [referring to the “several models”] individual outputs are combined by a set rule to determine the final output, this means that the final output is derived by using a certain rule to combine all other outputs. There are plenty of rules that can be utilized, but the two most common are averages (in terms of numerical output) and votes (in terms of non-numerical output). In the case of numerical output, we can take the average of all the outputs and use that average as the final output. In the case of categorial (or non–numerical) output, we can use vote (or the output that occurs the most times) as the final output.

Random forests are an example of an ensembling algorithm. Random forests work by creating several decision trees (which I’ll explain a little more) and combining their output to derive a final output. Decision trees are easy to understand models, however they don’t have much predictive power, which is why they have been referred to as weak learners.

Here’s a basic example of a decision tree:

Decision-Trees-modified-1

This tree isn’t filled with analytical jargon, but the idea is the same. See, this tree answers a question-“Is a person fit?”-just as decision trees do. It then starts with a basic question on the top branch-“Is the person younger than 30?” Depending on the answer to that question (Yes/No), another question is then posed, which can be either “Eats a lot of pizzas?” or “Exercises in the morning?”. And depending on the answer to either question (both are Yes/No questions), you get the answer to the question the tree is trying to answer-“Is a person fit?”. For instance, if the answer to “Eats a lot of pizzas?” is “Yes”, then the answer to the question “Is a person fit?” would be “No”.

This tree is a very simple example of decision trees, but the basic idea is the same with R decision trees (which I will show you more of in the next post).

Now let’s get into the dataset-Iowa Recidivism-which gives data on people serving a prison term in the state of Iowa between 2010 and 2015, with recidivism follow–up between 2013 and 2018. If you are wondering, I got this dataset from Kaggle.

Now, let’s start our analysis by first understanding our data:

Here, we have 26020 observations of 12 variables, which is a lot, but lets break it down variable-by-variable:

  • Fiscal.Year.Released-the year the person was released from prison/jail
  • Recidivism.Reporting.Year-the year recidivism follow-up was conducted, which is 3 years after the person was released
  • Race..Ethnicity-the race/ethnicity of the person who was released
  • Age.At.Release-the age a person was when he/she was released. Exact ages aren’t given; instead, age brackets are mentioned, of which there are five. They include:
    • Under 25
    • 25-34
    • 35-44
    • 45-54
    • 55 and over
  •  Convicting.Offense.Classification-The level of seriousness of the crime, which often dictates the length of the sentence. They include:
    • [Class] A Felony-Up to life in prison
    • [Class] B Felony-25 to 50 years in prison
    • [Class] C Felony-Up to 10 years in prison
    • [Class] D Felony-Up to 5 years in prison
    • Aggravated Misdemeanor-Up to 2 years in prison
    • Serious Misdemeanor-Up to 1 year in prison
    • Simple Misdemeanor-Up to 30 days in prison
  • Convicting.Offense.Type-the general category of the crime the person was convicted of, which can include:
    • Drug-drug crimes (anything from narcotics trafficking to possession of less than an ounce of marijuana)
    • Violent-any violent crime (murder, sexual assault, etc.)
    • Property-any crimes against property (like robbery/burglary)
    • Public order-usually victimless and non-violent crimes that go against societal norms (prostitution, public drunkenness, etc.)
    • Other-any other crimes that don’t fit the four categories mentioned above (like any white collar crimes)
  • Convicting.Offense.Subtype-the more specific category of the crime the person was convicted of, which can include “animal” (for animal cruelty crimes), “trafficking” (which I’m guessing refers to human trafficking), among others.
  • Main.Supervising.District-the jurisdiction supervising the offender during the 3 year period (Iowa has 11 judicial districts)
  • Release.Type-why the person was released, which can include parole, among other things
  • Release.type..Paroled.to.detainder.united-this is pretty much the same as Release.Type; I won’t be using this variable in my analysis
  • Part.of.Target.Population-whether or not a prisoner is a parolee
  • Recidivism...Return.to.Prison.numeric-Whether or not a person returned to prison within 3 years of release; 1 means “no”, they didn’t return and 0 means “yes”, they did return

Now let’s see if we’ve got any missing observations (remember to install the Amelia package):

Looks like there’s nothing missing in our dataset, which is always a good sign (though not all datasets will be perfectly tidy).

Next let’s split our data into training and validation sets. If you’re wondering what a validation set is, it’s essentially the dataset you would use to fine tune the parameters of a model to prevent overfitting. You can make a testing set for this model if you want, but I’ll just go with training and validation sets.

Here’s the training/validation split, following the 75-25 rule I used for splitting training and testing sets:

  • This will be a different approach to splitting the data than what you’ve seen in previous R posts. The above line of code is used as a guideline to splitting the data into training (trainSet) and validation (validationSet) sets. It tells R to utilize 75% of the data (19651 observations) for the training set. However, the interesting thing is the nrow function, which tells R to utilize ANY 19651 observations SELECTED AT RANDOM, as opposed to simply using observations 1-19651.
  • On the lines of code seen below, train selects the 75% of the observations chosen with the sample function while -train selects the other 25% of observations that are NOT part of the training set. By the way, the minus sign means NOT.

Now here’s our random forest model, using default parameters and the Convicting.Offense.Type. variable  (remember to install the randomForest package):

In case you were wondering, the default number of decision trees is 500, the default number of variables tried at each split is 2 (it’s 3 in this case), and the default OOB estimate of error rate is 3.6% (it’s much lower here-0.22%).

The confusion matrix you see details how many observations from the Convicting.Offense.Type variable are correctly classified and how many are misclassified. A margin of error-class.error-in this matrix tells you the percentage of observations that were misclassified. Here’s a breakdown:

  • Drug-No misclassification
  • Other-3.5% misclassification
  • Property-0.02% misclassification
  • Public Order-0.04% misclassification
  • Violent-0.1% misclassification

So all the crimes that were classified as Drug from trainSet were classified correctly, while Other had the most misclassification (and even 3.5% is a relatively small misclassification rate).

  • One important thing to note is that random forests can be used for both classification and regression analyses. If I wanted to do a regression analysis, I would’ve put my binary variable-Recidivism...Return.to.Prison.numeric-to the left of the ~ sign and other variables to the right of the ~ sign.

Now let’s change the model by change ntree and mtry, which refer to the number of decision trees in the model and the number of variables sampled at each stage, respectively:

By reducing ntree and increasing mtry, the OOB estimate of error rate decreases (albeit by only 0.02%) as does the amount of variables that have a misclassification error. By that I mean that the previous model had only one variable with no misclassification error-Drug-while this model has 3 variables with no misclassification error-Drug, Property, and Public Order. Also, the misclassification error for the other two variables-Other and Violent-is both smaller and larger than that of the previous model. Other had a 3.5% error in the previous model while it has a 2.2% error in this model, so the misclassification error decreased in this case. On the other hand, Violent had a 0.1% error in the previous model but it has a 0.44% error in this model, so the misclassification error increased in this case.

Now let’s do some predictions, first on our training set and then on our validation set (using the second model):

And then let’s check the accuracy of each prediction:

For both the training set and validation set, the classifications are very accruate. trainSet has a 99.9% accuracy rate (only 10 of the 19,651 observations were misclassified) while `validationSet` has a 99.8% accuracy rate (only 13 of the 19,651 observations were misclassified).

Now let’s analyze the importance of each variable using the `importance` function:

So what exactly does this complex matrix mean? It shows the importance of each of the 12 variables when classifying a crime based on the five types of offenses derived from the variable Convicting.Offense.Type.  More specifically, this matrix shows the importance of the 12 variables when classifying a variable-Convicting.Offense.Type-that has five possible values. The two important metrics to focus on in this matrix are MeanDecreaseAccuracy and MeanDecreaseGini.

A variable’s MeanDecreaseAccuracy is a metric of how much the accuracy of the model would decrease if that variable was to be excluded. The higher the number, the more accuracy the model will lose if that variable will be excluded. For instance, if I was to exclude the Recidivism...Return.to.Prison.numeric variable, the accuracy of the model wouldn’t be impacted much since it has the lowest MeanDecreaseAccuracy-3.42. Another reason the exclusion of this variable wouldn’t impact the model much is because this is a binary variable, which would be better suited to a regression problem as opposed to a classification problem (as this model is). On the other hand, excluding the variable Convicting.Offense.Subtype would have a significant impact on the model, since it has the largest MeanDecreaseAccuracy at 13,358.53.

A variable’s MeanDecreaseGini is a metric of how much Gini impurity will decrease when a certain variable is chosen to split a node. What exactly is Gini impurity? Well, let’s take a random point in the dataset. Using the guidelines of trainSet, let’s classify that point as Drug 5982/19651 of the time and as Property 5490/19651 of the time. The odds that we misclassify the data point based on class distribution is the Gini impurity. With that said, the MeanDecreaseGini is a measure of how much the misclassification likelihood will decrease if we include a certain variable. The higher this number is, the less likely the model will make misclassifications if that variable is included. For instance, Convicting.Offense.Subtype has the highest MeanDecreaseGini, so misclassification likelihood will greatly decrease if we include this variable in our model. On the other hand, Part.of.Target.Population has the lowest MeanDecreaseGini, so misclassification likelihood won’t be significantly impacted if this variable is included. This could be because this variable is a binary variable, which is better suited for regression problems rather than classification problems.

But what about all of the numbers in the first five columns? What do they mean? The first five columns, along with all the numbers, show the importance of each of the 12 variables when classifying a crime as one of the five possible types of crimes listed in Convicting.Offense.Type, which is the variable I used to build both random forest models. The higher the number, the more important that variable is when classifying a crime as one of the five possible crimes. For instance, the highest number (rounded) in the Fiscal.Year.Released row is 11.16, which corresponds to the Violent column. This implies that many Iowa prisoners convicted of violent crimes were likely released around the same time. Another example would be the highest number (rounded) in the Main.Supervising.District row-14.45-which corresponds to the `Other` column. This implies that Iowa prisoners who were convicted of crimes that don’t fall under the `Drug`, `Violent`, `Public Order`, or `Property` umbrellas would be, for the most part, located in the same Iowa jurisdiction.

  • If you wish to see the numbers in this matrix as a percent of 1, simply write scale = FALSEafter model2 in the parentheses. Remember to separate these two things with a comma.

Thanks for reading,

Michael

One thought on “R Lesson 14: Random Forests”

Leave a ReplyCancel reply