R Analysis 5: Random Forests, Decision Trees and Merit Badges

Advertisements

Hello everybody,

Its Michael, and today’s post will be an R analysis about Boy Scout merit badges earned in 2018 utilizing random forests and decision trees. I thought this would be a fun analysis to do since I am an Eagle Scout (Class of ’14).  Also, since I haven’t done so yet, I will show you a regression random forest and a regression decision tree in this analysis.

Here’s the dataset-2018 merit badges.

As always, let’s load our file into R and try to understand our data:

Screen Shot 2019-06-15 at 2.46.40 PM

So as you can see, we’ve got 137 observations of 7 variables. Here’s a variable-by-variable breakdown:

  • Rank-These badges are ranked based on how many Scouts earned them; the ranks go from 1 (most earned badge) to 137 (least earned badge)
    • The original dataset I found on ScoutingMagazine.org had 139 observations, but since the bottom 2 had discontinued or rebranded merit badges (Computers and Cinematography), I removed those two observations from the dataset
  • Merit.Badge-The name of the merit badge
  • X2018.earned-How many Scouts earned that particular badge
  • Year.Introduced-The year the badge was introduced into the BSA
  • Requirements.Revision-The year of the most recent requirements revision for the badge
  • Change.from.17-The percentage change of Scouts who earned a particular merit badge from 2017 to 2018
    • I mentioned that this variable was a percentage change, however I had to list these values as decimals so R would read them as num and not factor.
  • Eagle.Required-Whether or not a badge is required for the Eagle Scout rank; the two values are “Yes” and “No”
    • Just so you know, there are 21 merit badges required for the Eagle Scout rank.

Now let’s set up our model (remember to install the randomForest package). Since the model is fairly small-with only 137 observations-there’s no need to split the data into training and validation sets. So, here’s the model along with the corresponding summary:

In this model, I used X2018.earned as my dependent variable and Year.Introduced and Rank as my independent variables. In other words, I am attempting to figure out whether the amount of Scouts that earned a particular merit badge has to do with when the badge was introduced into the BSA and/or the badge’s rank of popularity. Since my dependent variable is non-binary, I will be performing a linear regression.

Now, you will see two unfamiliar terms at the bottom of the output. Here’s an explanation of each term:

  • The mean of squared residuals is the average of the squares of the residuals.
    • In regression analysis (linear & logistic), residuals are the differences between the observed value and the predicted value of the dependent variable-X2018.earned.
  • The % var explained is a measure  of how well out-of-bag predictions explain the variance of the training set; a low % var explained could be due to either true random behavior or a lack of fit. Our % var explained-97.07%-is high, which indicates plenty of fit in the model.

Now, you might recall that when dealing with classification random forests, we create several classification matrices. However, when dealing with regression random forests, we create a graph of the random forest using the handy ggplot2 package (which you must remember to install):

And here’s the actual graph:

OK, so what’s the deal with the actual and predicted columns? Both of these columns demonstrate the concepts of residuals. Remember that earlier in this post, I mentioned that residuals are the differences between the actual and predicted values of the dependent variable-X2018.earned.

This graph shows all of the predicted and actualvalues of X2018.earned. In case you were wondering, here are all of the predicted values of X2018.earned in order of their appearance on the dataset. Since I sorted all the badges (and other information in turn) alphabetically, the first number-2073.128-corresponds to the American Business merit badge, while each subsequent number corresponds to the next merit badge alphabetically. The last number-5492.116-corresponds to the alphabetically last merit badge-Woodworking:

  • In case you’re wondering how I got this output of numbers, I wrote the line file$predicted <- predict(model1) and then typed in file$predicted to get this output.

But that’s not all. See, we’re then going to take the predicted values and place them onto two separate graphs-one with Rank and the other with Year.Introduced. The predicted values will be the y-axis in both graphs, while the dependent variables will each serve as the x-axes. The whole point of making these graphs is to see how well the predicted values correlate with each of the independent variables.

First, here’s the code and the graph for Year.Introduced:

From this graph, you can see there is a small negative correlation between the predicted values and Year.Introduced. This implies that, even with all of the new merit badges that have been introduced in the last decade (e.g. Robotics, Game Design, etc.), many Scouts still went for the traditional merit badges (e.g. Archery, First Aid, Cycling) that have been around since the BSA’s founding in 1910. Most of the traditional badges, such as Swimming and Cooking, are Eagle-required badges, which could explain why plenty of Scouts earned them in 2018.

  • Granted, many of the traditional badges had been around since 1910, but their Year.Introduced is 1911, since that’s the year the first Scout handbook came out.

Want to figure out the exact correlation between Year.Introduced and the predicted values? Here’s how:

The number -0.1470237 indicates that there is some negative correlation between the predicted values and Year.Introduced. This means that the more recently a badge was introduced, the fewer scouts earned it-SOMETIMES. There are exceptions to this rule. For instance, twice as many Scouts earned the Chess badge-which was introduced in 2011-than the Aviation badge-which was introduced 59 years earlier. And here’s a more interesting example-four times as many Scouts earned the Exploration badge (which was just introduced in 2017) than the Stamp Collecting badge (which has been around since 1932).

Now here’s the code and the graph for Rank:

This graph shows that Rank has a much stronger correlation than Year.Introducedwith the predicted values. This makes perfect sense; after all, the higher a certain badge’s popularity ranking is, the more Scouts earned that badge in 2018, and vice versa.

Want to know the exact correlation? Check this out:

Just as with Year.Introduced, there is also a negative correlation. However, the number -0.8173855 implies a much stronger inverse correlation (stronger than the correlation with Year.Introduced).

Now let’s make some regression decision trees (remember to install the rpart package):

The formula used to create the tree is the same one you would use for classification decision trees, except use anova for the method.

  • ANOVA stands for analysis of variance, which is a type of regression that I will likely cover in a future R post.

You’ll recognize the pruning table at the bottom of the printcp(tree) output. Remember that the optimal place to prune the tree would be at the lowest level where  rel error-xstd < xerror.

Now let’s plot the pre-pruned tree (remember to keep the window with the tree open when you are writing the text line):

In this tree, only Rank was used, even though Year.Introduced was likely factored into the tree’s construction (though I’m not certain that it was). By the way, I’ll round up on all of the ranks.

Now, here’s a breakdown of the tree:

  • The top row contains all 137 observations and any badge ranked 22 or lower.
  • The top row splits into two branches:
    • The left branch contains badges ranked 55 or lower and 116 observations.
    • The right branch contains badges ranked 12 or lower and 21 observations.
  • The branch with the badges ranked 55 or higher is further split into two branches (this branch has 116 observations):
    • The left branch contains badges ranked 95 or lower and 83 observations.
      • This branch further splits into two terminal nodes, each containing the number of Scouts and the observations in the group.
        • The left node contains 2,587 Scouts and 43 merit badges.
        • The right node contains 6,563 Scouts and 40 merit badges.
    • The right branch contains badges ranked 31 or lower and 33 observations.
      • The right branch also further splits into two terminal nodes, each containing the number of Scouts and observations in the group.
        • The left node contains 1,287 Scouts and 24 merit badges.
        • The right node contains 2,083 scouts and 9 merit badges.
  • The branch containing badges ranked 12 or lower (and with 21 observations) splits into two terminal nodes.
    • The left node contains 3,424 Scouts and 9 merit badges.
    • The right node contains 5,364 Scouts and 12 merit badges.

So here’s a summary of all the groups in the tree, going from left to right:

  • Group 1 (the leftmost) contains 2,587 Scouts and 43 merit badges.
  • Group 2 contains 6,563 Scouts and 40 merit badges.
  • Group 3 contains 1,287 Scouts and 24 merit badges.
  • Group 4 contains 2,083 Scouts and 9 merit badges.
  • Group 5 contains 3,424 Scouts and 9 merit badges.
  • Group 6 (the rightmost) contains 5,364 Scouts and 12 merit badges.

Now let’s see if our tree needs pruning:

And here’s the pruned tree:

So as it turns out, this tree did need to be pruned. As you can see, there are only 3 terminal nodes now as opposed to the six that were in the pre-pruned tree. The top branch is still the same-all of the merit badges and ranking 22 or lower. The top branch’s splits are also the same-116 badges and ranking 55 or lower on the left, 21 badges on the right.

However, that’s where the similarities end. Unlike the pre-pruned tree, the right split from the top branch is a terminal node; this node contains 4,532 Scouts and 21 merit badges and doesn’t split any further. The left split, on the other hand, splits into two branches (just as it does in the pre-pruned tree), but unlike the pre-pruned tree, those branches are terminal nodes. Those two terminal nodes contain 83 badges/4,503 Scouts on the left split and 33 badges/1,504 Scouts on the right split.

So here’s a breakdown of each group in this tree:

  • Group 1 (left)-4,532 Scouts and 21 merit badges
  • Group 2 (center)-4,503 Scouts and 83 merit badges
  • Group 3 (right)-1,504 Scouts and 33 merit badges

How did I know where to prune the tree? Here’s how:

When I typed in plotcp(tree), I got this handy little diagram, which is a visual summary of the output I got with printcp(tree). The area I circled, which corresponds to Node 3, is the optimal place to prune the tree, since it’s the one closest above the dashed line (which I’ll call the pruning line). You might think Node 4 would have been a better place to prune the tree, but since it touches the pruning line, that would not have a good place to prune the tree.

So after determining the best place to prune the tree, I look up the CP for Node 3-0.061173-and use that for the pfit <- prune(tree, cp=0.061173) line (this is the line that prunes the tree-the CP tells it where to prune).

Thanks for reading,

Michael

R Lesson 14: Random Forests

Advertisements

Hello everybody,

It’s Michael, and today’s post will be on random forests in R.

You’re probably wondering what a random forest does? Me too, since I’ve never actually done any random forest problems (the other R lessons covered concepts from my undergrad business analytics coursework, so I was familiar with them).

Random forests are a form of supervised learning (look at R Lesson 10: Intro to Machine Learning-Supervised and Unsupervised for an explanation on supervised and unsupervised learning). But to better explain the concept of random forests, let me provide an easy to understand example.

Let’s say you wanted to visit the new ice cream shop in town, but aren’t sure if its any good. Would you go in and visit the place anyway, regardless of any doubts you have? It’s unlikely, since you’d probably want to ask all of your buddies and/or relatives about their thoughts on the new ice cream place. You would ideally ask multiple people for their opinions, since the opinion of one person is usually biased based on his/her preferences. Therefore, by asking several people about their opinions, we can eliminate bias in our decision making process as much as possible. After all, one person might not like the ice cream shop in question because they weren’t satisfied with the service, or. they don’t like the shop’s options, or a myriad of other reasons.

In data analytics jargon, the above paragraph provides a great example of a technique called ensembling, which is a type of supervised learning technique where several models are trained on a training set with their individual outputs combined by a set rule to determine the final output.

Now, what does all of that mean? First of all, when I mentioned that several models are trained on a training set, this means that either the same model with different parameters or different models can utilize the training set.

When I mentioned that their [referring to the “several models”] individual outputs are combined by a set rule to determine the final output, this means that the final output is derived by using a certain rule to combine all other outputs. There are plenty of rules that can be utilized, but the two most common are averages (in terms of numerical output) and votes (in terms of non-numerical output). In the case of numerical output, we can take the average of all the outputs and use that average as the final output. In the case of categorial (or non–numerical) output, we can use vote (or the output that occurs the most times) as the final output.

Random forests are an example of an ensembling algorithm. Random forests work by creating several decision trees (which I’ll explain a little more) and combining their output to derive a final output. Decision trees are easy to understand models, however they don’t have much predictive power, which is why they have been referred to as weak learners.

Here’s a basic example of a decision tree:

This tree isn’t filled with analytical jargon, but the idea is the same. See, this tree answers a question-“Is a person fit?”-just as decision trees do. It then starts with a basic question on the top branch-“Is the person younger than 30?” Depending on the answer to that question (Yes/No), another question is then posed, which can be either “Eats a lot of pizzas?” or “Exercises in the morning?”. And depending on the answer to either question (both are Yes/No questions), you get the answer to the question the tree is trying to answer-“Is a person fit?”. For instance, if the answer to “Eats a lot of pizzas?” is “Yes”, then the answer to the question “Is a person fit?” would be “No”.

This tree is a very simple example of decision trees, but the basic idea is the same with R decision trees (which I will show you more of in the next post).

Now let’s get into the dataset-Iowa Recidivism-which gives data on people serving a prison term in the state of Iowa between 2010 and 2015, with recidivism follow–up between 2013 and 2018. If you are wondering, I got this dataset from Kaggle.

Now, let’s start our analysis by first understanding our data:

Here, we have 26020 observations of 12 variables, which is a lot, but lets break it down variable-by-variable:

  • Fiscal.Year.Released-the year the person was released from prison/jail
  • Recidivism.Reporting.Year-the year recidivism follow-up was conducted, which is 3 years after the person was released
  • Race..Ethnicity-the race/ethnicity of the person who was released
  • Age.At.Release-the age a person was when he/she was released. Exact ages aren’t given; instead, age brackets are mentioned, of which there are five. They include:
    • Under 25
    • 25-34
    • 35-44
    • 45-54
    • 55 and over
  •  Convicting.Offense.Classification-The level of seriousness of the crime, which often dictates the length of the sentence. They include:
    • [Class] A Felony-Up to life in prison
    • [Class] B Felony-25 to 50 years in prison
    • [Class] C Felony-Up to 10 years in prison
    • [Class] D Felony-Up to 5 years in prison
    • Aggravated Misdemeanor-Up to 2 years in prison
    • Serious Misdemeanor-Up to 1 year in prison
    • Simple Misdemeanor-Up to 30 days in prison
  • Convicting.Offense.Type-the general category of the crime the person was convicted of, which can include:
    • Drug-drug crimes (anything from narcotics trafficking to possession of less than an ounce of marijuana)
    • Violent-any violent crime (murder, sexual assault, etc.)
    • Property-any crimes against property (like robbery/burglary)
    • Public order-usually victimless and non-violent crimes that go against societal norms (prostitution, public drunkenness, etc.)
    • Other-any other crimes that don’t fit the four categories mentioned above (like any white collar crimes)
  • Convicting.Offense.Subtype-the more specific category of the crime the person was convicted of, which can include “animal” (for animal cruelty crimes), “trafficking” (which I’m guessing refers to human trafficking), among others.
  • Main.Supervising.District-the jurisdiction supervising the offender during the 3 year period (Iowa has 11 judicial districts)
  • Release.Type-why the person was released, which can include parole, among other things
  • Release.type..Paroled.to.detainder.united-this is pretty much the same as Release.Type; I won’t be using this variable in my analysis
  • Part.of.Target.Population-whether or not a prisoner is a parolee
  • Recidivism...Return.to.Prison.numeric-Whether or not a person returned to prison within 3 years of release; 1 means “no”, they didn’t return and 0 means “yes”, they did return

Now let’s see if we’ve got any missing observations (remember to install the Amelia package):

Looks like there’s nothing missing in our dataset, which is always a good sign (though not all datasets will be perfectly tidy).

Next let’s split our data into training and validation sets. If you’re wondering what a validation set is, it’s essentially the dataset you would use to fine tune the parameters of a model to prevent overfitting. You can make a testing set for this model if you want, but I’ll just go with training and validation sets.

Here’s the training/validation split, following the 75-25 rule I used for splitting training and testing sets:

  • This will be a different approach to splitting the data than what you’ve seen in previous R posts. The above line of code is used as a guideline to splitting the data into training (trainSet) and validation (validationSet) sets. It tells R to utilize 75% of the data (19651 observations) for the training set. However, the interesting thing is the nrow function, which tells R to utilize ANY 19651 observations SELECTED AT RANDOM, as opposed to simply using observations 1-19651.
  • On the lines of code seen below, train selects the 75% of the observations chosen with the sample function while -train selects the other 25% of observations that are NOT part of the training set. By the way, the minus sign means NOT.

Now here’s our random forest model, using default parameters and the Convicting.Offense.Type. variable  (remember to install the randomForest package):

In case you were wondering, the default number of decision trees is 500, the default number of variables tried at each split is 2 (it’s 3 in this case), and the default OOB estimate of error rate is 3.6% (it’s much lower here-0.22%).

The confusion matrix you see details how many observations from the Convicting.Offense.Type variable are correctly classified and how many are misclassified. A margin of error-class.error-in this matrix tells you the percentage of observations that were misclassified. Here’s a breakdown:

  • Drug-No misclassification
  • Other-3.5% misclassification
  • Property-0.02% misclassification
  • Public Order-0.04% misclassification
  • Violent-0.1% misclassification

So all the crimes that were classified as Drug from trainSet were classified correctly, while Other had the most misclassification (and even 3.5% is a relatively small misclassification rate).

  • One important thing to note is that random forests can be used for both classification and regression analyses. If I wanted to do a regression analysis, I would’ve put my binary variable-Recidivism...Return.to.Prison.numeric-to the left of the ~ sign and other variables to the right of the ~ sign.

Now let’s change the model by change ntree and mtry, which refer to the number of decision trees in the model and the number of variables sampled at each stage, respectively:

By reducing ntree and increasing mtry, the OOB estimate of error rate decreases (albeit by only 0.02%) as does the amount of variables that have a misclassification error. By that I mean that the previous model had only one variable with no misclassification error-Drug-while this model has 3 variables with no misclassification error-Drug, Property, and Public Order. Also, the misclassification error for the other two variables-Other and Violent-is both smaller and larger than that of the previous model. Other had a 3.5% error in the previous model while it has a 2.2% error in this model, so the misclassification error decreased in this case. On the other hand, Violent had a 0.1% error in the previous model but it has a 0.44% error in this model, so the misclassification error increased in this case.

Now let’s do some predictions, first on our training set and then on our validation set (using the second model):

And then let’s check the accuracy of each prediction:

For both the training set and validation set, the classifications are very accruate. trainSet has a 99.9% accuracy rate (only 10 of the 19,651 observations were misclassified) while `validationSet` has a 99.8% accuracy rate (only 13 of the 19,651 observations were misclassified).

Now let’s analyze the importance of each variable using the `importance` function:

So what exactly does this complex matrix mean? It shows the importance of each of the 12 variables when classifying a crime based on the five types of offenses derived from the variable Convicting.Offense.Type.  More specifically, this matrix shows the importance of the 12 variables when classifying a variable-Convicting.Offense.Type-that has five possible values. The two important metrics to focus on in this matrix are MeanDecreaseAccuracy and MeanDecreaseGini.

A variable’s MeanDecreaseAccuracy is a metric of how much the accuracy of the model would decrease if that variable was to be excluded. The higher the number, the more accuracy the model will lose if that variable will be excluded. For instance, if I was to exclude the Recidivism...Return.to.Prison.numeric variable, the accuracy of the model wouldn’t be impacted much since it has the lowest MeanDecreaseAccuracy-3.42. Another reason the exclusion of this variable wouldn’t impact the model much is because this is a binary variable, which would be better suited to a regression problem as opposed to a classification problem (as this model is). On the other hand, excluding the variable Convicting.Offense.Subtype would have a significant impact on the model, since it has the largest MeanDecreaseAccuracy at 13,358.53.

A variable’s MeanDecreaseGini is a metric of how much Gini impurity will decrease when a certain variable is chosen to split a node. What exactly is Gini impurity? Well, let’s take a random point in the dataset. Using the guidelines of trainSet, let’s classify that point as Drug 5982/19651 of the time and as Property 5490/19651 of the time. The odds that we misclassify the data point based on class distribution is the Gini impurity. With that said, the MeanDecreaseGini is a measure of how much the misclassification likelihood will decrease if we include a certain variable. The higher this number is, the less likely the model will make misclassifications if that variable is included. For instance, Convicting.Offense.Subtype has the highest MeanDecreaseGini, so misclassification likelihood will greatly decrease if we include this variable in our model. On the other hand, Part.of.Target.Population has the lowest MeanDecreaseGini, so misclassification likelihood won’t be significantly impacted if this variable is included. This could be because this variable is a binary variable, which is better suited for regression problems rather than classification problems.

But what about all of the numbers in the first five columns? What do they mean? The first five columns, along with all the numbers, show the importance of each of the 12 variables when classifying a crime as one of the five possible types of crimes listed in Convicting.Offense.Type, which is the variable I used to build both random forest models. The higher the number, the more important that variable is when classifying a crime as one of the five possible crimes. For instance, the highest number (rounded) in the Fiscal.Year.Released row is 11.16, which corresponds to the Violent column. This implies that many Iowa prisoners convicted of violent crimes were likely released around the same time. Another example would be the highest number (rounded) in the Main.Supervising.District row-14.45-which corresponds to the `Other` column. This implies that Iowa prisoners who were convicted of crimes that don’t fall under the `Drug`, `Violent`, `Public Order`, or `Property` umbrellas would be, for the most part, located in the same Iowa jurisdiction.

  • If you wish to see the numbers in this matrix as a percent of 1, simply write scale = FALSEafter model2 in the parentheses. Remember to separate these two things with a comma.

Thanks for reading,

Michael