R Analysis 5: Random Forests, Decision Trees and Merit Badges

Advertisements

Hello everybody,

Its Michael, and today’s post will be an R analysis about Boy Scout merit badges earned in 2018 utilizing random forests and decision trees. I thought this would be a fun analysis to do since I am an Eagle Scout (Class of ’14).  Also, since I haven’t done so yet, I will show you a regression random forest and a regression decision tree in this analysis.

Here’s the dataset-2018 merit badges.

As always, let’s load our file into R and try to understand our data:

Screen Shot 2019-06-15 at 2.46.40 PM

So as you can see, we’ve got 137 observations of 7 variables. Here’s a variable-by-variable breakdown:

  • Rank-These badges are ranked based on how many Scouts earned them; the ranks go from 1 (most earned badge) to 137 (least earned badge)
    • The original dataset I found on ScoutingMagazine.org had 139 observations, but since the bottom 2 had discontinued or rebranded merit badges (Computers and Cinematography), I removed those two observations from the dataset
  • Merit.Badge-The name of the merit badge
  • X2018.earned-How many Scouts earned that particular badge
  • Year.Introduced-The year the badge was introduced into the BSA
  • Requirements.Revision-The year of the most recent requirements revision for the badge
  • Change.from.17-The percentage change of Scouts who earned a particular merit badge from 2017 to 2018
    • I mentioned that this variable was a percentage change, however I had to list these values as decimals so R would read them as num and not factor.
  • Eagle.Required-Whether or not a badge is required for the Eagle Scout rank; the two values are “Yes” and “No”
    • Just so you know, there are 21 merit badges required for the Eagle Scout rank.

Now let’s set up our model (remember to install the randomForest package). Since the model is fairly small-with only 137 observations-there’s no need to split the data into training and validation sets. So, here’s the model along with the corresponding summary:

In this model, I used X2018.earned as my dependent variable and Year.Introduced and Rank as my independent variables. In other words, I am attempting to figure out whether the amount of Scouts that earned a particular merit badge has to do with when the badge was introduced into the BSA and/or the badge’s rank of popularity. Since my dependent variable is non-binary, I will be performing a linear regression.

Now, you will see two unfamiliar terms at the bottom of the output. Here’s an explanation of each term:

  • The mean of squared residuals is the average of the squares of the residuals.
    • In regression analysis (linear & logistic), residuals are the differences between the observed value and the predicted value of the dependent variable-X2018.earned.
  • The % var explained is a measure  of how well out-of-bag predictions explain the variance of the training set; a low % var explained could be due to either true random behavior or a lack of fit. Our % var explained-97.07%-is high, which indicates plenty of fit in the model.

Now, you might recall that when dealing with classification random forests, we create several classification matrices. However, when dealing with regression random forests, we create a graph of the random forest using the handy ggplot2 package (which you must remember to install):

And here’s the actual graph:

OK, so what’s the deal with the actual and predicted columns? Both of these columns demonstrate the concepts of residuals. Remember that earlier in this post, I mentioned that residuals are the differences between the actual and predicted values of the dependent variable-X2018.earned.

This graph shows all of the predicted and actualvalues of X2018.earned. In case you were wondering, here are all of the predicted values of X2018.earned in order of their appearance on the dataset. Since I sorted all the badges (and other information in turn) alphabetically, the first number-2073.128-corresponds to the American Business merit badge, while each subsequent number corresponds to the next merit badge alphabetically. The last number-5492.116-corresponds to the alphabetically last merit badge-Woodworking:

  • In case you’re wondering how I got this output of numbers, I wrote the line file$predicted <- predict(model1) and then typed in file$predicted to get this output.

But that’s not all. See, we’re then going to take the predicted values and place them onto two separate graphs-one with Rank and the other with Year.Introduced. The predicted values will be the y-axis in both graphs, while the dependent variables will each serve as the x-axes. The whole point of making these graphs is to see how well the predicted values correlate with each of the independent variables.

First, here’s the code and the graph for Year.Introduced:

From this graph, you can see there is a small negative correlation between the predicted values and Year.Introduced. This implies that, even with all of the new merit badges that have been introduced in the last decade (e.g. Robotics, Game Design, etc.), many Scouts still went for the traditional merit badges (e.g. Archery, First Aid, Cycling) that have been around since the BSA’s founding in 1910. Most of the traditional badges, such as Swimming and Cooking, are Eagle-required badges, which could explain why plenty of Scouts earned them in 2018.

  • Granted, many of the traditional badges had been around since 1910, but their Year.Introduced is 1911, since that’s the year the first Scout handbook came out.

Want to figure out the exact correlation between Year.Introduced and the predicted values? Here’s how:

The number -0.1470237 indicates that there is some negative correlation between the predicted values and Year.Introduced. This means that the more recently a badge was introduced, the fewer scouts earned it-SOMETIMES. There are exceptions to this rule. For instance, twice as many Scouts earned the Chess badge-which was introduced in 2011-than the Aviation badge-which was introduced 59 years earlier. And here’s a more interesting example-four times as many Scouts earned the Exploration badge (which was just introduced in 2017) than the Stamp Collecting badge (which has been around since 1932).

Now here’s the code and the graph for Rank:

This graph shows that Rank has a much stronger correlation than Year.Introducedwith the predicted values. This makes perfect sense; after all, the higher a certain badge’s popularity ranking is, the more Scouts earned that badge in 2018, and vice versa.

Want to know the exact correlation? Check this out:

Just as with Year.Introduced, there is also a negative correlation. However, the number -0.8173855 implies a much stronger inverse correlation (stronger than the correlation with Year.Introduced).

Now let’s make some regression decision trees (remember to install the rpart package):

The formula used to create the tree is the same one you would use for classification decision trees, except use anova for the method.

  • ANOVA stands for analysis of variance, which is a type of regression that I will likely cover in a future R post.

You’ll recognize the pruning table at the bottom of the printcp(tree) output. Remember that the optimal place to prune the tree would be at the lowest level where  rel error-xstd < xerror.

Now let’s plot the pre-pruned tree (remember to keep the window with the tree open when you are writing the text line):

In this tree, only Rank was used, even though Year.Introduced was likely factored into the tree’s construction (though I’m not certain that it was). By the way, I’ll round up on all of the ranks.

Now, here’s a breakdown of the tree:

  • The top row contains all 137 observations and any badge ranked 22 or lower.
  • The top row splits into two branches:
    • The left branch contains badges ranked 55 or lower and 116 observations.
    • The right branch contains badges ranked 12 or lower and 21 observations.
  • The branch with the badges ranked 55 or higher is further split into two branches (this branch has 116 observations):
    • The left branch contains badges ranked 95 or lower and 83 observations.
      • This branch further splits into two terminal nodes, each containing the number of Scouts and the observations in the group.
        • The left node contains 2,587 Scouts and 43 merit badges.
        • The right node contains 6,563 Scouts and 40 merit badges.
    • The right branch contains badges ranked 31 or lower and 33 observations.
      • The right branch also further splits into two terminal nodes, each containing the number of Scouts and observations in the group.
        • The left node contains 1,287 Scouts and 24 merit badges.
        • The right node contains 2,083 scouts and 9 merit badges.
  • The branch containing badges ranked 12 or lower (and with 21 observations) splits into two terminal nodes.
    • The left node contains 3,424 Scouts and 9 merit badges.
    • The right node contains 5,364 Scouts and 12 merit badges.

So here’s a summary of all the groups in the tree, going from left to right:

  • Group 1 (the leftmost) contains 2,587 Scouts and 43 merit badges.
  • Group 2 contains 6,563 Scouts and 40 merit badges.
  • Group 3 contains 1,287 Scouts and 24 merit badges.
  • Group 4 contains 2,083 Scouts and 9 merit badges.
  • Group 5 contains 3,424 Scouts and 9 merit badges.
  • Group 6 (the rightmost) contains 5,364 Scouts and 12 merit badges.

Now let’s see if our tree needs pruning:

And here’s the pruned tree:

So as it turns out, this tree did need to be pruned. As you can see, there are only 3 terminal nodes now as opposed to the six that were in the pre-pruned tree. The top branch is still the same-all of the merit badges and ranking 22 or lower. The top branch’s splits are also the same-116 badges and ranking 55 or lower on the left, 21 badges on the right.

However, that’s where the similarities end. Unlike the pre-pruned tree, the right split from the top branch is a terminal node; this node contains 4,532 Scouts and 21 merit badges and doesn’t split any further. The left split, on the other hand, splits into two branches (just as it does in the pre-pruned tree), but unlike the pre-pruned tree, those branches are terminal nodes. Those two terminal nodes contain 83 badges/4,503 Scouts on the left split and 33 badges/1,504 Scouts on the right split.

So here’s a breakdown of each group in this tree:

  • Group 1 (left)-4,532 Scouts and 21 merit badges
  • Group 2 (center)-4,503 Scouts and 83 merit badges
  • Group 3 (right)-1,504 Scouts and 33 merit badges

How did I know where to prune the tree? Here’s how:

When I typed in plotcp(tree), I got this handy little diagram, which is a visual summary of the output I got with printcp(tree). The area I circled, which corresponds to Node 3, is the optimal place to prune the tree, since it’s the one closest above the dashed line (which I’ll call the pruning line). You might think Node 4 would have been a better place to prune the tree, but since it touches the pruning line, that would not have a good place to prune the tree.

So after determining the best place to prune the tree, I look up the CP for Node 3-0.061173-and use that for the pfit <- prune(tree, cp=0.061173) line (this is the line that prunes the tree-the CP tells it where to prune).

Thanks for reading,

Michael

R Lesson 15: Decision Trees

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be about how to create decision trees in R. I briefly explained the concept of decision trees in my previous post. This post serves as the second part of a two-part lesson (with R Lesson 14 being the first part); I will be using the same dataset.

Now I know you’ve learned the basic concept of a decision tree from my previous post, so I’ll start off this post by going further in depth with more decision tree terminology.

Here’s an example of the decision tree (it’s not the one you saw in my previous post):

The uppermost question-“Income range of applicant?”-is the root node of the tree, which represents the entire population/sample. Splitting the tree involves dividing a node (ovals represent the nodes here) into sub-nodes (represented by the other questions in this tree); the opposite of splitting is pruning, which is when sub-nodes are removed. When sub-nodes split into further sub-nodes, they are referred to as decision nodes (but you can use the terms sub-node and decision node interchangeably). Also, when dealing with sub-nodes, the main node is called the parent node while the sub-node is called the child node. Nodes can be both a parent and child node; for instance, the “Years in present job?” node is the child node of the “Income range of applicant?” node but it is the parent node of the “Makes credit card payments?” node. Nodes that don’t split are called terminal nodes, or leaves (the ovals with loan and no loan are terminal nodes in this tree).

There are two main types of decision trees-regression and classification (which are also the two types of random forest models). Regression trees are more appropriate for QUANTITATIVE problems (such as the percentage likelihood that something will or wont happen) while classification trees are more appropriate for QUALITATIVE problems (such as Yes/No questions). Since the models we made in the previous post are classification problems, then we will learn how to build classification trees.

  • An easy way to remember how decision trees work is to think of them as questions that need more questions to be answered.

To create our decision tree, let’s start by installing a package called rpart.  Then let’s use the printcp function to display a summary of the tree:

To create the classification tree, use the `Convicting.Offense.Type` variable (and only this variable) along with `trainSet` as your data. Since we are creating a classification tree, use `class` as the method; if we were creating a regression tree, use `anova`for the method.

Now what do all the unfamiliar terms in the output actually mean? The first term, root node error, sounds like a misnomer since it refers to percentage of correctly sorted records at the first root or split. The higher this number is, the better. Here, the root node error is 69.3%, which is OK, but we should try to figure out how to increase this number to at least 75%.

The other part of the output is the pruning table that you can see at the bottom of the output (that six-by-five matrix is referred to as a pruning table). The numbers 1-5 refer to each of the five values of Convicting.Offense.TypeDrug, Property, Violent, Public Order, and Other. The CP refers to the complexity parameter, which is a metric that is used to control the size of the tree and select the optimal tree size. It decreases with each successive level (the numbers 1-5 are referred to as levels), and tree building stops when the CP reaches 0. nsplit refers to the amount of splits at that level of the tree; in this case, the number of splits equals the level minus 1 (e.g. the 1st level has no splits, while the 2nd level has 1 split). However, that won’t always be the case, since the 2nd level of a tree might have more than 1 split sometimes.

The `relerror` for each level is 1 minus the root mean squared error (or the mean squared error, which is the square of the RMSE); this is the error for predictions of the data that were used to estimate the model-`trainSet` in this case. The lower the `relerror`, the better.

Next is the xerror, which is the cross-validated error rate for each level. Cross-validation is the process of testing the results of the model on an independent, unseen dataset in order to avoid overfitting or underfitting the model, as well as to analyze how well the data will fit with an independent data set. The xerror is used to measure the error generated from the cross-validation process; the lower the `xerror` the better. In this case, all of the `xerror` values are equal to all of the `relerror` values, but that won’t always be the case.

Last is the xtsd, which represents the standard deviation, which is a measure of how much observations in a certain level are spread out from one another. The smaller the number, the less spread out the observations are from each other.

The graph below is simply a visual representation of the matrix I just described:

So, what do we do next? We try to find the best place to prune the tree. If you don’t know, pruning means to remove nodes from the tree that aren’t significant to the model. It’s similar to the concept of pruning your tree, since you would be removing unnecessary branches from the tree.

And how do we find the best place to prune the tree? A rule of thumb is to find the lowest level where rel error-xstd < xerror. But, since this is an unusual case where all of the rel error equals all of the xerror, I’ll let R decide where to prune the tree (if doing so is necessary)

But before I do that, let me plot a pre-pruned decision tree:

The first thing you would do is to use the `plot` function (along with any parameters) to plot the tree then use the `text` function (along with any parameters) to place the text on the tree. Remember-KEEP the window with the decision tree open when calling your `text` function.

  • The `margin` parameter is completely optional; it’s just a good idea since the default tree tends to cut off some of the text. A margin of 0.25 or 0.3 is ideal.
  • Even though I didn’t use Convicting.Offense.Subtype in the model, it is included in the tree since the tree couldn’t be created with only one variable (plus `Convicting.Offense.Subtype` and `Convicting.Offense.Type` are closely related to one another).

Now, you might be wondering about all the letters on the tree right by Convicting.Offense.Subtype. Here’s the thing-the actual values of this variable are too long to show on the tree, and there are 26 possible values, so it wouldn’t have been practical to create 26 nodes (one for each value). Since there are 26 possible values, each value is represented by a letter of the alphabet. Each of the values are represented by a letter of the alphabet; the values and their corresponding letters are listed alphabetically. Here’s the list of values along with the letters they correspond to:

  • A-alcohol (I’m guessing public intoxication is a part of this)
  • B-animals (namely animal cruelty offenses)
  • C-arson
  • D-assault
  • E-burglary
  • F-drug possession
  • G-flight/escape (probably referring to prison escapes but could also mean offenders who flee town before trial/sentencing)
  • H-forgery/fraud
  • I-kidnap
  • J-murder/manslaughter (not sure why these are grouped together, since murder would suggest malicious intent while manslaughter is mostly negligence)
  • K-other criminal (meaning some other criminal offense that won’t fit into the other 25 types)
  • L-other drug (any other drug offense that isn’t possession, so sale of drug paraphernalia would count towards this)
  • M-other public order (so something like indecent exposure)
  • N-other violent (other violent crimes that won’t fall into any of the other mentioned subtypes)
  • O-OWI (operating while intoxicated, the legalese term for a DUI)
  • P-prostitution/pimping
  • Q-robbery
  • R-sex (possibly sexual assault, maybe human trafficking)
  • S-sex offender registry/residency (sex crimes against kids)
  • T-special sentence revocation (usually for those who violate probation/parole)
  • U-stolen property
  • V-theft (I guess this means both petty theft and grand theft)
  • W-traffic (any other traffic offense that is not a DUI/OWI)
  • X-trafficking (I think this means drug trafficking)
  • Y-vandalism (graffiti, destruction of property, etc.)
  • Z-weapons (probably illegal possession of weapons, i.e. convicted felon possessing a gun)

So, knowing what each of the letters refer to, how do we read this tree? First, let’s look at the uppermost node. It starts with offense sub-types F, L, and X (which are all drug-related offenses) and has the type of offense-Drug-on the bottom of the Convicting.Offense.Subtype line. The string of numbers you see after the word Drug5982/974/5490/2706/4363-refer to how many observations belong in each level. Since this string of numbers is on the root node, all 19,515 observations in trainSet are considered. As we move down the tree, you’ll start to see more 0s in each section and as a result, you’ll figure out exactly how many observations are in each level.

So, the root node splits into two branches. The left branch has the word Drug and the number string 5982/0/0/0/0 on it. Since this branch doesn’t split any further, we can conclude that 5,982 of the 19,515 offenses are drug crimes. The right branch contains offense sub-types C, E, H, U, V, and Y, which are all property-related offenses, along with the number string 0/974/5490/2706/4363, which refers to all 13,533 of the offenses that aren’t drug-related.

The Property branch (the right branch from the Drug branch split) also splits in two. The left branch contains the word Property and the number string 0/0/5490/0/0. Since this branch doesn’t split any further, we can conclude that 5,490 of the 19,515 offenses are property crimes.  The right branch contains offense sub-types A, B, G, K, M, O, P, S, T, W, and Z, which are all violent crimes, and the number string 0/974/0/2706/4363, which refers to the 8,043 crimes that aren’t drug or property related.

The Violent branch (the right branch from the Property branch split) also splits in two, just like the previous two branches. The right branch has the word Violent and the number string 0/37/0/0/4363, which indicates that 4,400 (4363+37) of the offenses are violent crimes. The left branch has the word Public Order and the offense sub-types B, K, and T, which are public order offenses, along with the number string 0/937/0/2706/0, which refers to the 3,643 observations that haven’t yet been classified. If you’re wondering how 937 was derived, the Violent branch had the number string 0/974/0/2706/4363. When that branch split in two, 37 of the 974 observations went into the right split while the other 937 went into the left split.

The Public Order branch (with the number string 0/937/0/2706/0) also splits into two. The left branch has the word Other and the number string 0/937/0/0/0, meaning that 937 of the offenses are miscellaneous crimes that don’t fit into the other 4 categories and the right branch has the number string 0/0/0/2706/0, meaning that 2,706 of the offenses are public order crimes (i.e. prostitution).

At this point, all observations have been classified. Here’s a breakdown of the results:

  • 5,982 Drug crimes
  • 5,490 Property crimes
  • 937 Other crimes
  • 2,706 Public Order crimes
  • 4,400 Violent crimes

Now, we will let R do the magic of pruning the tree (if R finds it necessary to do so). Here’s the code that you need to prune the tree (remember to not close the window with the tree when you are writing the text line):

And here’s the tree. Turns out, R didn’t think it was necessary to prune the tree, so we got the exact same tree.

How come R didn’t prune the tree. Well, let’s analyze this graph from earlier:

Notice the dashed horizontal line. See how it meets the curve at the 5th level. Well, that line gives us an idea of where the tree should be pruned. Since it’s at the 5th level, the tree won’t be pruned, as all 5 levels are significant. This could also be because all of the rel error values are equal to all of the xerror values.

Thanks for reading,

Michael