R Analysis 5: Random Forests, Decision Trees and Merit Badges

Advertisements

Hello everybody,

Its Michael, and today’s post will be an R analysis about Boy Scout merit badges earned in 2018 utilizing random forests and decision trees. I thought this would be a fun analysis to do since I am an Eagle Scout (Class of ’14).  Also, since I haven’t done so yet, I will show you a regression random forest and a regression decision tree in this analysis.

Here’s the dataset-2018 merit badges.

As always, let’s load our file into R and try to understand our data:

Screen Shot 2019-06-15 at 2.46.40 PM

So as you can see, we’ve got 137 observations of 7 variables. Here’s a variable-by-variable breakdown:

  • Rank-These badges are ranked based on how many Scouts earned them; the ranks go from 1 (most earned badge) to 137 (least earned badge)
    • The original dataset I found on ScoutingMagazine.org had 139 observations, but since the bottom 2 had discontinued or rebranded merit badges (Computers and Cinematography), I removed those two observations from the dataset
  • Merit.Badge-The name of the merit badge
  • X2018.earned-How many Scouts earned that particular badge
  • Year.Introduced-The year the badge was introduced into the BSA
  • Requirements.Revision-The year of the most recent requirements revision for the badge
  • Change.from.17-The percentage change of Scouts who earned a particular merit badge from 2017 to 2018
    • I mentioned that this variable was a percentage change, however I had to list these values as decimals so R would read them as num and not factor.
  • Eagle.Required-Whether or not a badge is required for the Eagle Scout rank; the two values are “Yes” and “No”
    • Just so you know, there are 21 merit badges required for the Eagle Scout rank.

Now let’s set up our model (remember to install the randomForest package). Since the model is fairly small-with only 137 observations-there’s no need to split the data into training and validation sets. So, here’s the model along with the corresponding summary:

In this model, I used X2018.earned as my dependent variable and Year.Introduced and Rank as my independent variables. In other words, I am attempting to figure out whether the amount of Scouts that earned a particular merit badge has to do with when the badge was introduced into the BSA and/or the badge’s rank of popularity. Since my dependent variable is non-binary, I will be performing a linear regression.

Now, you will see two unfamiliar terms at the bottom of the output. Here’s an explanation of each term:

  • The mean of squared residuals is the average of the squares of the residuals.
    • In regression analysis (linear & logistic), residuals are the differences between the observed value and the predicted value of the dependent variable-X2018.earned.
  • The % var explained is a measure  of how well out-of-bag predictions explain the variance of the training set; a low % var explained could be due to either true random behavior or a lack of fit. Our % var explained-97.07%-is high, which indicates plenty of fit in the model.

Now, you might recall that when dealing with classification random forests, we create several classification matrices. However, when dealing with regression random forests, we create a graph of the random forest using the handy ggplot2 package (which you must remember to install):

And here’s the actual graph:

OK, so what’s the deal with the actual and predicted columns? Both of these columns demonstrate the concepts of residuals. Remember that earlier in this post, I mentioned that residuals are the differences between the actual and predicted values of the dependent variable-X2018.earned.

This graph shows all of the predicted and actualvalues of X2018.earned. In case you were wondering, here are all of the predicted values of X2018.earned in order of their appearance on the dataset. Since I sorted all the badges (and other information in turn) alphabetically, the first number-2073.128-corresponds to the American Business merit badge, while each subsequent number corresponds to the next merit badge alphabetically. The last number-5492.116-corresponds to the alphabetically last merit badge-Woodworking:

  • In case you’re wondering how I got this output of numbers, I wrote the line file$predicted <- predict(model1) and then typed in file$predicted to get this output.

But that’s not all. See, we’re then going to take the predicted values and place them onto two separate graphs-one with Rank and the other with Year.Introduced. The predicted values will be the y-axis in both graphs, while the dependent variables will each serve as the x-axes. The whole point of making these graphs is to see how well the predicted values correlate with each of the independent variables.

First, here’s the code and the graph for Year.Introduced:

From this graph, you can see there is a small negative correlation between the predicted values and Year.Introduced. This implies that, even with all of the new merit badges that have been introduced in the last decade (e.g. Robotics, Game Design, etc.), many Scouts still went for the traditional merit badges (e.g. Archery, First Aid, Cycling) that have been around since the BSA’s founding in 1910. Most of the traditional badges, such as Swimming and Cooking, are Eagle-required badges, which could explain why plenty of Scouts earned them in 2018.

  • Granted, many of the traditional badges had been around since 1910, but their Year.Introduced is 1911, since that’s the year the first Scout handbook came out.

Want to figure out the exact correlation between Year.Introduced and the predicted values? Here’s how:

The number -0.1470237 indicates that there is some negative correlation between the predicted values and Year.Introduced. This means that the more recently a badge was introduced, the fewer scouts earned it-SOMETIMES. There are exceptions to this rule. For instance, twice as many Scouts earned the Chess badge-which was introduced in 2011-than the Aviation badge-which was introduced 59 years earlier. And here’s a more interesting example-four times as many Scouts earned the Exploration badge (which was just introduced in 2017) than the Stamp Collecting badge (which has been around since 1932).

Now here’s the code and the graph for Rank:

This graph shows that Rank has a much stronger correlation than Year.Introducedwith the predicted values. This makes perfect sense; after all, the higher a certain badge’s popularity ranking is, the more Scouts earned that badge in 2018, and vice versa.

Want to know the exact correlation? Check this out:

Just as with Year.Introduced, there is also a negative correlation. However, the number -0.8173855 implies a much stronger inverse correlation (stronger than the correlation with Year.Introduced).

Now let’s make some regression decision trees (remember to install the rpart package):

The formula used to create the tree is the same one you would use for classification decision trees, except use anova for the method.

  • ANOVA stands for analysis of variance, which is a type of regression that I will likely cover in a future R post.

You’ll recognize the pruning table at the bottom of the printcp(tree) output. Remember that the optimal place to prune the tree would be at the lowest level where  rel error-xstd < xerror.

Now let’s plot the pre-pruned tree (remember to keep the window with the tree open when you are writing the text line):

In this tree, only Rank was used, even though Year.Introduced was likely factored into the tree’s construction (though I’m not certain that it was). By the way, I’ll round up on all of the ranks.

Now, here’s a breakdown of the tree:

  • The top row contains all 137 observations and any badge ranked 22 or lower.
  • The top row splits into two branches:
    • The left branch contains badges ranked 55 or lower and 116 observations.
    • The right branch contains badges ranked 12 or lower and 21 observations.
  • The branch with the badges ranked 55 or higher is further split into two branches (this branch has 116 observations):
    • The left branch contains badges ranked 95 or lower and 83 observations.
      • This branch further splits into two terminal nodes, each containing the number of Scouts and the observations in the group.
        • The left node contains 2,587 Scouts and 43 merit badges.
        • The right node contains 6,563 Scouts and 40 merit badges.
    • The right branch contains badges ranked 31 or lower and 33 observations.
      • The right branch also further splits into two terminal nodes, each containing the number of Scouts and observations in the group.
        • The left node contains 1,287 Scouts and 24 merit badges.
        • The right node contains 2,083 scouts and 9 merit badges.
  • The branch containing badges ranked 12 or lower (and with 21 observations) splits into two terminal nodes.
    • The left node contains 3,424 Scouts and 9 merit badges.
    • The right node contains 5,364 Scouts and 12 merit badges.

So here’s a summary of all the groups in the tree, going from left to right:

  • Group 1 (the leftmost) contains 2,587 Scouts and 43 merit badges.
  • Group 2 contains 6,563 Scouts and 40 merit badges.
  • Group 3 contains 1,287 Scouts and 24 merit badges.
  • Group 4 contains 2,083 Scouts and 9 merit badges.
  • Group 5 contains 3,424 Scouts and 9 merit badges.
  • Group 6 (the rightmost) contains 5,364 Scouts and 12 merit badges.

Now let’s see if our tree needs pruning:

And here’s the pruned tree:

So as it turns out, this tree did need to be pruned. As you can see, there are only 3 terminal nodes now as opposed to the six that were in the pre-pruned tree. The top branch is still the same-all of the merit badges and ranking 22 or lower. The top branch’s splits are also the same-116 badges and ranking 55 or lower on the left, 21 badges on the right.

However, that’s where the similarities end. Unlike the pre-pruned tree, the right split from the top branch is a terminal node; this node contains 4,532 Scouts and 21 merit badges and doesn’t split any further. The left split, on the other hand, splits into two branches (just as it does in the pre-pruned tree), but unlike the pre-pruned tree, those branches are terminal nodes. Those two terminal nodes contain 83 badges/4,503 Scouts on the left split and 33 badges/1,504 Scouts on the right split.

So here’s a breakdown of each group in this tree:

  • Group 1 (left)-4,532 Scouts and 21 merit badges
  • Group 2 (center)-4,503 Scouts and 83 merit badges
  • Group 3 (right)-1,504 Scouts and 33 merit badges

How did I know where to prune the tree? Here’s how:

When I typed in plotcp(tree), I got this handy little diagram, which is a visual summary of the output I got with printcp(tree). The area I circled, which corresponds to Node 3, is the optimal place to prune the tree, since it’s the one closest above the dashed line (which I’ll call the pruning line). You might think Node 4 would have been a better place to prune the tree, but since it touches the pruning line, that would not have a good place to prune the tree.

So after determining the best place to prune the tree, I look up the CP for Node 3-0.061173-and use that for the pfit <- prune(tree, cp=0.061173) line (this is the line that prunes the tree-the CP tells it where to prune).

Thanks for reading,

Michael