Hello everybody,
It’s Michael, and today’s lesson will be about how to create decision trees in R. I briefly explained the concept of decision trees in my previous post. This post serves as the second part of a two-part lesson (with R Lesson 14 being the first part); I will be using the same dataset.
Now I know you’ve learned the basic concept of a decision tree from my previous post, so I’ll start off this post by going further in depth with more decision tree terminology.
Here’s an example of the decision tree (it’s not the one you saw in my previous post):
The uppermost question-“Income range of applicant?”-is the root node of the tree, which represents the entire population/sample. Splitting the tree involves dividing a node (ovals represent the nodes here) into sub-nodes (represented by the other questions in this tree); the opposite of splitting is pruning, which is when sub-nodes are removed. When sub-nodes split into further sub-nodes, they are referred to as decision nodes (but you can use the terms sub-node and decision node interchangeably). Also, when dealing with sub-nodes, the main node is called the parent node while the sub-node is called the child node. Nodes can be both a parent and child node; for instance, the “Years in present job?” node is the child node of the “Income range of applicant?” node but it is the parent node of the “Makes credit card payments?” node. Nodes that don’t split are called terminal nodes, or leaves (the ovals with loan and no loan are terminal nodes in this tree).
There are two main types of decision trees-regression and classification (which are also the two types of random forest models). Regression trees are more appropriate for QUANTITATIVE problems (such as the percentage likelihood that something will or wont happen) while classification trees are more appropriate for QUALITATIVE problems (such as Yes/No questions). Since the models we made in the previous post are classification problems, then we will learn how to build classification trees.
- An easy way to remember how decision trees work is to think of them as questions that need more questions to be answered.
To create our decision tree, let’s start by installing a package called rpart. Then let’s use the printcp function to display a summary of the tree:
To create the classification tree, use the `Convicting.Offense.Type` variable (and only this variable) along with `trainSet` as your data. Since we are creating a classification tree, use `class` as the method; if we were creating a regression tree, use `anova`for the method.
Now what do all the unfamiliar terms in the output actually mean? The first term, root node error, sounds like a misnomer since it refers to percentage of correctly sorted records at the first root or split. The higher this number is, the better. Here, the root node error is 69.3%, which is OK, but we should try to figure out how to increase this number to at least 75%.
The other part of the output is the pruning table that you can see at the bottom of the output (that six-by-five matrix is referred to as a pruning table). The numbers 1-5 refer to each of the five values of Convicting.Offense.Type–Drug, Property, Violent, Public Order, and Other. The CP refers to the complexity parameter, which is a metric that is used to control the size of the tree and select the optimal tree size. It decreases with each successive level (the numbers 1-5 are referred to as levels), and tree building stops when the CP reaches 0. nsplit refers to the amount of splits at that level of the tree; in this case, the number of splits equals the level minus 1 (e.g. the 1st level has no splits, while the 2nd level has 1 split). However, that won’t always be the case, since the 2nd level of a tree might have more than 1 split sometimes.
The `relerror` for each level is 1 minus the root mean squared error (or the mean squared error, which is the square of the RMSE); this is the error for predictions of the data that were used to estimate the model-`trainSet` in this case. The lower the `relerror`, the better.
Next is the xerror, which is the cross-validated error rate for each level. Cross-validation is the process of testing the results of the model on an independent, unseen dataset in order to avoid overfitting or underfitting the model, as well as to analyze how well the data will fit with an independent data set. The xerror is used to measure the error generated from the cross-validation process; the lower the `xerror` the better. In this case, all of the `xerror` values are equal to all of the `relerror` values, but that won’t always be the case.
Last is the xtsd, which represents the standard deviation, which is a measure of how much observations in a certain level are spread out from one another. The smaller the number, the less spread out the observations are from each other.
The graph below is simply a visual representation of the matrix I just described:
So, what do we do next? We try to find the best place to prune the tree. If you don’t know, pruning means to remove nodes from the tree that aren’t significant to the model. It’s similar to the concept of pruning your tree, since you would be removing unnecessary branches from the tree.
And how do we find the best place to prune the tree? A rule of thumb is to find the lowest level where rel error-xstd < xerror. But, since this is an unusual case where all of the rel error equals all of the xerror, I’ll let R decide where to prune the tree (if doing so is necessary)
But before I do that, let me plot a pre-pruned decision tree:
The first thing you would do is to use the `plot` function (along with any parameters) to plot the tree then use the `text` function (along with any parameters) to place the text on the tree. Remember-KEEP the window with the decision tree open when calling your `text` function.
- The `margin` parameter is completely optional; it’s just a good idea since the default tree tends to cut off some of the text. A margin of 0.25 or 0.3 is ideal.
- Even though I didn’t use
Convicting.Offense.Subtypein the model, it is included in the tree since the tree couldn’t be created with only one variable (plus `Convicting.Offense.Subtype` and `Convicting.Offense.Type` are closely related to one another).
Now, you might be wondering about all the letters on the tree right by Convicting.Offense.Subtype. Here’s the thing-the actual values of this variable are too long to show on the tree, and there are 26 possible values, so it wouldn’t have been practical to create 26 nodes (one for each value). Since there are 26 possible values, each value is represented by a letter of the alphabet. Each of the values are represented by a letter of the alphabet; the values and their corresponding letters are listed alphabetically. Here’s the list of values along with the letters they correspond to:
- A-alcohol (I’m guessing public intoxication is a part of this)
- B-animals (namely animal cruelty offenses)
- C-arson
- D-assault
- E-burglary
- F-drug possession
- G-flight/escape (probably referring to prison escapes but could also mean offenders who flee town before trial/sentencing)
- H-forgery/fraud
- I-kidnap
- J-murder/manslaughter (not sure why these are grouped together, since murder would suggest malicious intent while manslaughter is mostly negligence)
- K-other criminal (meaning some other criminal offense that won’t fit into the other 25 types)
- L-other drug (any other drug offense that isn’t possession, so sale of drug paraphernalia would count towards this)
- M-other public order (so something like indecent exposure)
- N-other violent (other violent crimes that won’t fall into any of the other mentioned subtypes)
- O-OWI (operating while intoxicated, the legalese term for a DUI)
- P-prostitution/pimping
- Q-robbery
- R-sex (possibly sexual assault, maybe human trafficking)
- S-sex offender registry/residency (sex crimes against kids)
- T-special sentence revocation (usually for those who violate probation/parole)
- U-stolen property
- V-theft (I guess this means both petty theft and grand theft)
- W-traffic (any other traffic offense that is not a DUI/OWI)
- X-trafficking (I think this means drug trafficking)
- Y-vandalism (graffiti, destruction of property, etc.)
- Z-weapons (probably illegal possession of weapons, i.e. convicted felon possessing a gun)
So, knowing what each of the letters refer to, how do we read this tree? First, let’s look at the uppermost node. It starts with offense sub-types F, L, and X (which are all drug-related offenses) and has the type of offense-Drug-on the bottom of the Convicting.Offense.Subtype line. The string of numbers you see after the word Drug–5982/974/5490/2706/4363-refer to how many observations belong in each level. Since this string of numbers is on the root node, all 19,515 observations in trainSet are considered. As we move down the tree, you’ll start to see more 0s in each section and as a result, you’ll figure out exactly how many observations are in each level.
So, the root node splits into two branches. The left branch has the word Drug and the number string 5982/0/0/0/0 on it. Since this branch doesn’t split any further, we can conclude that 5,982 of the 19,515 offenses are drug crimes. The right branch contains offense sub-types C, E, H, U, V, and Y, which are all property-related offenses, along with the number string 0/974/5490/2706/4363, which refers to all 13,533 of the offenses that aren’t drug-related.
The Property branch (the right branch from the Drug branch split) also splits in two. The left branch contains the word Property and the number string 0/0/5490/0/0. Since this branch doesn’t split any further, we can conclude that 5,490 of the 19,515 offenses are property crimes. The right branch contains offense sub-types A, B, G, K, M, O, P, S, T, W, and Z, which are all violent crimes, and the number string 0/974/0/2706/4363, which refers to the 8,043 crimes that aren’t drug or property related.
The Violent branch (the right branch from the Property branch split) also splits in two, just like the previous two branches. The right branch has the word Violent and the number string 0/37/0/0/4363, which indicates that 4,400 (4363+37) of the offenses are violent crimes. The left branch has the word Public Order and the offense sub-types B, K, and T, which are public order offenses, along with the number string 0/937/0/2706/0, which refers to the 3,643 observations that haven’t yet been classified. If you’re wondering how 937 was derived, the Violent branch had the number string 0/974/0/2706/4363. When that branch split in two, 37 of the 974 observations went into the right split while the other 937 went into the left split.
The Public Order branch (with the number string 0/937/0/2706/0) also splits into two. The left branch has the word Other and the number string 0/937/0/0/0, meaning that 937 of the offenses are miscellaneous crimes that don’t fit into the other 4 categories and the right branch has the number string 0/0/0/2706/0, meaning that 2,706 of the offenses are public order crimes (i.e. prostitution).
At this point, all observations have been classified. Here’s a breakdown of the results:
- 5,982
Drugcrimes - 5,490
Propertycrimes - 937
Othercrimes - 2,706
Public Ordercrimes - 4,400
Violentcrimes
Now, we will let R do the magic of pruning the tree (if R finds it necessary to do so). Here’s the code that you need to prune the tree (remember to not close the window with the tree when you are writing the text line):
And here’s the tree. Turns out, R didn’t think it was necessary to prune the tree, so we got the exact same tree.
How come R didn’t prune the tree. Well, let’s analyze this graph from earlier:
Notice the dashed horizontal line. See how it meets the curve at the 5th level. Well, that line gives us an idea of where the tree should be pruned. Since it’s at the 5th level, the tree won’t be pruned, as all 5 levels are significant. This could also be because all of the rel error values are equal to all of the xerror values.
Thanks for reading,
Michael