R Lesson 15: Decision Trees

Advertisements

Hello everybody,

It’s Michael, and today’s lesson will be about how to create decision trees in R. I briefly explained the concept of decision trees in my previous post. This post serves as the second part of a two-part lesson (with R Lesson 14 being the first part); I will be using the same dataset.

Now I know you’ve learned the basic concept of a decision tree from my previous post, so I’ll start off this post by going further in depth with more decision tree terminology.

Here’s an example of the decision tree (it’s not the one you saw in my previous post):

The uppermost question-“Income range of applicant?”-is the root node of the tree, which represents the entire population/sample. Splitting the tree involves dividing a node (ovals represent the nodes here) into sub-nodes (represented by the other questions in this tree); the opposite of splitting is pruning, which is when sub-nodes are removed. When sub-nodes split into further sub-nodes, they are referred to as decision nodes (but you can use the terms sub-node and decision node interchangeably). Also, when dealing with sub-nodes, the main node is called the parent node while the sub-node is called the child node. Nodes can be both a parent and child node; for instance, the “Years in present job?” node is the child node of the “Income range of applicant?” node but it is the parent node of the “Makes credit card payments?” node. Nodes that don’t split are called terminal nodes, or leaves (the ovals with loan and no loan are terminal nodes in this tree).

There are two main types of decision trees-regression and classification (which are also the two types of random forest models). Regression trees are more appropriate for QUANTITATIVE problems (such as the percentage likelihood that something will or wont happen) while classification trees are more appropriate for QUALITATIVE problems (such as Yes/No questions). Since the models we made in the previous post are classification problems, then we will learn how to build classification trees.

  • An easy way to remember how decision trees work is to think of them as questions that need more questions to be answered.

To create our decision tree, let’s start by installing a package called rpart.  Then let’s use the printcp function to display a summary of the tree:

To create the classification tree, use the `Convicting.Offense.Type` variable (and only this variable) along with `trainSet` as your data. Since we are creating a classification tree, use `class` as the method; if we were creating a regression tree, use `anova`for the method.

Now what do all the unfamiliar terms in the output actually mean? The first term, root node error, sounds like a misnomer since it refers to percentage of correctly sorted records at the first root or split. The higher this number is, the better. Here, the root node error is 69.3%, which is OK, but we should try to figure out how to increase this number to at least 75%.

The other part of the output is the pruning table that you can see at the bottom of the output (that six-by-five matrix is referred to as a pruning table). The numbers 1-5 refer to each of the five values of Convicting.Offense.TypeDrug, Property, Violent, Public Order, and Other. The CP refers to the complexity parameter, which is a metric that is used to control the size of the tree and select the optimal tree size. It decreases with each successive level (the numbers 1-5 are referred to as levels), and tree building stops when the CP reaches 0. nsplit refers to the amount of splits at that level of the tree; in this case, the number of splits equals the level minus 1 (e.g. the 1st level has no splits, while the 2nd level has 1 split). However, that won’t always be the case, since the 2nd level of a tree might have more than 1 split sometimes.

The `relerror` for each level is 1 minus the root mean squared error (or the mean squared error, which is the square of the RMSE); this is the error for predictions of the data that were used to estimate the model-`trainSet` in this case. The lower the `relerror`, the better.

Next is the xerror, which is the cross-validated error rate for each level. Cross-validation is the process of testing the results of the model on an independent, unseen dataset in order to avoid overfitting or underfitting the model, as well as to analyze how well the data will fit with an independent data set. The xerror is used to measure the error generated from the cross-validation process; the lower the `xerror` the better. In this case, all of the `xerror` values are equal to all of the `relerror` values, but that won’t always be the case.

Last is the xtsd, which represents the standard deviation, which is a measure of how much observations in a certain level are spread out from one another. The smaller the number, the less spread out the observations are from each other.

The graph below is simply a visual representation of the matrix I just described:

So, what do we do next? We try to find the best place to prune the tree. If you don’t know, pruning means to remove nodes from the tree that aren’t significant to the model. It’s similar to the concept of pruning your tree, since you would be removing unnecessary branches from the tree.

And how do we find the best place to prune the tree? A rule of thumb is to find the lowest level where rel error-xstd < xerror. But, since this is an unusual case where all of the rel error equals all of the xerror, I’ll let R decide where to prune the tree (if doing so is necessary)

But before I do that, let me plot a pre-pruned decision tree:

The first thing you would do is to use the `plot` function (along with any parameters) to plot the tree then use the `text` function (along with any parameters) to place the text on the tree. Remember-KEEP the window with the decision tree open when calling your `text` function.

  • The `margin` parameter is completely optional; it’s just a good idea since the default tree tends to cut off some of the text. A margin of 0.25 or 0.3 is ideal.
  • Even though I didn’t use Convicting.Offense.Subtype in the model, it is included in the tree since the tree couldn’t be created with only one variable (plus `Convicting.Offense.Subtype` and `Convicting.Offense.Type` are closely related to one another).

Now, you might be wondering about all the letters on the tree right by Convicting.Offense.Subtype. Here’s the thing-the actual values of this variable are too long to show on the tree, and there are 26 possible values, so it wouldn’t have been practical to create 26 nodes (one for each value). Since there are 26 possible values, each value is represented by a letter of the alphabet. Each of the values are represented by a letter of the alphabet; the values and their corresponding letters are listed alphabetically. Here’s the list of values along with the letters they correspond to:

  • A-alcohol (I’m guessing public intoxication is a part of this)
  • B-animals (namely animal cruelty offenses)
  • C-arson
  • D-assault
  • E-burglary
  • F-drug possession
  • G-flight/escape (probably referring to prison escapes but could also mean offenders who flee town before trial/sentencing)
  • H-forgery/fraud
  • I-kidnap
  • J-murder/manslaughter (not sure why these are grouped together, since murder would suggest malicious intent while manslaughter is mostly negligence)
  • K-other criminal (meaning some other criminal offense that won’t fit into the other 25 types)
  • L-other drug (any other drug offense that isn’t possession, so sale of drug paraphernalia would count towards this)
  • M-other public order (so something like indecent exposure)
  • N-other violent (other violent crimes that won’t fall into any of the other mentioned subtypes)
  • O-OWI (operating while intoxicated, the legalese term for a DUI)
  • P-prostitution/pimping
  • Q-robbery
  • R-sex (possibly sexual assault, maybe human trafficking)
  • S-sex offender registry/residency (sex crimes against kids)
  • T-special sentence revocation (usually for those who violate probation/parole)
  • U-stolen property
  • V-theft (I guess this means both petty theft and grand theft)
  • W-traffic (any other traffic offense that is not a DUI/OWI)
  • X-trafficking (I think this means drug trafficking)
  • Y-vandalism (graffiti, destruction of property, etc.)
  • Z-weapons (probably illegal possession of weapons, i.e. convicted felon possessing a gun)

So, knowing what each of the letters refer to, how do we read this tree? First, let’s look at the uppermost node. It starts with offense sub-types F, L, and X (which are all drug-related offenses) and has the type of offense-Drug-on the bottom of the Convicting.Offense.Subtype line. The string of numbers you see after the word Drug5982/974/5490/2706/4363-refer to how many observations belong in each level. Since this string of numbers is on the root node, all 19,515 observations in trainSet are considered. As we move down the tree, you’ll start to see more 0s in each section and as a result, you’ll figure out exactly how many observations are in each level.

So, the root node splits into two branches. The left branch has the word Drug and the number string 5982/0/0/0/0 on it. Since this branch doesn’t split any further, we can conclude that 5,982 of the 19,515 offenses are drug crimes. The right branch contains offense sub-types C, E, H, U, V, and Y, which are all property-related offenses, along with the number string 0/974/5490/2706/4363, which refers to all 13,533 of the offenses that aren’t drug-related.

The Property branch (the right branch from the Drug branch split) also splits in two. The left branch contains the word Property and the number string 0/0/5490/0/0. Since this branch doesn’t split any further, we can conclude that 5,490 of the 19,515 offenses are property crimes.  The right branch contains offense sub-types A, B, G, K, M, O, P, S, T, W, and Z, which are all violent crimes, and the number string 0/974/0/2706/4363, which refers to the 8,043 crimes that aren’t drug or property related.

The Violent branch (the right branch from the Property branch split) also splits in two, just like the previous two branches. The right branch has the word Violent and the number string 0/37/0/0/4363, which indicates that 4,400 (4363+37) of the offenses are violent crimes. The left branch has the word Public Order and the offense sub-types B, K, and T, which are public order offenses, along with the number string 0/937/0/2706/0, which refers to the 3,643 observations that haven’t yet been classified. If you’re wondering how 937 was derived, the Violent branch had the number string 0/974/0/2706/4363. When that branch split in two, 37 of the 974 observations went into the right split while the other 937 went into the left split.

The Public Order branch (with the number string 0/937/0/2706/0) also splits into two. The left branch has the word Other and the number string 0/937/0/0/0, meaning that 937 of the offenses are miscellaneous crimes that don’t fit into the other 4 categories and the right branch has the number string 0/0/0/2706/0, meaning that 2,706 of the offenses are public order crimes (i.e. prostitution).

At this point, all observations have been classified. Here’s a breakdown of the results:

  • 5,982 Drug crimes
  • 5,490 Property crimes
  • 937 Other crimes
  • 2,706 Public Order crimes
  • 4,400 Violent crimes

Now, we will let R do the magic of pruning the tree (if R finds it necessary to do so). Here’s the code that you need to prune the tree (remember to not close the window with the tree when you are writing the text line):

And here’s the tree. Turns out, R didn’t think it was necessary to prune the tree, so we got the exact same tree.

How come R didn’t prune the tree. Well, let’s analyze this graph from earlier:

Notice the dashed horizontal line. See how it meets the curve at the 5th level. Well, that line gives us an idea of where the tree should be pruned. Since it’s at the 5th level, the tree won’t be pruned, as all 5 levels are significant. This could also be because all of the rel error values are equal to all of the xerror values.

Thanks for reading,

Michael

R Lesson 14: Random Forests

Advertisements

Hello everybody,

It’s Michael, and today’s post will be on random forests in R.

You’re probably wondering what a random forest does? Me too, since I’ve never actually done any random forest problems (the other R lessons covered concepts from my undergrad business analytics coursework, so I was familiar with them).

Random forests are a form of supervised learning (look at R Lesson 10: Intro to Machine Learning-Supervised and Unsupervised for an explanation on supervised and unsupervised learning). But to better explain the concept of random forests, let me provide an easy to understand example.

Let’s say you wanted to visit the new ice cream shop in town, but aren’t sure if its any good. Would you go in and visit the place anyway, regardless of any doubts you have? It’s unlikely, since you’d probably want to ask all of your buddies and/or relatives about their thoughts on the new ice cream place. You would ideally ask multiple people for their opinions, since the opinion of one person is usually biased based on his/her preferences. Therefore, by asking several people about their opinions, we can eliminate bias in our decision making process as much as possible. After all, one person might not like the ice cream shop in question because they weren’t satisfied with the service, or. they don’t like the shop’s options, or a myriad of other reasons.

In data analytics jargon, the above paragraph provides a great example of a technique called ensembling, which is a type of supervised learning technique where several models are trained on a training set with their individual outputs combined by a set rule to determine the final output.

Now, what does all of that mean? First of all, when I mentioned that several models are trained on a training set, this means that either the same model with different parameters or different models can utilize the training set.

When I mentioned that their [referring to the “several models”] individual outputs are combined by a set rule to determine the final output, this means that the final output is derived by using a certain rule to combine all other outputs. There are plenty of rules that can be utilized, but the two most common are averages (in terms of numerical output) and votes (in terms of non-numerical output). In the case of numerical output, we can take the average of all the outputs and use that average as the final output. In the case of categorial (or non–numerical) output, we can use vote (or the output that occurs the most times) as the final output.

Random forests are an example of an ensembling algorithm. Random forests work by creating several decision trees (which I’ll explain a little more) and combining their output to derive a final output. Decision trees are easy to understand models, however they don’t have much predictive power, which is why they have been referred to as weak learners.

Here’s a basic example of a decision tree:

This tree isn’t filled with analytical jargon, but the idea is the same. See, this tree answers a question-“Is a person fit?”-just as decision trees do. It then starts with a basic question on the top branch-“Is the person younger than 30?” Depending on the answer to that question (Yes/No), another question is then posed, which can be either “Eats a lot of pizzas?” or “Exercises in the morning?”. And depending on the answer to either question (both are Yes/No questions), you get the answer to the question the tree is trying to answer-“Is a person fit?”. For instance, if the answer to “Eats a lot of pizzas?” is “Yes”, then the answer to the question “Is a person fit?” would be “No”.

This tree is a very simple example of decision trees, but the basic idea is the same with R decision trees (which I will show you more of in the next post).

Now let’s get into the dataset-Iowa Recidivism-which gives data on people serving a prison term in the state of Iowa between 2010 and 2015, with recidivism follow–up between 2013 and 2018. If you are wondering, I got this dataset from Kaggle.

Now, let’s start our analysis by first understanding our data:

Here, we have 26020 observations of 12 variables, which is a lot, but lets break it down variable-by-variable:

  • Fiscal.Year.Released-the year the person was released from prison/jail
  • Recidivism.Reporting.Year-the year recidivism follow-up was conducted, which is 3 years after the person was released
  • Race..Ethnicity-the race/ethnicity of the person who was released
  • Age.At.Release-the age a person was when he/she was released. Exact ages aren’t given; instead, age brackets are mentioned, of which there are five. They include:
    • Under 25
    • 25-34
    • 35-44
    • 45-54
    • 55 and over
  •  Convicting.Offense.Classification-The level of seriousness of the crime, which often dictates the length of the sentence. They include:
    • [Class] A Felony-Up to life in prison
    • [Class] B Felony-25 to 50 years in prison
    • [Class] C Felony-Up to 10 years in prison
    • [Class] D Felony-Up to 5 years in prison
    • Aggravated Misdemeanor-Up to 2 years in prison
    • Serious Misdemeanor-Up to 1 year in prison
    • Simple Misdemeanor-Up to 30 days in prison
  • Convicting.Offense.Type-the general category of the crime the person was convicted of, which can include:
    • Drug-drug crimes (anything from narcotics trafficking to possession of less than an ounce of marijuana)
    • Violent-any violent crime (murder, sexual assault, etc.)
    • Property-any crimes against property (like robbery/burglary)
    • Public order-usually victimless and non-violent crimes that go against societal norms (prostitution, public drunkenness, etc.)
    • Other-any other crimes that don’t fit the four categories mentioned above (like any white collar crimes)
  • Convicting.Offense.Subtype-the more specific category of the crime the person was convicted of, which can include “animal” (for animal cruelty crimes), “trafficking” (which I’m guessing refers to human trafficking), among others.
  • Main.Supervising.District-the jurisdiction supervising the offender during the 3 year period (Iowa has 11 judicial districts)
  • Release.Type-why the person was released, which can include parole, among other things
  • Release.type..Paroled.to.detainder.united-this is pretty much the same as Release.Type; I won’t be using this variable in my analysis
  • Part.of.Target.Population-whether or not a prisoner is a parolee
  • Recidivism...Return.to.Prison.numeric-Whether or not a person returned to prison within 3 years of release; 1 means “no”, they didn’t return and 0 means “yes”, they did return

Now let’s see if we’ve got any missing observations (remember to install the Amelia package):

Looks like there’s nothing missing in our dataset, which is always a good sign (though not all datasets will be perfectly tidy).

Next let’s split our data into training and validation sets. If you’re wondering what a validation set is, it’s essentially the dataset you would use to fine tune the parameters of a model to prevent overfitting. You can make a testing set for this model if you want, but I’ll just go with training and validation sets.

Here’s the training/validation split, following the 75-25 rule I used for splitting training and testing sets:

  • This will be a different approach to splitting the data than what you’ve seen in previous R posts. The above line of code is used as a guideline to splitting the data into training (trainSet) and validation (validationSet) sets. It tells R to utilize 75% of the data (19651 observations) for the training set. However, the interesting thing is the nrow function, which tells R to utilize ANY 19651 observations SELECTED AT RANDOM, as opposed to simply using observations 1-19651.
  • On the lines of code seen below, train selects the 75% of the observations chosen with the sample function while -train selects the other 25% of observations that are NOT part of the training set. By the way, the minus sign means NOT.

Now here’s our random forest model, using default parameters and the Convicting.Offense.Type. variable  (remember to install the randomForest package):

In case you were wondering, the default number of decision trees is 500, the default number of variables tried at each split is 2 (it’s 3 in this case), and the default OOB estimate of error rate is 3.6% (it’s much lower here-0.22%).

The confusion matrix you see details how many observations from the Convicting.Offense.Type variable are correctly classified and how many are misclassified. A margin of error-class.error-in this matrix tells you the percentage of observations that were misclassified. Here’s a breakdown:

  • Drug-No misclassification
  • Other-3.5% misclassification
  • Property-0.02% misclassification
  • Public Order-0.04% misclassification
  • Violent-0.1% misclassification

So all the crimes that were classified as Drug from trainSet were classified correctly, while Other had the most misclassification (and even 3.5% is a relatively small misclassification rate).

  • One important thing to note is that random forests can be used for both classification and regression analyses. If I wanted to do a regression analysis, I would’ve put my binary variable-Recidivism...Return.to.Prison.numeric-to the left of the ~ sign and other variables to the right of the ~ sign.

Now let’s change the model by change ntree and mtry, which refer to the number of decision trees in the model and the number of variables sampled at each stage, respectively:

By reducing ntree and increasing mtry, the OOB estimate of error rate decreases (albeit by only 0.02%) as does the amount of variables that have a misclassification error. By that I mean that the previous model had only one variable with no misclassification error-Drug-while this model has 3 variables with no misclassification error-Drug, Property, and Public Order. Also, the misclassification error for the other two variables-Other and Violent-is both smaller and larger than that of the previous model. Other had a 3.5% error in the previous model while it has a 2.2% error in this model, so the misclassification error decreased in this case. On the other hand, Violent had a 0.1% error in the previous model but it has a 0.44% error in this model, so the misclassification error increased in this case.

Now let’s do some predictions, first on our training set and then on our validation set (using the second model):

And then let’s check the accuracy of each prediction:

For both the training set and validation set, the classifications are very accruate. trainSet has a 99.9% accuracy rate (only 10 of the 19,651 observations were misclassified) while `validationSet` has a 99.8% accuracy rate (only 13 of the 19,651 observations were misclassified).

Now let’s analyze the importance of each variable using the `importance` function:

So what exactly does this complex matrix mean? It shows the importance of each of the 12 variables when classifying a crime based on the five types of offenses derived from the variable Convicting.Offense.Type.  More specifically, this matrix shows the importance of the 12 variables when classifying a variable-Convicting.Offense.Type-that has five possible values. The two important metrics to focus on in this matrix are MeanDecreaseAccuracy and MeanDecreaseGini.

A variable’s MeanDecreaseAccuracy is a metric of how much the accuracy of the model would decrease if that variable was to be excluded. The higher the number, the more accuracy the model will lose if that variable will be excluded. For instance, if I was to exclude the Recidivism...Return.to.Prison.numeric variable, the accuracy of the model wouldn’t be impacted much since it has the lowest MeanDecreaseAccuracy-3.42. Another reason the exclusion of this variable wouldn’t impact the model much is because this is a binary variable, which would be better suited to a regression problem as opposed to a classification problem (as this model is). On the other hand, excluding the variable Convicting.Offense.Subtype would have a significant impact on the model, since it has the largest MeanDecreaseAccuracy at 13,358.53.

A variable’s MeanDecreaseGini is a metric of how much Gini impurity will decrease when a certain variable is chosen to split a node. What exactly is Gini impurity? Well, let’s take a random point in the dataset. Using the guidelines of trainSet, let’s classify that point as Drug 5982/19651 of the time and as Property 5490/19651 of the time. The odds that we misclassify the data point based on class distribution is the Gini impurity. With that said, the MeanDecreaseGini is a measure of how much the misclassification likelihood will decrease if we include a certain variable. The higher this number is, the less likely the model will make misclassifications if that variable is included. For instance, Convicting.Offense.Subtype has the highest MeanDecreaseGini, so misclassification likelihood will greatly decrease if we include this variable in our model. On the other hand, Part.of.Target.Population has the lowest MeanDecreaseGini, so misclassification likelihood won’t be significantly impacted if this variable is included. This could be because this variable is a binary variable, which is better suited for regression problems rather than classification problems.

But what about all of the numbers in the first five columns? What do they mean? The first five columns, along with all the numbers, show the importance of each of the 12 variables when classifying a crime as one of the five possible types of crimes listed in Convicting.Offense.Type, which is the variable I used to build both random forest models. The higher the number, the more important that variable is when classifying a crime as one of the five possible crimes. For instance, the highest number (rounded) in the Fiscal.Year.Released row is 11.16, which corresponds to the Violent column. This implies that many Iowa prisoners convicted of violent crimes were likely released around the same time. Another example would be the highest number (rounded) in the Main.Supervising.District row-14.45-which corresponds to the `Other` column. This implies that Iowa prisoners who were convicted of crimes that don’t fall under the `Drug`, `Violent`, `Public Order`, or `Property` umbrellas would be, for the most part, located in the same Iowa jurisdiction.

  • If you wish to see the numbers in this matrix as a percent of 1, simply write scale = FALSEafter model2 in the parentheses. Remember to separate these two things with a comma.

Thanks for reading,

Michael

R Analysis 4: Naive Bayes and Amazon Alexa

Advertisements

Hello everybody,

It’s Michael, and today’s post will be an R analysis (my fourth one overall). I’ll be going back to R lessons now, but for those who want more Java content, don’t worry, it will be back soon.

First of all, let’s upload and understand our data. Here is the file-amazon_alexa.

This dataset contains reviews for various Amazon Alexa products (such as the Echo Sub and Echo Dot) along with information about those reviews (such as star rating). Here’s a variable-by-variable breakdown of the data:

  • rating-the star rating the user gave the product, with 1 being the worst and 5 being the best
  • date-the date the review was posted; all of the reviews are from June or July 2018
  • variation-the exact product that is being reviewed, of which there are 16
  • verified_reviews-what the actual review said
  • feedback-whether the review was positive or negative-0 denotes negative reviews and 1 denotes positive reviews

In total, there are 3,150 reviews along with five aspects of information about the reviews (such as star rating, date posted, etc.)

Next, let’s convert feedback into a factored create a missmap in order to see if we have any missing observations (remember to install the Amelia package):

According to the diagram, there are no missing observations, which is always a good thing.

Now here’s a table using the feedback variable that shows us how many positive (1) and negative (0) reviews are in the dataset:

According to the table, of the 3150 reviews, only 257 are negative, while 2893 are positive. I’m guessing this is because Amazon’s various Alexa products are widely praised (though there are always going to be people who weren’t impressed with the products).

Let’s create some word-clouds next in order to analyze which words are common amongst positive and negative reviews. We’ll create two word-clouds, one for the positive reviews (feedback == 1) and another for the negative reviews (feedback == 0). Remember to install the wordcloud package:

  • Before creating the word-clouds, remember to subset the data. In this case, create subsets for positive (feedback == 1) and negative (feedback == 0) reviews.

Here are the wordclouds for both the positive and negative reviews:

Some of the words that are common amongst both the positive and negative reviews include echo, Alexa, and work. This is likely because customers usually mention the name of the product in the review (most of which contain either Alexa or echo) in both positive and negative reviews. Customers’ reviews, whether positive or negative, are also likely to use the word work because customers usually mention whether or not their product worked (and if it did, how well it worked).

Some of the words that are common amongst positive reviews include great, love, can, like, and easy, which are all words that are commonly found in positive reviews (e.g. “The Echo is super easy to use. Love it will buy again!”). On the other hand, some of the words that are common amongst negative reviews include doesn't, didn't, stopped, and never, which you are very likely to find in critical reviews (e.g.”The Alexa speaker doesn’t work AT ALL! Never gonna buy again!”). An interesting observation is that the word refurbished is commonly found in negative reviews; this could be because many Alexa users who are dissatisfied with their product usually buy refurbished Alexa devices.

Next up, let’s clean up the data and prepare a corpus, which will be the text document that is derived from the actual text of the reviews. Our corpus is then used to create a document term matrix (referred to in the program as dtm), which contains the text of the review in the rows and the words in the reviews in the columns (remember to install the tm package):

You may recall from R Lesson 13: Naive Bayes Classification that there are four functions to include when creating the dtm. They are:

  • toLower-sets all words to lowercase
  • removeNumbers-remove any numbers present in the reviews (like years)
  • removePunctuation-remove any punctuation present in the reviews
  • stemming-simplify analysis by combining similar words (such as verbs in different tenses and pairs of plural/singular nouns). For instance, the words “trying”, “tries”, “tried” would be combined into the word “try”.

Remember to set each function to TRUE.

Next we’ll create training and testing labels for our dataset. I’ve probably said it before, but I always like to use the 75-25 split when it comes to splitting my data into training and testing sets. This means that I like to use 75% of my dataset to train and build the model before testing it on the other 25% of my dataset. Here’s how to split the data:

Using the 75-25 split, observations 1-2363 will be part of my trainLabels while observations 2364-3150 will be part of my testLabels. I included the feedback variable since you should always include the binary variable when creating your trainLabels and testLabels.

Now let’s make some tables analyzing the proportions of positive-to-negative reviews for both our trainLabels and testLabels. Remember to use the prop.table function:

According to the tables, the difference between the proportions of positive-to-negative reviews for our trainLabels and testLabels is rather small-1% to be exact.

Now let’s split our dtm into training and testing sets, using the same guidelines that were used to create trainLabels and testLabels:

To further tidy up the model, let’s only include words that appear at least 3 times:

Our DTM uses 1s and 0s depending on whether a certain word can be found in the review (1 means the word is present and 0 means the word isn’t present). Since Naive Bayes works with categorical features, 1 and 0 are converted into Yes and No. This conversion is applied to every column (hence MARGIN = 2):

Last but not least, it’s time to create the Naive Bayes classifier (remember to install the package e1071):

Now let’s test out the classifier on two words, great and disappoint:

Now, what does all this mean?

Well, the word great:

  • DOESN’T appear in 93% of negative reviews but DOES appear in the other 7% of negative reviews
  • DOESN’T appear in 77.1% of positive reviews but DOES appear in the other 22.9% of positive reviews

These results are a little surprising since I thought the word great would appear in a majority of positive reviews. Then again, great is more prevalent in positive reviews than in negative reviews, which isn’t surprising, since negative reviews aren’t likely to use the word great.

And the word disappoint:

  • DOESN’T appear in 93.5% of negative reviews but DOES appear in the other 6.5% of negative reviews
  • DOESN’T appear in 98.8% of positive reviews but DOES appear in the other 1.2% of positive reviews

Just as with great, these results are and aren’t surprising. The surprising thing is that disappoint only appears in 6.5% of negative reviews, when I thought (and probably you did too) that disappoint would be found in more negative reviews. Then again, disappoint is more common in negative reviews than in positive reviews, which isn’t surprising.

Last but not least, let’s create a confusion matrix, which evaluates the performance of this classifier using our testing dataset (which is the variable test). Remember to install the gmodels package:

The confusion matrix contains 787 observations-the amount of observations in our testing set. The column total for column 1 represents the amount of correctly classified observations-729. The column total for column 0 represents the amount of incorrectly classified observations-58. In other words, 729 reviews were correctly classified while 58 reviews were incorrectly classified. The overall accuracy of the classifier is 93% which is excellent, but for the misclassified reviews, it could complicate customer feedback analysis, in which case a more sophisticated model would be needed.

Thanks for reading.

Michael

 

Java Program Demo 2: Arrays, Random Numbers, Inheritance & Polymorphism

Advertisements

Hello everybody,

It’s Michael, and today’s post will be my second Java program demo. I will focus on the topics of the last 3 posts (random number generators, arrays, inheritance, and polymorphism); however I may also utilize concepts from the first six Java lessons (such as for loops and if statements) since Java concepts build on each other.

In this demo, I will create three programs that demonstrate the concepts from Java Lessons 7-9 (with some concepts from Lessons 1-6 as well).

Here is my first program:

package programdemo2;
import java.util.Random;

public class ProgramDemo2
{
public static void main(String[] args)
{
String [] maleNames = {“Alex”, “Brian”, “Chester”, “David”, “Eric”, “Felipe”, “Gabriel”,
“Henry”, “Ike”, “Jacob”, “Kyle”, “Lenny”, “Michael”, “Nicholas”,
“Oliver”, “Phillip”, “Quentin”, “Ricky”, “Steven”, “Todd”,
“Ulises”, “Victor”, “William”, “Xavier”, “Yancy”, “Zach”};

Random gen = new Random ();
int limit = gen.nextInt(25);

System.out.println (maleNames[limit]);
}

}

This program demonstrates a simple one-dimensional array along with a random number generator. The array shown above is a String array containing random males names for all 26 letters of the alphabet (and yes Yancy is an actual male name). The random number generator’s upper limit is set to 25, since there are 26 elements (remember the “indexes start at 0” rule). The system will print out any one of the 26 names mentioned (at random). Let’s check out two sample outputs:

run:
Jacob
BUILD SUCCESSFUL (total time: 2 seconds)

run:
William
BUILD SUCCESSFUL (total time: 0 seconds)

Our two sample outputs-Jacob and William-correspond to indexes 9 and 22, respectively (or the 10th and 23rd elements in the array).

Now let’s demonstrate a two-dimensional array:

package programdemo2;
import java.util.Scanner;

public class ProgramDemo2
{

public static void main(String[] args)
{
Scanner s = new Scanner (System.in);
System.out.println(“Pick a number: “);
int num = s.nextInt();

int [][] multiples = new int [5][5];

for (int i = 0; i <= 4; i++)
{
for (int j = 0; j <= 4; j++)
{
multiples[i][j] = num*(i+1);
}

}

System.out.println (multiples[4][4]);
}
}

Using a combination of a Scanner, a for-loop, and a two-dimensional array, this program will create a two-dimensional five-by-five array based on the number you input into the Scanner. The five rows will contain five different numbers, while the five columns will contain the same number through the column. The number in a column is calculated using the formula num*(I+1), which means that the number inputted into the Scanner will be multiplied by I+1 in order to figure out what number goes into the column. Remember that I increments by one after each loop iteration, so when the loop starts at 0, the number inputted will be multiplied by 1 (since I+0 would be 1). Likewise, on the final iteration of the loop (I=4), the inputted number will be multiplied by 5, since 4+1=5.

Let’s create a sample array, using 4 different outputs (and using 6 as the Scanner number each time):

run:
Pick a number:
6
30
BUILD SUCCESSFUL (total time: 5 seconds) index[4][2]

run:
Pick a number:
6
24
BUILD SUCCESSFUL (total time: 5 seconds) index[3][3]

run:
Pick a number:
6
24
BUILD SUCCESSFUL (total time: 2 seconds)index[3][0]

run:
Pick a number:
6
12
BUILD SUCCESSFUL (total time: 6 seconds)index[1][2]

In each of these four outputs, I use a different index but the same Scanner number-6. The indexes to which these outputs correspond are mentioned to the right of the BUILD SUCCESSFUL line.

Here’s a visualization of the array with the outputs filled in. With the information given in the program and outputs, could you fill in the missing elements?

My last program demo will involve polymorphism and inheritance. Here is my main class (the one from which I will run the program):

import java.util.Scanner;

public class ProgramDemo
{
public static void main (String[]args)
{
Scanner s = new Scanner (System.in);
System.out.println(“Pick a number: “);
double num = s.nextDouble();

String [] time = {“hours”, “days”, “weeks”, “months”};

System.out.println(“Pick an index from 0-3:”);
int index = s.nextInt();

if (index == 0)
{
Days d = new Days();
d.measure(num);
}

else if (index == 1)
{
Weeks w = new Weeks ();
w.measure(num);
}

else if (index == 2)
{
Months m = new Months ();
m.measure(num);
}

else if (index == 3)
{
Years y = new Years ();
y.measure(num);
}

else
{
System.out.println(“Pick another number”);
}
}
}

Now here’s my superclass (which is NOT the same as my main class):

public class TimeMeasurements extends ProgramDemo
{
public void measure ()
{
double num = 0;
System.out.println(num);
}
}

And here are my four subclasses:

public class Days extends TimeMeasurements
{
public void measure(double num)
{
System.out.println(“Number of days is: ” +num/24);
}
}

public class Weeks extends TimeMeasurements
{
public void measure (double num)
{
System.out.println (“Number of weeks is: ” + num/7);
}
}

public class Months extends TimeMeasurements
{
public void measure (double num)
{
System.out.println (“Number of months is: ” + num/4);
}
}

public class Years extends TimeMeasurements
{
public void measure (double num)
{
System.out.println (“Number of years is: ” +num/12);
}
}

Before I get into sample outputs, let me explain the structure of my program.

The first line of the main program in the main method creates a Scanner object, which I will use to store the double variable num. After the line asking the user for input, there is a String array called time, which contains four elements-hours, days, weeks, and months. The user is then asked to choose a number from 0-3 (corresponding to each of the possible indexes), and based on the input, one of the four if statements will execute. OK, there’s also an else statement, but that will only execute if the number you pick for the Pick an index from 0-3:  line is not between 0 and 3 (or any decimals for that matter).

The superclass TimeMeasurements extends the main class ProgramDemo; this means that TimeMeasurements inherits all variables and methods of the ProgramDemo class. This extension ultimately doesn’t mean much since the only method in ProgramDemo is the default main method; however, the num variable from ProgramDemo does carry over.

Here is where polymorphism comes in. In TimeMeasurements, there is a method called measure which converts a number from one unit of time to another. This method is of type void-meaning it doesn’t have a return statement-but its parameter is double num, which is the exact same variable and type as the double num in my main class. Each of my four subclasses-Days, Weeks, Months, and Years-also contains the measure method, but each class will return a different result for the same method. Days divides num by 24 since I’m converting from hours to days. Weeks divides num by 7, Months divides num by 4, and Years divides num by 12. The polymorphism is present throughout the measure method in my superclass and subclasses since the calculations in each method differ from class to class (and the TimeMeasurements class simply sets the num to 0).

  • My subclasses extends the superclass TimeMeasurements, which in turn extends the main class ProgramDemo. So in turn, my 4 subclasses extend ProgramDemo
    • This is why I chose to use num as the variable for all 4 subclasses and TimeMeasurements; using num would allow the program to produce output based on the formulas specified in each of the subclasses measure methods.
  • I used double for num because chances are high that the program’s calculations will involve decimals.
  • A method’s parameters and type don’t have to be the same. Case in point-the measure method; this method is of type void but the parameter is of type double.

There are 4 if statements and an else statement in this program, which will do the following based on the Pick an index input:

  • If 0 is chosen, an object of class Day will be created and the measure method for class Day will be executed. The output will show you how many days are in X amount of hours (X being your num input).
  • If 1 is chosen, an object of class Week will be created and the measure method for class Week will be executed. The output will show you how many weeks are in X days.
  • If 2 is chosen, an object of class Month will be created and the measure method for class Month will be executed. The output will show you how many months are in X weeks.
  • If 3 is chosen, an object of class Year will be created and the measure method for class Year will be executed. The output will show you how many years are in X months.
  • If neither 0, 1, 2, or 3 is chosen, the else statement will execute, which will simply display the line Pick another number.

Now let’s show five sample outputs (to account for the 5 possible conditions). For the outputs, I’ll go in order of conditional statements (in other words, I’ll start with 0):

run:
Pick a number:
65
Pick an index from 0-3:
0
Number of days is: 2.7083333333333335
BUILD SUCCESSFUL (total time: 49 seconds)

In this output, I chose index 0 and the number 65-referring to 65 hours. The numerical output was 2.71 (rounded to 2 decimal places), meaning that there are approximately 2.71 days in 65 hours.

Next I’ll choose index 1:

run:
Pick a number:
112
Pick an index from 0-3:
1
Number of weeks is: 16.0
BUILD SUCCESSFUL (total time: 13 seconds)

I chose 112 as my num input-referring to 112 days. I got 16 as the output, which means that 112 days equals 16 weeks.

Now I’ll use index 2:

run:
Pick a number:
88
Pick an index from 0-3:
2
Number of months is: 22.0
BUILD SUCCESSFUL (total time: 23 seconds)

I chose 88 as my num input-referring to 88 weeks. I got 22 as the output, which means that 88 weeks equals 22 months.

Now time to try index 3:

run:
Pick a number:
105
Pick an index from 0-3:
3
Number of years is: 8.75
BUILD SUCCESSFUL (total time: 10 seconds)

And last but not least, let’s enter a number other than 0, 1, 2, or 3:

run:
Pick a number:
35
Pick an index from 0-3:
5
Pick another number
BUILD SUCCESSFUL (total time: 8 seconds)

I picked 35 as my number and 5 as my index and, since my chosen index wasn’t 0, 1, 2, or 3, I got the message Pick another number as my output.

Thanks for reading,

Michael