R Lesson 2: Basic summarization of R Data

Advertisements

Hello everybody,

This is Michael, and today’s post will be about basic summarization of R Data. I thought this would be an appropriate place to continue from R Lesson 1: Basic R commands (I’ll be using the dataset from that post).

Let’s start off simple by using the summary() command to display a summary of the age field.

As you can see, the output shows the minimum age for any congressperson (25), the end of the 1st quartile (45.4), the median age (53), the mean age (53.31), the beginning of the third quartile (60.55), and the maximum age (98.1). But what does this all mean?

  • The minimum is obviously the minimum age amongst the congresspeople-25 years
  • The 1st quartile is the age between the minimum and the mean (45.4)
    • In other words, the youngest 25% of congresspeople were between 25 and 45.4 years old (at the start of their terms)
  • The median is the center of all the ages amongst the congresspeople-53 years in this case
    • 50% of congresspeople were between 45.4 and 60.55 years old (at the start of their terms)
  • The mean is the average of all the ages
  • The 3rd quartile is the age between the median and the maximum (60.55)
    • In other words, the oldest 25% of congresspeople were between 60.55 and 98.1 years old (at the start of their terms)
  • The maximum is obviously the maximum age amongst the congresspeople-98.1 years

However, if you use summary on a non-numeric field, such as state, the counts for each observation (in this case, how many times each state appears in the dataset).

27Jun capture2

Another summary command I will discuss is table(), which shows all values of a variable along with each values’ frequencies (how many times that value appears in the dataset).

Below is a table displaying the number of congresspeople who are representatives, as well as those who are senators.

Now here’s what the table would look like if we add another value (I’ll use congress)

Like the last table, this table shows the number of representatives and senators, except divided up by congress (80th to 113th specifically).

Thanks for reading,

Michael

 

 

R Lesson 1: Basic R commands

Advertisements

Hello everyone,

It’s Michael, and I thought a perfect first post (aside from my welcome post) would be an intro to the wonderful, statistical and completely free software known as R. The dataset I will use will be congress-terms.csv, which I have attached to this post.

To start we will first upload the file onto R. If you are wondering how to do that, here’s the command:

  • dataFile <- read.csv(“/Users/michaelorozco-fletcher/Downloads/congress-terms.csv”)

You may choose different a different variable name. Your file path will be different too. To know what your file path is, open up Excel, then click File > Properties. This window will pop up.

The location field would be your file path (along with a slash and the file name, congress-terms.csv in this case, after “Downloads”).

Allright, now that I explained how to read a CSV file onto R, here are some basic R commands.

str(dataFile) displays a summary of all the data fields in the file, which is important for understanding the data you are working with. As mentioned above, there are 18635 observations of 13 variables, which include

  • congress-which term of Congress does a particular congressperson serve in (anywhere from the 80th-lasting from 1947 to 1949-to the 113th-lasting from 2013 to 2015)
  • chamber-whether a particular congressperson is a part of the House or Senate
  • bioguide-each congressperson’s ID Number within the Biographical Directory of the United States Congress
  • firstname, middlename, lastname-These are self-explanatory
  • suffix-A “Jr.” or “III” or something like that at the end of a particular congressperson’s name
  • birthday-Again, self-explanatory
  • state-What state the congressperson serves
  • party-A congressperson’s party affiliation, whether D for Democrat, R for Republican, I for independent, among others
  • incumbent-whether a congressperson was in office at the beginning of a particular term (such as the 110th Congress) or came into office after another congressperson left
  • termstart-when a term of Congress began
  • age-how old a congressperson was when a term began

 

Now lets check out some other basic commands. I used the age field because it is the field with the most numbers.

Above you will find the mean, sd (standard deviation-square root of variance), var (variance-the standard deviation squared), max, and min for the age field. Some inferences we can make include

  • There is a fair spread among the ages (10.67 years, as given by sd)
  • The ages are quite spread out from the 53.31 mean (as given by the 114.03 var)
  • The oldest congressperson was almost 100 when his term began (J. Strom Thurmond, 1902-2003)

These are just a few of the basic commands. For more commands check out https://www.calvin.edu/~scofield/courses/m143/materials/RcmdsFromClass.pdf

Here’s the spreadsheet: congress-terms

Thank you,

Michael

Welcome

Advertisements

Hello readers,

My name is Michael, and this is my data science blog. Here you will find plenty of information about data science-ranging from how-to posts to analyses with actual datasets. I’ll mostly focus on MySQL, R, and Excel (with Java lessons from time to time) for now, though I may eventually add other analytical tools like Python.

Thank you,

Michael