A data frame, commonly known as a table, is one of the main objects in R that you will often use when working with data.
What You'll Learn
> To access a data frame’s rows and columns, and look up the structure of a data frame using STR command.
If you haven’t installed R and Rstudio already, you can watch "Getting started with Python and R for Data Science" video to get started.
For the dataset used in this exercise, download from here.
A data frame is a two-dimensional array or, to put it simply, a table data that is tabled in two rows and columns makes it easy to work within R and means we can store a mix of numeric, categorical variables, and character strings. Data frame is one of the main objects you’ll be working on within R. So, let’s get ourselves familiar with it.
In our video on reading and writing data, you read in the income data set as a data frame. You can see a mix of data types here and variable names. Data frame makes it easy to subset data in R, as you would’ve seen in our operations video. You can use conditions to extract out something you might be interested in. So, for example, I might be interested in extracting out the average income for jobs that are, you know, above 90,000.
To do this, I’ll simply write “income” and, within my income data set, I use the dollar sign ($) to refer to the variable that I’m interested in, which is “average income”, and I would like it to be greater than or equal to 19K. Now, in our data frame income, we can refer to specific rows and/or column names inside the square brackets here. So, here we are specifically referring to rows that meet this condition. We want to extract all the rows where the average income value in the row is greater than or equal to 90,000. The comma that kind of follows this means that we can also extract at the column level. So, you can say within income, we can specify the rows and we can specify the columns. Let’s just say I’m interested in the third row of the third column.
So, what I’m saying here is I would like the income value that sits at Row 3 of the “average.income” column which is the third column and if we run this, you can see it has extracted the relevant value. The same goes for meeting a condition. So, we just specified the rows we want before the comma and, if we want to specify any columns, we do this after the comma.
We can also extract a range of rows and columns in our data set. For example, I might want rows one to three and I only want the values from columns one to two of income. Now, we can see the rows one to three showing and only columns one to two of those rows. In a data frame, we can easily add or remove columns too. So, for example, to add a column, we simply type “income”, use this dollar sign ($) to add a variable. We just call it “new.column” and I’m just going to add a bunch of “NA” missing values to that just for the quick demonstration.
Let’s have a look at this. Okay, cool. Now, to remove a column we follow a similar kind of command here. So, “income” and I want to kind of get rid of the fourth column. So, gonna -4 here which is the lost column in our data set. Let’s have a look. Okay, great.
Now, another useful command in R is STR, or what we call structure. So, if we look at the structure of “income”, for example. Here, we can see our numerical or character or categorical variables. We can see how many characters are, you know, categories or factor levels there. How many rows of data we have to work with and the like.
So, now that you’re familiar with data frames, we’ll move on to vectors in the next video.
Rebecca Merrett - Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.
© Copyright – Data Science Dojo