We will cover ways to do feature engineering both with dplyr (“mutate” and “transmute”) and base R (“ifelse”) including how to create new columns based on the existing features in the original data set.
What You'll Learn
> To impute missing values as well as create new values based on existing columns
> Use dplyr for data manipulation tasks
To get setup with dplyr, watch our Setup & Data Preparation tutorial and our accompanying blog post here.
Hello everyone today we’re going to talk about how to do feature engineering with dplyr, including how to create new columns based on the existing features in the original data set, we’re also going to cover how to combine different data sets into a new entity using different methods so let’s get started.
We’re going to continue using the wine ratings data set from Kaggle. If you don’t know what I’m talking about please refer to previous videos in this series. Make sure that you already have the wine data set as we produced it last time if you’re not sure we’re going to post this data set on our Github page so you could always download it from there. So I just opened up my RStudio and I have the data loaded on my machine as you can see over here, it’s the original data set after dropping two columns. So now it has nine variables. We can do a quick view of what this dataset that looks like and make sure that you do have the exact same set as I do so I believe we drop the description column and everything else should remain the same it still has the same number of rows as as the original dataset from Kaggle. So let’s get started, first I’m going to load dplyr. You should already have dplyr installed on your machine. So I can, I already do, so I can just call the library.
The first thing we’re going to introduce is how to create a new column based on existing columns using the function called mutate. so, I’ll quickly demonstrate how that works. If I do the dataset’s name, and then use a pipe operator I can do mutate here I’m gonna create a new variable let’s call it PPratio because I want to find out the points to price ratio if I’m interested in the value in a wine so in other words if the points variable is high and price is low, then this PPratio is going to be higher. So I just ran that and you can see I’m gonna scroll up here. We just created a new column called PPratio and these numbers are the division of points by price, one thing to bear in mind is that the mutate function is a non-mutating function, ironically, meaning we just did this line of code but if we examine our original data set named wine, it’s not changed even though we used a mutate function on it. So if we want to make sure, if we do want to append this new column to our original data set we have to overwrite it by assigning this line of code to the original object name. So if I run this line of code now it’s going to overwrite the original data set here so after I hit enter you should see that this dataset called “wine” is going to have 10 instead of 9 variables and let’s do that now.
Another function that’s similar to mutate is called transmute, and the only difference here is that when you do transmute instead of mutate, it’s only going to produce the new variable that we’re, that we’re producing here so we won’t, instead of all these other columns: regions, winery names, countries, it’s only going to contain this new variable called PPratio. I’m going to run that here... Oh it’s pretty big. But as you can see it’s only going to, instead of a data frame that has all, that retains all the original columns from our original data frame, transmute only produces a new column and if we want to. Again transmute is also a non-mutating function. So you can see over here. Just to confirm that our wine object is not changed even though we just called this function transmute on it. So if you do want to produce a new object let’s create a new object called wine1 and then run this function on it. Checking over here this new variable, new object called wine1 is, has the same number of rows as our original data frame but it only contains one variable which is the PPratio that we just produced. So now that you understand how transmute and mutate work.
Now let’s go ahead and find out how many observations there are for each country using our old friend group-by aggregation. So if I’m just curious and I want to find out which countries produce more or fewer wines in this set, I can do something like group_by country and then summarize. Inside the summarize function I can create a new column as well I can call it whatever but let’s just call it “total” and I’m gonna do the “n” function here which is going to count the number of observations by each country since that’s a variable we’re grouping by here Alright so now we’re running into a problem because there are five observations that have the the country variable missing. Missing values are a common problem when we deal with any kind of real-world data set but don’t worry too much here because we’re going to learn how to impute the missingness here so if you want to find out what these five observations are that don’t have a country specified we can subset this data set by using this syntax so here I’m saying that, out of the wine data set I want to subset all the rows that have the country variable missing or all the rows that have the country variable set to nothing. So there’s nothing in these quotes and then I put nothing after the comma because I’m saying I want all the rows that satisfied this condition and all the columns. So I’m putting nothing after the comma. So after I run that I can see that not surprisingly I’m getting five observations since the previous line of code told me that there are five missing observations that don’t have country specified. So here we have the these observations listed and one reason I decided to subset these rows while retaining all the columns is that usually we can try to glean some information from the other columns so even though for these observations the country column is missing, they do have most of the other columns filled out so it’s always a good practice to try to impute the missing values based on the existing values so in this case we can see that, let’s start with designation.
So if I don’t know what this word “Askitikos” means remember to always be resourceful and just Google around or ask a friend who might be knowledgeable in this area and just, you know, be resourceful and try to look for information online and come up with with an answer. So I just copied this word and if I go to Google I’m gonna search. Alright, so I guess, some results here. If I just click on some of them looks like this is a Greek wine or it might be a region in Greece, from what I can gather. So here it’s confirming what I was suspecting so looks like this designation implies that the country should be Greece. Going down the line “Shah” might be a little more difficult so we’ll deal with that later and if we want to find out where Piedra Feliz is referring to, again just google it there we go so looks like this is a region or a place in Chile so now we know that these last three observations probably have Chile as their country so “Shah” may be a little more ambiguous but if we go over here we see that the specific wineries’ names listed. I don’t even know how to pronounce that but I’ll just copy and paste into Google and it’s telling me that this is a winery in Turkey.
So even though we started out with five missing values for country we have just come up with a way to impute all of them. And the syntax for filling out these missing values in country, one way to do that is using the function called “ifelse.” This is not a dplyr function; it’s just a base R function but it’s very useful as well for feature engineering so I’ll quickly show you how that works. I’m gonna go over here to copy and paste so to explain what this line of code is saying “ifelse” takes usually three arguments inside the parentheses, so the first one is a condition. So here we’re saying that in this wine data set if the designation column is this word, if that condition is true the second argument is saying that we want this country column, so in front of the equal sign here, to be Greece and remember to put the word in quotes. So here’s a condition, here is what you do if that condition is true, and the third argument is what you do you if the condition is not satisfied. So we’re saying that in this data set wine if the designation column contains this word we’re going to fill the country column with Greece but otherwise just set it as the country column. So in other words leave it unchanged. I’m gonna run that I see that nothing’s showing up but after we run all three lines of code we can check to see that everything worked and we’re gonna do the same thing for Chile. I just press the up arrow on my keyboard to copy the last line of code I ran, just to save some time and then I can just change the country and region names, so here I’m saying if an observation’s designation is set to Piedra Feliz I want the country name to be Chile and otherwise the country name would be the the country name, so don’t change it and the third one because we arrived at the country name by winery instead of designation I’m going to change designation to winery and just copy and paste the winery’s name. Notice that we probably could have found out Greece and Chile based on the winery names as well because these tend to be pretty specific but designation worked for us so I’m gonna leave that. Okay so we just imputed all five missing values for country and I want to confirm that everything worked I’m gonna just keep pressing the up arrow button on my keyboard and then do another group_by summarize and just to make sure that everything worked. So here we go, we see that there are no missing values any more.
One thing to note here is that even though we just imputed the missing values for country and that was a mutating function meaning our original dataset called wine has been changed after we filled in the blanks for the country variable. The actual group-by aggregation functions are non-mutating. So we did something like this. Every time we just do a group_by and summarize series it’s not going to impact the original data set we started out with. So here if we want to create a new data frame that contains this new column called total and we’re going to arrange that from top to, from the highest to the lowest values. So if you remember from our previous videos or if you already know this function “arrange” we’re passing descending into “arrange” in order to sort from the highest to lowest values we’re going to name this new data frame. Just call it “newwine” I forgot to actually pass in a variable within “arrange” so we’re saying arranging by total from the highest to the lowest. Come over here to make sure that this new data frame has been created and it has 48 observations and only two columns if we take a look at what we just created we can do View.
So here you can see that group-by aggregation produces new objects that only have whatever variable we grouped by and the new variable we create within summarize function so even though the original wine data set had ten variables after the group-by aggregation we only have two now; so all the other variables that we didn’t group by or summarize, like winery names, regions, prices, points, they just all went away so that’s something to bear in mind. And because of the volume of this data set we’re going to just create a subset of it Let’s call it subset1 and we’re just gonna select the first six rows using the head function and we’re gonna create a second subset and this time we’re gonna choose the first six rows of the new wine data frame that we just created over here and you can see that subset1 has six observations of ten variables because we chose the first six rows of the original wine data frame that has ten variables and whereas subset2 also has six observations, but only two variables because again this new wine data frame only has two columns so we’re going to start talking about how to join data sets now using four different methods to join these two subsets. The first one is called full_join. Remember that these join functions are also non-mutating so we have to assign them to a new variable if we want to, if you want an object to actually contain the product of the function otherwise if we just use the function, it’s gonna go away. And it’s hard to call it later on. So the first function is called “full_join” and I’m just gonna call it, create a new variable called “full.” I’m gonna pass in the data set names, subset1 and subset2 so here we see that we just create a new data frame. If we want to examine what it looks like we’ll just call it by its name called “full” here you can see that we have nine observations but some of these columns are missing and that’s because we do a full join, RStudio is going to include all the observations in both subset1 and subset2 even though these subsets have some distinct or non-overlapping values. So to demonstrate what I mean here let’s just take a look at what subset1 and subset2 actually contain.
So you see that subset1 has only has three unique values for country, whereas subset2 has not surprisingly six because we grouped by country. So when we do a full join the result is going to include all these rows and remember that the columns have to have the exact same name so in subset1 it’s called country and that’s the exact same name in subset2 as well and even though subset2 has 3 countries: Italy, Chile, and Argentina that don’t appear in subset1, when we do a full join the resulting data frame as we can see here is going to include all these countries. And not only that it’s also going to include all the observations from subset1 so you see that even though subset1 and subset2 both only have six observations when we did a full join we get nine observations and that’s because in addition to subset1 that only has three unique countries we’re adding Italy, Chile, and Argentina on top of subset1 and because subset2 only has two columns and it doesn’t have all these other features: designation, point, price, etc, but it does have this new column called total for these three new rows that don’t exist in subset1 it’s just going to show n/a for all these columns that don’t appear in subset2 because we don’t have values for Italy, Chile, or Argentina. So you see that all these columns are NAs except the very last one called total.
The second type of join is called inner join, I’m gonna create a new variable called inner and assign that to the inner join function so exactly like the syntax for for full join. I’m just gonna pass in the data set names. So here you can see that whereas the full object has nine observations, the inner object only had six. And let’s just take a look at what they are. So here you see that we don’t have any NA values and notice that we also don’t have Italy, Chile, or Argentina any more like we did in full. And that’s because whereas full join will combine everything in these two datasets regardless of missing values inner join is only going to join the datasets by common values. So because US, Spain, and France are the only overlapping countries that appear in both subset1 and subset2, they’re the only ones that will be retained after an inner join. So pretty much we just lost the values for Italy Chile and Argentina. The final two join functions are called left_join and right_join. And similarly I can do something like this and create a new variable called left and the syntax is exactly the same. You simply pass the dataframes’ names into it. I’m also gonna do right_ join. So left and right joins simply means, left_join will mean retain everything from subset1 and right_join will mean retain everything from subset2. So come over here we can see that left_join has six observations whereas right has nine observations and that’s because we created, subset1 is in a way a subset of subset2, if that makes sense. Because subset2 has six countries where a subset1 only has three. So when we do a left_join retaining only unique values from subset1, we get fewer observations than if we do a right_join. That’s because when we do a right_join we’re retaining every unique value from subset2 and our subset2 has more unique country values than subset1. That’s how we end up getting more values from doing a right_join than a left_join. And the result of the right_join should be the same as a full_join. If we want to confirm that we can just call it and see that we have all six countries with NA values for Italy, Chile, and Argentina and if we want to look at the left variable it looks the same as the inner_join product. So it doesn’t have the three values that only appear in subset2 and remember that we decided to subset from the two different data frames wine and newwine, or wine1, rather. It’s because we have over 150,000 observations and doing joins would take too much time and computing power so in order to illustrate how the join functions work I selected these subsets just to make it more time efficient for the purposes of this tutorial. That’s it for today’s tutorial.
Thank you for watching the dplyr series. If you’ve liked what you’ve seen so far please check our other videos on YouTube as well as follow our blog page. If you have any questions be sure to ask in the comments below. Thanks for watching!
Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.
© Copyright – Data Science Dojo