Subset, Reshape, and Summarize Data






     

    Course Description

    In this video, we’re going to subset, group, and summarize data with some very useful functions in dplyr and show basic visualization techniques with ggplot2.

    What You'll Learn

     > Use Arrange, group_by, summarize functions

     Use ggplot2 for plotting



     

    To get setup with dplyr, watch our  Setup & Data Preparation Tutorial.

    Be sure to also check our accompanying blog post here


     

    Hi everyone, this is Ningxi with Data Science Dojo. Today we’re going to introduce some very useful functions including group_by, summarize, and arrange, to start manipulating your data. We’re also going to walk you through some basic visualization techniques with ggplot, just a few simple lines of code can create beautiful plots So one of my favorite tools to use in dplyr, and I think this applies not just with to dplyr but data analysis in general, is the group-by aggregation combo tool that we can use,it basically groups the dataset by a certain feature and then it performs some aggregation function on the subgroups which comes in really handy if we want summary statistics on the different subgroups to do a quick exam of how the different groups are different in our dataset. Then group-by aggregation is one of the most powerful tools you’re going to use.

    So, the first function we’re going to demonstrate with that is “arrange” also note that this syntax in dplyr is called the “pipe operator.” It’s very handy when it comes to passing multiple functions sequentially, since it takes whatever object that comes immediately before it as the object that it’s going to use. So I’m going to do the group-by aggregation, which in dplyr syntax will be “group_by” that is it's own function, followed by “summarize.” So here we’re going to group by “country” because we want summary statistics on the different countries, followed by another pipe operator. So whatever function comes after this operator is going to take whatever object created by this function as its argument So on so forth, which is very useful. 

    We’ll do “group_by” and I’ll summarize. We want to count how many wines each country produced in this dataset so when I do this, it means I’m creating a new variable, and I could have named this variable whatever but I’m naming it “count” because that makes sense. I’ll do another function inside the “summarize” function named “n”. And it does not take any argument in the parenthesis. So it’s counting all the observations grouped by country and then creating a new variable called “count.” 

    So right now we’re creating a new data frame that only has two columns, The column that we selected by “group_by,” which is “country,” and the column we just created called “count” inside the “summarize” function. And we want to know what are the top producers in this dataset So we’ll do something like “arrange.” That’s going to sort through all the values, because “arrange” has its default setting as ascending. It means it’s going to start from the lowest value to the highest. We want to do something like “desc,” standing for “descending” by “count,” since we want to know, we want the top to bottom count, so from the largest to the smallest value So when we do that, we get, like I said, the two-column data frame that only has “country” and “count.” And we see that the top ten producers are listed with 30 more countries not listed, so there are 49 countries in total in this set because of the magnitude of this dataset we want to focus our attention on, let’s say, the top ten producers. So I’m going to actually create a new variable called “selected countries” that’s going to do just that again, I’m going to start with our original dataset, “wine,” group by “country” and summarize by “count” Note that the “n” function is actually the function that’s counting, and this count is just any name that I chose to give it So this is just repeating my previous line of code. Here I’m introducing a new function called “top_n” and inside it I can pass any number I wish. Here I can do 12 or 15, but I’m going to select the top 10 producers and I’m passing 10 Finally, a “select” function that’s going to let us select which columns we want. And note that it only works with subsetting columns, not rows. We’ll see how to subset rows shortly Ok, so here I actually forgot to pass “count” inside the “arrange” function because otherwise it does not know which variable I want the count to take. So if it was taking the variable “country” then it doesn’t make any sense, so remember to pass in “count” into the “arrange” function and now selecting by count and if we examine this new object we just created It is actually a dataframe. If we check out its structure, here we go which we do not want, because as you can see later we want this object to be in a vector structure so we can directly reference it So the way to do that is overwriting this object using the “as.character” function and because this object is still a dataframe I can use this syntax to reference which column, even though it only has one column I still need to do the dollar sign syntax to explicitly say that I am selecting the “country” column so now if I check this out again, it’s now a character vector Just to make sure, so if we want to create yet another data frame containing only data on these top ten producers and we want to find out how their wines were rated in this set. And that’s, as you can see here, accomplished by the “points” function. So we’re going to subset again Let’s create a new object. Let’s call it “select point” Starting from our original data frame, and here we’re using this function called “filter,” which is used for filtering rows. So as opposed to “select,” which works with columns, “filter” works with rows We’re saying that we only want the rows that appear in the “selected countries” list. So note the syntax here, this function is basically saying that we only want the rows that have their value in the “country” column, over here If their country is listed in this vector, which is here as you can see the top ten producers list that we came up with a moment ago So we’re gonna select only certain columns out of this dataframe. So we only want the “country” and “points” columns and we’ll do another “arrange” by country And if we check this object out, we’ll see that it is a new dataframe containing only the top ten countries we selected and their points.

     I’m skipping most of the values because there are too many rows If we want to do a quick examination of how the variable “points” moves with price, so in other words, if we want to know if a more expensive bottle of wine really is better, or at least so judged by Wine Enthusiast where this dataset is obtained from. I think that’s a question most of us encounter when we try to pick a good bottle of wine, at a store or at a bar, because oftentimes we don’t know if more expensive really does mean better. So let’s do a quick exam there using ggplot I’ll do a scatterplot so passing the geom_point function after the ggplot function, and also add a smooth line just to make it visually easier to see the trajectory of the correlation and it’s taking a second here As you can see the system’s still running and go over to “Plot” We can see that overall there is a positive correlation between points and price, even though it’s not a huge jump. And a few outliers here, you can see the most expensive wines located towards the top of the graph do seem to be highly rated as well, except this guy right here. So rest assured, if you’re paying more for a bottle of wine, chances are, I wouldn’t say very likely, but likelier than average, say 50%, that you’re actually getting higher quality than average So that’s interesting to know and if we want to delve into the country-specific details of how their wines were rated, we can do a more specific plot. Or in this case, actually a series of plots – a seires of boxplots, and see what we get.

    Here I’m saying that I want the variable on the Y-axis to be points pretty straightforward. On the X-axis I want country. However, this variable is going to be reordered by the “points” column, and we’re saying that we want the median of points to be the reordering metric when we’re placing these country values on the X-axis So moving on to calling the boxplot aesthetic We’re gonna fill these boxplots by country. Remember to use “fill” instead of “color” because “color” only does color the outline of the boxplots. It does not fill the entire box Finally, we’re adding some labels to our plot so it looks pretty. Don’t forget to use quotes And we can also add a title. Let’s say “Distribution of Top Ten Wine Producing Countries.” Here I’m just gonna make sure that the title appears right in the center. So I’m gonna do something like “element_text” since we’re gonna move the textual element, and pass in the “hjust” argument, which stands for horizontal justification. I’m gonna set it to 0.5 since this is usually treated as a value between 0 and 1 So .5 means it’s gonna be horizontally centered, essentially. Let’s see what this gives us So here we have a series of boxplots of our top ten producing countries, ranked by their median point. We can see that on average, or by the median metric, France is the highest rated country in terms of wine quality even though it’s not, it does not vary a lot from the runner up, Austria. One caveat when we’re using this dataset is that all the observations of wines that received a rating of 80 or more, since Wine Enthusiast did not even include wines that received a rating below 80. 

    So the distribution may be a little tighter, since all these have been essentially pre-filtered before they were even included in the dataset. So here we can see that we have some countries that have a wider spread, like the US and Spain. And some have tighter distributions like New Zealand, Italy, and Austria. And some outliers across the board, really. If we want to find out if there were any countries that produced high quality wines but are not mass producers, meaning if they’re not in the top ten producing countries like we found here, but nonetheless produced wines that were rated high We can do something like the following. So this time around we’re gonna filter by countries that do not appear on the “selected_countries” list so we’ll do this syntax. So the exclamation point like in many other programming languages just means that we’re saying exclude, that we’re selecting countries that are not in this list So, continuing the piping sequence, we’re gonna do another group-by aggregation function. Grouping by country again and then summarize. This time around we’re gonna create an new variable called “median” That’s going to be created by doing the “median” function, don’t be confused here, on the “points” variable.

    I’m gonna do another “arrange.” again descending order since we want the top rated wines to come up first So here you go. After filtering for countries that do not appear on the top ten list, and arranging by their median point, we see that the top high quality wine producing countries are England, India – who would’ve thought Right? Germany, ok, Slovenia, Canada, and Morocco – that’s an interesting one as well. So on, so forth So keep an open mind when you go wine shopping next time. If we wanna find out which countries not only produced a large quantity of wine but also do offer high quality products, we can do an “intersect” function and find out where the intersecting rows are in two dataframes. Also it’s a handy tool to use with dplyr So here I’ll quickly create another variable called “top.” and here I’m not filtering by any rows. I’m just gonna do a quick group-by aggregation again. Group by and summarize, same thing as before. We want to use the median point as a metric. So here, if you check that out, it’s actually gonna be a dataframe structure We’re gonna transform that into a vector. I’ll overwrite it. We’re only selecting the “country” column So there you are. These are the countries that are most highly rated in terms of their wine quality So this includes both the top producers and otherwise. By now you should be able to properly select columns and filter rows, and well as use ggplot to work with dplyr to create seamless visualizations. 

    In Part 3 of this Introduction to dplyr, we’re going to close out by introducing “intersect” and “setdiff” functions to show you how to find overlapping and distinct values from two different data sources. We’ll see you soon I know I did “count” as my new count variable before, so just to switch it up, I’m gonna use the word “number” now and doing this exact same thing another pipe operator, and “arrange” function. Again I want to set it to descending order since I want to find out, I want the largest value to appear on top and then I’m selecting the top, say, ten grapes Oh, looks like there’s a typo. There we go. Let’s check out this variable. It’s again another dataframe, and I want to set it to a vector So let’s do something like this to overwrite this object and now it’s a character vector. So these are the top ten most represented grape varietals in this dataset.If we want to find out which varietals tend to be higher rated than others, we can again, with the help of ggplot visualize this pretty easily.

    We’re gonna do another “filter” from dplyr and use the same syntax and filter for rows that only have values of variety only in the “top wine” list that we just created And do another “group_by summarize” We’ll call this “median” again, but I could name it anything I like And you see that ggplot works seamlessly with the pipe operator syntax in dplyr, so here I’m passing it directly after all the other functions, and again, each function takes the variable immediately preceding it as the argument. So it’s like a waterfall structure, so to speak So here I want to order my x-axis, which is gonna be “variety” but we want to make sure that the order is dictated by the median of the “median” variable. Don’t be confused here I’ll do a geom_col function. Just make sure this is, it’s a column structure, simply counting how many observations appear in each category I’m gonna fill it by variety I’m also adding some labels and title. Let’s say “variety” and “median point.” It’s a straightforward label Since there are some grape names that are longer For example, sauvignon blanc, Bordeaux-style red blend they’re not going to all appear on the graph, on the x-axis So I’m gonna say abbreviate the x-axis And because it is a discrete variable – I’m spelling that out here, it’s “scale_x_discrete” and pretty straightforward and the labels are gonna be abbreviated Let’s see what this gives us So not a huge dispersion here, but then again, one big caveat when working with this set is that all the wines were already pre-selected and they were all rated above 80. But we do see that Bordeaux-style red blend tends to be the highest rated grape, followed by pinot noir and syrah. So on, so forth Finally, if you’re a savvy consumer, not just a curious oenophile, you might want to know where the best values are. So in other words, if you want to find bottles of wine that are of as high quality as possible but don’t cost so much, we can do another “intersect” function and find the rows that appear on the top, towards the top on the ranking list but appear towards the lower end of things on the price list So let’s just do that now Let’s say we want to find out the top 15% of all wines in terms of ranking, you should know by heart how to do this now. Making sure it’s gonna be in descending order according to the “points” variable Gonna filter Since we’re selecting the top 15%, the probability in the quantile function is gonna be .85 So we’re saying that we want the observations that have a value higher than 85% of the whole set. So in other words, we’re filtering for the top 15% There we have it Now we’re gonna find the cheapest 15% Now we’re arranging by price instead of points We only do an “intersect” we want to make sure that our two dataframes have the same numbers of rows and without looking at how many rows there are in my “top15p” object that I just created, I’m just gonna say for the “cheap15percent” object I want to select however many rows there are in my other dataframe So instead of the “top_n” syntax I’m now using the “head” function, which achieves a similar goal But instead of passing, explicitly passing a number, I’m now saying that I want to find however many rows there are in this other dataset So there we have it And finally, we want to know where all the good value wines are. So let’s do an “intersect” function and pass the two new dataframes that I just created There we go And if we want to examine this object, let’s do Let’s just call it There you go Looks like we have 16 bottles that come from Portugal France, US, Italy, and Austria that are among the top 15% in terms of rating and the bottom 15% in terms of price. 

    So, get more bang for your buck, so to speak next time you go shopping you know where to look for good values of wine I might actually put these on my personal shopping list before I go to the store So that’s it for our tutorial today There are many many other things dplyr can do But I think the most important takeaway for today’s video is that group-by aggregation, or “group_by,” “summarize” functions used as a combo is great way to slice and dice a dataset. And it really provides a lot of insight into whatever data you have. I think a lot of people may not realize, but this simple syntax and tool that almost anyone could use is so powerful So you might learn a lot of other more advanced machine learning techniques, like neural networks support vector machines, but always remember that a simple group-by aggregation oftentimes can achieve a lot more than you would otherwise realize. So it really is a foundation of skill to have and hone continuously. 

    Well if you enjoyed today’s video, check out our other videos on YouTube and follow us on social media We’re on Facebook, YouTube, LinkedIn, and Meetup And be sure to check our blog at blog.datasciencedojo.com as well. We routinely post interesting articles and tutorials such as this one on there So, thank you for your time and see you next time.


     

    Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.