Intersect & Setdiff – Combining Datasets






     

    Course Description

    We’re going to talk about how to find overlapping and non-overlapping values from two different data sources using dplyr to slice and dice a dataset on wine ratings from Kaggle.

    What You'll Learn

     To use group-by aggregation and group_by summarize functions, as a combo to slice and dice a data set




     To get setup with dplyr, watch our  Setup & Data Preparation tutorial and our accompanying blog post here.


     

    Hello everyone today we’re going to talk about how to find overlapping and non- overlapping values from two different data sources using dplyr this is going to build upon the same data set on wine ratings that we’ve used in previous videos. 

    If we wanted to find out which rows appear in both the selected countries list and this top list the global top list and we can do like I mentioned an intersect function and pass the two vectors in and let’s check out who the winners are. So these ten countries are both in the top list and the selected country list. So if you want to further refine the top list and select only the top, let’s say, let’s do ten again contenders. We’ll do, I’ll overwrite this top object and only select the first ten entries. So if we examine the top object again, here are the ones that still remain and let’s do the intersect function again and check out this variable. Oh sorry I should have done top instead of both we see that only three countries appear both on the top 10 producing list and the top 10 global best list. So next time you go shopping pay attention to French Austrian and Australian wines. As a complement to intersect we can find out which observations are not in both lists so in other words which entries only appear in one of the lists. You do setdiff, which is now searching for variables that that don’t belong to two bills lists so here we could have probably guessed that it would be the remaining 7 countries but I typed out the syntax just so you can see clearly what I meant here so these seven countries are in the top 10 producing country or rather they are not in the top 10 producing countries list but they do produce high-quality wines nonetheless. 

    So after dissecting this data frame from a country perspective we can now turn to focusing more on the grapes. So in other words if you want to find out which are the most represented grape varietals in this data set we can do something as follows. Let’s say top wine, creating a new object again starting from all the way from the original dataset wine, and do another group-by aggregation. This time we’re gonna group by variety over here instead of country and then summarize as usual I know I did count as my new count variable before so just to switch it I’m gonna use the word number now and doing the exact same thing another pipe operator and arrange function again I want to set it to descending order since I want to find out I want the largest value to appear on top and then I’m selecting the top say 10 grapes. Oh looks like there is a typo There we go. Let’s check out this variable it’s again another data frame and I want to set it to a vector so let’s do something like this to overwrite this object and now it’s a character vector so these are the top ten most represented grape varietals in this data set. If we want to find out which varietals tend to be higher rated than others we can again with the help of ggplot visualize this pretty easily. We’re gonna do another filter from dplyr and use the same syntax and filter for rows that only that have values of variety only in the top wine list that we just created and do another group_by summarize I’m gonna call this median again but I could name it anything I like and you see that ggplot works seamlessly with the pipe operator syntax in dplyr so here I’m passing it directly after all the other functions and again each function takes the variable immediately preceding it as the argument so it’s like waterfall structure so to speak so here I want to order my x-axis which is gonna be variety but we want to make sure that the order is dictated by the median of the median variable don’t be confused here and then to a geom_col function but just make sure this is it’s gonna be column structure simply counting how many observations appear in each category. I’m gonna fill it by variety also adding some labels and title, say variety and median point it’s a straightforward label. Since there’s some grape names that are longer. For example sauvignon blanc, Bordeaux-style red blend, they’re not going to all appear on the graph on the x-axis so I’m gonna say abbreviate the x-axis and because it is a discrete variable I’m spelling that out here it’s a scale X discrete and pretty straightforward and the labels are gonna be abbreviated let’s see what this gives us so not a huge dispersion here but then again one big caveat when working with this set is that all the wines were already pre-selected and they were all rated above 80 but we do see that Bordeaux-style red blend tends to be the highest rated grape followed by Pinot Noir and Syrah so on so forth Finally, if you’re a savvy consumer not just a curious oenophile, you might want to know where the best values are. So in other words if you want to find bottles of wine that are of as high quality as possible but don’t cost that much we can do another intersect function and find the rows that appear on the top towards the top on the, on the ranking list but appear towards the lower end of things on the price list so let’s do that now let’s say we want to find out the top 15% of all wines in terms of ranking you should know by heart how to do this now making sure it’s gonna be descending order according to the points variable I’m gonna filter since we’re selecting the top 15% of the probability, in the quantile function it’s gonna be .85 so we’re saying that we want the observations that have a value higher than 85% of the whole set so in other words we’re filtering for the top 15%. There we have it. Now we’re gonna find the cheapest 15% now we’ll arrange by price instead of points. When we do an intersect we want to make sure that our two data frames have the same numbers of rows and without looking at how many rows there are in my top15p object that I just created I’m just gonna say for the cheap 15 percent object I want to select however many rows there are in my other data frame. So instead of the “top_n” syntax I’m now using the head function which achieves a similar goal by instead of passing explicitly passing a number I’m now saying that I want to find however many rows there are in this other data set. So there we have it and finally we want to know where all the good value wines are so let’s do an intersect function and pass the two new data frames that I just created. There we go. And if we want to examine this object let’s do I just call it so there you go it looks like we have 16 bottles that come from Portugal, France, US, Italy, and Austria that are among the top 15% in terms of rating and the bottom 50% in terms of price. So get get more bang for your buck so to speak. Next time you go shopping you know where to look for good values of wine. i might actually put these on my personal shopping list before I go to the store. That’s it for our tutorial today. 

    There are many many other things dplyr can do but I think the most important takeaway for today’s video is that group-by aggregation, or group_by summarize functions, used as a combo is a great way to slice and dice a data set and it really provides a lot of insight into whatever data you have. In the next video we’re going to talk about how to do feature engineering using dplyr including how to create new columns based on existing features in a data set we also learn different ways to join data sets including full join, inner join left join, and right join. Thank you for watching today’s video and see you next time. 


     

    Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.