Joining Datasets

r_subheading-Course Description-r_end In this Data Mining with Azure Machine Learning Studio tutorial, we will show the process of joining data sets by a key and walk through the entire process with appropriate examples. This powerful technique can help in a lot of analyses and is extremely useful in data science. r_break r_break r_subheading-What You'll Learn-r_end • Joining data sets in Azure ML. r_break • Improving the quality of the ML model. r_break • Types of joins in data science.


Hey. Welcome back to Data Mining with Azure Machine Learning Studio, brought to you by Data Science Dojo. r_break r_break Today, we’re going to learn how to join two data sets together by a key. If you’ve ever used SQL before, it’s the same thing. If you’ve ever used Excel before, it’s a more powerful version of a VLOOKUP. So joining allows us to have multiple tables share relevant information with each other, or it can be used to cross-reference additional information about each transaction. Now, joining usually expands the number of columns or features that we have access to, which may give the machine learning model access to more relevant features to learn from. r_break r_break So we’re going to continue where we left off yesterday, so make sure you watched the previous video, because it’s going to be a prerequisite for the current video. r_break r_break So last video, we took the airport codes, and then we basically cloned the data and then we renamed– on one side, we prefixed everything with the column name, and on the right side, we went ahead and prefixed everything with destinations. So now when we do a join, we’ll know that are we talking about a destination state or an origin state? So what we got to do is, first of all, we got to bring in our main transaction data. r_break r_break So our main transaction data, as you remember, will be the flight delay data. So if I open up the sample data files, scroll down, I will go ahead and find the flight on time performance raw. I’ll drag that in, OK? So we have to join on a key. So if we look at either of these two data sets, so if we visualize, right-click and visualize on the output node, we can see that airport ID will be the unique identifier used to cross-reference. r_break r_break And if we go into flight on time performance and visualize, you’ll see that there are two columns that are using these airport ID keys. So an origin airport ID and a destination airport ID. So when we combine these two data sets to this transaction table, we will end up with at least three additional columns per join. So this is going to add one, two, three, and this is also going to add one, two, three. r_break r_break So this will hopefully improve the quality that our machine learning model can learn from later. So our machine learning model can learn, oh, man, this flight is coming from Alaska. It’s always going to be late if it’s coming from Alaska, for example. It can learn those details like that. So to do that, go to the very top left corner of your toolbar and then basically expand it, and then, in the search box, type in the word join. r_break r_break You will have a module called join data. So I’m going to drag that in here. And I like to put it so it’s in the middle of the two data sets that I’m going to try to join together. So notice that the main transaction table, so the table that I want the transactions to be based off of, is going to be the table on the left. Azure ML is left-join based, and I’ll explain what that means in a little bit. But the idea is you want the main transaction table on the left side, and you want the cross-reference table to be on the right side. r_break r_break And notice that the join module has two input nodes, a left and right. So I’m going to drag this over here, and I’ll drag this over here. So notice these two data sets are now funneling into this join data module. So if I click on this join data module, you’ll see that there’s a value required. So that means that I have to expand this. I have to select it, make sure it’s highlighted in blue, and then I have to expand the properties table for it. r_break r_break All right, so now it’s asking us what key do we want to join on on the left, and what key do we want to cross-reference with that join on the right? So I will go ahead and launch the Column Selector and say I want to join the origin airport ID. You see that? I’ve selected the origin airport ID on the left table, the left table being the main transaction table. And on the right side, the right side, this is the origin airport. So notice that I’m going to try to join these IDs together. So I’m going to launch the Column Selector and notice that airport ID, I’m going to drag that in. So I want the origin airport ID to be cross-referenced with the origin airport ID. r_break r_break Now, notice that we can match the case. In this case, I think they’re the same because they’re numbers, but I’m just going to uncheck this anyways. It’s usually a habit of mine. And then the next thing I’m going to do is see this? See this that says keep right key columns in joined table? Now, if you need to keep the airport ID of this data set, this is what you would do, you would keep it checked. But since they’re going to be redundant information for me, I’m going to go ahead and uncheck this box, which means I’m only going to keep the origin airport ID on the left side, and I’m going to drop the origin airport ID on the right table. r_break r_break And now, this part’s going to be a little bit confusing. What join type do you want? So join type is referring to, basically, when you do the join, what happens when there is a match or when there’s not a match? r_break r_break So when you do an inner join, let’s look at the inner join first. When you join two tables together, there’s a table and a right table. When you do an inner join, you only want to see rows that have a match on both sides, right? The same primary key shows up on the left table as the right table, and then you want to exclude anything that doesn’t have a pair. That’s an inner join. We don’t want that right now, because remember, our main transaction table is on the left. We care to keep every transaction table. We’re only hoping to add additional information to our table. r_break r_break So that’s where we want to do what’s called a left outer join. So the left outer join is going to keep every key on the left table, regardless of whether it has a match to the right table or not. r_break r_break And notice that everything in the right table that does not have a match is going to be dropped. And then the opposite is true for right outer join, and then for a full outer join, you keep everything. Whether there is a match or not a match, you keep everything. So what we’re going to do is we’re going to do a left outer join to ensure that all transactions are kept in the left table. So we’re going to go ahead and select left outer join here. And then what we’re going to do is join data module, we’re going to add some documentation. r_break r_break So I’m going to say look up origin airport location. So this is going to bring in airport name, airport state, airport city. So I’m going to run this module now, so it’s going to do that join for us. OK and it’s done. r_break r_break So if we go ahead and visualize the output of this brand-new table that is now an amalgamation of these two tables combined, we can then visualize. So if the join happened properly, we shouldn’t have dropped any rows. So we have 504,000 rows after the join, and before the join, if I visualize, I should also have 504,000 rows, which is true. r_break r_break So the join so far has not corrupted any data. And if we go back into the data set, we can go ahead and see if we can find– so we can see that the destination airport ID is right there. That was one of our keys that we were joining. And notice at the end, it’s added that the airport ID again, origin city, origin state, and origin airport. So what I did was I forgot to uncheck this box right here, so that’s why the airport ID came in from this table, too, even though it’s redundant because we already have a destination airport ID. I’m just going to go back to this join and uncheck this box. But I also know that this did wonderfully. This join did what it was supposed to do. r_break r_break So now I can go ahead and join this table to the second table. So I can go ahead and select this join. I can either right-click and say a copy and then right-click and say paste, or I can just go to my keyboard and say Control + C while selecting this and then Control + V and then it will copy the join data module so that I can connect my new table. So remember, the primary transaction table is going to be this table. It’s going to be the left table. r_break r_break And then we want to now join the destination airport to the airport ID. So I am going to drag this join module until it’s in the center of these two data sets, just so I’m crystal on what’s happening here. All right, so notice that it’s on the left table. It brought in our original value, which was origin airport ID, which is wrong. r_break r_break I want to launch this Column Selector again, and instead of saying origin airport ID, I want to say destination airport ID. I want to check this box. So I’m joining destination airport ID with the airport ID inside of this table on the right side, which is where we renamed the columns to be prefixed with destination. And I want to keep everything the same, so I want it to also be a left outer join. And we’re not going to keep the right key, and we’re not going to match casing. All right, so I’m going to run that. r_break r_break So I’ve done and I’ve successfully– once this is running, once this is finished running, I would have successfully added six columns to this data set, and these six columns will be probably very invaluable to us when we go and build our machine learning models. r_break r_break And that’s how you join data, two data sets, together in Azure ML. So let’s look at it real quickly, and then I think that’s it for time. So if we go here, yup. So destination, destination, destination, origin, origin, origin, and then we’re left with 504,000 rows still, which is good. No data got lost during the join. r_break r_break So join us next time, when I will show you how to summarize, show you some summary statistics from your data, and how to drop columns. Hey, if you liked that video you want to see more videos like this in the future, go ahead and like and subscribe, and I will look forward to seeing you at our boot camp.

Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.