Renaming Columns and Replicating Data

r_subheading-Course Description-r_end Renaming columns is an essential task in data science. There are several cases where renaming columns and replicating data comes in handy. The flight delay data is used in this tutorial to explain these basic concepts with real-life examples to highlight the usefulness of these activities. r_break r_break r_subheading-What You'll Learn-r_end • To renaming columns and replicating data. r_break • Importance of these practices in data science. r_break • Steps involved in the process.


Hey, and welcome back to data mining with Azure Machine Learning Studio, brought to you by Data Science Dojo. Today we’re going to learn how to rename columns. r_break r_break Renaming columns is a very mundane, but necessary, task in data science. These are some of the scenarios that might make it necessary to rename columns. r_break r_break So let’s say you’ve read in a data set, and there’s no headers. Now you have to go in and manually provide it, right? Another scenario– someone gave you a bad column name. Now you need to rename it so that it has some more clarity among you and your team, so you can actually look at a column and instantly know what it is for. r_break r_break Another scenario is a column has transformed into something different completely. Let’s say you had a table of customer names and that you’ve normalized all ages. Now that column is completely different. The column should actually say normalized ages, for clarity. So we’re just going to begin where we left off last time. r_break r_break So last time we went ahead and did some exploration. Through data exploration, we looked at these column names. And then what we found out was if we visualize the column names again, we can see that there is origin airport ID and destination airport ID. So, it just so happens that there is another table inside of Azure ML. r_break r_break So if you look at sample data sets, and yes, if you drag in a data set called airport codes data set, if you drag that in, OK, and visualize. So what is in here? So these are the airport IDs. So the idea is if we can cross-reference these IDs, the airport IDs from the actual flight table to this table, we can actually then get what state did the flight depart from? What city did the flight depart from? And also what is the name of that airport, which actually might be more clear than airport ID. OK? r_break r_break So we’re going to do a join on this table later. But before we do that though, something is going to cause that to be a little bit weird, because we actually need to do this join twice, right? Because there is an airport ID for destination and there is an airport ID for origin. All right? r_break r_break So we’re going have to make that join twice. But if we join twice on the same thing, the column names are going to be the same. OK? So we’re going to have two states. We’re going to have two cities. And that’s going to make everything very, very confusing. r_break r_break So what we’re going to do is actually we’re going to go ahead and clone this data set twice and rename the columns both times on this. So that way, we can do a join on the left and we know this is the origin state. And then when we join on the right, now we can say, OK, this is the destination state. OK, so to rename columns, what we’re going to do is we’re going to use a module called the METADATA EDITOR. So if you drag in the METADATA EDITOR, OK, and if you expand the toolbar on the right-hand side, you can see that we can launch what’s called a COLUMN SELECTOR. r_break r_break So these are the columns that we want to make the changes to. And notice at the very bottom, there is a section called NEW COLUMN NAMES. The NEW COLUMN NAMES accepts basically a list of new column names to be overwritten. OK? And it’s in the same order. All right. r_break r_break So the idea is if I connect the airport codes to this edit metadata workflow right here. So notice that the output of this module is going to the input of this module. So the data is flowing out of the data module, and it’s flowing into the metadata module. OK? r_break r_break And now, because this is connected, if I launch this column selector, it’s going to see the columns already there for me, which is really nice. OK? And then I can go ahead and select which of the columns I want to rename. So I want to rename city, state, and name. OK? So that’s what I’m going to do. And then I’m going say check, OK? r_break r_break So notice it’s going to give me an array right here, or a list, of the column names, OK? And if I paste that into here and I just say city name one, city name– or state two, and name three, that’s what it’s going to rename the columns to in that order, right? So the first index matches with the first index, the second index matches the second index, and so forth. r_break r_break But what I’m going to do is I’m actually going to rename it to be something more clear. OK? So notice I’m going to bring this into a text editor and I’m going to rename the beginning of each of the column names. So the first one I’m going to say is origin, OK? So I will know that there is an origin city, origin state, and origin name. OK? That’s what I’m going to rename this one to. So if I click back into here and paste the new column names back in, and if I hit the RUN button once again, it’s going to go ahead and go in and change all the column names for me. And noticed that I’ve ignored, OK, I’ve ignored airport ID. OK? r_break r_break So we’ll show you in the next episode why we’re going to ignore airport ID. r_break r_break So notice that if I look at this data set now, OK, origin city has been renamed. City has been named to origin city, same with state, and same with name. OK? And it might be more clear if instead of saying origin name, I would say origin airport. I think that’s much more clear. OK? r_break r_break So I want to run this again. And notice that our other data set, the flight delay data set, has been removed from our workspace. That’s because it didn’t have any execution modules associated with that. OK. r_break r_break So if we visualize the output of the edit metadata now, we can see that now it’s origin airport, origin state, origin city, OK? But notice that didn’t change the original data set. If I go back to the original data set output module and right click and visualize, the data set is still the same. Notice that the transformation only happens once it goes through this edit metadata editor. And notice the output of the edit metadata is the changed names. OK? r_break r_break And also, you can document what this thing is doing. So if you double click on a module, notice it opens up a text box for me. Now I can type in a quick statement about what this is going to do. So I can say, RENAME AIRPORTS AS ORIGIN. OK? I think that’s much more clear as to what this is doing. r_break r_break And, OK, that’s only one side of the story, right? I need a secondary data set where the columns actually say destination, right? So I’m going to connect this over here. And I forget what it’s actually called. So I’m going to drag in the flight data again and actually I have the list of column names right over here. OK. So it’s called DEST. OK? So I will keep this naming convention so I can be more consistent about it. r_break r_break So inside of this, I’ll go back to my original notes. This works in every note editor, note editor, like Notepad++, Sublime, or even Word. And also in Exel. OK? So you can do a FIND AND REPLACE. So I can find anything that starts with ORIGIN_ and rename it to DEST_. OK? So I can hit and do a REPLACE ALL. This is a universal function in most text editors. I’m going to copy that now. And notice that because all I did was copy and paste this module, it automatically brought in original column names that I had before. r_break r_break So all I have to do now is say destination city, destination state, and destination name, and if I run this, I now have a workflow where the data starts off as one and then it branches into two different things. So actually, these are two separate data sets right now, which is really, really cool, right? r_break r_break So this is one data set over here, where everything is called origin, and there’s another data set over here, where everything’s called DEST, OK? So now we are very clear that when we do these joins in next episode that this is a destination airport, not an origin airport. r_break r_break OK, let’s also do a little bit of housekeeping. Down here– actually, over here, I renamed it origin airport. Yet over here, it’s called destination name. So I’m going to renamed this real quick so they match up. So it said it would say DESTINATION NAME, I’ll say DESTINATION ORIGIN, OK? And then the comment over here, instead of saying ORIGIN, I’ll just say DEST, OK? So now we know that this site is renaming differently than this side, OK? So I’m going to hit this RUN button again to finalize the results. r_break r_break And then we’re going to do some safekeeping now, or some housekeeping now. So before we couldn’t save the experiment because we didn’t have an execution module. So an execution module is any module that performs an action. You can’t just save an experiment with just a data object in it. It has to have some kind of execution module. So I get to name the experiment here. So I get to rename this PREDICTING FLIGHT DELAYS and I can hit the SAVE button now. OK? r_break r_break And if I hit this EXPERIMENT button again, this will list all of my experiments. And notice that it’s actually called PREDICTING FLIGHT DELAYS now. I can go one step further than that and also add this to the project that I created in the previous video. So if I click on this PROJECT button on this side over here. So if you can’t see this window it’s because this right toolbar is minimized. So what you want to do is you’re want to not to select any modules. So basically click in a blank spot and then click PROJECTS, right? And then you’ll want to add that to an existing project. r_break r_break So in the last video, I made a project called PREDICTING FLIGHT DELAYS, and I’m going to check that, and say yep, yep. So this experiment and any data that I build, and any model that I train, will be associated with this project. OK? And so later on, when you have lots of different basically versions of this experiment, they will all be collected together, so you’ll know what it’s doing. It’s not that big of a deal now, because we only have one problem that we’re working on. But the idea is as the number of projects you work on increases, things get really confusing. r_break r_break All right and that concludes it for today’s video. Join us next time where I will show you how to join two data sets together. Hey, if you liked that video and you want to see more videos like this in the future, go ahead and like and subscribe. And I will look forward to seeing you at our boot camp.

Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.