Dropping & Selecting Columns
We will discuss how to drop or remove columns or features from a dataset. Dropping columns not only helps the model in learning better but also speeds it up, thus making it more efficient. A detailed discussion on the steps involved in the process will be shown with appropriate examples.
What You'll Learn
> Introduction to dropping and removing columns
> The need for dropping columns
> The process of making the model efficient
Hey, welcome back to Data Mining with Azure Machine Learning Studio, brought to you by Data Science Dojo. Today, we’re going to go and learn how to drop or remove columns or features from our dataset.
When you have features that may lead the model astray or features that the model don’t know how to work with, like text or image data, dropping them becomes very necessary to guide the models learning.
Now dropping columns will also speed up the efficiency of the operations, especially within Azure ML since every time you run an execution module, it goes ahead and caches that next data set in its own module on a separate computer. So it makes our payloads lighter and it speeds up our workflow by shedding those columns that we don’t need. But you want to make sure that those columns that we’re dropping do indeed add no value in their current form, because we want the model to learn from as much data that they can have access to.
And let’s go ahead and get started. So two videos ago, I did some data exploration and then we identified columns that would not add value to our machine learning model in its current form. Now if you want to hear my rationale on why we’re dropping these particular columns in this iteration, go ahead and watch that video. It’s called data exploration. And we did it in about two videos. And this is the list that I ended up with.
So notice I have a list of columns I want to drop like quarter, or month, day of the week, et cetera, et cetera. Not that these columns aren’t useful, they’re just not useful right now in their current form. OK, so I’m going to show you three different methods, three different ways, to drop columns from Azure ML.
There are three different ways that do the same thing, but they approach it differently. And there might be times where one is more optimal than the other. All right, so let’s go into that. So you want to search in your toolbox, you want to search for a module called Select Columns, Select Columns. So I will drag into Select Columns in dataset module, and I will connect it directly after my drives. So after I’ve drawn all of my data sets together in the last video, I will then output the output of that join.
So this table that now has six extra columns on it will now be thrown into the Select Columns module. So this Select Columns module will let me decide, if I launch this column selector– so there’s three ways.
So the first way is this window pops up right here. So this window will only pop up if there is a green checkbox in the previous module. If there is not a green checkbox in the previous module, what you want to do is you want to hover over this– you want to select the module that is the dependency. You want to hover over Run and then hit the Run Selected. So it’s going to go ahead and run everything up until this module. So it needs to have a green checkbox for you to see this particular window where you have Available Columns and Selected Columns.
So this is the first method, which is you have this column, OK? So what this column does is you select the column names that you want to keep and throw them into the Selected Column site.
So you can either do– you can throw the columns you want to keep onto the right side– for example, one at a time– or what you can do is you can select all. Say you want to start with every column and then start by dropping a particular column.
So I’m going to start doing that real quick. So in this case, Year, Quarter, Month, and I think Day of the Month is something that I want to leave behind. So these are the columns that are being left behind right now. Because I have Airport Name in the last join for both the origin and the destination, I no longer need both the origin and the destination airport ID. So I can drop those now.
So let’s go back to this too. So it also says I should drop CRS departure time and CRS arrival time. So let’s go ahead and do that real quick. So CRS arrival time– I’m going to hold down the Shift button so I can select multiple things. CRS arrival time and CRS departure time– so I’m holding down the control button. That’s how I’m selecting multiple things at the same time.
And then I can go ahead and tell it that I want to leave these columns behind. And then I think that’s it. Now remember we have four response classes. And we only want to keep one response class. That’s the Arrival, Delay 15, so we want this to be an easy classification machine learning problem. So I’m going to go ahead and go and find Arrival Delay. I’ll leave that behind. I’m also going to leave behind the Cancelled and the Diverted columns. I want to leave those behind too. So these are the columns I’m going to be left for going forward. I’m going to leave behind 11 columns. These are columns that I’m dropping, and I’m going to bring forward 13 columns.
I’m going to go ahead and hit this Run button now. So notice that once I’ve hit the Run button, there is a list of columns that I’m going to keep inside of the Launch Column selector. So the output of this column in the data sets will be another data set. It will be cached within this module, and it should have– if we did it right– it should have much less columns than it had before. So it should have 13 columns, if we look at it this way. Here we are. So that’s one method, which is we had a window where we selected which columns we wanted to keep and which columns we wanted to leave behind.
Let me show you another method. So I’m going to copy this Select Columns and data set module. And then I’m going to drag it over here. I’m also going to show you that it’s just a parallel workflow. It does the same thing. So it’s up to you which one you want to keep, I’m just showing you an alternate method. So let’s say you had thousands of columns. That could be a problem sometimes. You don’t want to specify individual columns that you want to keep one at a time. Maybe there’s a thousand columns, and you only want to drop four of them.
So here’s a way to do that. So if you launch the Column Selector, you can filter columns by name– that’s what we did last time– or we can do it by rule. So on the left side, there is by rules or by names. So with rules, there’s two modes. I can begin with no columns selected and then I can add individual columns to this list. For example, notice that I can X these out or I can add them in. The secondary method I can do is– I want to say Begin with All Columns, Begin with All Columns. And then, instead of saying include, I would say exclude.
All right, so I want to begin with all of the columns– so 13 plus 11. And these are the particular columns that I’m about to list to be excluded. So I’m going to exclude Name Year, Quarter, Month, Day of the Month. I think we’re going to drop also– well, you get the idea. So that’s the secondary method of doing it. So I’m going to delete this for less confusion. And then I’m going to document what this is doing.
So in the Select Columns, I’m going to say this is dropping columns. And then I’m going to expand it. And that’s how you drop columns in Azure ML.
So join us next time where I’ll show you how to clean missing values from our data set and also how to get summary statistics out of our dataset. Hey, if you liked that video and you want to see more videos like this in the future, go ahead and like and subscribe. And I will look forward to seeing you at our boot camp.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo