Splitting Data & Categorical Casting
Machine learning models behave differently based on how the categories are typed. We will discuss why categorical data need to be treated differently? The tutorial also introduces the concept of splitting the data in two partitions, train and test set.
What You'll Learn
> Cast data according to the appropriate data types
> Sample the data set in partitions
> Making the model robust and responsive for the future/unseen data
Hey. Welcome back to Data Mining with Azure Machine Learning Studio, brought to you by Data Science Dojo.
So last time what we did was we cleaned all of our data. We made it nice and pristine for a machine learning model, so now we won’t get any screams at us for any known values. And today what we’re going to do, is we’re going to make sure, before we feed this into the machine learner model, that all of our features are casted in their proper data types.
So the machine learner model will behave differently based upon how the categories are typed. And then the next thing we’ve got to do is, we’ve got to split our data into two partitions, a test set, and a training set. So the first thing we’re getting to is categorical data. So why is categorical data need to be treated differently? So numerics you can leave as numerics. Just make sure they’re listed as numerics.
But let’s look at this data right here, where it’s flight ID and state. So clearly state in this case is going to be a category. Arizona, Washington, Arizona, Texas. But the whole backbone of machine learning is based upon math, algorithms, and things like that. And you can’t do math on this category. You can’t divide for example, Arizona by Washington. You can’t add Washington to Arizona and get something else out. So the next back bone is it can’t do distance calculations.
Distance calculation is by which the main core principle of how a machine learning algorithm determines that something is similar and something is not similar to something else. So what normally has to be done a machine learning model to understand it, or for any computer to understand data, in the form of categories, is you have to create a separate column for each category. This is also called one hot encoding. This is also called binarization.
So this is what we’re going to– an example here. So notice that every category gets its own column, and then we have a one where it’s present in that row. So notice that there is a column called is Arizona, and because flight ID one was Arizona, there’s a one here. So notice that it’s going to spawn four columns. So the number of categories you have is how many columns you’re going to end up with, and these columns are going to mutually exclusive of one another.
So notice that if you’re Washington, you can’t be Texas. And if you’re Texas, you can’t be California, for example. So this will be the same thing for male and female, and this would be same thing for time zones. This will be the same thing for zip codes. Anything that’s a category. So this is really prevalent in other data mining platforms such as Excel and things like that. But in Azure ML, Azure ML actually has a data type called categorical, which actually will do this tabularization transformation for you, without you having to think about it.
So what we have to do is, we have to go into Azure ML and cast all of our categories into categorical data types, so that our computer treats them properly. So let’s go into our Azure ML workspace, and we’ll continue where we left off last time. So if you look under clean missing data here, that’s the last thing that we did, you should have also had the summarized data from last time as well. If not, go ahead and drag it in.
So what I’m going to do is I’m going to right click on this, and I’m going to visualize the summarized data. And summarized data has what’s called the unique value count column. So it basically tells you how many categories are in, basically, this column set. And how you can tell that something should be a category, is basically look at the ratio compared to the count, versus how much there is.
So you’ll notice that there’s only seven possible values, and days of the week, out of almost 500,000 rows. That tells you that hey, this is probably a category. Because there are so few unique value counts in regard to the count. And if we look at all of this, probably everything should be a category. Most of it is knowing that the ratio is low, but there are other ones where it’s kind of higher, at origin_city.
But it just comes from domain knowledge that we established earlier. We know that city is, you’re either in a city or not in a city. So these cities, they should be distinct buckets of things that can be.
Right so the flights can be, basically, put into different buckets here. So there’s 268 different cities that you can land in. Next thing is departure delay and arrival delay. Notice that there’s only two here, so this is a binary feature. So we should definitely convert them into categories. And in specific, it is very important that we cast our response class into the correct data type. So noticed our response class is delay, whether or not you’ll be late by 15 minutes or not. So if we left this the way it is right now, it is a numeric right now.
And how you can tell that is, if you visualize the data right now, and then click on the column itself, so if you mouse over and click on arrival delay right now, so you’ll see it is a numeric feature.
So in regard to supervised learning, there’s two types of supervised learning. There is regression, which is, you’re trying to predict a number. So in this case, if you ran this through a machine learning model right now, it would try to do regression, and you’ll get weird numbers out, like the flight will be two. Arrival delay will be two. The arrival delay might be negative one, because it’s trying to do an extrapolation upon a line. And wouldn’t make sense, because it can only be zero and one.
So it is very important, that for a classification problem, that the response class is converted into a categorical data type in Azure ML.
And the next thing is, basically this entire data type, or this entire data set, if you look at it, every feature should be a category. The only feature that should not be a category is departure delay in minutes. So that actually is on a numeric spectrum.
So let me teach how to do that real quick. So you can change things into the proper data types here by typing in the metadata editor. So we used this earlier to actually rename our columns right above here, if you remember this from one of the earlier videos. But this can be used to edit the data about the data.
So metadata is data about data. So we’re going to edit the data around the data. So what data types and things like that. So if you connect that to your current workflow, so connect the output of the clean missing data to the input of the edit metadata data module.
Now we can launch the column selector and select which columns we want to be transformed. And remember, our transform in this case is, we’re going to convert everything to categorical. So since we only have one thing that isn’t category, which is this guy right here, departure delay, what I’m actually going to do, is I’m going to do a Control A, which selects everything. You can also do a Shift– hold down Shift after clicking the first one, and then clicking the last one. So shift will go ahead and select the rest of them as well. So you can do a Control A, or can do a Shift selection.
And then you want to say, I want all the columns to be– or all the features to be part of the transformation. And then you can say go ahead, OK, I want everything except departure delay. So notice that departure delay is not going to be affected in this transformation, but every other column will be. So I’m gonna hit Check and say yes, these are the columns I want to be transformed. And then the transformation itself I will select here and say, make categorical. And this will go ahead and cast all the columns into a categorical data type. So remember earlier when I showed you that table.
When it comes time to build a machine learning model, it’s actually going to extrapolate and expand out the table as we see. But to you as the user, you’ll still see it as one column. That’s really useful, because let’s say you had a column with, for example, city, you’d have a column for every city. That’s very inconvenient, because now your data set is spanned by a whole bunch of columns that’s basically representing one feature. So this is a really nice data type to work with, because the entirety of the feature is represented in one column.
So for example, if you look at origin, it is now called a categorical feature. And when it comes time for machine learning, it’s going to do that transformation for us, but to us, while we’re working with it, as humans, W only see one column, which is really, really nice for understanding.
So now that everything is properly casted into place, the data set is actually a machine learning model. Before we move on, let us go zoom out and see where we are in the data mining framework to actually understand where we are in the data mining framework, and where we’re doing some of the things that we’re doing.
So first thing is, in the past couple of videos, we’ve explored and we’ve understood our data, to try to develop a better understanding of data, so we can process and clean our data better and better. And we’re at a situation where our data is model ready. So it’s ready to be fed into a machine learning model.
So this is where we are right now. And this is where we’re going to be. This is where we’re going to go.
So the next thing we’re going to do is, we’re going to select an algorithm by which we’re going to use. And the next thing is we’re going to go ahead and build a model.
And the most important thing that we’re going to do, actually, is we’re going to evaluate whether or not the model that we built is any good or not. But that’s a little bit trickier than you would think, because that is, if you built a model, how do you tell if the model is good or not? Well, ideally, what you would do is, if the model can predict future values correctly, well then it’s a good model.
But the problem is, that’s its job, right? It’s job is to predict the future. So if you’re going to evaluate on the future data, and that at that point the model has failed its job. Because it’s past its useful shelf. So if the model is predicting after the future happens, I think that’s a bit useless. So what we have to do in the lab is, we have to synthetically treat future world data. And we’ll teach you some methodologies by which to do that.
So the first methodology, one of many, by the way, this is one of many methodologies, and the first methodology I’m going to teach you is to train test split. So the idea is, we start with 100% of our data. So this is where we have 499,000 rows or something like that.
The next thing we need to do is, we need to build two partitions, a training set and a test set. In this case, we’re going to use the ratio of 70% of the data sets will randomly go into the test set, or I’m sorry, 70% of the data will randomly go into training set, and 30% of data will randomly go into a test set. And if some of you who know sampling, this is this a sampling without replacement. So we’re going to go ahead and put them into either two bags here.
So the idea with the tests set is, we’re going to take this data set and hide it away. We’re going to pretend that it’s future world data. And this is really important, because it has the labels of the actuals, the ground truth. The actual labels.
So the idea is, if we build our model, so we’re going to take our model, and we’re going to build it using the 70% training set. And at the end of the day, it’s not going to see that 30%. So to the model, that test set, is new world data to that model. The model has never been exposed to this data set. And the assumption is, if this model that was built, if it built a generalizable model that found the ground truth in the underlying data, the idea is if it can do well on data that’s never seen before, if it can predict on data it’s never seen before, the assumption is it should do moderately just as well on data it’s never seen before.
So that’s what we’re going to use. So basically 70% of this data set is going to be part of training set. In my mind, I think it’s going to be– I like to think of the 70% training set– it’s going to be sacrificed to produce this model. And it’s going to learn from the past, what resulted in the current labels being the way they are. And then the idea is, once the model has been built, we would run it through and have it predict on the test set. And because it predicts on the test set, now we have another column called predictions. So we have a prediction. And in our case, it’s going to predict whether or not the flight will be late or not. It just so happens in the test set, in the past we know if the flight was late or not.
So we have, basically, we can build a comparison between predicted versus actual. We can go in one at a time, line item, and say, are you right? Is this row right? Was this flight correctly predicted upon? Yes or no? We can go ahead and do that. And if we aggregate all of the rights, and we aggregate all the wrongs, eventually we can get some pretty good measures of performance out of this model. So this is a high level road map of where we’re going to go.
So what we’re going to do today is, we’re not going to build any models today. We’re going to actually set up the training set and the test set today in Azure ML. So if you will go back into Azure ML with me, and go where we left off. So in the Edit Metadata, I’m going to go ahead and add some documentation to this Edit Metadata here before we move on. So I’m going to say, this is casting a categorical data.
And then the next thing is, I’m going to build this 70/30 split partition. So in this case, if you type in the word split, there is a split data module. So go ahead and drag this into the Azure ML workspace, and connect the output of the Edit Metadata. So the clean data that’s model ready, it’s going to flow into the split data. We’re going to split it by rows. And we’re going to say, so notice this percentage here? It says, fraction of rows in the first output data set, the first output data set being this guy. So the remaining part of the data will go out here. So if you put, for example, 0.7 here, 70% of the data will go out here. 30% of the data will go out here. And yes, you want the split to be randomized.
So randomization is very important in machine learning. It will help improve the model itself. And then there is an idea of stratified splits. Before I go into what stratified split is, we have to look at something real quick. So what stratified split does, is it keeps the ratios the same on both the test set and the training set. So if you look at it arrival delay, arrival delay, in this case, there’s 86 percent not late, and there’s 14% late. If you want to keep the ratios the same, basically 86/14, the same on both sides, you would stratify it.
Now for the most part, you only want to stratify, and care to stratify, your response labels. You do not usually care about stratifying the rest of your predictors, unless there is something that you really care about that is a rare class. So for example, if 99% of one of your predictor features is really common, and the other one is not common, like let’s say, less than 1%.
So through sheer randomization, you can actually end up with a split that doesn’t have one of the categories, for example. So if you want to prevent that, you would stratify that. But for the most part, we only care for the most part about stratifying what’s called the response class here. So I want to keep this ratio the same, 86/14. So I’m going to go ahead and in the split module, I want to say Stratify True.
Now if you have your categories in your response class being basically really close to each other, let’s say 50-50, or 60-40, or something like that, I would go ahead and just not stratify. But in this case, it’s getting to the point where, through just sheer randomization alone, I can severely under sample the thing that I actually care about, which is whether or not the flight is late or not. Remember, the one label, being late, is only 14% of the data.
So I’m going to launch this column selector and say, I want you to split, but I want you to also stratify arrival delay. And I’m going to go ahead and hit this Run button right here. So what this is going to do is going to split 70% of data over here, and 30% of my data over here. So 70/30 tends to be the industry standard, but it is the right percentage anyway.
So the idea is data beats algorithm. So you always want your test set, or your training set should always have the most amount of data. So the idea is, the model will learn better if it has more data. So there’s that.
But then there’s also the other side of it, which is the test set, which is well, why can’t you just give everything to the training set? Well then you’d have nothing left to evaluate with. So we’d have to keep something. But the thing is, later, we’re going to do what’s called aggregate measures of evaluation. Things like accuracy, precision recall. We have to have enough representation, enough observations, to basically trust those numbers.
So for example, if you had 500,000 rows in your training set, but only 10 rows in your test set, now are you going to trust the accuracy measure of 10 values? Probably not, because each value that’s right or wrong is an extra plus or minus 10% from that measure. So that measure is going to be very unstable. So I tend to want it to be enough so that I trust those numbers coming out. OK, and just to double check, let’s see if this did what we wanted it to do.
So if you click on the Edit Metadata from before, so this is the data before the split. So notice that we start off with 499,000 rows.
So the idea is, after the split, we should have 70% of the data in the first output node, so we have about 349,000 rows. So let’s go ahead and take out our calculator, and take this and divide it. So 349,776 divided by, and then I think I can just paste the original value here, and that’s 70%. So that’s correct. So that’s the first thing we need to validate. The next thing we should validate is whether or not it did the stratification correctly. So if I click on arrival delay, I should have the same ratio. 14 and 86. So it kept the number of rows, basically, or it kept the number of response class labels in the same ratio as it was before.
So let’s also go ahead and look at our result data set, too, which is 30% of our data. So this should be the remaining rows of data. The next thing we will find is that, did it stratify this correctly as well? Yes. 14 in 86 as well. So that we’ll go ahead and show that. Yes it did what we wanted it to do.
And we’ve just about run out of time, and that would include how to cast your data in Azure ML. How to set up a train and test split inside of Azure ML.
Now if you like what we just saw, remember to hit that like button. It will help support us in creating future content for free. And remember to subscribe for future content, and to share this video to spread the glorious word of data science. And before we build this model, and before I go, I have a question for you. What kind of surprising things do you think we’ll find out about the aviation industry, or flights in general, once we build this model? Go ahead and leave your hypothesis in the comments. My name is Phuc Duong, and I’ll see you next time. Happy modeling.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo