Summary Statistics & Cleaning Missing Data
Summary statistics help us understand the overall behavior of our features. In Azure ML, we can easily access mean, median, mode, and other useful statistics to get a better idea of what is happening in the data set. Summary statistics also give us a lot of valuable information that can be used in various analyses.
What You'll Learn
> Importance of summary statistics in data science
> Using summary statistics in data analysis
> Custom substitution for missing values in Azure ML
Hello internet, and welcome back to a data mining with Azure Machine Learning Studio brought to you by Data Science Dojo. So today we will explore our data set further by looking at some summary statistics inside of Azure ML on how to clean missing values from our data set using the clean missing values module inside of Azure ML.
So let’s go to Azure ML. So the last time what we did was we used select columns in the datasets module to select the columns that we wanted to use for machine learning, which columns do we want to learn from, and which model do we want to leave behind. And not that they’re not useful, but we want them to be useful in their current form, We’ll talk about how to talk to do that transformation on some of it columns later.
So if we look at these columns, you should have 13 columns just like me, if you click on any column, you’ll see on the side a bunch of summary statistics. So if we can click on it, we can see that one of the values is, if you look at missing values, we can see a summary of all the missing values in the dataset. So if you click on, for example, departure time book, it is completely filled in, every value is there. Same with this one right here. But if you want to go and if you have hundreds of columns or dozens of columns, even thousands of columns, this would be a huge pain to click on every column them one at a time and then to view the missing values counts themselves.
Now luckily, inside of Azure ML there’s a module that crunches all of that for you. So we go to the left side, search for something called summarize data. So there is a summarize data module, which if you drag in here and connect it, OK, you can go ahead and run it.
So I notice that I’m connecting to summarize data module to decide. I am making a conscious note that this is not part of the workflow. This is like a separate dipstick test. This is a separate test of quality, if you will. So if I right-click and visualize this, I can see basically the summary of all of my columns. So what are my column names, and then what are the observations?
These are the counts. Then I have a list of unique values. So notice days of the week should have seven days. If you look at the missing values count, and that’s probably what we care about the most for this video, which is how many missing values are there? So in departure delay and departure 15 it’s 3,697 missing values. And that should be expected, because departure 15 is derived from departure delay. So if the flight was late shorter than 15 minutes, departure delay becomes 1 instead of 0.
And then we also look at what type they are. So in this case, the missing values that we have are these two are features, so they’re predictors. But then in arrival delay, arrival delay has 417 missing values. So this is also a problem, because arrival delay 15 is also our response class. This is what we’re going to try to predict.
So remember, we’re going to use the past to predict the future where in the past we knew the result that brought about that past, this current future. And so in this case, we don’t know the result of the past, so we can’t learn from the past. So these rows will have to be dropped in a little bit, but while we’re in here, let’s look at some more summary statistics.
So we get the min and the max. We get the range of the data set. What are the extremes of the dataset? What is the hottest day? What is the coldest day? So if you look at departure delay, so this is a numeric value, so you look at departure delay, one flight left 56 minutes early. This flight left 56 minutes ahead of advertised. And then if you look at the max, there was a flight in here and that was 1,401 minutes late to even depart.
So remember departure delay is what was your initial take-off. Was the initial take-off early or late? So this initial takeoff was 1,400 minutes delayed. So that’s bad. If you divide this by 60, basically they were waiting for the flight for almost 24 hours. So it’s like 23 point something hours. If you look at this one right here, day of the week, the min and max day a week, it doesn’t really make sense, because the reason like what is one? Is one Sunday? Is it Monday?
So the thing is day of the week is also a cyclical pattern. So one to seven means, oh this is beginning of the week to the end week. But remember, it this number currently does not include that this number will start over and over and over again. So it’s kind of meaningless. That tells us that we should probably treat this not as a number but probably as a category. We’re going to do that in the next video.
OK, next thing is if we look at departure delay. There’s only 0 from 1 for arrival delay 15 and departure delay 15. So that tells you that there’s two unique values. It’s a binary feature. It can only be one thing or the other. So let’s take a look at some other things here. So these are some measurements of central tendency.
So if you look at this, this is the mean, so this is the average. So if you look at this, the average a day of the week doesn’t really make sense. The average of the day of the week is what, is this Wednesday, Thursday? I’m not sure here. And what does that even mean? It doesn’t mean anything. So if we look at the next thing, the departure delay is a number, so it would make sense. So on average, flights that left from their original airport were about 4.8 minutes late. That makes sense. But also keep in mind with the mean that it’s pulled by extraneous outliers.
So if I looked at this number, I would probably see that this is probably an extraneous outlier. Because it’s too past the standard deviation. So there is the mean of the deviation, where how far do you deviate away from this number?
So notice it’s two past the standard deviation maybe if three past, so even three past, so it’s definitely an outlier here. So the mean is pulled up in this case by this extraneous number right here. Then next thing is if you look at the next row, which his departure delay 15, departure delay 15 is the mean of that is 0.128. But the weird part is it’s only between 0 and 1. So what this is measuring is out of all the observations, out of 500,000 flights, what is the propensity of this column to be 0 or 1. That’s what that mean is measuring.
So if you look at this number, it’s very close to zero and very far away from one. So that means that most of the time, you can also kind of say that 88% or 87% of the time, based upon this data and this sample, that flights are only going to be 12% that 13% late, which is, I would say that’s still pretty bad. Because let’s say I was 10% late to work, I would probably get fired.
The next thing his arrival delay. So arrival delay is when you actually arrive from your flight. So we see that this is 0.136. And what’s really interesting is that if you take this number in contrast to this number, they are very, very close. So what does that mean? To me it means that if you’re already late, chances are you’re going to stay late. That’s what I see that this means. And if you’re already late, there’s very little chance that you will make up that flight. So the idea is I think if you take off and you’re already late, you can’t just go faster. You’re going to stay late. So it will have a rippling effect.
So you notice that because this number is so close to this number, there’s not many factors once you’re in the air that seem to make you even more late. So just because you left early or just because you left on time, it almost basically assures that you will arrive on time. So it looks like there’s very little delays once the plane actually gets in the air. And that’s what this number is trying to tell me.
The next thing is if you look at the first quartile and the third quartile, so this is if you took the data and sorted it, assuming it’s a numeric number, so if you look at departure delay, which is a numeric value. So if you shorted it, the first quartile is 25% of the data if you sorted it, and the third quartile is to 75th percentile.
So if you took between this range and this range, that’s 50% of the data. So you can say that 50% of all flights were between five minutes early and three minutes late. So that’s very good, I think. So to be that tight with your first and third quartiles ranges within eight minutes. So if you look at it, I think that’s amazing that they can do that. Because these planes are flying from different time zones to different states to different airports. And they have different carriers, different management, and hundreds of different people along the way. And somehow every plane, or 50% of all planes are landing between these values, which is a difference of eight minutes. I think that is awesome.
And then if you look at the median, the median is the middle value. So basically 50% of it is above and below this point. So notice that 50% of the time the plane is going to be less than two minutes or earlier. And 50% of the time the plane will be less than two minutes or greater.
OK, so let’s go back and look at this. So the missing values is something that we should take care of, something we have to do. So if we exit out, we can go ahead and start thinking about a cleaning strategy now. So if you go into the left top left-hand corner and type in “clean missing data,” just drag that in.
So this is the module which we’re going to use to clean missing values, and just go ahead and connect your data. So notice that this is an actual transformation. That’s why I’m making it part of this workflow it’s just a style thing. I like to keep my experiments nice and neat, because I work with people, and I want them to also understand what I’m doing. So we select the columns we want to clean. So in this case, I had a response class here. So if I go by name, what I did was click by name and I said arrival delay is what I’m going to clean.
So remember arrival delay had 4,700 missing values and it’s a response class. So just a very quick rule of thumb here. So the response class, so if missing response class, we have to drop the row. And there’s nothing we can do here.
The idea is we can’t we can’t learn from the past if we don’t know what resulted in the past being the way it was. If we go back even further, the idea is machines can’t learn from missing values. The missing values have to be cleaned up. We can’t ignore them, because you can’t do math on missing values, and machine learning is all about mathematical operations. So missing values, therefore, become like roadblocks for machine learning. So they have to be taken care of before the model can learn from them.
Now the next question is, these packages are smart. Can’t they just clean missing values for me? Yes, they can, and they will. Most packages in R, Python, and even Azure ML, they won’t complain if you feed it missing data. It will just take care of it. You don’t even need to worry about it. It won’t even tell you that it clean in the missing values.
However, you as the data science practitioner will always want to retain the power that is cleaning missing values, because it becomes a tuning parameter for you. And remember, these are experiments. It’s a scientific method, which is you’re trying to isolate certain causes and performances. So if you built two different models and one performed better, is it because it cleaned better because of the algorithm, or did it clean better because of the way you cleaned the data? You can’t isolate that effect if it’s doing that for you.
And the next part is how well you clean missing data will affect the predict model, because the model is going to learn from the values that you filled in the missing data with because you’re going to learn from that data set. So cleaning missing values just tries to get back at the idea of data quality. The completeness of the data will improve the quality of your model. You can think of data as the raw materials for your machine learning model. If you feed the model higher-quality raw materials, you will forge a higher-quality model.
So the problem is missing data is a natural occurrence in data science. So many things in life that happen in the real world that prevent data collection from being a perfect system. So data could have been failed to be recorded, or failed to have been given. Or the data was never recorded because it doesn’t make sense, and it’s not applicable for that situation. So failure give or failure to record.
So some examples, if you look at survey data, which is probably the messiest form of data there is, let’s look at let’s say age is missing. Why is age missing? And maybe the surveyor never got time to ask that question, maybe he didn’t ask that question, maybe there was a discriminatory law that prevented you from asking the age.
What about the user side? Maybe the user didn’t want to tell you their age, they don’t know their age, or they had to leave early or something.
So there could be a multitude of reasons why that was missing. This happens all the time on online when you make an account on basically any website now they just ask you for your username or email and then a password. But an account still has a bunch of other data columns ready for you to fill out, like your first name, your last name, your address, and all of these things. But they don’t want to ask that right away, because they want you to go ahead and have an account with them right away because you can use their platform. So the rest of that is blank. This happens all the time on social media.
For example, if you sent out a tweet and you don’t have a hashtag, the hashtag column is now blank. Maybe you queried data and you don’t have access to a certain row, so to you it looks blank. So it depends if you have access to certain parts of your data or not. If you were looking at manufacturing equipment or IoT data, the monitoring equipment could have been down, so maybe each column in that data set could have represented a different measurement device, a different sensor, and all of a sudden a sensor goes down. And then it would just be submitting nulls for the whole time.
And then let’s just say maybe you are recording something one day, maybe heights of students, and that student was absent that day, so that’s just failed to record. And then maybe you do something, but not something else.
Let’s say you go to Amazon, you rate something but you didn’t review it, or you go to Netflix, you review a movie without ever having watched it, so watch time is blank. Another thing is that maybe the situation is just not applicable. Maybe some conditions just aren’t met. But there’s still a column in that dataset. So for example the spouse will probably blank unless you have a spouse, unless you’re married. So in that case, if you’re cleaning missing values there, you want to fill it in with whatever makes sense. So instead of just saying blank, you might want to just say single.
The next thing is maybe the refund date, the refund date column, or the refund value column is only populated if someone returns a purchase. So in this case, if it’s blank you might want to just fill it in with zero, just assuming that they never returned anything for that transaction. And then maybe graduation is blank because they dropped out, they never got to the graduation. So there are many reasons in the real world why something could be missing. So let’s go back to Azure ML. Let’s just finish now with the cleaning missing data.
So you notice there’s two output modes here. One is for the cleaning transformation, and one is for the clean datasets. For the most part, you don’t want to care about the right output mode. The right output mode is later on when you want to deploy this pipeline and you clean the missing values with a derived number like the median, the median of something of a column was 40. Who’s to say that the median of future date will also be 40? So the idea is if you want to build this as a pipeline and preserve what the derived values are, you would save it as a transformation. But for the most part you don’t need to care about this side.
The next one is this is where the data actually gets cleaned. So if you visualize out here and you visualize inside of it, we’ve gone ahead and cleaned in the missing values of I think arrival delay. So if I mouse over this, there should be zero missing values. So everything got dropped.
So notice I had 500,000 rows earlier, but I dropped almost 5,000 rows. So I’m down to 4,899 rows. so I’ve lost quite a little bit, like less than 3% or so of my data. That’s not bad. But I couldn’t have learned from that data anyway, because the response class was empty. If you look at these two columns, they also now don’t have any more missing values in them either. And just to be sure, I can just take the summarized data module, Control C, Control V to copy paste it, and this connects over here. Just to double check, do I have any missing values left in my data? Now I know I don’t, but just in case I can check it. So that is if your response class is empty. So what if your response class was – what if it was a feature that was missing? So in this case, remember, these are some very general rules of thumb to start off. So what if a feature was missing? What if your feature was missing, which is basically these are predictors? Then it depends on the data type. If your feature is a category, if categorical, or if it’s a numeric, answer if it’s numeric, then you have access to basically a bunch of numbers, the numbers of central tendency.
So you have mean, you have median, and in both cases you have the mode. In this case, when you clean the categories, the mode makes a lot more sense. But these are what’s called global cleaning functions. These are actually really bad. In production, you generally don’t want to clean with the mean, median, or mode for numeric, just because it puts a blanket operation on the whole thing. For the most part, what you actually want to do is you want to impute missing values. So imputation of missing values is all about you build up basically a predictive model to predict the missing values of data to clean the data such that the data can be cleaned to feed into the actual machine learning model.
But again, you have to crawl before you can walk, and you have to walk before you can run. So let’s put that on the back burner for now. So let’s just clean it with the mode. If you have a data set and you want to just get going with machine learning you can just start off with median, the mode, and mean just to get going so the machine learning model doesn’t scream at you because there’s missing values. But also remember you can also do things like can drop the row, or you can drop the column. Now for the most part, these are really sub-par. Because remember data’s new oil.
The idea is data feeds algorithms. The more data you feed the algorithm, the better. So if you drop all the rows, then you’re going to lose a lot of data that it could have learned from. So maybe that column could have had a missing value for that cell, but maybe the rest of the row was fine. So you might be punishing the other columns in that row for the bad behavior of a single cell. So it’s better just to clean the missing values for the most part. And you don’t want to drop a column unless basically you’ve decided that it’s useless. And maybe there’s 99% missing, and there’s no signal in that 1%.
So those are very brief situations where you want to drop the columns, but almost never would you want to drop the column unless it’s irrelevant if it even had missing values. All right, so that will conclude how to clean missing data in Azure ML. And let me just show you really quickly.
So you remember in the cleaning function mode there were some other modes. I talked about the mean, median, and the mode. So the mean and the median will break unless it’s an actual numeric column. And then there’s also this MICE thing here. So MICE is all about loading a mini in-mind regression. So you select it and click on the More Help function, it will link you to a white paper released by Microsoft on what the MICE equation looks like. You can also do a custom substitution.
Custom substitution is very cool in case you want to replace things with whatever you think it should be, like for example 0 or something. So that’s also good for categories. If you have a situation where a category is missing, you can use custom substitution to create a separate categorical value. What that will do is it will create a separate category, and rather than filling the mode, you can actually isolate out the individual contributions that the mode had on the dataset, on the model, on the response class itself, without being tainted by all the missing values being filled in there.
So the next thing is you can remove the row, remove the columns, and then you can do basically a pre-CA calculation to fill in missing values as well. And that will conclude how to clean missing data in Azure ML. So during this next time I will show you how to cast columns into categories so the computer knows how to treat them correctly and how to split your data into multiple parts, pull partitions as we begin to build machine learning on this clean and ready to go dataset.
So now that this dataset is clean, it’s ready to go, it is good to be fed into a machine learning model.
And if you like this video and you want to see more videos like this in the future, remember hit that Like button and share with your friends and spread the good word of data science. It helps a lot. And before I go, I just got a question to ask you guys. What do you use as your machine learning forum. Let me know in the comments my name is Phuc Duong, and I’ll see you next time.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo