Data Exploration

r_subheading-Course Description-r_end We'll discuss an end-to-end data exploration science project in Azure Machine Learning Studio. We’ll choose the flight delay data, and use it to predict whether or not a flight will be late on arrival based upon the flight’s circumstances.r_break We will also perform preliminary exploration into the dataset using Azure Machine Learning’s dataset module. r_break r_break r_subheading-What You'll Learn-r_end • Introduction to projects. r_break • Exploring data set using Azure ML. r_break • Building a data mining strategy. r_break r_break You can get a free trial of Azure r_link-here- -r_end, and this is the r_link-link- to the Azure Portal.


Hello, and welcome back to Data Mining with Azure Machine Learning Studio, brought to you by Data Science Dojo. r_break r_break All right. So today we’re going to give you an introduction to projects inside of Azure ML. So basically how do you create a project to bind a bunch of assets together? And then we’re going to explore a data set using Azure ML. And then we’re going to build ourselves a data mining strategy on how we’re going to approach our action plan for this data set. r_break r_break OK, so the first thing you want to do is navigate to your Azure ML workspace by going to And then we’re going to begin the process of creating an end-to-end project where you will learn Azure ML basically through trial by fire. So we will take a data set and bring it into a predictive model, and then deploy that model. r_break r_break OK so to begin, go to Project. So we’re going to start a brand new project, and this project will contain all of our experiments that we’re going to use with this. This is just to be organized. This step is completely optional. It’s just later on your workspace might have a bunch of different experiments in it, and you might not be aware, or you might get confused which project is which. I’m going to create a new project, and I’m going to call this project Predicting Flight Delays. And it has a lot to do with what we’re about to do, so just name it right now. So Predicting Flight Delays. r_break r_break And then the description, I’ll just name it the same thing. OK, so this is going to basically create a project folder for me. So if I go into this, project’s going to notice that there is a project called Predicting Flight Delays. And it’s going to ask me to add assets. I don’t have assets yet, but we’re about to go and make some assets. r_break r_break The first thing you’re going to do is go to New, and create a brand new experiment. So get to New, and then Blank Experiment. And the data set that we’re going to be working with is under Samples. And there is a data set down here called Flight On-Time Performance Raw. OK, so go ahead and drag that in. r_break r_break So this data set, if you go ahead and visualize– it’s data from 2011. And it’s basically, can we use this data? Can we use the past to predict the future? And the past being each row in this data set refers to a flight. And then each column is an attribute of that flight. And towards the end of it, we’re trying to predict this column. Is the arrival delay can be behind by 15 minutes– yes or no? So if there’s a 1 here, it means that it was delayed more than 15 minutes. And if it wasn’t delayed, then it’s 0 here. And this column is actually derived, so it gets called ArrDel15. So arrival delay is at 15. r_break r_break Let me zoom in here, so you guys can read it a little bit better. r_break r_break And this column is actually– basically, what I think it is is it’s based off this column right here, which is by how many minutes was the flight on time, or delayed, or early. All right, so if it was negative 6, it means the flight was six minutes early. The flight was 12 minutes early, and so forth, and so forth. So if a number in here is greater than 15, that means it was more than 15 minutes late. So this would trigger a 0 or a 1. So this makes this problem really cool, because we can treat it as either a regression problem or we can treat it as a classification problem. r_break r_break So a classification problem being predicting whether it was late or not. Or we can predict by how many minutes it will be late or early. so we’re going to choose classification. It’s a lot more simple of a problem to tackle. But go ahead and do regression if you know how. r_break r_break And then there’s also these other two columns– whether the flight was canceled or diverted. So we’re going to ignore these columns, but in production, you would actually build predictive models to predict whether it is going to be canceled or diverted at the same time. There’s many ways you can approach it. r_break r_break If you build a regression model, and if it’s more than 15 minutes late, or a certain threshold, then business logic would kick in and say, if more than 60 minutes late, or whatever, say canceled or diverted. r_break r_break So we’re going to ignore these two columns. The column we’re going to focus on is the Arrival Delay 15, meaning that the flight is delayed by 15 minutes or not– yes or no? We don’t care if the flight’s early. We don’t care if the flight’s five minutes late. We only care if the flight is more than 15 minutes late. So let’s go ahead and explore this data set. So if you look at the data set, it’s got 504,000 rows, and there’s 18 columns. So this is a pretty sizable medium-ish data set to work with. If we look at the year, everything is in 1, so this column doesn’t seem useful right now. r_break r_break So as I’m doing this, I’ll write it down. And I really recommend that you do this for every data set you will ever work on. Build an attack strategy during your data exploration phase, which is, what you are going to do with each column? What are some notes that you will take away? r_break r_break So there’s a column here called year. And basically, I’m going to drop this column, because everything in this is 2011. So it looks like whoever gave me this data did a query inside of that database and only took out the 2011 flight data. And if we look at quarter– oh, and I can tell that because there’s the number of unique values is 1. And because the number of unique values is 1, well, that tells you everything is locked in 1. If you hover over this histogram over here of 2011, it says the count is 100%. So I suspect the same thing is true here with quarter so we can see that quarter is also 100% in the fourth quarter. So in quarter, I will also go ahead and drop this column. r_break r_break As far as month is concerned, let’s take a look at month. Month looks like it is under the same thing. So whoever queried this data created all in the same month, in the same year, and in the same quarter. r_break r_break So we’re going to build a predictive model that’s only going to be good for October. So with flights it’s very seasonal. So I would imagine that you would build maybe different models to accommodate different seasons, too. r_break r_break And then there’s day of the month. Day of the month being what day is it? Is it October 1 through 30? So the problem with day of the month is that if I’m going to use this feature to build a predictive model, it’s not going to be very useful, because if I’m going to predict the future, what happens on October 6 in the future, that might not mean anything because October 6 might be on a different day. It might be a Tuesday instead of a Thursday, or it might land on a different holiday or something like that. So this feature– it’s too granular to this particular entry. It is good historical information, because I can use this feature to basically determine is it a holiday or not, or what day of the week it is. But it looks like someone already did that for us in here, over here, which is day of the week, which is there are seven unique values here. So that tells me this is Sunday through Saturday. r_break r_break So day of the month– for now, let’s drop it. I’m not saying it’s not useful. I’m saying, in its current form, we’re not going to be able to do much with it. It poses too much of a uniqueness to it. So we don’t want the model learning, OK, so if it’s October 6, in every time in the future, it’s going to do this. No, that’s not how it works, because the way the calendar works, it’s going to keep shifting days of the week and things like that. So day of the month I’m going to go ahead and drop. r_break r_break But also note, it might be useful to find out holidays– can derive holidays. All right. And then there’s day of the week. So day of the week is all about Monday through Saturday. So we don’t know what lines up with what. So what does 4 mean? Is 4 a Wednesday, or is 4 a Thursday? It depends where day of the week lines up. So we’d have to go do some domain research and basically look up what October 6 is. What day was it back in 2011, of October 6? And then we can figure out what this day is. So we’ll do that later. r_break r_break So day of week. And we want to do that, because if we want a feature later that says, is weekend, is not weekend, that would become very useful for us. All right, so day of the week– it’s going to be useful. And right now it is casted as a numeric column. I would say it is not numeric, it is actually categorical. r_break r_break Remember, categorical is distinct bins or buckets of things that could have been. Numeric assumes that there is some kind of progression between 1 and 7 even though, yes, you’re progressing through time, but the jump between 7 and 1– it doesn’t make sense there, because it’s cyclical loop. And this data set currently doesn’t encompass that. So we have to cast this into a category. OK so we have to cast this into a category, because it’s in numeric right now. We don’t want it to treat it as a number. r_break r_break All right. So the carrier– looks like these are carrier codes. So some quick search, so I just double-click on WN, for example. WN seems like– and I type in carrier here. WN stands for Southwest. So the Southwest code is WN. So we can look up these codes later, but we have to ask ourselves a question. When we consider features to being used, will it help predict whether or not it will be late or on time? So yes, I would say that the carrier will probably be a very important feature in determining whether or not a flight is going to be on time or not. So I’m going to go ahead and copy carrier down. r_break r_break And carrier has to be a category. Right now it is a string feature. And we don’t want it to be a string. We want it to be treated as a category, as a discrete value to be used in the predictive model. r_break r_break All right. The next thing is the airport origin ID. What airport did they come from? And then what airport did they depart from? Now what is weird about this is there are 279 unique airport origins, and there are 280 unique airport destinations, which means the destinations has one more airport than the origin. So that might be weird later. So these codes will be very important, because remember, in this code, maybe there is some inherent lateness or inefficiencies associated with certain airports. For example, I can just imagine that if you had anything to do with Chicago’s O’Hare Airport, then you would probably be late. Or JFK Airport, or one of those really busy hub airports. So yes, let’s include this. r_break r_break But also, because they’re codes right now, it’s also treating as a numeric feature. No, no, we should not do that, because there’s no rhyme or reason between– they’re like postcodes, zip codes. So they should be treated not as numbers, but as codes. r_break r_break So origin airport ID– we’re going to go ahead and cast that to category. So I’m just building up an attack strategy right now. r_break r_break So destination airport ID, same thing. We will cast it into a category, as well. r_break r_break And then CRS departure time– CRS departure time, and there should also be an arrival time. So what time did they leave from the airport? So it’s listed in, I’m assuming, 2400– so 0 to 2400. So for example, this flight left 2:35 PM. And again, right now, as a numeric, because this is a cyclical feature, basically after 24 it resets back to 0, it doesn’t make sense to keep this as a numeric feature. It should be something else. But if we cast it into a category, that’s going to cause too many unique values, in my opinion. So there’s going to be 1,100 unique values here. r_break r_break So the idea here is we would bucket these time stamps, so we would have less categories. Now, if we see over here, there’s the departure time bulk. r_break r_break So it looks like whoever did this took all the time stamps and put it into 19 bins– so almost even bins is what it looks like here. So basically if the flight departure time was between 2 and 3, you would be in this bin. So we’re not going to use CRS departure time, and we’re not going to use CRS arrival time, but not that they’re not useful. In their current forms, they’re not useful. But you notice that we have bulk, the time block here. So the time block is what we’re going to use. And that will encode basically the same information but not to the granularity that we want the machine learning model to know about. We want some generality with our machine learning model. r_break r_break All right. So on another note, if I take the CRS arrival time and deduct it from the departure time, I can build a brand new feature that is basically number of minutes that it took. But the problem with that is I think these time stamps are in the time zone of the airports that they land in. So we’d have to convert all of these into the same time zone, and then we can do that subtraction. But that feature might be worth more labor than it’s worth right now. So we won’t consider that for the time being. But just note that the thing that you can do– so CRS departure time, we’re going to drop it. CRS time bulk, we’re going to go ahead. It’s a string right now. We need to cast it later into a category. r_break r_break And then departure delay– this is going to be a very important feature, I think, because if you already start off late or if you really start off early, I think that is a very strong indicator of whether or not you’re going to be early or not. r_break r_break So in this one, departure delay– and notice it’s in numeric. It’s fine. We want it– keep numeric. We want it to be numeric, because it’s in minutes. So notice negative numbers– so this flight was it left early. So departure time, delay is more than 15 minutes, so this, again, this is derived. So if the departure time was delayed– so in this one, it took 17 minutes behind schedule to take off. r_break r_break So that’s why it was casted at 1 here, because it’s greater than 15. So I think this is going to be a very good, good indicator on whether or not a flight is going to be delayed. So if it’s 1, the flight was already delayed before it left. So that means that for the flight to not be 15 minutes late, they actually has to show up early. So it has to jump through an extra hoop here. r_break r_break And also, this is actually, if you see here, there is only two unique values, 0 or 1. That tells you that it needs to be a category, because it’s a binary feature right now. So that’s what we’re going to have to do to it. So cast into category. We don’t want it to be a number. r_break r_break CRS time– I think we discussed this, that we will do the same thing here, which is, for now, we will drop it. You can build some crazy features from this, just so you know, if you like time series analysis. r_break r_break Arrival time bulk– what we’re going to do here– let’s see here. We’re going to do the same thing that we did with the departure time block, which is we need to cast it into a category, because I believe it’s a string right now. And then the next thing is these four columns right here are basically what we’re trying to predict. So these four could be our response classes. And the rest of them will be our predictors. Now, we could build four different predictive models, but once you know how to build one predictive model, you’ll know how to do it for all of them. r_break r_break So what we’re going to do here is we’re only going to care about Arrival Delay 15. And we’re going to be able to predict the model for that. So arrival delay, it’s going to be a response class, but we’re going to drop it. Same thing with canceled, and same thing with diverted. So these three columns, they’re all response classes. They’re all what we would want to predict. But for now, we’re only going to predict arrival delay by 15 minutes. r_break r_break So this is going to be our response class. And this also needs to be cast into category. So that is basically our attack strategy. What are we going to do as far as data manipulation, transformation, and all of that good stuff? r_break r_break All right, so join me next time. We will go ahead and start making some of these changes to the data set. Hey, if you liked that video, and you want to see more videos like this in the future, go ahead and like and subscribe. And I will look forward to seeing you at our boot camp. r_break r_break

Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.