Building a Machine Learning Model
All the knowledge you've acquired so far will be put to use in order to create a responsive and efficient model in Azure ML. This tutorial serves as a step-by-step guide for creating ML models.
What You'll Learn
> Building your first machine learning model in Azure ML
> Steps of the process with detailed instructions
> Testing the accuracy of the model
Hey, welcome back to data mining with Azure Machine Learning Studio brought to you by Data Science Dojo.
So today I’m excited. Today is the day where we get to build our model. So after all that work that we’ve done, we finally get to take our data set, feed into our machine learn model, have the model iteratively learn by itself from historical data what are the kind of things that brought together the circumstances for the results, right. So basically whether or not a flight is going to be delayed or not. What are the factors that will contribute to a flight being delayed or not? So we get to answer those questions soon.
So listen let’s go back to the data mining framework and remind ourselves where we are in the data mining framework. So the last video we sent up a train test partition. So we have a model ready data. So that 70% of the data will be the model ready data. Today what we’re going to do is we’re going to select an algorithm to train on. And then the next thing we’re going do is we’re going to build ourselves a model. And then we’ll leave evaluation for next time.
So to iterate further where we are in the methodology of the train test split, so where we are right now is we’re right here. So we’re going to build our model. And then we’re going to see what’s going to go on today. And for the most part, we’re going to ignore that test set today. It’s not until predictions that we care about the test set. So let’s go ahead and go back into our Azure Machine Learning workspace.
Last time what we went ahead and did was, we split the data. So 70% went on the side. 30% went on this side. The 30% we’re going to basically ignore for a while. We’re going to pretend that it’s tomorrow’s sales data, tomorrow’s flight data, for example.
OK, so the next thing we got to do is to train a machine learner model in Azure Machine Learning, you have a module called a Train Data. Just go to your top left bar and type in the word training. If you don’t see this bar, go ahead and minimize or expand it out. So you want to type in train model. And then just drag in this train data model module. And then what you want to do is, you want to hover over the output nodes. The output nodes will tell you what it wants. On the left side means, it wants an end train model. An end trail model means an algorithm, right.
The model is the result of training. And then the model is the applicable form of the algorithm. The algorithm is just a blank set of instructions on how to build that algorithm.
The next thing is, it wants in a data set. So it wants to learn from the past here. So notice that we want to give it the training set, the 70% of data. So notice if I mouse over this data set, it will connect here, as well. But also notice if I’m also over here, it will accept it as well. So notice it just wants a data set. It doesn’t matter which one. But you know and I know we need to give it the training set. So go on and connect that. And it’s still not happy. It’s still not happy. It wants an algorithm. Now what we’re going to do is, we need to select an algorithm. So if you go into AzureML and look at – You should actually just close all of the extra features for now.
And if you look at just the tab that says Machine Learning. This is where all the algorithms inside of AzureML is kept. And if you open this out, there is a thing called Initialize Model. Go ahead and expand that. Now we get into this four families of machine learning models. So once you identify what your machine learning problem is, you will find out what your machine learning algorithm type you need. So there’s four types of machine learning algorithms inside of AzureML.
So the first thing you need to figure out is this a supervised learning data set, meaning do you have labels. Labels being what is it that you want to know from the past? In this case I want to know if a flight is going to be delayed in the past. Do I have that in my data set? Do I already have whether or not from the past this flight was delayed or not, yes or no. If it’s yes, then it’s supervised learning. I have labels to the past. I know the answers in the past. I know the stock price in the past. So that’s supervised learning.
The next thing you have to figure it out is what data type it is. So just because it’s supervised learning, there’s two types of supervised learning algorithms. There is classification type algorithms. And then there’s aggression type algorithms.
If your feature– if the response class is a label, if it is a category, it is a classification task. You are trying to predict is this pixel red, blue, or green. In this case, we’re not going to predict how many minutes it will be late by. We’re going to predict whether or not it will be late at all, past 15 minutes. So that tells us it’s classification. Now regression would have been if I want to predict how many minutes it would be late.
So there was a column at the beginning that we dropped called, I think, Arrival Delay and that was in minutes. So if we want to predict that later, that would be a regression prompt. So now that we know what type of algorithm we need, we go in and expand the classification task. And then the next thing it wants you to know is how many classes are there in the response class. So notice that we have two classes, you’re late or you’re not late, zero or one. That is a two class type algorithm.
So basically we are stuck with these type of algorithms right here. Now if you have more than two class, if you were late, kind of late, super late. If you had that kind of tiering in your data set, then you would have a multiclass classification problem. But in this case, we know we have a two class classification problem.
And now this is the cool part, we get to basically go shopping for a machine learning model. This is this kind of nice. This is also the curse of machine learning, because you don’t need to know what these things are. You can drag them in and they’ll work. But that’s not how a good practitioner does things. They should probably understand a little bit before they start doing something with it.
So first thing we’re going to do, so we’re not going to really get into the differences between these algorithms. If you want to know the differences, I would join the Data Science Dojo, the five day Data Engineering and Data Science Boot Camp. We will teach you about most of these algorithms. But for now, I know based upon my experience as a data scientist, that this data contains a huge amount of categorical data.
So if you visualize this data set, most of it is categories. So if it is a situation where most of your data set is categorical, we need to select what’s called a nonparametric algorithm. So if we have lots of categories, decision trees are really, really good at discerning categories apart from one another. So if we had lots of numeric data, that would have been a different issue. But we have lots of categories. Basically, there are three families of decision tree algorithms inside a Azure Machine Learning studio. And the simplest one is the decision forest.
So we’re going to go ahead and drag this in first. And if you want to know what these algorithms are, we might make a video about it in the future. But definitely take our Boot Camp, we will teach you everything you need to know about these algorithms. But for now, just go ahead and slide in the decision forest.
So the next thing you need to do is hook up the decision forest. So select the decision forest. And then this window on the side will pop up. If it doesn’t pop up, go ahead and expand it. And what you need to do is you got to connect this to here. So notice that this could have taken in basically any other two class decision algorithm. I could have just put in another forest here. I can connect a decision jungle here. But that’s just showing you as an example. So I’m going to connect this forest and then inside of this forest, there are what’s called two new parameters. We’ll go over these in a little bit.
But for now, notice that the training model module is – there’s a red mark next to it. It’s angry at you. It wants something from you. Every time you see this red mark, just click on it. There should be some kind of launch button on the right side that will tell you what to do. So this time it says Value Required. So it’s kind of cryptic. But what it actually means is it wants to know what are you trying to predict. Is this is this a state predictor? Because you didn’t actually take your data set and predict on any column, what type of carriers is it, what is the departure time, what is the departure place.
So you can go to Predict and predict any of these functions. So in this case we know that we want to predict, so launch the column selector, we know that we want to predict arrival delay 15, yes or no. So we’re going to go ahead and say arrival delay will be the response class. So now that our training model knows what to do. So now it’s going to cast the rest of them as predictors or as features to be used in regard to the response class, the response class being arrival delay.
So the next thing you want to do is you want to look at your algorithm module. So the algorithm module for me, in this case, is a two class decision forest. Once you click on it, you will notice that there is a toolbar that pops up on the right hand side. So this toolbar will go ahead and let us tune how will the algorithms belts guide.
These are knobs and levers. So you will see that the number of decision trees right now is eight. So I want to build this tree. And I want us to look at this tree and explore this tree, so we can kind of get the mechanics of how these trees work. So we’re going to build actually a very, very bad model. And bad because we’re going to build it to be a very simplistic model, so that humans can understand it.
So what we’re about to do here should not be used in production. I’m doing this for educational purposes. The first thing I want to do is I want to reduce the number of trees down to one. I want only to zoom in on one tree right now. By the way, never deploy a single tree in production in the real world. You will regret it. Trees have a habit of over fail. That’s why you want to use lots of different trees. The next thing is maximum depth of decision trees, which tells you how deep the tree can actually grow, so in this case 32. That’s going to be a huge tree. I might not even be able to look at it, even if I had a big screen monitor.
So if I want to look at the tree, I will change this to like five or six. I’m going to change it to five. And then the next thing is number of random splits is left at 128. Leave that alone for now. We’ll tune these parameters in a different video.
The next thing you look at is the minimum number of samples per leaf node. So basically this is the minimum number of observations I must have after a split, if I want to split on it. So the idea is I don’t want to split and then have all of a sudden one observation in a single node all by itself. That is basically the definition of over fail. So let’s turn this number up a little bit. So I want to make this number 34. So 34 is roughly about 0.1% of the training set which is 349,000 rows right now. So once you’ve set all that, go ahead and hit the Run button. And this will go ahead and build us a decision tree based upon 70% of the data. So remember we’re ignoring the 30% for now.
And now what we’re going to do is we’re going to build a single decision tree, max step of five. And we have to have enough representation in order for a tree to split on that decision. And again, I want to state that this is really dumb and simplistic model. Don’t actually use this in production.
Now this is so we can actually do what we’re about to do now, which is right click and visualize on the model. So the output of a train model module is actually the model itself. So notice that this guy right here was – You can think of it as an algorithm. You can think of it as a blank set of blueprints to build a model, to build a tree. And the output of this is a model itself. So in this case the tree has been built based upon historical data.
So if I visualize this, I can then get a graphic of basically the tree that I built. So this tree notice that it’s got one, two, three, four, five depth. It’s got five depth, because remember, I set that at five depth. Now remember, earlier the default was 32. Can you imagine how basically hairy that gets as it goes down.
And the next thing I want to look at is how do I interpret this tree. So the decision tree, what you want to do is think of it as, OK, I want to take in new data set, a new observation, a new flight. Basically, I could print this out and I could read it word for word what it’s going to do. So if I look at this, the first thing it’s going to ask me, the first question that this decision tree is going to ask me, if this was a brand new flight route. Let’s say I’m building a prediction for a brand new flight. Is this flight going to be delayed or not based upon what I’m about to tell it?
So the first thing the model is going to ask me is, was the departure time between 1700 and 1559? In this case did your flight leave between the hours of 5:00 PM and 6:00 PM? And if you say yes, you’re one, which means you are greater than zero. So you go over here. If you’re less than zero, you go over here. Less than or equal to zero, you go on the left side. So let’s just say, no, we did not leave between 5:00 and 6:00 PM.
The next question it would then ask you is wherever that node leads you. So we’ve gone to the left side.
So next thing it’ll ask you, OK, was your flight already delayed by 15 minutes before you even left the original airport? And if you are one, you go over here. If you’re zero, you go over here. So let’s just say our flight was on time at the very beginning of the origin airport.
The next thing it’s going to ask you, hey, was that airport Phoenix, Phoenix Sky Harbor? If you click on this, it will say Phoenix Sky Harbor. So in this so far, we’re in a situation where, let’s just say, no, we did not come from Phoenix Sky Harbor. It would ask you the next question. We’re not going to Phoenix Sky Harbor. Sorry the destination is we’re going to. The next thing is origin city. Are you going to San Fran, yes or no? And now we give it a decision, right. So notice that it’s zero or one down here. So notice that if we are going to San Fran, we will be on time. If we’re not going to San Fran, we will be late. And that is basically how you interpret it.
So for this, let’s assume that this is a brand new data set coming in. So if you did not leave between 5:00 and 6:00 and we went ahead and said if the plane was not late on departure and it wasn’t from Sky Harbor and we weren’t going to San Francisco. We’re going to go ahead and be late. And that’s how you interpret that tree. Now remember this is a very simplistic tree. And also you never want to use a single tree in production. But that was just me showing you so you can see what the tree is doing, what the model is doing to your brand new data.
We have about run out of time. And if you like what you just saw, remember to hit that like button. This will help support us in creating future content for free. Remember to subscribe for future content and share this video to spread the glorious word of data science. And I have a question for you before we leave. Now how well do you think this model is going to do on the test set? Now I have some opinions. But I want to hear from you.
I have another question for you. Was your tree different from my tree? Was your tree different from my tree? And if you were paying attention in the previous video, you’ll know what the result of that is. And what do you think that tree may or may not be different? Go ahead and leave your responses in the comments. My name is Phuc Duong and I’ll see you next time. Happy modeling.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo