Hello, ladies and gentlemen of the internet. My name is Phuc Duong and I’m here to show you how to do the Kaggle competition in Azure Machine Learning Studio.
I’ve gone ahead and uploaded the blueprints to a solution that I’ve come up with in Azure Machine Learning Studio, and I have posted it on the Cortana Intelligence Gallery, where, you can clone and replicate this experiment. Now, I’m going to post the link to this experiment in the description, or you go ahead and pause the video and look at this notepad of the URL, and type that in yourself. I really recommend clicking on the description.
So, now that we’re done with that, there is a quick description on this page. You can read it if you want, or you can listen to my beautiful voice as I walk you through how to do this. If I click on the Open in the Studio button, it’s going to go ahead and try to open it in your Azure Machine Learning workspace. The first thing it’s going to ask you is, which workspace do I want to clone it to? You choose the workspace that you want to clone it to, which region you want it to clone to, and say yes. So, it’s going to go ahead and bring in my datasets, my modules, and then we’re going to go ahead and run the experiment. I’m going to walk you through what it’s doing. Now, remember that this is a solution. This is definitely nowhere near the optimal solution, but it is definitely a good solution to get running as far as a template for most machine learning models, not just the Titanic machine learning model.
Once it’s gone ahead and loaded, you should see something that looks like this. It kind of looks like an octopus. So, the first thing you do is to hit run. Right now, this is just like a blank set of instructions, or a set of blueprints. Once you hit run, though, it’s going to go through and execute and actually do the transformation on the datasets. So, at the very top, we begin with our two datasets. The Titanic dataset will be our training set. So, 891 rows where we know from the past whether or not people lived or died when they stepped on the Titanic. And then our job is to build a predictive model based upon these demographical conditions. How many children they had, whether or not they’re male or female, et cetera. And then the test set is our duty. This is what we’re going to need to do because we don’t have survivors. So based upon feeding it brand new data, about the model that we’ll build, can the model predict on this dataset? Can we predict whether or not these people will live or die? And that’s what the Kaggle competition is going to go ahead and grade you on. So let’s ignore the test set workflow for a minute. I’m going to basically mouse over here so you guys don’t see, so you guys will ignore it. And I’m going to go through this workflow.
So, we’re going to drop four columns right now - passenger ID, name, ticket, and cabin - because they’re not useful in their current forms right now. We can go back and literally improve this model by going into the name, for example, and extracting titles, or the cabin letter, or something else. And then we’re going to go through and we’re going to cast columns into categories. So, the categories are basically Survived, Pclass, Sex, and Embarked. So we want the data set to treat - especially Survived - as a category, because remember by default, it treats 0 and 1 as numeric. If it goes in as numeric, it then becomes a regression problem, not a classification problem. And then we have 2 cleaning modules, so in this case, we’re going to go ahead and do a quick and dirty clean module.
We’re going to clean the entire dataset with basically the median. In this case, I think the median translates to 28 for 177 rows that are missing an age. If I visualize before the cleaning, we can see that there are 177 missing values of age but, then after we clean all the numerics, we should be left with no missing values of the age. But notice that at around 28, our histogram is entirely pulled up. So our distribution stays the same, our distribution just gets tighter.
Remember, this is a very subpar cleaning function. You definitely want to eventually build a machine learning model to predict these missing values. The next thing we want to do is we’re going to do the same thing, but for categories. And for categories, we’re going to clean it with the mode. So if we looked at it before the cleaning, we looked at Embarked and there should be two missing values of Embarked, for example, but if after this, if we visualize this, there would be no missing values of Embarked. And there should be no missing values for any of the remaining columns. So now what we’re going to do is we want to show you the two methodologies here.
So ignore the methodology. So there are really three methodologies here, but I want to show you the first step. At the first site, is the train test. And notice that all three methodologies derive from the same algorithm. So this Two-class Decision Forest is going into each and every step. So the left side, basically, you train test split, so we hold out 30% of the data, and we basically build them all by using only 70% of the data.
Notice that 70% of the data is going in, and it’s going to go and sacrifice itself to build this forest, build this model. And once the model has been built, then we bring in the 30% dataset that we basically hid away. So we hid away this dataset, to pretend that it’s new world data. So we go ahead, and if you visualize the score model module, you’re going to go ahead and see that it’s gone ahead and taken this route for every row. It’s going to basically throw it to the model, and I’m also going to give a prediction for each and every single row line by line. So 445 predictions here. And looks like - oh, this is a mistake on my part. I noticed this a 70/30 split here, it’s actually a 50/50 split. So ignore that, a mistake. You can change that if you want.
So, it currently doesn’t match the documentation. All right, so if you go ahead and visualize this, so we can see that based upon the cleaning function and all this stuff, the model predicted 50% chance that this person would live because it’s rounding up from 50% - the model is going to go ahead and say, this person will live, in actuality, this person died. So the model was wrong in this case. The model was not in agreement with reality in this situation. And if you go down here, the model thought this person would have a 23% chance of living, it runs down from 50%. So, zero. This person is predicted to have died. Ooh, OK, so this model is wrong again.
So, notice that we already found two wrong answers. But you can also do a comparison between the two. So if you click on the predicted result, and then compare that to survive, which is the ground troop, we’ll have predicted versus actual, and we get a really cool confusion matrix. 103 times the model was right, 254 times the model was right, the rest of this is wrong. We can go through and add those one by one, but there’s actually an evaluation module that would calculate accuracy, precision, and recall for us. So if you visualize the evaluate module, and scroll down, it says that this model has an 80% accuracy, which is actually not bad.
The Titanic competition wants you to maximize this particular metric accuracy, and hopefully, we won’t over-fit too much by trying to maximize this metric. The idea is, we will build a new feature and go ahead and run this again. So we’ll add a new feature, we’ll clean differently, we’ll tune the parameters of the algorithm, we’ll do tweaks to make the model better. And every time we’re going to go back and see, does that tweaking improve the accuracy of the model or not? So, that’s the first way. This is how we do the order of development.
Now, it could be that through sheer randomization alone, we could end up with a series of splits that make us look good. So then that’s where cross-validation comes in. Once we get an accuracy that we’re happy with, the idea is then we can go ahead and cross-validate to see whether or not we will trust that accuracy measure. And so if we look at this number, 80%, is that 80% on a good day or that is 80% on a bad day? It could just be through sheer randomization. Some of the test sets could have just been easy to classify. And the harder to classify was not in that test set. So to avoid that, what we’re going to do is we’re going to build 10 models in this cross-validation module.
So, first of all, I’m not going to trust any number that comes out of this evaluation module. I’m going to be skeptical and I’m going to do a cross-validation check on that number. So this model claims that it will get me an 80% accurate model. This cross-validation is basically like a separate dipstick test. Notice it’s reading from the same algorithm, the same dataset, the same everything because I want to test all of the current conditions.
So, if I visualize this cross-validation module, it’s going to go ahead and build me, 10 models. It’s going to take my dataset, chop it up 10 times, and build a model on a different partition every time, and train on a different partition every time until I see my entire dataset. And we can go ahead and see the accuracy, this is a table that goes ahead and summarizes how the model is going to do on each and every single bit of each cross-validation. So notice that when it went to here, on fold 7, so basically it trained on every fold except 7, and then tested on fold 7. And now we’re in a situation where the model only got 66% accuracy on that. I think that’s very bad. So notice that the model was actually - that 80% seems to have been on a good day. So it’s 82, 89, I think that’s pretty good. But you notice that the model is jumping around. It is not a stable model.
So, this is a model that can eventually, probably betray us, it could fool us into thinking that it’s a good model, but we deploy it and then we start losing a lot of money based upon these predictions because we’re wrong a lot. So you noticed that so on average the model will get 80% when we retrain it. And the standard deviation from the mean is about 6% here, so that is a very high standard deviation model. So if you do two times the standard deviations, it’s about plus or minus 12. So the model will be anywhere between 68% and 92% accuracy. So remember we don’t want to care or get attached to any individual model.
We want to - we’re basically evaluating a process based upon our methodologies. Will this methodology produce me a good model every time? So the idea is, I don’t care about the individual sharpness of a single knife or a blade, I care about, what cross-validation cares about is, it cares that the factory goes ahead and produces a sharp blade every time. I don’t care about the individual sharpness of a blade, I care about the overall sharpness of all the blades that come out of this factory. And the same thing is true of this machine. So that tells me first of all, that this is not a good model. It’s a very unstable model.
So, I would go back to the drawing board. I would engineer more features and I would build– I would go ahead and build better tuning functions, maybe use a different parameter for the algorithm, maybe use a different algorithm altogether. But the idea is once I’m happy with the standard deviation in the cross-validation, once it’s low enough, I can go ahead and decide that I want to deploy this model. So deployed means I want to use it on my production data. So in this world and the Titanic Kaggle competition, the production data is the Kaggle test set, and so that’s the other 418 rows that they don’t give you survived on.
So, once I’m happy with my process on the model, I’m going to go ahead and retrain the model on 100% of the data. And notice that once my cross-validation has validated my process, my process being the way I treat my data, the way I clean my data, the way I’ve engineered my features, and also with the algorithm I chose, and the parameters of the algorithm, once all of that is perfect, I have a spreadsheet somewhere keeping track of all this, then I retrained the model on our 100% of the data. And notice that because we’ve trained on 100% of the data, we have no holdout set and because we have no holdout set that we can’t evaluate to tell how well this model did. But, we have some guarantees, because we know that in the past, it got this well and it did this good on this particular set of parameters, this algorithm, and this dataset.
The assumption is if I feed it more data, and keep everything else the same, it will perform better, it will learn better from that dataset. So, this is my production model, so notice I’m feeding it 100% of the data, no evaluation because the evaluation was done over here. So notice if I had two weeks to put a predictive model, this is basically day, I don’t know if 13 out of the 14 days. And notice that if you go back to the test set, I cleaned the test set exactly the same way. Notice that I basically took this, I right-clicked, I copied and I pasted it over here. So I cleaned it exactly the same way.
The only difference is I actually don’t drop passenger ID. I said like passenger ID to go forward and do it here. So if I visualize this, I actually keep the passenger ID. Because remember Kaggle wants passenger ID. But why doesn’t it air when it gets down here, in the prediction? Because we never built a model that uses passenger ID. Well, the idea is Azure ML is going to look for column names that it was trained on. If it sees extra column names, it’s going to go ahead and ignore it. It’s just going to pass through it. So that’s why passenger ID is fine over here, but it’s not fine over here. The next thing is we don’t have Survived. So I’m not even going to basically keep Survived. And also this documentation is wrong, it should not say dropping passenger ID, it should just say dropping name, ticket, and cabin. And I noticed that we’re going to cast everything except Survived into a category.
So, notice that no one survived here, but that was left because I copy and pasted this from over here. So notice that I’m casting these things into categories. So it knows that everything has to stay the same. If it was a category when you train it, it has to be a category when you predict on it, or in this case, score on it. And then I clean it the same way. Now, this is also really subpar cleaning because I think the median over here for age is 29, and the median over here is actually 28. Now if you were really serious about keeping everything the same, you would do a custom substitution over here. And just say “replace with 28”. So that’s probably what you would have done here, and just done a hard replace with 28. Same thing over here.
I think the mode is also the same on both sides so that we’re not really affected here. But now what we’re going to do is, it’s going to go ahead and take this dataset, that’s now a train on 100% data, or this model is trained on 100% of data. And in this case, it’s going to go through and once that model’s been built, it’s going to run through and build predictions. It’s going to derive predictions based upon that, basically, that production model, and it’s going to give me a prediction for each and every single row. So go ahead and visualize it when it’s done.
So, you can see that starting from passenger 892 and going onward, we’re going to see that these are the predictions that we’re going to get, and the scored labels are derived from the score probabilities, so it’s going to round it between higher than 50% and lower than 50%. So notice that this person is rounded up to one, this person is rounded down to zero, and life, and so forth. And remember this is what Kaggle wants from you. Kaggle just wants a straight answer. Will this person live or will this person die?
Now what we need to do, is we need to formulate our dataset so we can upload it to the Kaggle website. So we’ve gone ahead and we take the scored label, and we take this passenger ID - that’s all Kaggle wants, so that’s what we’re going to do here. So we’re going to take these select columns in the dataset so this will be passenger ID and this will be scored labels, and we’re going to get basically only two columns out of that. The next thing we’re going to do is we’re going to go ahead and rename the scored labels. Call them Survive because Kaggle is looking for two columns of particular names. So in this case, it was called scored labels, what we’re going to do is we’re going to rename that column.
So, I said I want scored labels to be renamed to be Survive, and then I want to convert the whole thing to CSB. So if I right-click and I download it’ll go ahead and download the dataset to my computer. So the next thing I want to do is I want to upload this to the Kaggle competition. So once you have downloaded that file, go to “make a submission” and this is what you should be looking at. So notice I have a Titanic Kaggle competition file that I’ve downloaded from Azure ML, from that convert to CSB module. And I’m just going to go ahead and drag that in and submit my dataset. And then I will say submit. It’s going to go ahead and score my competition, and then based upon this model, I got a 77% accuracy on my dataset on the test set. And basically, that’s how you do it.
So, to raise your rank from this competition, you basically go back to the drawing board. You change the cleaning functions, you engineer different features, you bring in different data, more data, and you do what you need to do to improve the model. Now if you’re in the public Kaggle competition, bringing in more data might be bad, because it’s against the terms and conditions. But if you’re doing this with the Data Science Dojo cohort, go ahead and, by any means necessary, make the model better.
So, let’s talk about this workflow real quick. Notice that I’ve set everything up in the same workspace. And notice that I will engineer features, I will build different, I would clean it differently. I will change these parameters, and every time I’ll go ahead and train test split and check whether or not if I clean out the mice, did that improve the accuracy. If I clean it with the median, would that increase the accuracy? And I’ll keep doing that until I get an accuracy that’s good. Once the accuracy is good enough, then I’ll go ahead and check the stability of that accuracy using the cross-validate module. And once all of that is happy, if it’s not stable, if the cross-validation brings me back a not-stable model, I’ll go back and I’ll find different features, different cleaning algorithms, different machine learning algorithms, different parameters until my model is good and stable. And once my model is stable enough, then I go ahead and train on 100% of my data, and just right click over here and say “download”. And notice that this does everything in one workflow.
So notice I can train when I’m happy I can then check the evaluation and see if I’m happy, and once I’m fully happy with the whole thing, I can just right-click and say download here and then publish that into my Kaggle competition. And now we’re about out of time.
Thanks for joining me today where we went through an end-to-end solution for the Titanic Kaggle competition in Azure Machine Learning Studio. If you like what you just saw, remember to like this video. It’ll help me produce more videos like this in the future for free, and remember to subscribe to get the latest tutorials. And if you know someone who’s getting into data mining, why don’t you give them this video, and I will spread the good word of data science to them. All right, now if you do use this experiment, let me know how you did it in the comments below. And if you thought of a great way to improve the model, like, let’s say you used a different cleaning function, a different feature, a different algorithm, a different parameter for the algorithm, or a different methodology altogether, remember that data science is only powerful when it’s collaborative.
So, go ahead and share your ideas and methodologies and help each other out in the comments below. For me in particular, I found that this dataset is best suited for the two-class decision jungle. So my name is Phuc Duong with Data Science Dojo and happy modeling. I’ll see you guys later.