Hello! All right, my name is Phuc Duong. I’m the senior data engineer at Data Science Dojo and I’m here to walk you through day two’s homework. I hope you’ve been enjoying so far of the Bootcamp. OK, so quickly just go to portal.datasciencedojo.com. All the homework is laid out there for you. If you prefer video, this is a video in walking you through the homework. But just so you know, the homework is elaborated in on both sections.
So, there are two parts to the homework. The first part of the homework is to apply what you learned today. So basically take the Titanic dataset and apply a predictive model to it. So go ahead and use the R part. Or if you’ve gotten to random force, go ahead and use a random force model. That’s the first part of homework. The second part of the homework is to actually enter into a Kaggle competition. So both the homeworks are elaborated here. I’m just going to go talk and show you how to do that Kaggle competition real quick.
OK, so this really is your data science capstone project for this course. So, by the end of Friday, basically, you’ll be working from Tuesday all the way to Friday to perfect your model, and then you’ll be ranked among your peers. Your peers being basically everyone at boot camp. And then there are prizes on the line, so I’ll talk more about the prizes later. But for now, this whole page talks you through how to create a Kaggle account, how to submit, and how to do all that good stuff. Now I will talk you through that also here as well. OK, so what you want to do is you want to Google Kaggle Titanic, OK?
So, notice that we’ve actually entered you into a Kaggle competition since day one. So that Titanic dataset actually comes from this Kaggle competition. And what is Kaggle? Well, Kaggle is a crowdsourced way of doing data science. So real companies like Home Depot, Liberty Mutual, Allstate, Netflix, come together and post real datasets. And from these real datasets, there is a data mining problem. And you’re ranked among your peers as you do these data mining problems on what is called leaderboards, OK? So the Titanic competition is basically the introductory Kaggle competition homework that we’ll do together. And then, if you notice, if you go to Data in this thing, there are a bunch of data sets that are associated with this Kaggle competition. So if you notice here, there is a train.csv, and let me tell you what that is really quick.
So, you notice that, throughout this Kaggle competition, you’ve been given this data set with 191 rows, right? This is the training set. This is the set that you’ve been working with, although some of you should have been kind of suspicious if you’ve been paying attention to history. The Titanic boat actually housed about 2,000 people, yet we only have 891 passengers. I wonder where the rest of the other passengers went? Well, it turns out Kaggle actually is withholding the other passengers in this test set. So, your homework is actually to basically build a predictive model. Your capstone is to build a predictive model on this training set and to apply it to this test set. I’m going to go ahead and download this test set. We can see what is inside of it, OK?
So, if you open up this test set, you will notice that the passenger ID starts at 892. So these are the remaining passengers that were on the Titanic. But you’ll also notice that we have one less column. You notice that survived is now missing. That is your job, OK? You’re supposed to predict whether or not these people will survive or die. So notice that that is all the Kaggle competition is. Kaggle wants your answers. They want to know whether or not individual passengers lived or died. For example, passenger 897, did they live or die based upon these demographical conditions that are going to be read in by your predictive model? So I’m going to show you real quick how to submit to Kaggle for the purposes of just an introduction.
So, for tonight’s homework, you don’t need to hook up a predictive model and submit the Kaggle, you just need that to just submit. And I’m going to show you how to submit. So, Kaggle wants two things from you. It wants passenger ID, and it wants basically survived. Did the person that is corresponding with that passenger ID live or die? So Kaggle just wants two columns from you. So the fact that these columns are here, irrelevant so we’re going to delete them. So, Kaggle wants a column called passenger ID, and noticed that the I is capitalized and the P is capitalized, and it’s one word. And it also wants a column called Survived. Notice that it’s past tense, and there’s a capital S. Kaggle will check for that. And we’re going to build a very simple model, a model where everyone dies.
So, you notice that, if everyone dies, then this is going to be a very - basically it’s not even a predictive model. We’re just going to say, if you step on a boat you will die. But notice that, if you remember from day one when we did exploration when we looked at the class distribution of survive versus dead, we noticed that there was about a 62% chance of death just by stepping on the boat. So actually by saying everyone died, we would have a statistical likelihood of doing better than a coin flip, doing better than 50%. So I’m going to go ahead and say everyone dies here, and I’m going to save that as a CSV. So I’m going to go and save this as my own model, everyone dies.csv, and I’m going to save that.
All right, and what you need to do is you need to go to Kaggle and upload this file. So go ahead and make a submission. There’s a Make Submission button here. So click on Make Submission, and then go ahead and we’ll upload a submission in here. So everyone dies.csv, and we’ll go ahead and submit that. All right, so it just so happens, notice that we– notice that we don’t even give Kaggle our predictive model. We just give Kaggle the answers. That makes it so we can build a predictive model in Python, Azure ML, it doesn’t matter. It is now class agnostic. They only care about your answers. And notice that we are just submitting predictions to Kaggle. And Kaggle is actually going to score this. And Kaggle is actually going to be able to give you an accuracy out of this. That’s because they actually hold the true labels.
Kaggle actually knows whether or not the person lived or died. And if you remember from the evaluation, if you compare predicted versus actual, you’ll get a confusion matrix. So, you submitted predictions. Kaggle has the actual. From that, Kaggle builds a confusion matrix. From the confusion matrix, you get accuracy. And notice that it spits me out and accuracy, and says my submission got a 62% accuracy, and I rank 5,517 in the world. All right, so the capstone here is basically we’re going to enter all of you guys into a Kaggle competition within the class, OK? And to enter yourself into this Kaggle competition, save the name that appears on the Kaggle leaderboard. So notice that I’m Phuc H Duong, so I’ll save my username as Phuc H Duong. And then I’ll go ahead and go back to that Kaggle submission homework, and I’ll paste it into this form down here.
So, this form down here will actually go ahead and enter your Kaggle username into our internal leaderboard. And on Friday, after lunch, we’re going to end the Kaggle competition wherein the first place winner, basically, the person that ranks the highest will get a prize. The prize will be an advanced statistical R book, and it’s a very good book. If you want to do some of these extra advanced data mining processes in R, that’s in there. And notice that we only can teach you so much. That book actually contains a lot of the other stuff that we couldn’t teach you. For example, there is actually more than one way to cross-validate, right? We taught you just K-fold cross-validation but there’s also leave one out cross-fold validation, right? So there are four other ways to cross-validate that we were not able to cover in class, and that book covers that. And then the second and third place winner will get an O’Reilly book called Doing Data Science.
I also really enjoy that book. I was raised by O’Reilly, and hopefully, you will be as well. OK, and more importantly, yes I know you can buy these books, I know you can go ahead and just kind of pass this off, but this is really important. You want to do this Kaggle competition and be able to ask the instructor questions while you’re still in class, right? Because there’s actually a lot of minute little steps to go along the way here that might basically cripple you when you go back to work and you try to work on your own Kaggle competition or your own datasets, OK? But more importantly, your honor is on the line. You have to defend your honor. And you will get big bragging rights from all of this, OK? All right, now, happy modeling.