Creating a Titanic Model
The Kaggle competition requires you to create a model out of the titanic data set and submit it. We will show you how to do the Kaggle competition in Rstudio.
This series gets you up-to-speed with our Data Science Bootcamp.
What You'll Learn
> Splitting the dataset into train and test set
> Cleaning the dataset
> Handling missing values
> The categorical casting of columns
> Building a machine learning model
> Improving the accuracy of the model
> Making a Kaggle submission
Hello. My name is Phuc Duong, and I’m here to show you how to do the Kaggle Competition in R. OK, so the first thing you want to do is - you want to quickly go to Google and just go to type in Titanic Kaggle. So this goes directly to the Titanic competition on Kaggle. Now we want to grab our data files that we’re going to be working with.
So, we have the train set, that’s going to be our supervised labeled data. And then we have the test set, so that is our blind hold data set where we don’t know if they lived or died. I’m going to save this to a working folder called Kaggle, so all of my submissions and my project files are going to be in this folder. I’m going to save it there. I’m also going to save the test set. So, that’s the first thing I’m going to do.
I have two data files. The training set is the one I’m going to use to build a predictive model. And then the test set is basically the one I’m going to score. All right, so let’s do this really quickly.
I have R open - this is R Studio - and I’m just going to do the rest of this in R Studio. So the first thing I want to do is I want to set a working directory. So I want to go to - I can do a set WD or if I’m in R Studio. I can go to a session, and then set a working directory, choose a directory, and then from that, I can go to my Kaggle file, select that as my folder. So, you notice that it automatically typed in a set working directory for me. I’m going to be typing this as a script, basically. So, if you guys remember, we can just save things, and then paste them, and execute them as needed. Hold on, I need to make a new R script. There we go, all right.
The next thing is, now that I have a working directory, now that it shows that I’m in the folder that I want to work with, now I can do read.csv, so I can read in these data sets now. I have two data files, and I’m going to read them both separately. So, on one hand, I have a titanic.train model - I’ll call that titanic.train - and I’ll do a read.csv of it where the file will equal, in this case, train.scv.
I also want to do something special here, so I want to make sure that what’s called stringAsFactors is equal to false. So by default, read.csv is going to read into a data file, build a data frame out of that, and then convert all the strings into categories. And we don’t want that, because we want to do some manipulation. If we want to do some manipulation of the factors, that’s also going to be a thing. And also, I’m going to do a very different methodology by which I am actually going to combine these two data files together and clean them together. And if you’ve ever worked in R before, you’re going to know very quickly that to combine two data files together, the factors basically have to line up perfectly. And then, just to remove that as a barrier, what we’ll do is, we’ll keep them as strings.
Once we combine the two data files together, then we can go ahead and cast them as factors when we need to. And then, just as a style thing, I like to do headers equals true. Now, by default, read.csv has headers equals true, but it’s just a good habit that I’ve gone into it where I sometimes do read table and then do read.csv. And sometimes I get confused, and then my models break because it read in the first line as a header or it didn’t. So, it read in 191 rows of that. And I just want to really quickly check the tail of this file, so titanic.train. I’m checking the tail of the fire because if there was an error on the read in later down on the file, it will ripple through and error through the rest of the file, that’s why I do a tail. If the last line seems OK, then that means the rest of the file is probably OK. So for now, it seems that it read in correctly.
I’m going to go ahead and do that for the other one as well, but instead of saying titanic.train, I’m going to go ahead and do titanic.test. And then I’ll do a read.csv of test.csv. So we’re going to go ahead and execute that. And I’m executing the line from the script by just using Control-Enter here from R Studio. Notice that the minute I press Control-Enter, it copies the code and executes it in the console. Also, note that I have 418 observations of the test set. This is the thing that we’re trying to predict. So if I do an str of titanic.test, we should find that survived is missing. So I have between PassengerId and Pclass, there should have been survived but there is no survived. So that’s kind of what we intended for.
Now I’m going to go and try to combine these two files together because I want to clean them together. Now there’s nothing stopping you from cleaning them separately, you just have to run each function twice. So, if you clean with the median on the training set one way, you have to find out what the median is and then hard replace it. So if the median was 29 on the training set, you have to clean it the same way on the test set. Rather than saving - then you have to save the 29 and then insert that as a hard insert into the test set - if I combine them together, I can actually get the global median. The global median actually might be different from the actual non-global median.
If you type in the median to just figure out what the median is real quick of the titanic of, let’s see, for example, Age, the median of Age. And I think I have to do na dot remove is equal to true. So basically, it’s going to calculate the median in the absence of missing value. I think there were 177 missing values. So, it’s gone ahead and calculated to the median for basically the non-missing values, so 28. That is the median of the training set.
Now, the median of the test set might be something different. So, let’s just try that real quick, and hopefully, it will be different. Yes, it was different. Notice that the median of the two data sets is different. When I combine them together, there might be a third different median as well, what’s known as the global median. So, now I need to combine these two data files together.
When I combine these two data files together, later on, I want to split them apart, I’ll need a way to differentiate whether or not it is part of the training set are part of the test set. Now, one can very easily just do if PassengerId is greater than 891 or not. But you want to get in the best practice in case there is no PassengerId in your work data set. Normally, what I like is, I like to create my own column, and then I’d like to build it into true or false. So, we’ll build a column by which we’ll mark if it’s part of the test set, which is true, and we’ll mark it part of the train set. Or if train set is false for the test set. In this case, I would do a titanic.train, and we’ll call a brand new column that doesn’t exist. So R is going to go ahead and create a column for us. And we’re going to fill that entire column with true.
So, we’ve gone through it, and we’re going to mark 891 rows here to be true. If I go to the tail of titanic.train of - actually, titanic.train of isTrainSet - you will notice that everything is true, and that’s what we want. We’ll do the same thing for the test set. However, we’ll flip it, and we’ll say it’s false. So, later on, we’ll just do a very quick check to see if it’s true or false. And we’ll split it that way.
Now that I have my label set on both the training set and the test set, I can now go ahead and combine them together. However, when I do combine them together, they actually have to line up. So, if I do an ncol call of titanic.train, I will find that there are 13 columns. And then if I do and ncol of titanic.test, I’ll notice that there’re 12 columns. You notice that already there is one column missing, and that’s survived. So, first of all, I have to make sure that the survived column exists in the test set. And then, secondly, I have to make sure that the row names, or the column names, the headers, line up as well. The lineup being, they’re spelled exactly the same. And let’s see, if I do that names of titanic.train, and if I do names of titanic.test, we can go ahead and compare these two.
Check the spelling, so basically, the capital Sex capital Parch, et cetera. Now it just so happens that they are the same. You can check it if you want to but I know that they’re the same. So, yes, we don’t have to go ahead and line up the headers in this case but we do have to add a column called Survived to the test. So the test doesn’t have that. To add a Survived column, what we’re going to do is that we’re going to do titanic.test and call Survived.
Notice that there is no column called Survived, we’re calling a column that doesn’t exist right now. But R is smart enough to know that if you call a column that doesn’t exist and fill it with something - in this case, we’re going to fill it with NA, NA being not available. In the test set, we’re going to build an entire column, and we’re going to fill that entire column with NA, so basically 418 rows of NA for Survived. We’re going to go ahead and execute that. Now, if I do ncol of titanic.test, we’ll go with 13, and if I go with names of titanic.test, we’ll go ahead and see that there is a column now called Survived.
The next thing is, now we’re going to go ahead and combine these two data sets together into one now that the headers have lined up and the columns have lined up. I’m going to call this new data set called titanic.full. I’m going to go ahead and do what’s called an rbind. We’re going to do a simple row bind. If you’re more used to SQL, this is called a union. Basically, we’re going to take the first data set, titanic.train, and then we’re going to merge it with the second one, titanic.test. We’re just going to do a vertical join on these two data sets. So, titanic.full has 1,309 rows. Just to check the math on that: 891 plus 418 rows is 1,309. So, that’s fine. No rows got skipped. And I also want to check the tail of this file just to be sure that everything came out fine.
Did you notice that there, the train set, isTrainSet is all set to false, correct? This is ordered right now. So just to double-check real quickly - if I do a table of isTrainSet, I should have 891 trues and 418 falses. Let’s do that real quick. If you do a table of titanic.full of isTrainSet, I can find that there’s 891 true and 418 false. This is exactly what I want, so later on I can split them apart.
The next thing is, there are some missing values in this dataset. If I quickly do a table of titanic.full of Embarked, I can see that there is a category of basically an empty string. There’s C, Q, S, and an empty string, and there’re two of them. I’ll go ahead and clean the missing values of this real quick.
Now, what I’m going to show you is not, I would say, the optimal cleaning method. It is a cleaning method but we need to clean the data in a way so that we can go ahead and build a model because the model is not going to like us having basically null values. What I'll do is, I’ll quickly build a filter, so that titanc.full of Embarked is equal equal to this double quote, the double-quote being an empty string. If I run this, I’ll get a series of true or false. And in there, somewhere, should be two trues. So, that is my filter. I want to query just the Embarked column of titanic.full. So, of titanic.full, I’m going to query the rows where Embarked is equal to null, and I only want Embarked to come back. I should get 2 values back, and they should be null - and let’s just run it real quick.
Once I’ve selected these two values, I’m going to replace them with something. So it just so happens that if I do table of titanic.full of Embarked, and if I do a table of that again - let’s figure out what the mode is and just replace it with the mode. That’s the quick and easy way of doing it. In this case, it’s S. I’m going to replace it with S. So, with that, let’s see if that went ahead and did exactly what we wanted it to do. So, if I do a table again, the nulls should be missing now. You notice that those two nulls have been added to S now. The S before had 914, now S has 916.
The next thing is, I think age had missing values. If I do an is.na of titanic.full of Age, I should get a pretty sizable amount there. If I do a quick table of this, I should get a count of trues versus falses. And we can see that there is 1,000 false and 263 true, so there are a lot of missing values. I think in the training set alone, there’s 177. The test set also brought with it a lot of missing values. That tells me that almost half the dataset in the test set had missing values of Age. So, how we clean Age is actually going to become extremely important. But I’ll let you do that as your homework for day four, which is going to be tomorrow. For now, let’s just replace it with the median. This is going to be a quick and easy way of doing it.
So, this right here represents a query of true-falses where NA is true - that is already our filter. The next thing we want to do is, we only want to query the column Age. Now we want to query which data set? Well, titanic.full. And we’re going to really quickly just replace everything with the median. So, before I run this query, I’m actually going to define what the median is. So in this case, the median will be, if you remember, if I just do a quick median of titanic.full Age. If I ran it right now, it would break, because there’re missing values. I would have to na dot remove. I’m telling it to do the calculation of median in the absence of missing values. Notice that it’s also 28. So, luckily, the global median is also the same median as the training set. I’m going to assign that to age.median.
Now I could have just very simply just when I’d done this query of the missing values of Age. Notice, I get, I think, these 400 rows back, or 200 something rows back, where Age is missing. Now, I could have very easily just said, replace everything with 28. I didn’t have to go through basically this long calculation of code. But if we’re in the process of writing a script that does an automated process, say this was sales data, the median of tomorrow might be different than the median of today. I want to build my script in a very extendable manner. That’s why I use this variable here that was calculated elsewhere from the data itself. If I run this line, it’s going to go ahead and calculate the median for me and then insert that into the missing values of Age. If I run this query again, this query again finds missing values. There should be true-falses - basically trues where there is a missing value. So, if I do a table right now of is.na of titanic, there should be no trues. Basically, all the missing values have been filled in. That’s that one.
There is another column that has missing values, so that needs to be done. If I do - I think it was Fare - if I do a titanic.full of Fare, if I run that real quick, that gives me the fare but is.na will filter out true and false, true where it’s NA. And if I wrap that entirely around the table, I should get number of true and false. Notice that there is one missing value of Fare. We’ll go ahead and fill in really quickly. So, let’s just do the median again. For day four, your homework is to build a predictive model that uses regression to actually predict the missing value of Fare but for now, let’s just do it with the median.
So we’ll grab this line that we wrote up here where we calculated the median, but instead of Age, we’ll just go ahead and say the median of Fare and also fare.median. So, we’ll figure out what the median of fare is. If I go to fare, the median fare is $14.45. I’ll go ahead and also do this replacement strategy like we did up here. I’m going to just copy and paste the code I had above except change everything from Age to Fare, and then we’ll do fare.median instead here. So it’s going to replace everything missing a fare. I’m going to push up a couple of times to go back to where I queried how many missing values of Fare there were. I’m going to run that query again, and notice that that one true is missing. That one true has been replaced. Now, we are ready to build a predictive model. But before we can even do that, we need to go ahead and split our data back out into train and basically test set.
If you remember, we can do a query now. So titanic.full of isTrainSet is equal equal to true. This represents a query that we’ll find ourselves, basically the test set, the 891 rows. If I do titanic.full, I can throw this data back into titanic.train. So with any luck, this should give me back what I wanted, 891 rows, but this time it has been cleaned. I’m going to add that in the script up here. And then I’m going to do the same thing, but for the test set. So, the test set, same query except false. Now, for you programmers out there, I also could have just flipped the true to falses by just adding an exclamation here using a not operator. But if you don’t understand what that means, ignore it, that’s fine. Just say is equal equal to false. And instead of saying train here, I’ll say test. I’ll throw this into the test set. So, I’ve gone ahead and ran that.
Actually, before we should have done that, we should have cast everything we needed to into categories. We forgot to do that. Well, I forgot to do that. So, there were a couple of things if I go back to titanic.full. Notice, because it’s in a script, I can rerun this later, it’s fine. I’m actually going to insert some lines in here where I’m going to do categorical casting here, so I notice that. I’m going to do some commenting here. This will say, “split data set back out into train and test.” And this was, “clean missing values of fare.” And then going forward in that, we’re going to also - before we do that, we’re going to do categorical casting.
Now, we’re going to do categorical casting for every column except Survived. Because if we do categorical casting of Survived now, there are actually three categories in Survived. So, let me just show you, titanic.full of Survived - there are 0 1s. And then I think there should be a bunch of NAs. So let me... I think something went wrong here. Yeah, there was a bunch of NAs at the very end.
If we did a categorical casting now, we will lose the binary classification that we had before. We would actually have three classifications - an NA, a 0, and a 1. So what we need to do is, actually, we need to cast everything else except Survived. If I quickly do an str of titanic.full, we will see the columns that we have at our disposal. Basically, we’re going to do as.factor, we need to do that no matter what. And I’m just going to copy this, because we’re going to need it again. So as.factor titanic.full - and actually, while I’m here, I might as well just copy that as well. I think Pclass needs to be a category. And, by the way, you should also convert Pclass to an ordinal category. It’s dot as ordered. But I’ll let you guys figure out how to do that, you have to pass it the order.
The next thing is, Pclass class is going to need that. Next thing is, we will also cast Sex into a factor. And then if we go down further, Embarked should definitely be a factor. Now also keep in mind, there could new case made for a sibling, spouse, and parent-child to be an ordinal category. I’ll let you guys experiment with that to see if that improves the performance of the model. And also, we can’t just cast them into factor, we have to actually assign them back into the data itself.
I’ve got to basically take the same column and basically assign it back into itself. So after the factor has been cast, I will go ahead and load it back in. I’m going to run these three lines to do my categorical casting. A titanic.full, str of that, so it gives me the structure of this. Notice that now Embarked is a factor, and there’re only three levels, before there would have been four levels, the fourth level being the missing value null one. And then, notice that we have gender, and we also have Pclass. Awesome. Now we can run these two lines again and split the data back up.
I’m going to rerun this, and then it’ll keep the factorization. So if I do an str of titanic.train train now, we’ll notice that the factors have been retained in the same order and in the same types of factors. That’s nice, that’s what we wanted. Now we have a situation where we can build a predictive model but before we can build a predictive model, you guys remember, we have to cast Survived back into a category. If we go back into here, if we do titanic.train of Survived. We need to cast this into a factor, so as.factor. And notice I’m doing this after I’ve split apart my dataset because I don’t want that NA to be inside of this Survived category.
So, titanic.train Survived here. That’s going to cast my Survived into a category, which is going to tell me that, yes, this is going to be a binary classification problem. It’s not going to try regression. It’s not going to try to do a multiclass classification setting. Now I’m going to show you something new.
Up until this point in the Bootcamp, you’ve probably just dropped the columns and when you built your predictive model, and when you went ahead and did your predictive model. Now I’m going to show you how to mark, for R, which ones are predictors and which ones to ignore when you send into a predictive model. This is useful because our test - we need to keep PassengerId. Yeah, we need to keep PassengerId. So instead of drop– we can’t drop PassengerId, because we need them. So I’m going to show you how to basically define in R what to use.
In this case, we’re going to actually build a formula. The formula, if you guys remember, the normal formula was just, if you remember, randomForest. We just did Survived, that was the initial formula, basically. We told it to use everything except Survived to predict Survived, and that required us to drop everything we didn’t need. And that’s not the form that we want. So now I will just simply do– I will explicitly call out which columns I want to build a predictive model out of.
If I do basically str of titanic.train, I can then find out what my predictors need to be. Notice, I’m not going to use PassengerId, but Pclass is going to be one of the ones that I want to use. Basically, we’re going to work with a string. And we want to tell it, we need to predict Survived. Actually, I’m going to copy-paste, that’s a much safer way here. I don’t want to mess up the column names, because if I mess up the column names, I’m going to have to debug it. So basically Pclass plus - I’m going to tell it to use Pclass, I’m going to tell it to use Sex, I’m going to tell it to use Age. This plus sign is required because it tells it to use it. Age - I’m going to tell it to use siblings. Spouse - tell it to use - oops, need a plus sign here, plus sign again, Parch. We need to tell it to also use Fare. And we can’t use Cabin now, and we need to use Embarked. So there it is. So that is our survived.equation.
We’ll go ahead and say that this will be our equation before we spit into the model. But we also need to cast that as a formula, because R expects this to be in a format called a formula, it’s a data type. So as.formula will throw the Survived equation into that formula. And then we’ll assign it to a new variable called survived.formula. So that’s going to build us basically a set of relationships– predict Survived given these columns. So now I can go ahead and do install.packages of, let’s say, a random forest. It’s going to go ahead and install myself a random forest. And now I can do a library of a random forest. Awesome, now I can actually call a random forest.
So, randomForest where my formula is equal to my survived.formula, so that’s where I just find what the relationships are, predict Survived given Pclass, x, and et cetera. The data that it will train on will be the titanic.train. And notice, I’m skipping the 70/30 split here. I’m also skipping cross-validations. You guys should definitely be doing this on your data sets. I’m just simply showing you how to build a predictive model and submit it to Kaggle. And then I think I can just do ntree is equal to 500, I think that’s the default. And then mtry will also equal to 3. The square root of, I think, 7 is 2 point something, but we’re going to add it up, we’ll round up to 3. And then I like my node sizes, basically the minimum samples per node has to be - I’d like to have at least be 1% of how big my test set is or how big my train set is, so titanic.test. So basically 891 rows, it needs to see around 9 to 8 for it to consider it as a good split.
We’ll go ahead and build this. And we’ll call this a titanic.model. So there it is, it built me a predictive model. And now I need to go ahead and apply the predictive model. So we’ll go ahead, and we can also specify features now. So in this one, I’ll use the same thing, I’ll define what features are being used. So in this case, we don’t have a Survived anymore, so I’m just going to remove that. It’s very important that you define this because we want to use PassengerId. In this case, we’ll do a survived.features or features.equation.
Now we can go ahead and use the predict function. In this case, I can run a prediction from my model. My model is the titanic.model - basically, titanic, random forest, model. And then my new data, the data I’m going to predict on, will be titanic.test set. So, this is going to score each of them, and then I’m going to go ahead and say - hold on just a second, there’s something I’m missing. Oh right, I need to assign it. So these are just predictions. I’m going to call it Survived, because that’s going to be the name of the column. We’re going to do a C bind later, and it’s going to take the name of the variable as the column name. So, I’m cheating here.
Technically, it should be called titanic.predictions, but I’m just skipping that step for now. Survived is not found. Hold on just a second. Oh, no. That’s because I selected just Survived. So if I run entire line, it should work. If I type in Survived right now, I should get a bunch of 0s and 1s of whether or not these people lived or died. It basically went through my random forest and did this, which is really, really cool. This is what Kaggle wants from us. Kaggle wants these answers from us, these 0s and 1s.
The next part of that is I need to build a data frame to write that out as a CSV. And the CSV only needs to have two columns - PassengerId and Survived. We’re going to go ahead and do that real quick. So basically, let’s isolate PassengerId, so titanic.test of PassengerId. And we’ll throw that into, I don’t know, a barrel called PassengerId. And notice, I’m going to call it basically by the same name, because later on when we do cbind together of these two things, it will take the column name to be the variable name, so later on I don’t have to rename the column names. But Kaggle wants it in a very particular way.
It wants capital P Passenger, and then Id with a capital I. So that’s why I’m calling it that way. And now what I want is I want to convert that into a dataframe. So as.data.frame will be my initial data frame that I’m going to submit. I’m going to throw the PassengerId vector into there. And then this will be called, I guess, as an output dataframe - output dot dataframe. It is a one-dimensional dataframe right now with only one column in it.
Now what we’ve got to do is, if you remember, if we call output dot dataframe and call a column it doesn’t exist, such as Survived, we can throw Survived actually into this thing. The Survived vector, basically all the predictions of 0s and 1s, we’re going to throw it on as a secondary column in here. If we do a tail of basically output dot data frame, we can see that PassengerId and Survived are next to each other, and this is what Kaggle wants.
I’m going to do a write.csv here of the output dot dataframe. Basically, I’m going to write this dataframe out to a file, and the file will equal– and this is where you get to name the file. I’m going to call it kaggle_submission.csv. And notice I just get to call it kaggle_submission, because I already have my working directory that I set way in the beginning up here. Remember that? And it’s going to write in this same directory. And there’s something very, very - that isn’t unintuitive. We have to set row.names is equal to false. Because if we don’t, it’s actually going to write this column right here.
This column right here, see 413, 414, 415– it’s actually going to write that into the CSV by default. We don’t want that, so this is what we’re going to do. So if I run this line, if I check my folder, there should be a brand new thing here. If I open that up, we can see that it correctly did that for us, beautiful. Now it’s time to go ahead submit this to Kaggle. Basically, I’m going to go to make a submission, and I’m going to upload that dataset now.
Kaggle submission - before, I think I submitted, and everyone dies in the model, and I got a 62%. Let’s see how well I do today. Now keep in mind, this model is probably not good, because I cleaned everything with the median. Remember, the medians are different for genders and different for different Pclasses. And maybe, once you learn regression on day four, build a regression model, build a predictive model to predict missing values of Age, missing values of Fare, et cetera.
Look at that! That’s awesome. My model boosted me up to 77% accuracy, which, if you remember, my rank was 5,000 something. I jumped up 1,000 points with the help of this random forest. And, notice, I haven’t done parameter tuning, I haven’t done cross-validation, I haven’t done the 70/30 split, which you should. So, this is definitely not the best model that I could have built by any means, but that is your homework.
I hope you will go ahead and conclude our data science really quick session here. Now if you go to your get Github repository, I’m going to post this homework solutions to this - basically, the thing that I’ve been working on here with you, I’m going to upload that as an R file. You can follow along. If you don’t want to follow along with this video, you also have a script. If you go to github.com/datasciencedojo/bootcamp, under homework solutions, I’ve gone ahead and posted the Kaggle Titanic example dot R. Also keep in mind that I’ll show you how to do this in Azure ML tomorrow. So if you like Azure ML, that’s coming. Just remember, the Kaggle competition ends at 1:00 p.m. After lunch on Friday. All right, happy modeling.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo