Creating a Titanic Model (Cont.)
The Kaggle competition for the titanic dataset using R studio is further explored in this tutorial with more advanced cleaning functions for your model.
What You'll Learn
> Building a predictive model to clean the missing values of our data
> Learning different methods to find outliers in a dataset and how to filter the outliers
> Preparing the titanic dataset for building a predictive model
This video assumes you have watched Creating a Titanic Model in R Part 1.
Download RStudio from here.
The data set used in this tutorial can be accessed here.
Hello. This is Phuc Duong again. This is part two of how to do a Kaggle competition in R.
In the last video, I showed you how to do a very simple model in R and submit it to Kaggle but we also used some very subpar cleaning functions. This video is going to depend on basically the codebase from that previous video. If you have not watched that video yet, you’re going to be very confused. Click here on the screen to watch the previous video and then come back. I’m going to wait a little bit so you guys can click it. All right. If you’re still here, then that means that you are ready for part two of this video.
I have saved what I’ve written in the last video as a script, so I’m just going to rerun all of it. Well, I don’t need to write the series, just run all of it, okay? And maybe I shouldn’t have run install packages again, but that’s fine.
All right, so we’re going to go back up to where we cleaned the data. Where did we clean the data? We cleaned the data somewhere up here. So notice that the age, we cleaned it with the median. We also cleaned fare with the median. That is really bad. It’s very sub-optimal because if you actually do bucketing and segmentation, you’ll find that fare of different P-classes are different. The median of fare of different P-classes are different, the fare of different genders are different. You can even stack that. Maybe the fare of females in the third class is higher than the median fare of females in the first class. So you just start stacking a lot of things and then the median will start to change dramatically.
Let us build a predictive model to actually clean the missing values of our data. Let’s not just make blanket assumptions on the data set by filling in with the median. Let’s take a much more educated guess, and then we’ll feed that into the actual predictive model. Notice that we’re building a predictive model to clean missing data so that we can actually get a more accurate predictive model on all of it. Over here. I’m going to go ahead and comment on this line. This line is what we don’t want to do, these two lines. I no longer want to clean with the median. Now, this is an example for the fare but you can apply this for age or anything else that is numeric. Last time I already showed you how to build a classification model in a random force. You would apply the same concept.
In this case, I want to build a regression model to predict fare. Now, this is also kind of a waste, because, actually, if you look at it there’s only - if I run this line, which is basically going to query all the missing values of fare, and there is actually only going to be - Oh. In this case, I ran the script through and it went ahead and it cleaned the missing values for me. I don’t want to do that. I want to run the script and have it stop right here, right before it cleans fare. So, I’m to go ahead and, see this brush up here? It’s going to clear all of my objects so I can restart again fresh. I’m going to select everything up until that point, and I’m going to run everything else. Notice that it cleaned in embarked and it cleaned in the median. I still have to clean the missing values or else the model’s not going to like that I’m missing values. And I also have to do this categorical casting here, which I will do, actually, now. All right.
Now, what do we want to do? All right let’s load a predictive model to predict the fare, and because our response class is going to be numeric it will be a regression model. And if you look here - if I run this statement which basically finds me all the missing values of fare, there’s only one. So, this is going to be a very wasteful predictive model, in the sense that we’re going to build a predictive model just to predict one row.
Now I’m only showing you that because your homework should actually be how do I build a predictive model to predict age? Age has 200 and something missing values. That would be much more useful there. So I’ll let you guys do that for homework but the idea is, you can take this code base and convert it into an H-predictor very easily. You just switch out the names of the columns. So let’s start. All right.
If we look at Titanic fare, we can build a linear regression model. So I can easily just call this lm function but, here’s a big but, there are two types of linear regression models. There is an online gradient descent variant, which we’re not going to cover in this video. But there’s also another version which is an ordinary least squares linear regression model. That’s a very simple one that I tend to like to go with. So, that one is very susceptible to outliers. Before we do this linear model, we have to filter the outliers.
If we simply do a box plot of titanic.full$fare. All right, look at that. So, anything beyond this, this core tab, this whisker, is going to be considered an outlier to this model. So with that in mind, we want to filter these guys out. We want to build a linear regression model just based on these guys. So these outliers, we don’t want them because if we built a model to predict on them, this guy, for example, would be completely throw off our model and might bring up our regression model, and it might seem like everyone is synthetically richer than they actually are or paying more for a fare when they actually are not.
So, how do I get this core? What is the value of this whisker? So if I move over, I can kind of guess it and say that, okay, if you paid more than, let’s say, $77 for a fare - and this is me guessing, I’m eyeballing, right? We’ll, however, go ahead and filter it out. But as it turns out in R, R, actually stores it. So if I just do boxplot.stats, I can actually figure that out. So basically, titanic.full$fare again.
Notice that this brings me back out all the stats that I can ever want. Now, there is one stat that I want, which is this guy right here, 65. See that? This tells me, basically, the first whisker, the first quartile, the median, the third quartile, and the last whisker. Anyone who paid more than $65 for a fare would go ahead and be, in this case, an outlier. And we would filter those guys out.
Now, I can very quickly and very easily just build a filter for that. Or I could do titanic.full$fare is less than, less than, or equal to, 65. Now that builds me my filter right away. And I filtered out the outliers. But that is not how we build scripts. Because if this was sales data, the outliers might be - that whisker might be moving. The upper bound might be moving on us. Tomorrow’s sales data could change things. So let this actually derive what that is.
If I just type in that same command, boxplot.stats, notice that if I hit enter here, it brings me... Actually, see this dollar? That means I can reference these things. So, let’s see what happens. If I want something in stats, I would call the dollar sign of stats here. So, a dollar of stats. Notice I get this back and this is a vector. I can call that, and I can get the fifth vector back. And I get 65 back. So this gives me my upper bound. I’ll call this upper.whisker is equal to the fifth quantile. See that? All right, cool. So that number is equal to 65.
Now I can go ahead and build my filter. So I can do outlier.filter is equal to titanic.full$fare that is less than upper.whisker. And notice that I’m only cleaning the upper bound whisker. There is also a bottom whisker. But notice that it’s also 0, so that’s fine. We don’t have any outliers that go below the minimum there. We’ll go ahead and do that filter. This gives me a series of true falses.
Oh, I have to run this code first. This upper whisker should be 65. There we go. And now, we will go ahead and do a filter here. If I run this, this should be a series of true falses. In this case, someone paid more for a fare. I’m going to put that into, basically, a filter. Then I run that. I have a filter.
The next thing is I’m going to go ahead and do the actual filtration of the data now. So, titanic.full of outlier.filter. And notice that we only want the rows that are basically not an outlier. That’s what this means. If I run this, this will give me all the rows that aren’t outliers. The next thing is how we can go ahead and build our model. What I said, we’ll go ahead and do an lm here. An lm, where the formula - we have not defined a formula yet, actually. We haven’t told it how to predict yet. In this case, fare.equation. So, what do we want to do? Let’s do an str real quick, an str of titanic.full.
How do we want to build this predict file? What is their relationship? So we’re going to build a model, not to predict survive, but we’re going to build a model to predict fare. So, build me a model to predict fare. And then this total will be given. It’s like y equals a bunch of stuff. We want it to use everything else, okay? Notice that we’re getting to use Pclass. We’re going to use gender here, so sex plus sex plus age plus sibling/spouse plus parent/child, okay? And then plus embarked here. All right, that will be our equation. So, fare.equation will be inserted there. That’s telling me that build me a predictive model based upon these other predictors. And notice I’m not using survived, right? Because our future data will not have this column. We can’t rely on it as a predictor. And then the data will be the data in the absence of the outlier.
Earlier, here and here, I went ahead and did that filter. I said titanic of full, where I only want to see non-outliers. That’s what’s contained in this outlier filter, okay? So I want to go ahead and run the equation line. And I want to run this lm line. But I also want to stuff this bottom line into a variable. So fare.model, okay? All right, so I’m going to run this, and it’s going to build me a predictive model using a linear, ordinarily squares, model.
The next thing is I want to apply this. I want to fill in the missing values that are missing using a predictive model, using the rest of the data sets on that row that has the missing value of fare to fill in the value missing a fare. We’re going to go ahead and do, in this case, a prediction. So, the problem now is where is our model? What is our model? Our model will be fare.model. And what is our new data? New data will be any row that has missing values of fare. And then the next thing is we have to define what our features are. And that gets tricky because we have other things in here, like passenger ID. We have survived. We have embarked or not embarked. We have a name. We need to tell it not to use those things. This is where it gets a little tricky.
We have to now query our data to basically isolate the things that we only want. So titanic.full, we’re going to do a quick query. How do I find if something is missing in fare? Remember, is.na of titanic.full$fare. This will find all the missing values and give me back a vector of true falses. That will be our query. This will be - notice that I can query like this.
All right, the next thing is what columns do I want. Notice that not every column is needed. So, I only want to query specific columns. In this case, I want to query the columns that were included here. See that? So I’m going to go ahead and do this. I’m going to copy that. And actually, I’m going to do some text processing in Excel. Now you can do this manually if you want but I know that a vector is going to require a comma right here and quotes in these.
I’m going to do a find replacement. I want to find all plus signs with space before and a space after and replace it with a quote before, comma, space, and then quote. Notice if I replace this, this will build me my vector initialization command. I want Pclass, sex, age, sibling, spouse, parch. So, I’m going to go ahead and close that. And now I want to tell it I only want it to query those columns. I’ll paste that in here. I want to query only the rows that have missing values of fare. And I only want to see Pclass, sex, age, sibling, spouse, parch, and embarked. All right, now we can go ahead and fill this in. Because, basically, this is going to return me the rows that are pertinent to me. So fare.row. This will give me a series of rows. But let’s just run this first, okay?
Notice that this brings me back to the row that has the missing value, which is awesome. So, this row - notice I didn’t query fare, that’s the job of the model. Now we’re going to go ahead and predict on this. So, predict that. It’s kind of a waste because it’s only going to run a prediction on one row, this 1,004 row. So, I’m going to go ahead and run this prediction. I’m going to store that into a label too. But actually, let’s run it before we store it into a label. Oops, I have not run that filter yet.
I’m going to go ahead and run that filter. So, fare.row, that predicts is going to go ahead and predict. Notice that it’s going to predict for that person, passenger 1,044, that he or she might have paid $8.25 for a fare if he paid. Awesome. So, that prediction actually needs to be thrown back into the data set as a replacement. Notice that we just called a prediction and printed it to constant. We didn’t store it anywhere. So, we got to just call this fare.prediction. And then we’re going to go ahead and replace it.
Earlier, we showed you that is.na of titanic.full$fare. Now we need to query which rows have missing values of fare. This brings us back to a true false vector. This is going to be our query. And we’re going to go ahead and tell it we only want to query the fare column. So, we wrap all of that in these brackets. And we’re going to query that from the titanic.full data set. And this should bring me back that one row and the specific na. And I want to replace that na with this prediction thing here.
If this query had brought back like 10 lines - and the fare prediction should have 10 lines - It will, basically, just replace them in order. That’s how we’ll go in and mark all these predictions. All right, and that’s all there is to it. We have gone ahead and filled in missing values for that, so that value is now gone. Before it was in na, now it should be in 888.2 or something like that. But let’s find out.
If I go into my titanic.full and view the 1,044th row, I should see that the fare is $8.25, okay? And now I can go ahead and run the rest of the model. Go ahead and run the rest of the model, and treat everything else the same. Do the categorical casting back again. Go ahead and split out back into train and test. Cast the survived into the training set. Tell it that we want to predict survived given all this other stuff. Use the random forest package. Do a library of random forest. Build the random forest. And now, we can go ahead and predict on survived, right? And now, hopefully, our model is making a better-educated guess than it was and just filling in the median for fare reform.
Now, go back and do that for all of the other columns. I think age had a bunch of missing values. So, go forth and build yourself a model to predict their age.
All right. Happy modeling.
Phuc H Duong - Phuc holds a Bachelors degree in Business with a focus on Information Systems and Accounting from the University of Washington.
© Copyright – Data Science Dojo