Skip to main content

Blog entry by Dave Langer

Feature Engineering and Data Wrangling in R

Feature Engineering and Data Wrangling in R

  • Share

Feature engineering and data wrangling are key skills for a data scientist. Learn how to accelerate your R coding to deliver more, and better, features.

Earlier this month I had the privilege of traveling to Amsterdam to teach a great group of folks data science. As is so often the case, I felt I learned as much from the students as they learned from me. For example, one of the students asked for some R programming assistance in the area of data wrangling and feature engineering. The scenario in question really intrigued me. I knew how I could solve the problem using traditional non-functional programming techniques (e.g., using for loops), but I was looking for something more elegant.

In the hotel that evening I fired up RStudio and started noodling on the problem using my current go-to solution for data wrangling and feature engineering in R – the mighty dplyr package. I had so much fun working through the scenario, here’s some example code from the video showing dplyr in action.

[splus] #====================================================================== 
#Add the new feature for the Title of each passenger
train <- train %>%
mutate(Title = str_extract(Name, "[a-zA-Z]+\\.")) table(train$Title)


#Condense titles down to small subset
titles.lookup <- data.frame(Title = c("Mr.", "Capt.", "Col.", "Don.", "Dr.",
"Jonkheer.", "Major.", "Rev.", "Sir.",
"Mrs.", "Dona.", "Lady.", "Mme.", "Countess.",
"Miss.", "Mlle.", "Ms.",
New.Title = c(rep("Mr.", 9),
rep("Mrs.", 5),
rep("Miss.", 3),
stringsAsFactors = FALSE)
#Replace Titles using lookup table
train <- train %>%
left_join(titles.lookup, by = "Title")

train <- train %>%
mutate(Title = New.Title) %>%

Now compare the above elegant (if I do say so myself ;-)) code with the following code from my series:


# Expand upon the realtionship between `Survived` and `Pclass` by adding the new `Title` variable to the
# data set and then explore a potential 3-dimensional relationship.

# Create a utility function to help with title extraction
extractTitle <- function(name) {
name <- as.character(name) if (length(grep("Miss.", name)) > 0) {
return ("Miss.")
} else if (length(grep("Master.", name)) > 0) {
return ("Master.")
} else if (length(grep("Mrs.", name)) > 0) {
return ("Mrs.")
} else if (length(grep("Mr.", name)) > 0) {
return ("Mr.")
} else {
return ("Other")

titles <- NULL
for (i in 1:nrow(data.combined)) {
titles <- c(titles, extractTitle(data.combined[i,"name"]))
data.combined$title <- as.factor(titles)

# Re-map titles to be more exact
titles[titles %in% c("Dona.", "the")] <- "Lady."
titles[titles %in% c("Ms.", "Mlle.")] <- "Miss."
titles[titles == "Mme."] <- "Mrs."
titles[titles %in% c("Jonkheer.", "Don.")] <- "Sir."
titles[titles %in% c("Col.", "Capt.", "Major.")] <- "Officer"

# Make title a factor
data.combined$new.title <- as.factor(titles)

# Collapse titles based on visual analysis
indexes <- which(data.combined$new.title == "Lady.")
data.combined$new.title[indexes] <- "Mrs."

indexes <- which(data.combined$new.title == "Dr." |
data.combined$new.title == "Rev." |
data.combined$new.title == "Sir." |
data.combined$new.title == "Officer")
data.combined$new.title[indexes] <- "Mr."


In our Bootcamp we spend a lot of time emphasizing that in the bulk of scenarios a Data Scientist is best served by focusing their time on Data Wrangling and (most importantly) Feature Engineering. So often quality Feature Engineering trumps everything else – algorithm selection, hyperparameter tuning, blending, etc. My work on this video series is aligned to our teachings on the importance of Feature Engineering. Hopefully folks get as much out of my new series as I am getting out of making it.

Enjoy and happy data sleuthing!

Here is a cheat sheet: