Setup & Data Preparation






     

    Course Description

    In this tutorial, we’re going to demonstrate how to get R, RStudio as well as wine rating dataset from Kaggle, and how to install and load dplyr and ggplot, including how to properly load foreign characters into RStudio.

    What You'll Learn

     > How to download RStudio and R 

     > To get our dataset from Kaggle 

     How to install dplyr and ggplot2 packages



     

    Rstudio can be downloaded from here

    R Programming Language can be downloaded from here

    Wine Reviews kaggle dataset can be accessed here

    Our accompanying blog post can be accessed here

    dplyr Package can be installed using the command: install.packages(‘dplyer’)

    ggplot2 Package can be installed using the command: install.packages(‘ggplot2’)


     

    Hello everyone, this is Ningxi with Data Science Dojo. Today we’re going to start introducing you to a very powerful tool in R called dplyr that’s widely used for data manipulation and analysis. It’s going to make your data munging process that much more efficient and easier.

    We’re going to start talking about how to set up dplyr and also cover the data preparation process in this series. As an overview of this series we’re going to talk about why we use dplyr and what it does. We’re also starting to introduce you to some basic functions that dplyr can do, including “arrange,” “group_by,” “summarize” “select,” “filter,” “intersect,” and “setdiff.” 

    Through the series you’re going to learn how to arrange data, how to do group-by aggregation, as well as how to subset columns and rows and how to find overlapping and non-overlapping values from two different data sources. The goal of watching this dplyr series is that you should be able to use the functions we introduce to perform basic data manipulation tasks at hand. You should also be able to at a high level start thinking about the data analysis as a process that you can divide and conquer into subgroups. So instead of looking at one massive dataset as one standalone entity, after learning group-by aggregation you should start thinking about how to divide up the data you have into subparts and dissect it that way. It’s only going to make your job easier. 

    We’re going to demonstrate how to use dplyr while working with a real-world dataset from Kaggle on wine ratings The goal of this dplyr series is to get beginners up to speed quickly and help you guys select segments that you find most useful so you don’t have to watch every single video in the series. You could just pick and choose whatever segment you find most relevant to the task you have at hand. We don’t have hard prerequisites for you prior to watching the series but you should be able to have some familiarity with basic R syntax you should be able to code very simple commands in R before watching this series, since learning how to use packages such as dplyr and ggplot builds upon a foundation of understanding how the R programming language works. If you are not familiar with it please check out our YouTube page. We have a whole series on introduction to R on our, in our channel. So please go to that page and watch that series before watching the dplyr series here.

    In this video as Part 1 of the series we’re going to demonstrate how to get R, RStudio as well as wine ratings dataset from Kaggle. We’ll also walk you through how to install and load dplyr and ggplot, including how to properly load foreign characters into RStudio. 

    A little bit about myself I started my data science journey after a career in finance because I wanted to learn more data-driven techniques. I create content for Data Science Dojo as well as teach part of our 5-day in-person bootcamp and that we host around the world. And these are tailored toward working professionals and they’re meant to get you up to speed and in order to apply data science techniques in your daily work immediately upon graduation. So I encourage you to check that out. I enjoy using data to uncover interesting and fun things in life. 

    So as you can see we’re going to talk about wine ratings today. I hope you guys get a lot out of it. Ok so now you know what to expect, let’s get right into it. First we’re going to show you how to download RStudio and R if you don’t have that on your computer. Just go to Google and type in RStudio and the first link that comes up should be the place where we can get it from. So I already typed this in my search bar. Come over here, click Download RStudio. Just choose the first option. Click Download. Make sure that you also download R before RStudio, because RStudio is just an IDE that supports the underlying R language. So make sure you download both. Come over here to download R Choose your respective operating system and go through the prompts. I’m gonna go back to the page for RStudio. So once you have downloaded R from CRAN It’s the network that supports the R programming language. You’re going to come back to the RStudio page and choose your operating system. And download the IDE. So for instance, for Mac you click here and Windows here. So on and so forth. I already have it on my computer so I’m not going to click through it. But you pretty much just hit “return,” or “enter” continuously until you have that download on your computer. 

    So once you have downloaded both R and RStudio we’re going to get our dataset from Kaggle So just go back to Google type in “wine ratings Kaggle” and that should be the first link that comes up. So clicking this link will take you to the Kaggle page where this dataset resides. If you’re interested you can scroll down and just read about the background of this set, why this user decided to provide it here Also shows you the different features that are in the set and some related links. But we’re going to go to the Data tab and choose the second option from the left-hand side toolbar and click Download. I’m gonna rename it “wine” just because easier. Save and unzip the file So because I’m going to load this into my RStudio later and I don’t want to have to type in all these words I’m just gonna rename this CSV file and just call it “wine,” making it easier for myself to type in RStudio later on. 

    So once you have both RStudio and the dataset loaded, open up RStudio. Make sure to go into the directory where this dataset saved. So for me that’s on my desktop I’m gonna go to my desktop. And this is very important. Make sure to go over here Click “More” and “Set As Working Directory.” So this will make sure that your current working directory is set to where the dataset is saved so when we read the CSV into RStudio, the system will know where to find the set. Otherwise it’ll just be confused. So once I’ve done that I’m gonna create a new object. Let’s just call it “wine” for simplicity. I’m gonna set this object to the content of the CSV And because I have renamed the dataset previously, now I can just do “wine.csv” instead of that whole long name that we started out with. Also make sure to set this parameter “stringsAsFactors” to false because otherwise RStudio will treat all the characters as factors which we don’t want, since there are a lot of columns that contain text: different country names and tasting notes and different wineries’ names. And just there’s no need to treat them as factors so setting “stringsAsFactors = FALSE” makes sure that all the text is loaded as characters instead of factors. And also because of the nature of this set. Go over, go back to the Kaggle page. If you just play around and quickly browse the first 100 rows you’ll see that a lot of these wines are from European countries and the foreign languages have different accents. So for instance here. So French, Spanish, Italian, etc all have accents on their, on certain letters. And if we don’t do anything and just read the dataset as it is, it’s going to mess up all these words that have accents. So we need to do something special here and pass another parameter and set it to “encoding = UTF- 8.” So this makes sure that all the characters are loaded correctly. So I’m going to load that into my console. As you can see here, it means the dataset is being loaded. And now it’s done. If you want to take a quick look at what this dataset looks like, you can do “View.” Make sure V is in uppercase So do another quick scan of the dataset we just loaded and everything looks good All the accents are imported correctly. 

    This probably goes without saying but this is a dplyr video so we’re going to need to install dplyr into RStudio. We’re also going to use ggplot to do some basic visualizations so let’s do “install.packages” and just type in dplyr in quotes Also gonna install ggplot2 So once these two packages are loaded we also need to call them explicitly. So use the library function and this time around you don’t need quotes because they’re already, we already have these packages. If you go over here you can see that these are available in RStudio but they are not loaded; there’s no check mark here .So we’re gonna actually call the library by using the library function. I’m gonna do the same thing for ggplot2 and now we’re good to go. We’ll also see that we have this weird extra column named “X” so we’re going to get rid of that momentarily. We also have this column named “description” that seems to be sommeliers’ textural statements about how these wines taste. And because this tutorial is not going to focus on natural language processing we will soon drop this column as well. So this is a fairly large dataset. You can see over here, it has over 150,000 observations of 11 variables and we’re not going to use all the columns so let’s clean the set now. I will overwrite our original dataframe by subsetting only the columns we want. So we can do. This means we want all the rows, and minus sign means we don’t want these columns: the first column and the third – that was “description” I believe now if we do View again you can see that the weird “X” column and the “description” column have been removed. 

    I hope that by watching today’s video you’re able to get up and running with using dplyr. In Part 2 of this series we’re going to cover how to select and filter rows as well as perform some basic visualizations with ggplot. You will see that ggplot and dplyr often work seamlessly together to create neat data analysis and visualization results all in one. So thank you for watching and stay tuned for Part 2 of this series.


     

    Data Science Dojo Instructor - Data Science Dojo is a paradigm shift in data science learning. We enable all professionals (and students) to extract actionable insights from data.