Hi, welcome to this data science dojo beginner tutorial on getting started with Python and R for data science.
In this beginner tutorial we’ll take you through some common Python and R packages and libraries used for machine learning and data analysis as well as go through a simple linear regression model. We’ll also help you setup Python and R on your Windows, Mac, or Linux machine run your code locally and push your code to a github repository.
So let’s get started with installing Python and R. To install python on a Windows machine we first need to check if our machine is 64-bit or 32-bit as this will determine the appropriate Python program to install. To do this search for “about your PC” and you’ll see if your machine is 64-bit or 32-bit, in my case, its 64-bit.Next, in your web browser, type “python.org / downloads / windows” and scroll down to the version of python you wish to download, in my case, I’ll choose the latest version for 64-bit executable installer. You can go with the default installation or you can do a custom installation to include optional features such as “pip” or you can specify your path directly under C so it’s easier to locate your Python program later on and just click install once python has installed on your computer you’ll need to add python to your path to be able to run Python scripts in a directory or folder. Download Git for Windows to set your path and run the Python command.
The command using this program are basically the same when using terminal in Mac or Linux Alternatively, for Windows, you can use the default command prompt by searching “CMD” You can also set your local path by searching “environment variables” and setting your path there Here’s an example of a Python script saved in my documents project one folder. Using a text editor of my choice, such as notepad plus to write my Python code, I saved my file as a '.py' file. Then, I open my terminal which is in “C: program files/git/git-cmd”.I navigate to documents project one and I set my local Python path. So we’ll set this up permanently using a bash RC file with the path to my Python program directly under “C” now, I simply type “py” followed by the name of the file and extension If using Python 2.7 just type “Python” followed by the name of the file and extension if we were to hit enter to run this, it would produce the output of my code which has predicted Heights using a linear regression model.
The final part of this python windows setup is installing pip to be able to easily install Python packages and libraries pip might not have come with your installation if you didn’t customize your installation or it might not be installed in an older version of Python so to get pip, type in your web browser “bootstrap.piper.io/git-pip.py" and right click, to save in your Python program folder and then run the command “python get-pip.py” so my Python programs under (C:) Moving on to installing R for windows, simply type in your browser “cran.r-project.org/bin/windows/base” and select the 32 or 64-bit Once it is downloaded, press ok and click “next” to all. Once R has installed on your computer, you can simply open the program on your desktop and start typing R commands or code.
I recommend you to download R studio as it just makes the process of editing and debugging your code easier. Otherwise, you’re welcome to use the R command line. To save an R file, click on “file”, “file history”, and this will save your code so you can run it later if you wish to set your path or working directory, just simply type “setwd” followed by the path to where you would like to store your R files locally. You might need to use double backslash for Windows as Windows understands this to mean separators in the path.
Now, let’s install Python on a Mac Go to Mac terminal in “finder”, “applications”, “utilities” and now we’re going to store our command line utilities Xcode as this will help with the installation So type “xcode – select – -install” click “install” and “agree’ Now, we’re going to use homebrew to install Python So type “/usr/bin/ruby” and we’re going to use curl and we’re going to type the URL to homebrew on github press return enter your password if need be. Next add the path, so we will create a bash RC file to permanently add the path. If you get an error message stating “cannot write to path” try the “sudo”. channel command accompanying this video. All commands can be copied and pasted as they accompany this video. Next we’ll install Python so just brew install Python or Python 3 if your using Python 3 we’ll also add this to our path So we’ll create another “- RC” file Now to check if pip is installed as part of your Python program, simply type “which pip” and It’ll show you the location where your pip is installed and if you want to check out the version just type “pip – V” and it’ll show you which version of pip you’ve installed. As mentioned pip is useful for easily installing Python packages and libraries.
Moving on to R, to install this on a Mac after installing homebrew, simply type “brew tap homebrew/science” and then type “brew install r” To open the our command line simply type “r” and enter. Now let’s install Python and R on Linux I’m using Ubuntu, later versions of Ubuntu might already have Python installed but I’ll take you through the process anyway. So open your terminal, we’re going to type “sudo apt-get install python 3.6 or 2.7” Now we’re going to type “sudo apt – get install Python – set up tools” lastly, install pip to easily install python libraries in packages by typing “sudo easy_install pip” To install R on Linux, simply type “sudo apt-get -y install r-base” Now type uppercase “R” and enter to open the R command line now that we’ve got the setup and installation part of this tutorial out of the way we can now move on to more fun stuff.
Let’s have a quick play with some data to get you familiar with some key data analysis and linear regression concepts as well as basic scripting for this. I’m going to go through an example of a simple linear regression in Python and R using simulated data on people’s height in centimeters and their weight in kilograms. The model is based on a formula which can be produced using Python and R functions that gives a predictor out come or estimated y-value given a certain x-value at a certain constant and slope. Here is what’s called the “regression line” I like to think of it as a line of predicted values along the x-axis for a given x-value the line predicts the y-value to fall about here in height the actual values are slightly above and below the line, but the model is generalized enough to take into account where most cases would probably fall. The formula gives a constant value here which we add this to a given x- value multiplied by a given coefficient or slope. The constant means when X is at 0, y is at this value and the slope means for every one unit increase in X, Y increases by this number of units. So we can use this formula to plug in any new x-value of a person’s weight to predict their height or y-value. Of course there are many other factors not only weight that could influence a person’s height, hence we’re just looking at a very simple model to get started with, to implement linear regression in Python we first need to install a few commonly used packages.
We’ll open our terminal and install “sklearn” for modeling If using Python 2.7, just type “python -m pip install” Now, we’re going to pip install pandas for data importing We’ll also install matplotlib for plotting The last package we need to install is just “scipy” Next, go to your text editor and save a new Python file in “Documents/project 1” or a folder of your choice. So I’ll just call my file “LM model”, save it as a Python file Also, don’t forget to CD into this folder in terminal so you can run your script later.
Now we’re going to import these packages at the beginning of the script when it runs, so at the top of the file we’ll type “from sklearn import linear model” So our linear regression tool. We’re also going to import data frame from pandas we also want to use pandas as PD and we’ll just use it as pandas and we want to import matplotlib and use it as PLT Now we need to read in our data which you can download as part of this tutorial and save in your current folder. Will use the pandas read table function for this So we’ll put our data and variable and we’ll just call it input data and we’ll use the read table function and we’ll give the data file name an extension in our folder its comma separated as it’s a CSV file and we have headers and they start at line 0 and we’ll give our X&Y; headers specific names This automatically infers the data types for each column too. before applying a linear regression model, let’s plot the data using matplotlib’s plot function to see if the data naturally follows a linear pattern and the normal distribution as linear regression is not appropriate or useful for datasets that don’t follow this assumption. So, we’ll use a scatter plot and we’re just plotting weight versus height. So weight is on our x-axis and height is on our y-axis We’ll need to show this graph, so it can render on our screen now save and run the script As we can see, the data is linear and follows a normal distribution making linear regression appropriate to use on these data Now we’ll define our X predictor variable weight and our Y outcome variable height. So, we’ll use PD as pandas and we use the data frame function and we’ll use weight, as our predictor and we’ll make height our outcome variable.
Now we’ll fit a model to the data using the fit function and use this to predict height to given weight So we’re using a linear regression model and we’ll fit the model to the data We can now compare the first, say, six predicted values using the predict function with the actual height values to see if they’re on par So first we’re going to get all the predicted values and we’re going to use our predictor variable to predict the outcome and we’ll just print some sub heads to differentiate the list of predicted values from the actual and we’ll have a look at the first 0 to 6 predictions and we’ll compare with the first 0 to 6 actual values All right, we’ll save and run the script. A quick eyeball of the first few predictions with the actual shows the model was not far off the mark. Which is good, however, to properly assess a model, we can use measures such as R squared which is the percentage of explained variants So we’ll go back to our script and we’re going to use the score function to get the R squared and we want to print this obviously.
Now we’re just going to comment out the above lines as we no longer want to view these we’ll save and run our script again as we can see, a high r-squared shows the model explained most or nearly all of the variance which is good however relying solely on r-squared is probably not good enough when assessing and measuring our models predictions sometimes it can be misleading to look at the r-squared, but the course will go through other measures you can use.
To perform the same analysis in R, we’ll first install commonly used R package, ggplot2, which is used for effectively visualizing and analyzing data I’ll select a cran mirror that’s close to me We need to load ggplot2 whenever we want to use it We’ll read in our data using the read table function we’ll put our data in a variable we use read table we’ll give it our file in our current working directory its comma separated and we do have headers and we’ll just use the default header names x and y. This automatically infers data types too will also attach our data frame so we can refer to column headers or variable names without having to refer to the name of our data each time making this more convenient. Now we’ll plot the data to see its normal distribution, but we can also use ggplot2 to plot the regression line or the line of best fit. So we’ll plot our x and y, which is weight and height and in the smooth function, we’ll specify a linear model as we could see before the actual heights are close to the predictions of the line implementing a simple linear regression in R is quite easy using the LM function. Now, to see the first few predictions of height we’ll use the predict function we first need to get all of the predictions and we’re just going to print the first few to have a quick look so the first 0 to 6 and we’ll compare with our actual values. As seen before, for the first few cases, the predictions are pretty close, to print the r-squared or percentage of explained variants for assessing the model we’ll use summary As seen before, it explains nearly all the variants, but it’s a good idea to also look at errors or other measures for this.
Finally now that we’re finished we’ll detach our data In the last part of this tutorial we’ll push our code to a github repository so you can share your code publicly or store it privately if you wish. You can create a github account for free you can also follow a data science dojo to clone or access a copy of the code provided as part of the course material. Once you have created an account add a new repository without initializing via the github website.
The instructions to push your code to github are on the website but I’ll take you through the process anyway. First open your terminal and CD into your current project directory and you’ll need to configure your user name and user email now configure your username We’ll initialize our project directory as our git repository Then we’ll add all files in our project folder, we’re not pushing it live yet, it’s just selecting the files commit your files to track the first mission with the message should you wish to publish updates later on.
So I’m just gonna say first go at implementing simple linear regression as you can see all the files in project 1 folder are there Now we’re going to give the URL of our main repository so go to the main page of your github repo and copy the URL and we’re going to paste it into the terminal when adding a remote repo Finally we’re going to push our code to the repo and github master branch Now, if you have a look at your github repo, you can see all your files are there All the work we have done in this tutorial is here. alternatively, after initializing your github repo via the site, you can simply drag and drop your project folder onto the main page of your repo. Now that you’ve gone through the basics you should feel ready to dive in to the course and gain a deeper and wider understanding of data science.
You know how to set up Python and R in your machine, how to do basic scripting for reading and visualizing data, how to apply a model and assess it, and now you can share your hacks and projects on github. The data used in this tutorial the coded examples, the commands, the URLs to programs, and so on are all accompanying this video.
My name is Rebecca Merrett, feel free to reach out to me by commenting on this video I’m more than happy to help you get ready before you start your course thanks for watching and happy analyzing.