Writing Your Script in Rvest
In this tutorial we will look at the tools that R has to perform simple web scraping tasks using the rvest package.
What You'll Learn
> Write standard web scraping commands in R
> Filter timely data based on time difference
> Analyze or summarize key information in the text
> Send an email alert of the results of your analysis
Packages used in this tutorial are
The R full script for this video tutorial can be accessed here
To see an example of web scraping timely political news events and commentary from Reddit, check out Data Science Dojo’s blog tutorial on KDnuggets: https://www.kdnuggets.com/2018/12/automated-web-scraping-r.html
Hi there, welcome to this Data Science Dojo video tutorial on automating the tasks of web scraping for analyzing text.
In this video tutorial we’ll be web scraping for text analysis periodically. There are many applications that require text analysis on recent or timely content, or capturing changing events and commentary or analyzing trends in real-time and so on.
As fun as it is to do an academic exercise or web scraping for one-off analysis, it’s not useful when wanting to use timely or frequently updated data. So I’ll take you through the process of writing standard web scraping commands in R, filtering timely data based on time diffs, analyzing and summarizing key information in the text and sending an email alert of the results of your analysis.
Then in Part 2 I’ll show you how to automate running your script every hour or periodically so you can run this in the background of your computer and free yourself to work on more interesting tasks. So let’s imagine you would like to tap into new sources to analyze the events happening in Bitcoin or in any other area that frequently changes by the hour. These events could be analyzed to summarize the key changes of movements in between read the overall sentiment of recent discussions, capture important events and so on. So you need a collection of recent Bitcoin events or news script every hour so you can analyze these events. We’re going to use market-watchers Bitcoin articles as an example data source, but it can be any other news website or any other topic area.
If you check out our blog tutorial linked below this video, you’ll see an example of using Reddit to scrape political news and events every hour, however just keep in mind whenever using a repository such as Reddit, it’s easy to filter pages that were published within a time frame as they usually marked as published X hours ago, X minutes ago but when you’re dealing with dates in different time zones, it’s not that simple. So we’ll also tackle this problem in this video tutorial.
So let’s start with a quick demonstration of scraping the main head and body text of a single web page just to get familiar with the basic commands. We’re going to use the library rvest for this so if you just uncomment this and run this you’ll install this library here this also applies to all other libraries that we’re going to use in the tutorial. And don’t forget to load in to R. So we’re just going to escape a single web page so if we look at our example page here we’re just going to use this one here as an example so copy that URL and we’re going to call it market watch webpage. I’m gonna use the read HTML function and we’re gonna feed it our data source.
Now that we’ve read in our data source we want to scrape the title of the webpage, so how do we get the title? Basically, look at the source code and search for the title text, Bitcoin jumps. So as you can see the title lies within the title tags. So what this means is the program’s going to search for the title tags and then grab everything in between these tags. So let’s go ahead and write this so we’ll refer to our source. And we want to look for the HTML node called “title” and we want to grab the text. Okay, great. now when we run this command it should output the title of the webpage so let’s go ahead and do this. And as you can see it successfully grabbed the title. Now what we want to do is grab the body. So the body text usually lies within the paragraph tags of a web page so going to write a similar command here market WAP wepage. And we want to get all nodes or all instances of the paragraph tag And we want to grab the text. When we run this command we should get all the text that lay within a little paragraph tag so let’s go ahead and run this, and I’ll show you up here. looks like we grabbed all the body text that was lying within these paragraph tags. Now that we’ve had a quick play let’s get right into it.
So we’re going to read in our source in this case is just going to basically be a search results page of everything on Bitcoin recently published, so let’s go ahead and run this. Now we want to get the URLs on this webpage so our URLs out our articles and they basically lie within this specific div tag here and we also want to get the href attribute so we don’t want to get the text per say we want to get the href attribute for our URLs. So let’s go ahead and run this, and let’s check our URL’s. Okay, great. As you can see there’s 15 URLs or 15 use articles that we’re interested in now we want to get the published date times of these news articles and depending on the time of day some of them are made invisible and some of them are not so let’s go ahead and check this. Okay, it looks like they’re all visible now so what we’re going to do is modify this code a little bit. We’re just going to get rid of invisible. We’re gonna rerun this. Okay, great. We have all our 15 date times. If this was made invisible what we would do is run two commands and join the invisible and visible date times together. Now what we need to do is treat these date times accordingly so we need to clean them up a bit we need to convert them into standard date/time formats and we also need to take time differences. So a good package to do all these kind of tasks is called lubridate. It’s designed for this kind of thing so we’re going to install and load this into R. Okay, great. So as you can see here lubridate finds it difficult to interpret a.m. and p.m. with periods in them so what we need to do is basically remove these to make it easy for lubridate to understand a.m. or p.m. So we’re just going to remove it by replacing it with an empty string. So let’s go ahead and run this command and now it’s ready to pass into the past date time function so this is going to create a standard date-time format it just makes it easier to work with later when it’s in a standard date-time format. So let’s go ahead and do this and we’ll have a look at it. Okay, great now all our date times are in standard date/time formats now before we go further let’s have a look at our example article here You can see that all the articles are published with Eastern Standard Time but what if we’re not in Eastern Standard Time? What we’re going to do is take time differences between the date time of the article and our current time So it’s going to be difficult for us to do those differences if we’re working in different time zones. So what we need to do is take these date times and we’re going to first ask for it in Eastern Time and then we’re going to ask for it to be converted into our local time and in my case it’s US Pacific time. So let’s go ahead and run these and let’s have a look at the converted date times. Okay, great everything’s in Pacific time Now we need to create a data frame so we have our date times here and we have our webpage URLs that we grabbed before. Now we’re just going to stick them into a data frame one column called webpage another column called date/time so let’s go ahead and run this and we should have 15 rows with two columns. And let’s check this is the case. And it is. Now we’re going to create another column in our data frame called diffhours and we’re basically going to take our current time or our system time and compare it with the date time that the article was published and we want to get the differences in hours you can get it in minutes or another unit measure but we’re just interested in hours so let’s go ahead and run this and let’s have a look at our differences. Okay, great now it’s not clear whether these are in their proper double datatypes so let’s just make sure they treat it as a proper data double type here. Let’s go ahead and run this command. Now let’s have a look at it. Okay, great they treat it as doubles now which is what we need.
Now that we’ve got these values we’re going to create a column and stick them in diff hours. Gonna add it to our data frame so let’s go ahead and run this command. Now that we’ve got our data frame of web page URLs we’ve got the date times and the differences in hours we’re going to use these differences in hours to subset them down to everything that was published say one hour ago or two hours ago, however long you would like. So in my case I’m just interested in everything that happened say seven hours ago. so what we’re going to do is look at everything that was published or a date difference of less than seven hours ago and basically filter these rows down to the ones we’re interested in. So let’s go ahead and run this and let’s have a look at it. Okay, so we’ve got one article that was published within seven hours ago so what we’re going to do now is take this filtered list or this new data frame with the filtered webpages we’re going to read them we’re going to grab the title of each webpage and we’re going to grab all the paragraph tags and collapse them into a single body and place them into their respective titles and body vectors so let’s go ahead and run this and now we’re going to add all the titles into the title column of our data frame and we’re going to do the same with body so let’s go ahead and run these. Okay, great now if we have a look at all the names or the column names of our data set we should see a more complete data set. Okay, great so we’ve got the webpage URL we’ve got the date/time we’ve got the differences in hours, we’ve got the title, we’ve got the body what we’re gonna do is just inspect the body text a little bit more so this is what we’re going to use to analyze or summarize on later so if we have any major issues here we want to know that now rather than later when we’re ready to analyze it so let’s just have a look at this and we’ll just look at the first case, well we’ve only got one case anyway, okay as you can see here there’s a few problems with the text so we’ve got these random new lines carried returns we’ve got random whitespace happening here so it’s gonna make it difficult to analyze on when the text needs has a bit of problem so what we’re going to do is just clean out the major junk we’re not gonna go too far into cleaning in sense of you know normalizing the text by lower casing it removing stop words etc we’re just going to get rid of all the obvious junk so a good package to do all this is called string up so let’s install and load this into R and we’re going to use this function here just to get rid of all the major junk in the text so those new lines carry returns random whitespace etc we’re gonna ply that on to the body so let’s go ahead and run this and let’s have a look at our text to make sure it’s clean. Okay,great. We did a pretty good job at cleaning it up so got rid of all those major problems there. it’s gonna make it easier for us to summarize on later.
In this part of the tutorial we’re actually ready to summarize the text. so we’re going to use LSA fund library which uses a simple ranking algorithm to summarize the text but there are more sophisticated ways to summarize text and extracting information other types of analysis can be done on the body or the title text and data science dojo bootcamp covers text analytics and how to write programs to make sense of texts if you want to take this further. but in the meantime we’re just going to use a simple summarizer. so what we’re going to do is loop through each body text and grab the top three sentences with the most relevant information so the most relevant in terms of having the most keywords and most information rich sentences and then we’re going to stick it into its respective summary vector so let’s go ahead and run this. first of all, we need to install and load the library and let’s have a look at our summary here or summaries if we had more than one case. okay, great so now we have a much more condensed version of the text so when I email this to myself later I don’t have to email the entire body text of the article I just want to get like a quick snapshot of the key events that happened in this article and not have to read the entire thing so we’re going to do now is add it to our data frame. I’m gonna put it under a column called summary so let’s go ahead and do this.
Now we’re basically ready to email it to ourselves, so we’re going to use this library here we’ll install and load that into R. this is basically going to allow us to email text to ourselves now I could simply take the data frame and email that to myself but I’m gonna go a bit beyond that. I’m only really interested in the title and the summary from this data frame so what I’m going to do is create a vector that prints things in certain order so I wanted to print it in the order of title 1 followed by summary 1 then title 2 followed by summary 2 and so on and so forth. so let’s run this code here so we can have things that print in order. and let’s just check it does print it in order so let’s have a look at this. so the body of my email is going to have the title followed by the summary and if there was more than one case it would be title 2 followed by summary 2 and so on.
So,now we’re ready to set up the parameters these are the parameters that we’re going to input into this function here to send our email so we’ve got the from email the to email the subject line we’ve got the body text, which is going to be our titles and summaries and if you’re using Gmail you want to specify Google’s SMPT you could also have more than one email you could send it to so you can create a vector or a list of emails here but for the purpose and demonstration I’m just going to send it to myself so let’s run these. when you see this here you know that it’s successfully sent and here’s one that basically just sent to myself not too long ago, you’ve got the title of the article followed by the body and basically this gives me a nifty little snapshot of everything that happened in the Bitcoin world in the last few hours or so. Thanks for watching, in the next part of this video tutorial I’ll show you how to run your script hour hourly, so you can automate this process even further.
If you found this video useful give us a like you can also check out other videos at data science dojo tutorials
Rebecca Merrett - Rebecca holds a bachelor’s degree of information and media from the University of Technology Sydney and a post graduate diploma in mathematics and statistics from the University of Southern Queensland. She has a background in technical writing for games dev and has written for tech publications.
© Copyright – Data Science Dojo