Hello, ladies and gentlemen of the Internet. My name is Phuc Duong, Senior Data Engineer for Data Science Dojo. And I’m here to teach you how to web scrape with Python.
So in front of you, you see, is actually a website that employs web scraping. So this web scrape’s actually a storefront of a website called Steam. So steam sells video games. And the cool thing about Steam is that they do flash sales every day. So the user has to come back every day and study this page. What is a good deal? What is not a good deal? And it’s a lot of information. This is how they’ve gamified shopping online.
Now there’s a website that actually scrapes steam’s front page in real time and shows you the best deals, and ranks them. So a lot of people ask me, how do I get all of my data? And actually, in the absence of APIs, if you learn, web scraping, it is actually a very important tool for data scientists and a data engineer to know, because the entire internet becomes your database. So– I can scrape any storefront– Nordstrom, Macy’s. Study the sales. Web scrape reviews. I can web scrape baseball stats, baseball players in real time. Wikipedia is also a good place to web scrape. For example, you can see that this frame over here of this Harry Potter character, Ron Weasley, it’s very standardized. I could write a web scrape script and then loop over every single Harry Potter character, very quickly, and create a data set. All right.
Today we’re going to learn how to do that. So today I’m on Windows, so you can normally install Python if you’re on Linux, but if I don’t if you’re on Windows, I highly recommend installing Anaconda instead. So if you go to Google, and just type in Anaconda, it should be a continuum dot and you should just download based upon your operating system. OK. Next thing I’ll be using, is a text editor called Sublime Text. So you can just go ahead and go to Google and type in Sublime Text and then install that. I like using Sublime Text 3. That’s where you get those things. All right. So once you’ve installed this, this is actually– if you’re using Anaconda, its actually a pretty big file. It’s like 500 megabytes, OK? So be warned of that.
So what I’m going to do is I’m going to go ahead and open up my command line. And for those of you who don’t know, if you go to a folder, any folder, and then just hold down the Shift button and right click and say Open Command Window Here, this opens up the command line for you. And this is where you can work with Python. So if you type in Python right here, right, and if you’ve installed either Python or Anaconda, well this is show up, right? So notice that I’m using Python 3.5 with Anaconda. And if I just do a very quick two plus two, it should equal 4. That’s how I know I’m inside of my console. All right, next thing is yes, now that I know that if I push down control and hit C, Control plus C, basically if I do a copy on Windows, it will exit this console. OK and I get back to, basically, the Windows command line.
So what I’m going to do now is, I’m going to go ahead and install a package called Beautiful Soup. That’s the package that we’re going to use to web scrape, actually. It’s a very powerful package. I encourage those of you who want to go further beyond this introduction to go ahead and learn this package. So all you’ve got to do is do a pip, install, bs4. OK, bs4 stands for Beautiful Soup 4. So here we are. So Beautiful Soup has been installed. And how do I know if it’s been installed? Well, if I type in python, and I type in import bs4, right? It should just not err. OK. Awesome. So that’s how I know that the packet is online and ready to go.
Next thing I want is, I need a web client. So Beautiful Soup is a good way to parse HTML text. That’s all is. It’s a good way to traverse HTML text within Python. Now I actually need a web client to grab something from the Internet. And how you do that in Python, is, actually, you would use a package called your URL lib. And inside of URL lib, there is a module called request, and inside of that module is a function called URL open. OK? I know it’s a lot to take in. But settle down, we’re going to do step by step. I’m going to do a really quick import all-in-one line kind of step. All right. So I can do from URL lib dot request. So I’m calling a package called URL lib. If you’re on Python 2, this is a different package. It’s called URL lib 2. So I’m a calling module within that.
So notice, I’m importing only what I need. I don’t need all of URL lib. I just need the request module. And I’m going to import out of that. OK, URL open, the one, basic, function that I need. And it’s going to import all the basic dependencies, as well. And I’m going to give it a name because I don’t want to type in URL open every time. I want to say U request, uReq for short. That’s how I tend to do things. And also, I can also modularize the import of Beautiful Soup, as well. So I can do from BS4 import. And this is important, capital B Beautiful, and then capital S for soup. And then I’m going to just call it as soup. So I don’t have to call out Beautiful Soup again every time I want to use this package. And this is me working in the console. This is me playing around. So if you want to, you can actually start typing it into a script. So in this case, I have Sublime open. And I’m going to do a Control Shift P to open up the command console. And then I’m going to say set syntax is equal to Python. OK. Beautiful. So now I can do the same commands in here. So if I just select this into the command line, hit the Enter button, that will copy it. So that way I can paste it into my script here. OK? So there you have it, the first two lines of this. So now I’m ready to go.
Beautiful Soup is going to parse the HTML text, and then URL lib is actually going to grab the page itself. But what do we want to web scrape? Well I like graphics cards. I’m going to web scrape graphics cards off newegg.com. So some of you might know it. It’s basically Amazon but for, basically, hardware electronics. So I’m going to type in, for example, graphics cards. So these are a bunch of graphics cards that have shown up in my search bar. And it would be nice to basically tabularize and turn it into a data set. And notice that, if a new data set, if a new graphics card is introduced tomorrow, or if ratings change tomorrow, or phrases change tomorrow, I run the script again and it updates it into, basically, whatever it is that I loaded into. I can log into a database, a CSV file, and Excel file, it doesn’t matter. So in this case, I’m going to grab this URL. OK. That’s all I’m going to do. So basically I’m going to copy this URL, and I’ll pasted into my script. So, in this case, I can do my URL is equal to– so that is the URL I want to use of this. And in this case, I will actually run it in my console.
So when I’m web scraping, I like to also prototype it into the command line, as well, so I know that the script is going to work. And then once I know that it works, I will go ahead and paste that back into my Sublime. OK so this is my URL. So I’ve gone ahead and called a variable and placed a string of the URL into it. Now this is going to be good. So now I will actually open up my web client. So in this case, I would do U request, right? So notice I’m calling you URL lib, and I’m calling it from the shorthand variable that I called it earlier. So notice I called from URL lib dot request import URL open as U request. So I’m actually calling the function called URL open right now, inside of a module called request, inside of a package called URL lib.
So the next thing is, I’m going to throw my URL into this thing. So what this is going to do, it’s going to open up, basically, a connection, grab the web page, and basically just download it. So it’s a client. So I’m going to call it a U client is equal to U request of my URL. It’s going to take a while depending on your Internet connection because it’s actually downloading the web page. I noticed that. OK it’s done. So the minute I want it, I can do a read, a U client dot read. If I do read, it’s going to dump everything out of this right away. I can’t reuse it. So before it gets dumped, I want to store it into something, a variable. So I’m going to call, I guess, page underscore– since this is the raw HTML, I’m just going to call it HTML– page HTML is equal to U client dot read. I can go ahead and show you this thing, but it might– depending on how big the HTML file is– I can actually crash the console. So I’m going to show it to you once it’s inside of Beautiful Soup. Bear with me here. And any web client, since this is an open Internet connection, I want to actually close it when I’m done with it. So U client dot close is what I’m going to do. And knowing that all of these lines of code have worked so far, I can just go ahead and copy them into my script. So my URL is that. And U client is– and just add some documentation, opening up connection, grabbing the page. OK. And then what this does is, it offloads the content into a variable. And then what this is going to do, it’s going to close the client. Then the next thing I need to do is I need to parse the HTML, because right now the HTML is a big jumble of text. So what I need to do right now is I need to call the Soup function that I made earlier. So notice I called from BS4 for import Beautiful Soup. So if I call soup as a function, it’s going to call it the Beautiful Soup function within the BS4 package. So in this case, I will do soup of, basically, my page HTML. And then if I do a comma here, I will have to tell it how to parse it, because it could be an XML file, or, in this case, I will tell it to parse it as an HTML parse file. And I need to store it into a variable or else it’s going to get lost. So in this case, I’ll call it a page soup. I know it’s kind of weird that they call it a soup, but it’s standard notation.
Now, when you say soup, people understand that this is the data type of it. It’s derived from the Beautiful Soup package. All right. So in this case this does my HTML parsing. OK. So now, if I go to the page soup, and I just try to look at the H1 tag, page soup dot H1, I should see the header of the page. So this does say video cards and video devices. So I should see that somewhere. So notice that they grab this header right here. And just, for good measure, let’s just see what else is in there. So Beautiful Soup dot, maybe there’s a P tag in there I can look at. So newegg.com, a great place to buy computers. So I think that might be at the very bottom. Great place to– actually, no, it might be something that’s hidden. It might be just in a tagline. All right. But I am on this page. So now what we need to do is traverse the HTML. So basically what I’m going to do is, I’m going to convert every graphics card that I see into a line item, into a CSV we file. To do that, to traverse– now that I have a Beautiful Soup data type, I can’t actually traverse, basically, the dom elements of this HTML page So let me show you how to do that real quickly. So if I inspect the element of this page, so if I go find the body tag, for example. I think the body type– it starts off as a body. So if I do a body, page soup dot body, and then I can keep going. I can keep going dot within the– so notice that this body tag can go even further into an A tag or span tag. So if I type in the span tag, I should find this span tag. Or body dot pan. See that? Span class no CSS skip to. See that? No CSS skip to. That’s awesome.
So the next thing I’m going to do, let me just make this HTML a little bit bigger so you guys can see it even further. All right. So what I want is if I’m in Chrome, you can also use the Firefox Firebug to inspect the HTML elements of a page. So I’m going to just select this, the name of this graphics card right here, and try to inspect that element. It jumps me directly to this A tag. It jumps me directly into this A tag. And I want to grab the entire container that the graphics card is in, because I know that graphics card container contains other goodies, such as the original price, its sale price, its make, its review type, and the card image itself. So I go out. So since HTML is an embedded kind of tagging language, I can go out until I find what it is that is containing all of this. So notice that this div right here with the class of item dash container, contains and houses all of the items inside of this thing. So basically I would need to set a loop.
I’m going to set my syntax to become HTML. OK it’s in HTML now. But that’s not pretty. I want to use an external service called JS Beautifier. So it’s going to do all the spacing when there needs to be spacing. So JS Beautifier, you basically just copy an ugly code, and it turns it pretty. See that? Everything is all now nicely spaced and deliminated. Here we are. Now let’s read what’s actually in this thing. So if I open this up now, I know it’s going to be a little bit hard to read. What kind of things do we want out of this thing? If we go through, we can see that there’s some pretty useful things. We can see that the items have ratings. It has a product name. We want to grab the product name for sure. Let’s see, there is its brand. I can grab its brand. So notice that they call the image the name of the brand, which is useful. So if I grab the title of this image– Notice that the image itself, it says it says EVGA, but that’s an image, I can grab the image, I just can’t parse what it says unless I use image recognition. But notice that the title encodes what type of brand it is for us. So that’s very convenient. So this is something that we want to grab. And also I want to be sure I want to grab things that are true of everything. So if not, I’m going to have to run into some corner case if-else statements. So notice that this guy right here is special. He doesn’t have any egg reviews. So if I wrote something to parse reviews, I’m going to need to write an if else statement, or I’m going to do I’ll have to do a try and catch with an index out of error catch. OK. And then notice that it doesn’t even have what this number is. I think it’s the number of reviews here. So I’ll let you guys go ahead and handle the scraping of that, but I’m going to scrape things that are present in all of them. Notice that I’m going to scrape the names. All of them seem to have the names of the brand or the names of the product. And then I’m going to go ahead and scrape the product itself. And not all of them have a price. You see that? I have to add it to the cart to see the price. And let’s see what else is good. And they all seem to have shipping.
So I’m going to grab shipping to see how much they all cost. So once you learn how to scrape one, it’s the same really for all of it. Now if you want to loop through all of it, you have to do those if else statements to catch all the loose cases that aren’t there. So notice that if I do a container right now, a container of zero– going to throw container 0 into just a variable called container. Later I’m going to do a for loop that says for every container in containers. Right so right now I’m prototyping the loop before I want to build the loop. So I want to make sure it works once before I even build the loop. So this container contains a single graphics card in it. I will call it container instead of contain. So container dot, dot what? Let’s see what is in here. Notice that container dot A will bring me this thing back. So if I do container dot A, this brings me back exactly what I thought it would. It would bring me the item image. So the item image, not that useful to us. Let’s see if there’s anything that we can redeem in here. The title, we might be able to redeem the title, but it seems that we can also grab that down here which I think this might be the more efficient way to grab it. So let’s get it from there instead, because that’s what the customer sees. That’s what you will see when you go and visit the space. So we will go instead of doing dot A, we will do dot div. We’ll go jump from this A, directly into this div. So I’ll go ahead and push up, and say container dot div. So that will jump me into this div right here, and everything inside of it. OK. Boom. OK. So if I go into that container dot div, I will just probably assume this is the right one. I know web scraping HTML tends to be hard because it hurts your eyes, unless you know how to read HTML very well. But it’s something just to get used to. So I know that I’m in this div and I want to go into another div called item branding. So div dot div. And inside of that div there is, I think, an A tag. This A tag actually contains some things that we want, which is this guy right here. What is the make of this graphics card dot div dot A. And there we have it. So here’s the H ref of the link. So what I’m grabbing is this guy right here, this EVGA thing that I’m grabbing. Notice I hover. It’s a clickable link. That link is this guy right here. But what I really want is this title, the title of this link. So what do I want? I want to do container dot a dot image. So I want to grab this image tag now. So notice I’m just using these handles. I’m just referencing as if it was a JSON file. And notice that I’m inside of the image now. So the image is here. Now I need to grab this title. So this is an attribute inside of the image tag.
So how do you grab an attribute? Well you would reference it as if it was an index, or I mean, a So I would say title of this is equal to EVGA So now that I have prototyped it, I can go ahead and add that to my script. So I can go ahead and copy this right here, and paste that into my script. Inside of my script, this is where I actually can do that preemptive loop now. I can write that loop now. So for container in containers. It’s going to go loop through, and it’s going to grab container dot div dot div dot A of that image of that title is going to equal to the brand or the make. So the that’s the first thing I grabbed. So who makes this graphics card? That’s the first thing it’s going to do. So what else do I want to grab while I’m inside of this thing? So let’s grab two more things. All right. Just grab two more things just to have a really good file, because a CSV file with one column seems a little tiny bit pointless. All right the next thing I want to do is, I want to go ahead and grab the name of this graphics card, which is right here. Notice that it’s embedded within this A tag, and this A tag is embedded within this div tag. And this div tag is embedded within this div tag. In theory, if we do a container, dot div dot div dot A, it actually brings out it seems like it brought out the item brand instead. So the item brand is actually this A tag, which is not what we wanted. We wanted this A tag. So notice that it’s having trouble finding this particular A tag. So what I want to do, actually, is I want to do– I can do a Find All, and find just the direct class that I want. So in this case, I can do a find me all the A tags that have item dot title. So in this case, I can do container dot find all is equal to, I want to see the A tag, comma, and then I want to throw it into an object. And the object is, I’m going to say, look for all classes that will go ahead and start with item title. So this will give me a data structure back that has everything that it found. So hopefully should only be one thing so that we don’t have to loop over it. So in this case, container equals that which would be title underscore container. If I look at the title underscore container, I should have what I’m looking for. Beautiful.
So the name of the graphics card is somewhere in this thing. I’m going to put this and I’m going to throw it into my script so I can run it later. So going back– So the title container, notice this isn’t the actual title yet. I still have to extract the title out of this thing. So in my title container — notice that it’s inside of the bracket bracket, which means it’s inside of an array, or in this case it’s a list if you’re in Python. So in this case, if I go to zero, I want to grab the first object. And inside of that first object, I want to grab, nope it’s not inside of the I tag, it’s actually a text inside of the A tag. So if I do dot text, this should get me what I want. Yes. So I do title dot of zero dot text, and that gives me exactly what I want. So I’m going to place that in there, and I want to call this the title, so the product name. So product name is equal to title container dot text. So that is that. So I’ve got the brand, the make of the graphics card, and the name of the graphics card again. And now we can go ahead and grab shipping, because shipping seems like something else that they might all have.
So what we’re going to do is figure out where this shipping tag is inside of all of it. How much does it cost for shipping, because I think some of them cost differently for shipping. Yes, this is $4.99 shipping. So in this case, I need to find all LI classes– basically, LI stands for a list– with the class price dot dash ship. So I want to go ahead and do that. I’m going to copy this class. And I want to do container dot find all of LI comma of class is equal to price ship. And this will give me, hopefully, a shipping container. Shipping underscore container and, hopefully, there should only be one tag in this thing that has shipping in it. And I need to close that function. So my shipping underscore container, if I can just copy this, shipping container. You will see that it gives me back an array of things that qualify. So in this case, only one thing came back. So I can do that same thing I did earlier where I reference the first element, and then I think it’s also in the text again, right? So I can do dot text again. And this brings me back. It looks like there’s a lot of open space. Notice there’s a return, and then there’s a new line. There’s a return, and then there’s a new line. So in this case, I want to clean it up a little bit because I just want the text. So in this case I will say strip. So strip removes whitespace before and after new lines, all that good stuff. So it just says free shipping now. So I can go ahead and grab this, and throw it into my script, as well. So now I’ve grabbed three things. So in this case, I also need the find all that I did earlier. So if I go up a few times, I can find it. So the shipping container itself will be placed in here. And then if I close, actually, the find all function, and there we go. So now there are the three things that I want. So the product name, the brand, and the shipping container will be actually shipping. OK. So cool.
So now this is ready to be looped through. But before that, I want to print it out. So I want to show you why is Sublime is my favorite editor. It does multi-line editing. S in this case, I’m going to go ahead and enter three blank lines. I’m going to copy my three variables. OK, copy, copy, copy. I’m going to paste them in here. I’m just go ahead and make it nice and formatted. So I will print all of these things out into the console, just so I can see. So in this case I will copy this, as well. So that way, I can go ahead and just say quote, and then paste that. So I can see what it is when it actually does print out. And then I can do a plus for for a string concatenation. It’s going to print each of these three things out for me, so the brand, and the product name, and the shipping. And basically, before I throw this into a CSV file, I want to just make sure that this loop works. So I want to save this web scrape thing, too. I want to call this web my first web scrape.py. OK. So if I open this, there should be a file here. If I right click and open up another console, so notice I have accounts before. But this one is running Python. I want to open up this one. And I want to tell it. So notice that I’m inside of this file path now. So this file path is a file path that contains this script already in it. So what I need to do is just do Python. So I want tell it to run Python. And I want tell it, OK now that I’m in Python, execute this script. So my first web scrape dot py. Hit Enter. And then, hopefully, look at that. It went through. It did that loop. And it grabbed every other graphics card for me.
So all I have to do now is throw this into a CSV file. And I can then open it in Excel. So let’s go ahead and do that real quick. Just finish up our code. And I don’t really need the prototype for this, because I know that the script works now. To open up a file, you would do just the simple Open. And then, in this case, I need a file name. So the file name is equal to, I guess, products dot CSV. OK so I want to open up a file name. I need to instantiate a mode. So in this case W for write. So I want to open up a new file and write it in it. So this would be called F. So the normal convention for a FileWriter is F. And I want to write the headers to this thing. So, in this case, F dot write is equal to, now I need to call some headers to a CSV file which usually has headers. In this case, headers will equal to, I think I’ll make it, brand name, let’s call it product name, because if you load us into a SQL database later, name is a key word in SQL. So product name, and then I’ll call this shipping. OK. And then I also need to add a new line because CSVs are delineated by new line. So I’m going to tell it to write the first line to be a header And then the next thing is, I want to tell it to every time to loop through, I want to write a file. So instead of printing it to the console, which I’ll let it do actually, I’m going to do F dot write. So F dot right is going to write so these three things. So product, product name, shipping. I paste that in there. That’s going to paste all three of them for me. But what I need to do is actually concatenate them together. And I need to concatenate them with a comma in the middle. So comma. And let me just double check something real quick. See if my strings are clean. And no it is not. So notice that the product names have commas inside of them. So what that’s going to do is it’s going to create extra columns inside of my CSV file. So before I print the product names out, I actually need to do a string replace. So I need to call a replace function as every time you see a comma, let’s replace it with something else. And I like to do a pipe, but you can delineate it as anything you want. This is programming. You can do whatever you want as long as it doesn’t err. In this case, I would go ahead and do that. And also, don’t forget this, it needs to be deliminated by a new line. So every time is going to loop through, it’s going to grab and parse all of the data points. And then it’s going to write it to a file as a line in the file. And what I need to do is, once it’s done looping, I will have to close to file. Because if you don’t close the file, you can’t open the file. Only one thing can open the file at a time.
All right. So I will run the script again. So notice if I just push up, it runs the script. So you have to save the script first. I’m going to do Control S to quickly save it. When you do control– syntax error! I forgot to add a concatenation with the plus N. So I need to do a plus N to tell it to concatenate that. So I go Python my first web scrape. It went through. So after running that script, it’s gone ahead and scraped everything and printed everything to the console. But more importantly, it rewrote everything to this file. I told it to write everything to the CSV file. So if I open it up right now, you can see that it has gone ahead and scraped the entire page and thrown every data point as a row, every product as a row, into this CSV file. So you can go ahead and scrape the other details, like whether or not it is a sales price or not, what the image tag might be. And then there’s multiple pages. So if you go to Amazon, for example, there’s multiple pages of probably products. So you can start looping through. So usually up here, there’s a page equal something. So you can just do a loop and just say, in this case, do page two instead of page one. And that concludes today’s lesson on how to web scrape with Python.
I hope you guys learned a lot and had fun doing it. Now I want to really know from you guys, did you guys enjoy this kind of video? Do you guys want more coding videos? More data science videos? And if there’s a better way to code something, also let me know. I’m always happy to hear from you guys. What do you guys enjoy? I want to make this content for you guys. All right. Now I’ll see you guys later, and happy coding.