Skip to main content

Blog entry by Syed Saad Peerzada

Hands-on Ethical Web Scraping using Python

Hands-on Ethical Web Scraping using Python

Data Science Dojo has launched Jupyter Hub for Ethical Web Scraping using Python offering to the Azure Marketplace with pre-installed web scraping libraries and pre-cloned GitHub repository of famous book ‘Web Scraping with Python Book – 2nd Edition’ which helps the learner to take the first steps into the field of web scraping.

What is Web Scraping?

Web
scraping is the act of extracting the content and data from a website. The vast
amount of data available on the internet is not open and available to download.
As a result, ethical web scraping is the most effective technique to collect
this data. There is also a debate about the legality of web scraping as the
content may get stolen or the website can crash as a result of web scraping.

Ethical
Web Scraping is the act of harvesting data legally by following ethical rules
about web scraping. There are certain rules in ethical web scraping that when
followed ensure trust between the website owner and web scraper.

Web Scraping using Python

In Python, a learner can write a small piece of code
to do large tasks. Since web scraping is used to save time, a small code
written in Python can save a lot of time. Also, Python is simple and easy to
understand and provides an extensive set of libraries for web scraping and
further manipulation required on extracted data.

PRO TIP: Join our 5-day instructor-led Python for Data Science training to enhance your web
scraping skills.

Challenges for Individuals

Individuals who are new to web scraping and
wish to flourish in their field usually lack the necessary computing and
learning resources to obtain hands-on expertise. Also, they may face
compatibility issues when installing libraries.

What We Provide

With
just a single click, Jupyter Hub for Ethical Web Scraping using Python comes
with pre-installed Web Scraping python libraries, which gives the learner an
effortless coding environment in the Azure cloud and reduces the burden of
installation. Moreover, this offer provides the learner with a repository of the famous book on web scraping which contains
chapter-wise notebooks which serve as a learning resource for a user in gaining
hands-on experience with web scraping. Through this offer, a learner can
collect data from various sources legally by following the best practices for
ethical web scraping mentioned in the latter section of this blog. Once the data is
collected, it can be further analyzed to get valuable insights into almost
everything while all the heavy computations are performed on Microsoft Azure
hence saving the user from the trouble of running high computations on the
local machine.

Listed below are the pre-installed web scraping
python libraries and the sources of repositories of web scraping book provided
by this offer:

Python Libraries:

  •          Pandas
  •          NumPy
  •          Scikit-learn
  •          Beautifulsoup4
  •          lxml
  •          MechanicalSoup
  •          Requests
  •          Scrapy
  •          Selenium
  •          urllib3

Repository:

  •          GitHub repository of book Web Scraping with Python 2nd Edition,
    by author Ryan Mitchell.

Best Practices for Ethical Web Scraping

Globally, there is a debate about whether web scraping is an
ethical concept or not. The reason it is unethical is that when a website is
queried repeatedly by the same user (in this case bot), too many requests land
on the server simultaneously and all resources of the server may be consumed in
generating responses for each request, preventing it from responding to other
legitimate users. In this way, the server denies responses to any further
users, commonly known as a Denial of
Service (DoS) attack.

Below are the best practices for ethical web scraping, and
compliance with these will allow a web scraper to work ethically.

1.   Check out for ROBOTS.TXT

Robots.txt file, also known as the Robots Exclusion Standard, is used to inform the web scrapers if the website can be crawled or not, if yes then how to index the website. A legitimate web scraper is expected to respect the instructions in this file and not disobey the website owner’s allowed instructions.

2.   Check for Website APIs

An ethical web scraper is expected
to first look for the public API of the website in question instead of scraping
it all together. Many website owners provide public API access which can be
used by anyone looking to gain from the information available on the website.
Provision of public API works in the best interests of both the ethical
scrapper as well as the website owner, avoiding web scraping altogether.

3.   Avoid Repeated Requests

Vigorous scraping can occasionally
cause functionality issues, resulting in a poor user experience for humans. As
a result, it is always advised to scrape during off-peak hours. An ethical web
scraper is expected to delay recurrent requests to avoid a DoS attack.

4.   Provide Your Identity

It is always a good idea to take
responsibility for one’s actions. An ethical web scraper never hides his or her
identity and provides it in a user-agent string. Not only does this make the
intentions of the scraper clear but also provides a means of contact for any
questions or concerns of the website owner.

5.   Avoid fake ownership

The content scraped through web
scraper should always be respected and never passed on under the fake
information of scraper as the author. This act can be regarded as highly
unethical as well as illegal since the website owner may file a copyright
claim. It also damages the reputation of genuine web scrapers and hurts the
trust of the website owner.

6.  
Ask for permission

Since the website information
belongs to the owner, one should never presume it to be free and ask politely
to use it for their means. An ethical web scraper always seeks permission from
the website owner to avoid any future problems. The website owner should be
given the choice of whether she agrees to scrape the data.

 7.  
Give due credit

To encourage the website owner as
a token of thanks, the web scraper should give due credit wherever possible.
This can be done in many ways such as providing a link to the original website
on any blog, article, or social media post by generating traffic for the
original website.

Ethical Web Scraping using Python

Conclusion

Ethical web scraping is a two-way
street in which the website owner should be mindful of the global availability
of the data, similarly, the scraper should not harm the website in any way and
also first seek permission from the website owner. If a web scraper abides by
the above-mentioned practices, I.e., he/she works ethically, the web owner may
not only allow scraping his or her website but also provide helpful means to
the scraper in the form of Meta data or a public API.

At Data Science Dojo, we deliver
data science education, consulting, and technical services to increase the
power of data. We are therefore adding a free Jupyter Notebook Environment
dedicated specifically for Ethical Web Scraping using Python. Install the Jupyter
Hub offer now from the Azure Marketplace by Data Science Dojo, your ideal
companion in your journey to learn data
science
!

Try Now!