Web Scraping: Introduction, Applications and Best Practices
Web scraping typically extracts large amounts of data from websites for a variety of uses such as price monitoring, enriching machine learning models, financial data aggregation, monitoring consumer sentiment, news tracking, etc. Browsers show data from a website. However, manually copy data from multiple sources for retrieval in a central place can be very tedious and time-consuming. Web scraping tools essentially automate this manual process.
This article intends to get you up to speed on the basics of web scraping. We’ll cover basic processes, best practices, dos and don’ts, and identify use cases where web scraping may be illegal and have adverse effects.
Basics of Web Scraping
“Web scraping, ” also called crawling or spidering, is the automated gathering of data from an online source usually from a website. While scraping is a great way to get massive amounts of data in relatively short timeframes, it does add stress to the server where the source is hosted.
This is primarily why many websites disallow or ban scraping all together. However, as long as it does not disrupt the primary function of the online source, it is fairly acceptable.
Despite its legal challenges, web scraping remains popular even in 2019. The prominence and need for analytics have risen multifold. This, in turn, means various learning models and analytics engine need more raw data. Web scraping remains a popular way to collect information. With the rise of programming languages such a Python, web scraping has made significant leaps.
Typical applications of web scraping
Social media sentiment analysis
The shelf life of social media posts is very little, however, when looked at collectively they show valuable trends. While most social media platforms have APIs that let 3rd party tools access their data, this may not always be sufficient. In such cases scraping these websites gives access to real-time information such as trending sentiments, phrases, topics, etc.
Many eCommerce sellers often have their products listed on multiple marketplaces. With scraping, they can monitor the pricing on multiple platforms and make a sale on the marketplace where the profit is higher.
Real estate investors often want to know about promising neighborhoods they can invest in. While there are multiple ways to get this data, web scraping travel marketplaces and hospitality brokerage websites offer valuable information. This includes information such as the highest-rated areas, amenities that typical buyers look for, locations that may be upcoming as attractive renting options, etc.
Machine learning models need raw data to evolve and improve. Web scraping tools can scrape a large number of data points, text and images in a relatively short time. Machine learning is fueling today’s technological marvels such as driverless cars, space flight, image and speech recognition. However, these models need data to improve their accuracy and reliability.
A good web scraping project follows these practices. These ensure that you get the data you are looking for while being non-disruptive to the data sources.
Identify the goal
Any web scraping project begins with a need. A goal detailing the expected outcomes is necessary and is the most basic need for a scraping task. The following set of questions need to be asked while identifying the need for a web scraping project:
What kind of information do we expect to seek?
What will be the outcome of this scraping activity?
Where is this information typically published?
Who are the end-users who will consume this data?
Where will the extracted data be stored? For e. g. on Cloud or on-premise storage, on an external database, etc.
How should this data be presented to its end-users? For e. as a CSV/Excel/JSON file or as an SQL database, etc.
How often are the source websites refreshed with new data? In other words, what is the typical shelf-life of the data that is being collected and how often does it have to be refreshed?
Post the scraping activity, what are the types of reports you would want to generate?
Since web scraping is mostly automated, tool selection is very important. The following points need to be kept in mind when finalizing tool selection:
Fitment with the needs of the project
Supported operating systems and platforms
Free/open-source or paid tool
Support for scripting languages
Support for built-in data storage
Availability of documentation
Designing the scraping schema
Let’s assume that our scraping job collects data from job sites about open positions listed by various organizations. The data source would also dictate the schema attributes. The schema for this job would look something like this:
URL used to apply for the position
Remuneration data if it is available
Any special skills listed
Test runs and larger jobs
This is a no-brainer and a test run will help you identify any roadblocks or potential issues before running a larger job. While there is no guarantee that there will be no surprises later on, results from the test run are a good indicator of what to expect going forward.
Parse the HTML
Retrieve the desired item as per your scraping schema
Identify URLs pointing to subsequent pages
Once we are happy with the test run, we can now generalize the scope and move ahead with a larger scrape. Here we need to understand how a human would retrieve data from each page. Using regular expressions we can accurately match and retrieve the correct data. Subsequently, we also need to catch the correct Xpaths and replace them with hardcoded values if necessary. You may also need support from an external library.
Often you may need external libraries that act as inputs on the source. For e. you may need to enter the Country, State and Zipcode to identify the correct values that you need.
Here are a few additional points to check:
Scheduling for the created scrape
Third-party integration support (e. for Git, TFS, Bitbucket)
Scrape templates for similar websites
Depending on the tool end-users can access the data from web scraping in several formats:
SQL Server database
Script (A script provides data from almost any data source)
Improving the performance and reliability of your scrape
Tools and scripts often follow a few best practices while web scraping large amounts of data.
In many cases, the scraping job may have to collect extremely large amounts of data. This may take too much time and encounter timeouts and endless loops. Hence tool identification and understanding its capabilities is very important. Here are a few best practices to help you better tune your scraping models for performance and reliability.
If possible, avoid the use of images while web scraping. If you absolutely need images, you must store these in a local drive and update the database with the appropriate path.
Enable the following options in your scraping tool or script – ‘Ignore cache’, ‘Ignore certificate errors’, and ‘Ignore to run ActiveX and flash’.
Call a terminate process after every scrape session is complete
Avoid the use of multiple web browsers for each scrape
Handle memory leaks
Things to stay away from
There are a few no-no’s when setting up and executing a web scraping project.
Avoid sites with too many broken links
Stay away from sites that have too many missing values in their data fields
Sites that require a CAPTCHA authentication to show data
Some websites have an endless loop of pagination. Here the scraping tool would start from the beginning once the number of pages exhausts.
Web scraping iframe-based websites
Once a certain connection threshold reaches, some websites may prevent users from scraping it further. While you can use proxies and different user headers to complete the scraping, it is important to understand the reason why these measures are in place. If a website has taken steps to prevent web scraping, these should be respected and left alone. Forcibly web scraping such sites is illegal.
Web scraping has been around since the early days of the internet. While it can provide you the data you need, certain care, caution and restraint should be exercised. A properly planned and executed web scraping project can yield valuable data – one that will be useful for the end-user.
“Web Scraper Test Drive! – Web Scraping”, n. d. Accessed July 26, 2019..
“Top 5 Web Scraping Tools Comparison | Octoparse”, n. Accessed July 26, 2019. “10 Best Web Scraping Tools to Extract Online Data – Hongkiat”, n. “Web Scraping Explained”, n. “Web Scraping – Wikipedia”, n. “Big List of Web Scraping Uses: Application of Web Scraping to …”, n. Accessed July 26, 2019.
Liked what you read? Here are a few more that might interest you:
Build desktop applications with Electron – part 1
Progressive Web Applications
The Rise and Evolution of VueJS
About Price & Web Scraping Tools | Imperva
What is web scraping
Web scraping is the process of using bots to extract content and data from a website.
Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.
Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include:
Search engine bots crawling a site, analyzing its content and then ranking it.
Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.
Market research companies using scrapers to pull data from forums and social media (e. g., for sentiment analysis).
Web scraping is also used for illegal purposes, including the undercutting of prices and the theft of copyrighted content. An online entity targeted by a scraper can suffer severe financial losses, especially if it’s a business strongly relying on competitive pricing models or deals in content distribution.
Scraper tools and bots
Web scraping tools are software (i. e., bots) programmed to sift through databases and extract information. A variety of bot types are used, many being fully customizable to:
Recognize unique HTML site structures
Extract and transform content
Store scraped data
Extract data from APIs
Since all scraping bots have the same purpose—to access site data—it can be difficult to distinguish between legitimate and malicious bots.
That said, several key differences help distinguish between the two.
Legitimate bots are identified with the organization for which they scrape. For example, Googlebot identifies itself in its HTTP header as belonging to Google. Malicious bots, conversely, impersonate legitimate traffic by creating a false HTTP user agent.
Legitimate bots abide a site’s file, which lists those pages a bot is permitted to access and those it cannot. Malicious scrapers, on the other hand, crawl the website regardless of what the site operator has allowed.
Resources needed to run web scraper bots are substantial—so much so that legitimate scraping bot operators heavily invest in servers to process the vast amount of data being extracted.
A perpetrator, lacking such a budget, often resorts to using a botnet—geographically dispersed computers, infected with the same malware and controlled from a central location. Individual botnet computer owners are unaware of their participation. The combined power of the infected systems enables large scale scraping of many different websites by the perpetrator.
Malicious web scraping examples
Web scraping is considered malicious when data is extracted without the permission of website owners. The two most common use cases are price scraping and content theft.
In price scraping, a perpetrator typically uses a botnet from which to launch scraper bots to inspect competing business databases. The goal is to access pricing information, undercut rivals and boost sales.
Attacks frequently occur in industries where products are easily comparable and price plays a major role in purchasing decisions. Victims of price scraping can include travel agencies, ticket sellers and online electronics vendors.
For example, smartphone e-traders, who sell similar products for relatively consistent prices, are frequent targets. To remain competitive, they’re motivated to offer the best prices possible, since customers usually go for the lowest cost offering. To gain an edge, a vendor can use a bot to continuously scrape his competitors’ websites and instantly update his own prices accordingly.
For perpetrators, a successful price scraping can result in their offers being prominently featured on comparison websites—used by customers for both research and purchasing. Meanwhile, scraped sites often experience customer and revenue losses.
Content scraping comprises large-scale content theft from a given site. Typical targets include online product catalogs and websites relying on digital content to drive business. For these enterprises, a content scraping attack can be devastating.
For example, online local business directories invest significant amounts of time, money and energy constructing their database content. Scraping can result in it all being released into the wild, used in spamming campaigns or resold to competitors. Any of these events are likely to impact a business’ bottom line and its daily operations.
The following is excerpted from a complaint, filed by Craigslist, detailing its experience with content scraping. It reinforces how damaging the practice can be:
“[The content scraping service] would, on a daily basis, send an army of digital robots to craigslist to copy and download the full text of millions of craigslist user ads. [The service] then indiscriminately made those misappropriated listings available—through its so-called ‘data feed’—to any company that wanted to use them, for any purpose. Some such ‘customers’ paid as much as $20, 000 per month for that content…”
According to the claim, scraped data was used for spam and email fraud, among other activities:
“[The defendants] then harvest craigslist users’ contact information from that database, and initiate many thousands of electronic mail messages per day to the addresses harvested from craigslist servers…. [The messages] contain misleading subject lines and content in the body of the spam messages, designed to trick craigslist users into switching from using craigslist’s services to using [the defenders’] service…”
Web scraping protection
The increased sophistication in malicious scraper bots has rendered some common security measures ineffective. For example, headless browser bots can masquerade as humans as they fly under the radar of most mitigation solutions.
To counter advances made by malicious bot operators, Imperva uses granular traffic analysis. It ensures that all traffic coming to your site, human and bot alike, is completely legitimate.
The process involves the cross verification of factors, including:
HTML fingerprint – The filtering process starts with a granular inspection of HTML headers. These can provide clues as to whether a visitor is a human or bot, and malicious or safe. Header signatures are compared against a constantly updated database of over 10 million known variants.
IP reputation – We collect IP data from all attacks against our clients. Visits from IP addresses having a history of being used in assaults are treated with suspicion and are more likely to be scrutinized further.
Behavior analysis – Tracking the ways visitors interact with a website can reveal abnormal behavioral patterns, such as a suspiciously aggressive rate of requests and illogical browsing patterns. This helps identify bots that pose as human visitors.
Learn more about protecting your site from malicious bot traffic with Imperva’s bot management solution.
Is Web Scraping Legal ? – WebHarvy
Web Scraping is the technique of automatically extracting data from websites using software/script. Our software, WebHarvy, can be used to easily extract data from any website without any coding/scripting knowledge.
Is it legal to scrape data from websites using software? The answer to this question is not a simple yes or no.
The real question here should be regarding how you plan to use the data which you have extracted from a website (either manually or via using software). Because the data displayed by most website is for public consumption. It is totally legal to copy this information to a file in your computer. But it is regarding how you plan to use this data that you should be careful about. If the data is downloaded for your personal use and analysis, then it is absolutely ethical. But in case you are planning to use it as your own, in your website, in a way which is completely against the interest of the original owner of the data, without attributing the original owner, then it is unethical, illegal.
Also, while extracting data from websites using software, since web scrapers can read and extract data from web pages more quickly than humans, care should be taken that the web scraping process does not affect the performance/bandwidth of the web server in any way. Most web servers will automatically block your IP, preventing further access to its pages, in case this happens.
How to anonymously scrape data from websites?
Update: US federal court rules that web scraping does not violate hacking laws
Scrape Data Anonymously
WebHarvy is an easy-to-use visual web scraper which lets you scrape data anonymously from websites, thereby protecting your privacy. Proxy servers or VPNs can be easily used along with WebHarvy so that you are not connected directly to the web server during data extraction. Also, to minimize the load on web servers, and to avoid detection, there are options to automatically insert pauses & emulate a human user during the web scraping process.
Frequently Asked Questions about what can web scraping be used for
What are the uses of web scraping?
Web scraping is the process of using bots to extract content and data from a website….Legitimate use cases include:Search engine bots crawling a site, analyzing its content and then ranking it.Price comparison sites deploying bots to auto-fetch prices and product descriptions for allied seller websites.More items…
Is web scraping legal?
Web Scraping is the technique of automatically extracting data from websites using software/script. … Because the data displayed by most website is for public consumption. It is totally legal to copy this information to a file in your computer.