Simple Web Crawler

How to build a simple web crawler | by Low Wei Hong

Step by step guides to scrape the Best Global University RankingThree years ago, I was working as a student assistant in the Institutional Statistics Unit at NTU Singapore. I was required to obtain the Best Global University Ranking by manually copying from the website and pasting into an excel sheet. I was frustrated, as my eyes were tired, from looking at the computer screen continuously for long hours. Hence, I was thinking about whether there is a better way to do that time, I googled for automation and I found the answer for it — web scraping. Since then, I managed to create 100+ web crawlers and here is my first-ever web scraper that I would like to eviously, what I did was to use requests plus BeautifulSoup to finish the task. However, after three years when I look back to the same website, I found out that there is a way to get the JSON data instead which works way you are thinking of automating your boring and repetitive tasks, please promise me you’ll read till the end. You will learn how to create a web crawler so that you can focus on more value-added this article, I would like to share how I build a simple crawler to scrape universities’ rankings from first thing to do when you want to scrape the website is to inspect the web element. Why do we need to do that? This is actually to find whether there exists a more robust way to get the data or to obtain cleaner data. For the former, I did not go into so deep to dig out the API this time. However, I do find a way to extract cleaner data so that I can reduce the data cleansing you do not know how to inspect the web element, you just need to navigate to any position of the webpage, right-click, click on inspect, then click on the Network tab. After that, refresh your page and you should see a list of network activities appear one by one. Let’s us look at the specified activity that I have been selecting using my cursor in the screenshot above(i. e. the “search? region=africa&… “) that, please refer to the purple box in the screenshot above, highlighting the URL that the browser will send the request to in order to get the data to be presented to, we can imitate the browser behavior by sending the request to that URL and get the data we need right? But before that, why do I choose to call the request URL instead of the original website URL? Let’s click on the Preview tab, you will notice that all the information we need, including university ranking, address, countries, etc.. are all in the results field which is highlighted in the blue is the reason why we scrape through the URL. The data return by the URL is in a very nice format — JSON above screenshot shows a comparison of the code between today and 3 years before. 3 years before when I was a newbie in web scraping, I just use requests, BeautifulSoup, tons of XPATH and heavy data cleaning processes to get the data I need. However, if you compare the code that I have written today, I just need to use x to get the data and no data cleaning your information, x is built on top of requests, but it supports additional functions like provides async APIs and with x, you can send an HTTP/2 request. For a more complete comparison, you may refer to this quest URL: now, let’s pay attention to the link we are going to use as shown above. You will notice that you can change the values for region and subject as they are parameters to the URL. (For more information on URL parameters, here is a good read) However, do note that the values for them are limited to the region and subjects provided by the instance, you can change region=africa to region=asia or subjects=agricultural-sciences to subjects=chemistry. If you are interested to know what are the supported regions and subjects, you can visit my repo to check knowing how to query this URL to obtain the data you need, the left-over part is how many pages you need to query for a particular combination of region and, let’s take this URL as an example, copy and paste the URL into your browser and press enter, then use command+f to search for the keyword “last_page”, and you will find a similar screenshot as below. *Do note that I have installed a chrome extension that could help me to prettify the plain data into JSON format. This is why you can see that the data shown in my browser is nicely ngratulation, you manage to find the last_page variable as indicated above. Now, the only remaining process is how to go to the next page and get the data if the last_page is larger than is how I figure out the way to navigate to page 2. Take this link as an, click on the page number 2, and then view on the right panel. Pay attention to the purple box, you will notice there is an addition of page=2 in the Request URL. This means that you just need to append &page={page_number} to the original request URL in order to navigate through different, you have the whole idea of how to create a web scraper to obtain the data from the you would like to have a look at the full Python code, feel free to visit sourceThank you so much for reading until the is what I want you to get after reading this that there are many different ways to scrape the data from a website, for instance getting the link to obtain the data in JSON some time to inspect the website, if you manage to find the API to retrieve data, this can save you a lot of reason that I am comparing my code from 3 years before and the code I have written today is to give you an idea of how you can improve your web crawling and coding skills through continuous hard, the result will definitely come. — Low Wei HongIf you have any questions or ideas to ask or add on, feel free to comment below! Low Wei Hong is a Data Scientist at Shopee. His experiences involved more on crawling websites, creating data pipeline and also implementing machine learning models on solving business provides crawling services that can provide you with the accurate and cleaned data which you need. You can visit this website to view his portfolio and also to contact him for crawling can connect with him on LinkedIn and Medium.
How to build a web crawler? -

How to build a web crawler? –

At the era of big data, web scraping is a life saver. To save even more time, you can couple ScrapingBot to a web crawling bot.
What is a web crawler?
A crawler, or spider, is an internet bot indexing and visiting every URLs it encounters. Its goal is to visit a website from end to end, know what is on every webpage and be able to find the location of any information. The most known web crawlers are the search engine ones, the GoogleBot for example. When a website is online, those crawlers will visit it and read its content to display it in the relevant search result pages.
How does a web crawler work?
Starting from the root URL or a set of entries, the crawler will fetch the webpages and find other URLs to visit, called seeds, in this page. All the seeds found on this page will be added on its list of URLs to be visited. This list is called the horizon. The crawler organises the links in two threads: ones to visit, and already visited ones. It will keep visiting the links until the horizon is empty.
Because the list of seeds can be very long, the crawler has to organise those following several criterias, and prioritise which ones to visit first and revisit. To know which pages are more important to crawl, the bot will consider how many links go to this URL, how often it is visited by regular users.
What is the difference between a web scraper and a web crawler?
Crawling, by definition, always implies the web. A crawler’s purpose is to follow links to reach numerous pages and analyze their meta data and content.
Scraping is possible out of the web. For example you can retrieve some information from a database. Scraping is pulling data from the web or a database.
Why do you need a web crawler?
With web scraping, you gain a huge amount of time, by automatically retrieving the information you need instead of looking for it and copying it manually. However, you still need to scrape page after page. Web crawling allows you to collect, organize and visit all the pages present on the root page, with the possibility to exclude some links. The root page can be a search result or category.
For example, you can pick a product category or a search result page from amazon as an entry, and crawl it to scrape all the product details, and limit it to the first 10 pages with the suggested products as well.
How to build a web crawler?
The first thing you need to do is threads:
Visited URLsURLs to be visited (queue)
To avoid crawling the same page over and over, the URL needs to automatically move to the visited URLs thread once you’ve finished crawling it. In each webpage, you will find new URLs. Most of them will be added to the queue, but some of them might not add any value for your purpose. Hence why you also need to set rules for URLs you’re not interested in.
Deduplication is a critical part of web crawling. On some websites, and particularly on e-commerce ones, a single webpage can have multiple URLs. As you want to scrape this page only once, the best way to do so is to look for the canonical tag in the code. All the pages with the same content will have this common canonical URL, and this is the only link you will have to crawl and scrape.
Here’s an example of a canonical tag in HTML: previousDepth) {
previousDepth =;
(`——- CRAWLING ON DEPTH LEVEL ${previousDepth} ——–`);}
return nextLink;}
function peekInQueue() {
return linksQueue[0];}
//Adds links we’ve visited to the seenList
function addToSeen(linkObj) {
seenLinks[] = linkObj;}
//Returns whether the link has been seen.
function linkInSeenListExists(linkObj) {
return seenLinks[] == null? false: true;}
What is a web crawler? | How web spiders work | Cloudflare

What is a web crawler? | How web spiders work | Cloudflare

What is a web crawler bot?
A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed. They’re called “web crawlers” because crawling is the technical term for automatically accessing a website and obtaining data via a software program.
These bots are almost always operated by search engines. By applying a search algorithm to the data collected by web crawlers, search engines can provide relevant links in response to user search queries, generating the list of webpages that show up after a user types a search into Google or Bing (or another search engine).
A web crawler bot is like someone who goes through all the books in a disorganized library and puts together a card catalog so that anyone who visits the library can quickly and easily find the information they need. To help categorize and sort the library’s books by topic, the organizer will read the title, summary, and some of the internal text of each book to figure out what it’s about.
However, unlike a library, the Internet is not composed of physical piles of books, and that makes it hard to tell if all the necessary information has been indexed properly, or if vast quantities of it are being overlooked. To try to find all the relevant information the Internet has to offer, a web crawler bot will start with a certain set of known webpages and then follow hyperlinks from those pages to other pages, follow hyperlinks from those other pages to additional pages, and so on.
It is unknown how much of the publicly available Internet is actually crawled by search engine bots. Some sources estimate that only 40-70% of the Internet is indexed for search – and that’s billions of webpages.
What is search indexing?
Search indexing is like creating a library card catalog for the Internet so that a search engine knows where on the Internet to retrieve information when a person searches for it. It can also be compared to the index in the back of a book, which lists all the places in the book where a certain topic or phrase is mentioned.
Indexing focuses mostly on the text that appears on the page, and on the metadata* about the page that users don’t see. When most search engines index a page, they add all the words on the page to the index – except for words like “a, ” “an, ” and “the” in Google’s case. When users search for those words, the search engine goes through its index of all the pages where those words appear and selects the most relevant ones.
*In the context of search indexing, metadata is data that tells search engines what a webpage is about. Often the meta title and meta description are what will appear on search engine results pages, as opposed to content from the webpage that’s visible to users.
How do web crawlers work?
The Internet is constantly changing and expanding. Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.
Given the vast number of webpages on the Internet that could be indexed for search, this process could go on almost indefinitely. However, a web crawler will follow certain policies that make it more selective about which pages to crawl, in what order to crawl them, and how often they should crawl them again to check for content updates.
The relative importance of each webpage: Most web crawlers don’t crawl the entire publicly available Internet and aren’t intended to; instead they decide which pages to crawl first based on the number of other pages that link to that page, the amount of visitors that page gets, and other factors that signify the page’s likelihood of containing important information.
The idea is that a webpage that is cited by a lot of other webpages and gets a lot of visitors is likely to contain high-quality, authoritative information, so it’s especially important that a search engine has it indexed – just as a library might make sure to keep plenty of copies of a book that gets checked out by lots of people.
Revisiting webpages: Content on the Web is continually being updated, removed, or moved to new locations. Web crawlers will periodically need to revisit pages to make sure the latest version of the content is indexed.
requirements: Web crawlers also decide which pages to crawl based on the protocol (also known as the robots exclusion protocol). Before crawling a webpage, they will check the file hosted by that page’s web server. A file is a text file that specifies the rules for any bots accessing the hosted website or application. These rules define which pages the bots can crawl, and which links they can follow. As an example, check out the file.
All these factors are weighted differently within the proprietary algorithms that each search engine builds into their spider bots. Web crawlers from different search engines will behave slightly differently, although the end goal is the same: to download and index content from webpages.
Why are web crawlers called ‘spiders’?
The Internet, or at least the part that most users access, is also known as the World Wide Web – in fact that’s where the “www” part of most website URLs comes from. It was only natural to call search engine bots “spiders, ” because they crawl all over the Web, just as real spiders crawl on spiderwebs.
Should web crawler bots always be allowed to access web properties?
That’s up to the web property, and it depends on a number of factors. Web crawlers require server resources in order to index content – they make requests that the server needs to respond to, just like a user visiting a website or other bots accessing a website. Depending on the amount of content on each page or the number of pages on the site, it could be in the website operator’s best interests not to allow search indexing too often, since too much indexing could overtax the server, drive up bandwidth costs, or both.
Also, developers or companies may not want some webpages to be discoverable unless a user already has been given a link to the page (without putting the page behind a paywall or a login). One example of such a case for enterprises is when they create a dedicated landing page for a marketing campaign, but they don’t want anyone not targeted by the campaign to access the page. In this way they can tailor the messaging or precisely measure the page’s performance. In such cases the enterprise can add a “no index” tag to the landing page, and it won’t show up in search engine results. They can also add a “disallow” tag in the page or in the file, and search engine spiders won’t crawl it at all.
Website owners may not want web crawler bots to crawl part or all of their sites for a variety of other reasons as well. For instance, a website that offers users the ability to search within the site may want to block the search results pages, as these are not useful for most users. Other auto-generated pages that are only helpful for one user or a few specific users should also be blocked.
What is the difference between web crawling and web scraping?
Web scraping, data scraping, or content scraping is when a bot downloads the content on a website without permission, often with the intention of using that content for a malicious purpose.
Web scraping is usually much more targeted than web crawling. Web scrapers may be after specific pages or specific websites only, while web crawlers will keep following links and crawling pages continuously.
Also, web scraper bots may disregard the strain they put on web servers, while web crawlers, especially those from major search engines, will obey the file and limit their requests so as not to overtax the web server.
How do web crawlers affect SEO?
SEO stands for search engine optimization, and it is the discipline of readying content for search indexing so that a website shows up higher in search engine results.
If spider bots don’t crawl a website, then it can’t be indexed, and it won’t show up in search results. For this reason, if a website owner wants to get organic traffic from search results, it is very important that they don’t block web crawler bots.
What web crawler bots are active on the Internet?
The bots from the major search engines are called:
Google: Googlebot (actually two crawlers, Googlebot Desktop and Googlebot Mobile, for desktop and mobile searches)
Bing: Bingbot
Yandex (Russian search engine): Yandex Bot
Baidu (Chinese search engine): Baidu Spider
There are also many less common web crawler bots, some of which aren’t associated with any search engine.
Why is it important for bot management to take web crawling into account?
Bad bots can cause a lot of damage, from poor user experiences to server crashes to data theft. However, in blocking bad bots, it’s important to still allow good bots, such as web crawlers, to access web properties. Cloudflare Bot Management allows good bots to keep accessing websites while still mitigating malicious bot traffic. The product maintains an automatically updated allowlist of good bots, like web crawlers, to ensure they aren’t blocked. Smaller organizations can gain a similar level of visibility and control over their bot traffic with Super Bot Fight Mode, available on Cloudflare Pro and Business plans.

Frequently Asked Questions about simple web crawler

How do I make a simple web crawler?

Here are the basic steps to build a crawler:Step 1: Add one or several URLs to be visited.Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread.Step 3: Fetch the page’s content and scrape the data you’re interested in with the ScrapingBot API.More items…•Jun 17, 2020

What is simple web crawler?

A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.

How do you crawl a website in Python?

Building a Web Crawler using Pythona name for identifying the spider or the crawler, “Wikipedia” in the above example.a start_urls variable containing a list of URLs to begin crawling from. … a parse() method which will be used to process the webpage to extract the relevant and necessary content.Aug 12, 2020

Leave a Reply

Your email address will not be published. Required fields are marked *