What is a web crawler? | How web spiders work | Cloudflare
What is a web crawler bot?
A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed. They’re called “web crawlers” because crawling is the technical term for automatically accessing a website and obtaining data via a software program.
These bots are almost always operated by search engines. By applying a search algorithm to the data collected by web crawlers, search engines can provide relevant links in response to user search queries, generating the list of webpages that show up after a user types a search into Google or Bing (or another search engine).
A web crawler bot is like someone who goes through all the books in a disorganized library and puts together a card catalog so that anyone who visits the library can quickly and easily find the information they need. To help categorize and sort the library’s books by topic, the organizer will read the title, summary, and some of the internal text of each book to figure out what it’s about.
However, unlike a library, the Internet is not composed of physical piles of books, and that makes it hard to tell if all the necessary information has been indexed properly, or if vast quantities of it are being overlooked. To try to find all the relevant information the Internet has to offer, a web crawler bot will start with a certain set of known webpages and then follow hyperlinks from those pages to other pages, follow hyperlinks from those other pages to additional pages, and so on.
It is unknown how much of the publicly available Internet is actually crawled by search engine bots. Some sources estimate that only 40-70% of the Internet is indexed for search – and that’s billions of webpages.
What is search indexing?
Search indexing is like creating a library card catalog for the Internet so that a search engine knows where on the Internet to retrieve information when a person searches for it. It can also be compared to the index in the back of a book, which lists all the places in the book where a certain topic or phrase is mentioned.
Indexing focuses mostly on the text that appears on the page, and on the metadata* about the page that users don’t see. When most search engines index a page, they add all the words on the page to the index – except for words like “a, ” “an, ” and “the” in Google’s case. When users search for those words, the search engine goes through its index of all the pages where those words appear and selects the most relevant ones.
*In the context of search indexing, metadata is data that tells search engines what a webpage is about. Often the meta title and meta description are what will appear on search engine results pages, as opposed to content from the webpage that’s visible to users.
How do web crawlers work?
The Internet is constantly changing and expanding. Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.
Given the vast number of webpages on the Internet that could be indexed for search, this process could go on almost indefinitely. However, a web crawler will follow certain policies that make it more selective about which pages to crawl, in what order to crawl them, and how often they should crawl them again to check for content updates.
The relative importance of each webpage: Most web crawlers don’t crawl the entire publicly available Internet and aren’t intended to; instead they decide which pages to crawl first based on the number of other pages that link to that page, the amount of visitors that page gets, and other factors that signify the page’s likelihood of containing important information.
The idea is that a webpage that is cited by a lot of other webpages and gets a lot of visitors is likely to contain high-quality, authoritative information, so it’s especially important that a search engine has it indexed – just as a library might make sure to keep plenty of copies of a book that gets checked out by lots of people.
Revisiting webpages: Content on the Web is continually being updated, removed, or moved to new locations. Web crawlers will periodically need to revisit pages to make sure the latest version of the content is indexed.
requirements: Web crawlers also decide which pages to crawl based on the protocol (also known as the robots exclusion protocol). Before crawling a webpage, they will check the file hosted by that page’s web server. A file is a text file that specifies the rules for any bots accessing the hosted website or application. These rules define which pages the bots can crawl, and which links they can follow. As an example, check out the file.
All these factors are weighted differently within the proprietary algorithms that each search engine builds into their spider bots. Web crawlers from different search engines will behave slightly differently, although the end goal is the same: to download and index content from webpages.
Why are web crawlers called ‘spiders’?
The Internet, or at least the part that most users access, is also known as the World Wide Web – in fact that’s where the “www” part of most website URLs comes from. It was only natural to call search engine bots “spiders, ” because they crawl all over the Web, just as real spiders crawl on spiderwebs.
Should web crawler bots always be allowed to access web properties?
That’s up to the web property, and it depends on a number of factors. Web crawlers require server resources in order to index content – they make requests that the server needs to respond to, just like a user visiting a website or other bots accessing a website. Depending on the amount of content on each page or the number of pages on the site, it could be in the website operator’s best interests not to allow search indexing too often, since too much indexing could overtax the server, drive up bandwidth costs, or both.
Also, developers or companies may not want some webpages to be discoverable unless a user already has been given a link to the page (without putting the page behind a paywall or a login). One example of such a case for enterprises is when they create a dedicated landing page for a marketing campaign, but they don’t want anyone not targeted by the campaign to access the page. In this way they can tailor the messaging or precisely measure the page’s performance. In such cases the enterprise can add a “no index” tag to the landing page, and it won’t show up in search engine results. They can also add a “disallow” tag in the page or in the file, and search engine spiders won’t crawl it at all.
Website owners may not want web crawler bots to crawl part or all of their sites for a variety of other reasons as well. For instance, a website that offers users the ability to search within the site may want to block the search results pages, as these are not useful for most users. Other auto-generated pages that are only helpful for one user or a few specific users should also be blocked.
What is the difference between web crawling and web scraping?
Web scraping, data scraping, or content scraping is when a bot downloads the content on a website without permission, often with the intention of using that content for a malicious purpose.
Web scraping is usually much more targeted than web crawling. Web scrapers may be after specific pages or specific websites only, while web crawlers will keep following links and crawling pages continuously.
Also, web scraper bots may disregard the strain they put on web servers, while web crawlers, especially those from major search engines, will obey the file and limit their requests so as not to overtax the web server.
How do web crawlers affect SEO?
SEO stands for search engine optimization, and it is the discipline of readying content for search indexing so that a website shows up higher in search engine results.
If spider bots don’t crawl a website, then it can’t be indexed, and it won’t show up in search results. For this reason, if a website owner wants to get organic traffic from search results, it is very important that they don’t block web crawler bots.
What web crawler bots are active on the Internet?
The bots from the major search engines are called:
Google: Googlebot (actually two crawlers, Googlebot Desktop and Googlebot Mobile, for desktop and mobile searches)
Yandex (Russian search engine): Yandex Bot
Baidu (Chinese search engine): Baidu Spider
There are also many less common web crawler bots, some of which aren’t associated with any search engine.
Why is it important for bot management to take web crawling into account?
Bad bots can cause a lot of damage, from poor user experiences to server crashes to data theft. However, in blocking bad bots, it’s important to still allow good bots, such as web crawlers, to access web properties. Cloudflare Bot Management allows good bots to keep accessing websites while still mitigating malicious bot traffic. The product maintains an automatically updated allowlist of good bots, like web crawlers, to ensure they aren’t blocked. Smaller organizations can gain a similar level of visibility and control over their bot traffic with Super Bot Fight Mode, available on Cloudflare Pro and Business plans.
Definition of web crawler – Merriam-Webster
GAMES & QUIZZES
WORD OF THE DAY
MY WORDSMY WORDS
Games & Quizzes
Word of the Day
My WordsMy Words
or Web crawler
Definition of web crawler: a computer program that automatically and systematically searches web pages for certain keywords
Each search engine has its own proprietary computation (called an “algorithm”) that ranks websites for each keyword or combination of keywords. These algorithms use Web crawlers … that collect data from your website to determine where it ranks with respect to search terms. — Julie Brinton
First Known Use of web crawler
1994, in the meaning defined above
Keep scrolling for more
Learn More About web crawler
Time Traveler for web crawler
The first known use of web crawler was
See more words from the same year
Dictionary Entries Near web crawler
See More Nearby Entries
Statistics for web crawler
Cite this Entry
“Web crawler. ” Dictionary, Merriam-Webster,. Accessed 10 Oct. 2021.
See Definitions and Examples »
Get Word of the Day daily email!
Test Your Vocabulary
Dog Words Quiz
Which of the following animals has a dog in its etymology?
Test your vocabulary with our 10-question quiz!
TAKE THE QUIZ
Love words? Need even more definitions?
Subscribe to America’s largest dictionary and get thousands more definitions and advanced search—ad free!
Words at Play
“In Vino Veritas” and Other Latin Phrases to Live ByTop 10 Latin Phrases
The Difference Between ‘i. e. ‘ and ‘e. g. ‘ For example, the different ways to use them
‘Awhile’ vs. ‘A While’There are rules, but who’s listening?
15 of the Creepiest Ghosts, Creatures, and MonstersWe bring you the strangest, most elusive beasts i…
Ask the Editors
‘Everyday’ vs. ‘Every Day’A simple trick to keep them separate
What Is ‘Semantic Bleaching’? How ‘literally’ can mean “figuratively”
LiterallyHow to use a word that (literally) drives some pe…
Is Singular ‘They’ a Better Choice? The awkward case of ‘his or her’
Surprising Hispanic Origins Behind Everyday WordsWhat treat did the Aztecs refer to as “bitter wat…
Take the quiz
Test Your Punctuation SkillsDo you really know how to use a semicolon?
How Strong Is Your Vocabulary? Test your vocabulary with our 10-question quiz!
Take the quiz
A Hybrid Approach to Detect Malicious Web Crawlers | Hillstone Networks
What is a web crawler?
A web crawler (also called web spider, web robot) is typically a script or computer program that browses the targeted
website in an orderly and automated manner. It is an important method for collecting information on the Internet and is
a critical component of search engine technology. Most popular search engines, such as GoogleBot and BaiduSpider, use
underlying web crawlers to get the latest data on the internet.
All web crawlers take up internet bandwidth. But not all web crawlers are benign. A well behaved web crawler usually
identifies itself and balances the crawling frequencies and contents and thus the bandwidth consumption. On the other
hand, an ill-behaved or malicious web crawler can consume large amounts of bandwidth and cause disruptions,
especially to companies that rely on web traffic or content for their business.
For companies that rely on their website and online content to conduct business, if a web crawler is created by a hacker
or unauthorized users and used on bots, it can be used to steal data and information from businesses with the possibility
of staging DDOS attacks towards targeted websites.
How to effectively detect malicious web crawlers has become a critical topic in today’s cyber threat defense sector.
Web Crawler Characteristics
Since malicious or ill-behaved web crawlers are primarily scripting programming that runs on bot machines, they typically
have the following behavior with some variants:
High HTTP request rate and typically done in parallel.
Large amount of URL visits in terms of total number of URLs as well as the number of directories
More requests for specific file types versus others; for example, more requests for, files, and fewer for, files, etc.
Scarce use of HTTP POST method since the main purpose is to download information from the website versus uploading.
Potentially more HTTP HEAD methods used (compared with normal browsing) since a crawler often needs to determine the types of files before it tries to crawl it.
Potentially higher numbers of smaller sized files among the HTTP GET method returns. This is because, very often, a crawler needs to maximize results of its crawling within a minimal amount of time and therefore skip those large sized files and go for smaller ones.
In case some URLs being crawled need further authentication, the HTTP requests from the crawlers will be directed to those authenticating pages, resulting in 3XX or 4XX of HTTP request return codes.
Common Web Crawler Detection Methods
Commonly used methods such as proper configuration in files on server, whitelisting user-agent, among others, can detect and block some low level malicious crawlers. Advanced and sophisticated web crawlers are still difficult to detect because they can hide behind legitimate ones. Additionally, IT departments can invest time and resources to collect and analyze network traffic logging reports to surface hidden traces of web crawlers.
Take for example, the below snapshot of an actual logging data from a content hosting company. IT staff can identify the most visited IP addresses after sorting the log data; after filtering out those on the whitelists, the most visited and suspicious IP addresses can be further examined and action can be taken if they are determined to not belong to known and benign lists.
Hillstone’ s Hybrid Approach to Detecting Suspicious Web Crawlers
Using logging data analysis to identify suspicious or malicious web crawlers, however effective, is a labor intensive and sustaining effort and often consumes a lot of time and resources for IT departments.
Detection methods that are solely based on statistics from logging data can often generate false positive alerts, for example, they can’t distinguish a DOS attack from a crawler. Furthermore, this method can be ineffective in detecting slow moving web crawlers. This is because there is usually a vast amount of log data collected at any given point of time, and log data can only be stored for specific periods of time, and as time passes, slow moving crawlers usually lose all traces.
Hillstone Networks has adopted a hybrid approach that uses not only statistical logging data analysis but more importantly, focuses on behavioral modeling to detect suspicious web crawlers. This has proven to be effective in detecting sophisticated, malign crawlers as well as slow crawlers that are prone to losing trace.
In this hybrid approach, a set of pre-defined L3-L7 behavioral features monitor and collect data at the data plane, which is then fed into several behavioral models using machine learning algorithms that learn and profile these behavioral features periodically.
In tandem, network and application level traffic logging data collected over specific periods of time, are also processed, sorted, filtered and analyzed.
Built on the predictive results of the behavioral modeling and statistical analysis from the logging data, a set of correlation rules are defined to correlate the corresponding results from different detecting modules. They are used to identify those IP addresses that are “abnormal” compared with the IP addresses that have normal web accessing and browsing behavior. The final result is a classified threat event that is saved into the threat event database.
The solution also offers a user interface for network and IT staff with clear and accurate visibility of suspicious web crawler activity along with corresponding IP addresses and other forensic data so that they can take proper action to mitigate these actions.
The following are two examples that illustrate suspicious web crawler activity and detection using behavioral modeling and analysis:
In the above example, you can note the following:
On the left graph, the red dots represent the abnormality of HTTP requests with 3XX return code. You will notice that some IP addresses have 65% of 3XX return codes; other IP addresses have 100% of 3XX return codes.
On the right graph, the red dots indicate the abnormality of those having URL width (i. e. directories visited) requests within a learning cycle. Some IP addresses have a significantly higher number of URL directories visited over others in one learning cycle.
Hillstone’s behavioral model feature analyses these abnormal IP addresses (those depicted by the red dots) and correlates those IP addresses that fit these two behavioral abnormality rules. It is easy to identify potentially suspicious IPs that might be conducting malicious web crawling. In this case, the IP address 219. 149. 214. 103 is a suspicious candidate.
Since using behavioral modeling has helped narrow down the IP addresses that have significant abnormal behaviors, it makes it very easy for network administrators to take proper and effective action.
Another example is shown below:
In this example, we note the following:
On the right graph, the red dots indicate the abnormality of those having URL width (i. Some IP addresses have significantly higher number of URL directory visits than others in one learning cycle.
On the left graph, the red dots indicate those IP addresses that have abnormal (higher) number of HTML files request compared with other IP addresses that are monitored.
The Hilstone behavioral model features will then perform and analysis of the abnormal IP addresses (those depicted by the red dots), and correlate those IP addresses that fit these two behavioral abnormality rules. It is easy to identify the potentially suspicious IPs that might be conducting malicious web crawling. In this case, the IP address 202. 112. 90. 159 is such a suspicious candidate.
Using manual and static analysis on logging data (based on most visited IP addresses) can be labor-intensive and incur higher cost and more overhead; but more importantly, can be often ineffective if it mistakenly misses slow crawlers with lower IP address numbers in the logging data. Hillstone’s hybrid solution uses a proprietary self-learning behavioral modeling mechanism that is more effective in detecting these slow crawlers. It also provides statistical analysis to automatically detect sophisticated and suspicious web crawlers as well as rich and actionable forensic evidence to the administrator.
Frequently Asked Questions about what is crawler
What is the use of crawler?
A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.
What do you mean by web crawler?
: a computer program that automatically and systematically searches web pages for certain keywords Each search engine has its own proprietary computation (called an “algorithm”) that ranks websites for each keyword or combination of keywords.
What is a crawler cyber security?
A web crawler (also called web spider, web robot) is typically a script or computer program that browses the targeted. website in an orderly and automated manner. It is an important method for collecting information on the Internet and is. a critical component of search engine technology.Apr 7, 2016