How Do Web Crawlers Work? – 417 Marketing
Last Updated: December 9, 2019
If you asked everyone you know to list their topmost fears, spiders would likely sit comfortably in the top five (after public speaking and death, naturally*). Creepy, crawly, and quick, even small spiders can make a grown man jump. But when it comes to the internet, spiders do more than spin webs. Search engines use spiders (also known as web crawlers) to explore the web, not to spin their own. If you have a website, web crawlers have creeped onto it at some point, but perhaps surprisingly, this is something for which you should be thankful. Without them, no one could find your website on a search engine.
Turns out, spiders aren’t so bad after all! But how do web crawlers work?
What Is a Web Crawler?
Although you might imagine web crawlers as little robots that live and work on the internet, in reality they’re simply part of a computer program written and used by search engines to update their web content or to index the web content of other websites.
A web crawler copies webpages so that they can be processed later by the search engine, which indexes the downloaded pages. This allows users of the search engine to find webpages quickly. The web crawler also validates links and HTML code, and sometimes it extracts other information from the website.
Web crawlers are known by a variety of different names including spiders, ants, bots, automatic indexers, web cutters, and (in the case of Google’s web crawler) Googlebot. If you want your website to rank highly on Google, you need to ensure that web crawlers can always reach and read your content.
How Do Web Crawlers Work?
Discovering URLs: How does a search engine discover webpages to crawl? First, the search engine may have already crawled the webpage in the past. Second, the search engine may discover a webpage by following a link from a page it has already crawled. Third, a website owner may ask for the search engine to crawl a URL by submitting a sitemap (a file that provides information about the pages on a site). Creating a clear sitemap and crafting an easily navigable website are good ways to encourage search engines to crawl your website.
Exploring a List of Seeds: Next, the search engine gives its web crawlers a list of web addresses to check out. These URLs are known as seeds. The web crawler visits each URL on the list, identifies all of the links on each page, and adds them to the list of URLs to visit. Using sitemaps and databases of links discovered during previous crawls, web crawlers decide which URLs to visit next. In this way, web crawlers explore the internet via links.
Adding to the Index: As web crawlers visit the seeds on their lists, they locate and render the content and add it to the index. The index is where the search engine stores all of its knowledge of the internet. It’s over 100, 000, 000 gigabytes in size! To create a full picture of the internet (which is critical for optimal search results pages), web crawlers must index every nook and cranny of the internet. In addition to text, web crawlers catalog images, videos, and other files.
Updating the Index: Web crawlers note key signals, such as the content, keywords, and the freshness of the content, to try to understand what a page is about. According to Google, “The software pays special attention to new sites, changes to existing sites, and dead links. ” When it locates these items, it updates the search index to ensure it’s up to date.
Crawling Frequency: Web crawlers are crawling the internet 24/7, but how often are individual pages crawled? According to Google, “Computer programs determine which sites to crawl, how often, and how many pages to fetch from each site. ” The program takes the perceived importance of your website and the number of changes you’ve made since the last crawl into consideration. It also looks at your website’s crawl demand, or the level of interest Google and its searchers have in your website. If your website is popular, it’s likely that Googlebot will crawl it frequently to ensure your viewers can find your latest content through Google.
Blocking Web Crawlers: If you choose, you can block web crawlers from indexing your website. For example, using a file (discussed in more detail below) with certain rules is like holding a sign up to web crawlers saying, “Do not enter! ” Or if your HTTP header contains a status code relaying that the page doesn’t exist, web crawlers won’t crawl it. In some cases, a webmaster might inadvertantly block web crawlers from indexing a page, which is why it’s important to periodically check your website’s crawlability.
Using Protocols: Webmasters can use protocol to communicate with web crawlers, which always check a page’s file before crawling the page. A variety of rules can be included in the file. For example, you can define which pages a bot can crawl, specify which links a bot can follow, or opt out of crawling altogether using Google provides the same customization tools to all webmasters, and doesn’t allow any bribing or grant any special privileges.
Web crawlers have an exhausting job when you consider how many webpages exist and how many more are being created, updated, or deleted everyday. To make the process more efficient, search engines create crawling policies and techniques.
Web Crawling Policies and Techniques
To Restrict a Request: If a crawler only wants to find certain media types, it can make a HEAD request to ensure that all of the found resources will be the needed type.
To Avoid Duplicate Downloads: Web crawlers sometimes modify and standardize URLs so that they can avoid crawling the same resource multiple times.
To Download All Resources: If a crawler needs to download all of the resources from a given website, a path-ascending crawler can be used. It attempts to crawl every path in every URL on the list.
To Download Only Similar Webpages: Focused web crawlers are only interested in downloading webpages that are similar to each other. For example, academic crawlers only search for and download academic papers (they use filters to find PDF, postscript, and Word files and then use algorithms to determine if the pages are academic or not).
To Keep the Index Up to Speed: Things move fast on the Internet. By the time a web crawler is finished with a long crawl, the pages it downloaded might have been updated or deleted. To keep content up to date, crawlers use equations to determine websites’ freshness and age.
In addition, Google uses several different web crawlers to accomplish a variety of different jobs. For example, there’s Googlebot (desktop), Googlebot (mobile), Googlebot Video, Googlebot Images, and Googlebot News.
Reviewing the Crawling of Your Website
If you want to see how often Googlebot visits your website, open Google Search Console and head to the “Crawl” section. You can confirm that Googlebot visits your site, see how often it visits, verify how it sees your site, and even get a list of crawl errors to fix. If you wish, you may ask Googlebot to recrawl your website through Google Search Console as well. And if your load speed is suffering or you’ve noticed a sudden surge in errors, you may be able to fix these issues by altering your crawl rate limit in Google Search Console.
So… How Do Web Crawlers Work?
To put it simply, web crawlers explore the web and index the content they find so that the information can be retrieved by a search engine when needed. Most search engines run many crawling programs simultaneously on multiple servers. Due to the vast number of webpages on the internet, the crawling process could go on almost indefinitely, which is why web crawlers follow certain policies to be more selective about the pages they crawl.
Keep in mind that we only know the general answer to the question “How do web crawlers work? ” Google won’t reveal all the secrets behind its algorithms, as this could encourage spammers and allow other search engines to steal Google’s secrets.
See? Spiders aren’t so scary after all. A little secretive, perhaps, but perfectly harmless!
If you’re hoping to build a beautiful, effective website that ranks highly on Google, contact 417 Marketing for help. Our team of knowledgeable, creative, and passionate professionals specializes in SEO, web design and maintenance, and Google Ads, and we have successfully completed over 700 websites since our inception in 2010. Click here to contact us and learn more about what we can do for your company.
What is a web crawler? | How web spiders work | Cloudflare
What is a web crawler bot?
A web crawler, spider, or search engine bot downloads and indexes content from all over the Internet. The goal of such a bot is to learn what (almost) every webpage on the web is about, so that the information can be retrieved when it’s needed. They’re called “web crawlers” because crawling is the technical term for automatically accessing a website and obtaining data via a software program.
These bots are almost always operated by search engines. By applying a search algorithm to the data collected by web crawlers, search engines can provide relevant links in response to user search queries, generating the list of webpages that show up after a user types a search into Google or Bing (or another search engine).
A web crawler bot is like someone who goes through all the books in a disorganized library and puts together a card catalog so that anyone who visits the library can quickly and easily find the information they need. To help categorize and sort the library’s books by topic, the organizer will read the title, summary, and some of the internal text of each book to figure out what it’s about.
However, unlike a library, the Internet is not composed of physical piles of books, and that makes it hard to tell if all the necessary information has been indexed properly, or if vast quantities of it are being overlooked. To try to find all the relevant information the Internet has to offer, a web crawler bot will start with a certain set of known webpages and then follow hyperlinks from those pages to other pages, follow hyperlinks from those other pages to additional pages, and so on.
It is unknown how much of the publicly available Internet is actually crawled by search engine bots. Some sources estimate that only 40-70% of the Internet is indexed for search – and that’s billions of webpages.
What is search indexing?
Search indexing is like creating a library card catalog for the Internet so that a search engine knows where on the Internet to retrieve information when a person searches for it. It can also be compared to the index in the back of a book, which lists all the places in the book where a certain topic or phrase is mentioned.
Indexing focuses mostly on the text that appears on the page, and on the metadata* about the page that users don’t see. When most search engines index a page, they add all the words on the page to the index – except for words like “a, ” “an, ” and “the” in Google’s case. When users search for those words, the search engine goes through its index of all the pages where those words appear and selects the most relevant ones.
*In the context of search indexing, metadata is data that tells search engines what a webpage is about. Often the meta title and meta description are what will appear on search engine results pages, as opposed to content from the webpage that’s visible to users.
How do web crawlers work?
The Internet is constantly changing and expanding. Because it is not possible to know how many total webpages there are on the Internet, web crawler bots start from a seed, or a list of known URLs. They crawl the webpages at those URLs first. As they crawl those webpages, they will find hyperlinks to other URLs, and they add those to the list of pages to crawl next.
Given the vast number of webpages on the Internet that could be indexed for search, this process could go on almost indefinitely. However, a web crawler will follow certain policies that make it more selective about which pages to crawl, in what order to crawl them, and how often they should crawl them again to check for content updates.
The relative importance of each webpage: Most web crawlers don’t crawl the entire publicly available Internet and aren’t intended to; instead they decide which pages to crawl first based on the number of other pages that link to that page, the amount of visitors that page gets, and other factors that signify the page’s likelihood of containing important information.
The idea is that a webpage that is cited by a lot of other webpages and gets a lot of visitors is likely to contain high-quality, authoritative information, so it’s especially important that a search engine has it indexed – just as a library might make sure to keep plenty of copies of a book that gets checked out by lots of people.
Revisiting webpages: Content on the Web is continually being updated, removed, or moved to new locations. Web crawlers will periodically need to revisit pages to make sure the latest version of the content is indexed.
requirements: Web crawlers also decide which pages to crawl based on the protocol (also known as the robots exclusion protocol). Before crawling a webpage, they will check the file hosted by that page’s web server. A file is a text file that specifies the rules for any bots accessing the hosted website or application. These rules define which pages the bots can crawl, and which links they can follow. As an example, check out the file.
All these factors are weighted differently within the proprietary algorithms that each search engine builds into their spider bots. Web crawlers from different search engines will behave slightly differently, although the end goal is the same: to download and index content from webpages.
Why are web crawlers called ‘spiders’?
The Internet, or at least the part that most users access, is also known as the World Wide Web – in fact that’s where the “www” part of most website URLs comes from. It was only natural to call search engine bots “spiders, ” because they crawl all over the Web, just as real spiders crawl on spiderwebs.
Should web crawler bots always be allowed to access web properties?
That’s up to the web property, and it depends on a number of factors. Web crawlers require server resources in order to index content – they make requests that the server needs to respond to, just like a user visiting a website or other bots accessing a website. Depending on the amount of content on each page or the number of pages on the site, it could be in the website operator’s best interests not to allow search indexing too often, since too much indexing could overtax the server, drive up bandwidth costs, or both.
Also, developers or companies may not want some webpages to be discoverable unless a user already has been given a link to the page (without putting the page behind a paywall or a login). One example of such a case for enterprises is when they create a dedicated landing page for a marketing campaign, but they don’t want anyone not targeted by the campaign to access the page. In this way they can tailor the messaging or precisely measure the page’s performance. In such cases the enterprise can add a “no index” tag to the landing page, and it won’t show up in search engine results. They can also add a “disallow” tag in the page or in the file, and search engine spiders won’t crawl it at all.
Website owners may not want web crawler bots to crawl part or all of their sites for a variety of other reasons as well. For instance, a website that offers users the ability to search within the site may want to block the search results pages, as these are not useful for most users. Other auto-generated pages that are only helpful for one user or a few specific users should also be blocked.
What is the difference between web crawling and web scraping?
Web scraping, data scraping, or content scraping is when a bot downloads the content on a website without permission, often with the intention of using that content for a malicious purpose.
Web scraping is usually much more targeted than web crawling. Web scrapers may be after specific pages or specific websites only, while web crawlers will keep following links and crawling pages continuously.
Also, web scraper bots may disregard the strain they put on web servers, while web crawlers, especially those from major search engines, will obey the file and limit their requests so as not to overtax the web server.
How do web crawlers affect SEO?
SEO stands for search engine optimization, and it is the discipline of readying content for search indexing so that a website shows up higher in search engine results.
If spider bots don’t crawl a website, then it can’t be indexed, and it won’t show up in search results. For this reason, if a website owner wants to get organic traffic from search results, it is very important that they don’t block web crawler bots.
What web crawler bots are active on the Internet?
The bots from the major search engines are called:
Google: Googlebot (actually two crawlers, Googlebot Desktop and Googlebot Mobile, for desktop and mobile searches)
Yandex (Russian search engine): Yandex Bot
Baidu (Chinese search engine): Baidu Spider
There are also many less common web crawler bots, some of which aren’t associated with any search engine.
Why is it important for bot management to take web crawling into account?
Bad bots can cause a lot of damage, from poor user experiences to server crashes to data theft. However, in blocking bad bots, it’s important to still allow good bots, such as web crawlers, to access web properties. Cloudflare Bot Management allows good bots to keep accessing websites while still mitigating malicious bot traffic. The product maintains an automatically updated allowlist of good bots, like web crawlers, to ensure they aren’t blocked. Smaller organizations can gain a similar level of visibility and control over their bot traffic with Super Bot Fight Mode, available on Cloudflare Pro and Business plans.
How Google’s Site Crawlers Index Your Site – Google Search
Чтобы пользователи могли быстро найти нужные сведения, наши роботы собирают информацию на сотнях миллиардов страниц и упорядочивают ее в поисковом индексе.
Основы Google Поиска
При очередном сканировании наряду со списком веб-адресов, полученных во время предыдущего сканирования, используются файлы Sitemap, которые предоставляются владельцами сайтов. По мере посещения сайтов робот переходит по указанным на них ссылкам на другие страницы. Особое внимание он уделяет новым и измененным сайтам, а также неработающим ссылкам. Он самостоятельно определяет, какие сайты сканировать, как часто это нужно делать и какое количество страниц следует выбрать на каждом из них.
При помощи Search Console владельцы сайтов могут указывать, как именно следует сканировать их ресурсы, в частности предоставлять подробные инструкции по обработке страниц, запрашивать их повторное сканирование, а также запрещать сканирование, используя файл Google не увеличивает частоту сканирования отдельных ресурсов за плату. Чтобы результаты поиска были максимально полезными для пользователей, все владельцы сайтов получают одни и те же инструменты.
Поиск информации с помощью сканирования
Интернет похож на библиотеку, которая содержит миллиарды изданий и постоянно пополняется, но не располагает централизованной системой учета книг. Чтобы находить общедоступные страницы, мы используем специальное программное обеспечение, называемое поисковыми роботами. Роботы анализируют страницы и переходят по ссылкам на них – как обычные пользователи. После этого они отправляют сведения о ресурсах на серверы Google.
Систематизация информации с помощью индексирования
Во время сканирования наши системы обрабатывают материалы страниц так же, как это делают браузеры, и регистрируют данные по ключевым словам и новизне контента, а затем создают на их основе поисковый индекс.
Индекс Google Поиска содержит сотни миллиардов страниц. Его объем значительно превышает 100 миллионов гигабайт. Он похож на указатель в конце книги, в котором есть отдельная запись для каждого слова на всех проиндексированных страницах. Во время индексирования данные о странице добавляются в записи по всем словам, которые на ней есть.
Построение Сети Знаний — более современный способ определить интересы пользователей по сравнению с сопоставлением ключевых слов. Для этого мы упорядочиваем не только данные по страницам, но и другие типы информации. В настоящее время Google Поиск позволяет найти нужный фрагмент текста в миллионах книг из крупнейших библиотек, узнать расписание общественного транспорта, а также изучить данные общедоступных источников, таких как сайт Всемирного банка.
Frequently Asked Questions about how do web crawlers work
What is a Web crawler and how do they work?
A web crawler, or spider, is a type of bot that is typically operated by search engines like Google and Bing. Their purpose is to index the content of websites all across the Internet so that those websites can appear in search engine results.
How do web crawlers find websites?
Finding information by crawling We use software known as web crawlers to discover publicly available webpages. Crawlers look at webpages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those webpages back to Google’s servers.
How does Google crawler work?
Other pages are discovered when Google follows a link from a known page to a new page. Still other pages are discovered when a website owner submits a list of pages (a sitemap) for Google to crawl. … Once Google discovers a page URL, it visits, or crawls, the page to find out what’s on it.