Python Spider Tutorial

VPN

  • No logs
  • Kill Switch
  • 6 devices
  • Monthly price: $4.92

Visit nordvpn.com

Scrapy Tutorial — Scrapy 2.5.1 documentation

In this tutorial, we’ll assume that Scrapy is already installed on your system.
If that’s not the case, see Installation guide.
We are going to scrape, a website
that lists quotes from famous authors.
This tutorial will walk you through these tasks:
Creating a new Scrapy project
Writing a spider to crawl a site and extract data
Exporting the scraped data using the command line
Changing spider to recursively follow links
Using spider arguments
Scrapy is written in Python. If you’re new to the language you might want to
start by getting an idea of what the language is like, to get the most out of
Scrapy.
If you’re already familiar with other languages, and want to learn Python quickly, the Python Tutorial is a good resource.
If you’re new to programming and want to start with Python, the following books
may be useful to you:
Automate the Boring Stuff With Python
How To Think Like a Computer Scientist
Learn Python 3 The Hard Way
You can also take a look at this list of Python resources for non-programmers,
as well as the suggested resources in the learnpython-subreddit.
Creating a project¶
Before you start scraping, you will have to set up a new Scrapy project. Enter a
directory where you’d like to store your code and run:
scrapy startproject tutorial
This will create a tutorial directory with the following contents:
tutorial/
# deploy configuration file
tutorial/ # project’s Python module, you’ll import your code from here
# project items definition file
# project middlewares file
# project pipelines file
# project settings file
spiders/ # a directory where you’ll later put your spiders
Our first Spider¶
Spiders are classes that you define and that Scrapy uses to scrape information
from a website (or a group of websites). They must subclass
Spider and define the initial requests to make,
optionally how to follow links in the pages, and how to parse the downloaded
page content to extract data.
This is the code for our first Spider. Save it in a file named
under the tutorial/spiders directory in your project:
import scrapy
class QuotesSpider():
name = “quotes”
def start_requests(self):
urls = [
”,
”, ]
for url in urls:
yield quest(url=url, )
def parse(self, response):
page = (“/”)[-2]
filename = f’quotes-{page}’
with open(filename, ‘wb’) as f:
()
(f’Saved file {filename}’)
As you can see, our Spider subclasses
and defines some attributes and methods:
name: identifies the Spider. It must be
unique within a project, that is, you can’t set the same name for different
Spiders.
start_requests(): must return an iterable of
Requests (you can return a list of requests or write a generator function)
which the Spider will begin to crawl from. Subsequent requests will be
generated successively from these initial requests.
parse(): a method that will be called to handle
the response downloaded for each of the requests made. The response parameter
is an instance of TextResponse that holds
the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting
the scraped data as dicts and also finding new URLs to
follow and creating new requests (Request) from them.
How to run our spider¶
To put our spider to work, go to the project’s top level directory and run:
This command runs the spider with name quotes that we’ve just added, that
will send some requests for the domain. You will get an output
similar to this:… (omitted for brevity)
2016-12-16 21:24:05 [] INFO: Spider opened
2016-12-16 21:24:05 [scrapy. extensions. logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [] DEBUG: Telnet console listening on 127. 0. 1:6023
2016-12-16 21:24:05 [] DEBUG: Crawled (404) (referer: None)
2016-12-16 21:24:05 [] DEBUG: Crawled (200) (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file
2016-12-16 21:24:05 [] INFO: Closing spider (finished)…
Now, check the files in the current directory. You should notice that two new
files have been created: and, with the content
for the respective URLs, as our parse method instructs.
Note
If you are wondering why we haven’t parsed the HTML yet, hold
on, we will cover that soon.
What just happened under the hood? ¶
Scrapy schedules the quest objects
returned by the start_requests method of the Spider. Upon receiving a
response for each one, it instantiates Response objects
and calls the callback method associated with the request (in this case, the
parse method) passing the response as argument.
A shortcut to the start_requests method¶
Instead of implementing a start_requests() method
that generates quest objects from URLs,
you can just define a start_urls class attribute
with a list of URLs. This list will then be used by the default implementation
of start_requests() to create the initial requests
for your spider:
start_urls = [
The parse() method will be called to handle each
of the requests for those URLs, even though we haven’t explicitly told Scrapy
to do so. This happens because parse() is Scrapy’s
default callback method, which is called for requests without an explicitly
assigned callback.
Storing the scraped data¶
The simplest way to store the scraped data is by using Feed exports, with the following command:
scrapy crawl quotes -O
That will generate a file containing all scraped items,
serialized in JSON.
The -O command-line switch overwrites any existing file; use -o instead
to append new content to any existing file. However, appending to a JSON file
makes the file contents invalid JSON. When appending to a file, consider
using a different serialization format, such as JSON Lines:
scrapy crawl quotes -o
The JSON Lines format is useful because it’s stream-like, you can easily
append new records to it. It doesn’t have the same problem of JSON when you run
twice. Also, as each record is a separate line, you can process big files
without having to fit everything in memory, there are tools like JQ to help
doing that at the command-line.
In small projects (like the one in this tutorial), that should be enough.
However, if you want to perform more complex things with the scraped items, you
can write an Item Pipeline. A placeholder file
for Item Pipelines has been set up for you when the project is created, in
tutorial/ Though you don’t need to implement any item
pipelines if you just want to store the scraped items.
Following links¶
Let’s say, instead of just scraping the stuff from the first two pages
from, you want quotes from all the pages in the website.
Now that you know how to extract data from pages, let’s see how to follow links
from them.
First thing is to extract the link to the page we want to follow. Examining
our page, we can see there is a link to the next page with the following
markup:

Datacenter proxies

  • HTTP & SOCKS
  • unlimited bandwidth
  • Price starting from $0.08/IP
  • Locations: EU, America, Asia

Visit fineproxy.de

We can try extracting it in the shell:
>>> (‘ a’)()
Next
This gets the anchor element, but we want the attribute href. For that,
Scrapy supports a CSS extension that lets you select the attribute contents,
like this:
>>> (‘ a::attr(href)’)()
‘/page/2/’
There is also an attrib property available
(see Selecting element attributes for more):
>>> (‘ a’)[‘href’]
Let’s see now our spider modified to recursively follow the link to the next
page, extracting data from it:
for quote in (”):
yield {
‘text’: (”)(),
‘author’: (”)(),
‘tags’: (‘ ‘)(), }
next_page = (‘ a::attr(href)’)()
if next_page is not None:
next_page = response. urljoin(next_page)
yield quest(next_page, )
Now, after extracting the data, the parse() method looks for the link to
the next page, builds a full absolute URL using the
urljoin() method (since the links can be
relative) and yields a new request to the next page, registering itself as
callback to handle the data extraction for the next page and to keep the
crawling going through all the pages.
What you see here is Scrapy’s mechanism of following links: when you yield
a Request in a callback method, Scrapy will schedule that request to be sent
and register a callback method to be executed when that request finishes.
Using this, you can build complex crawlers that follow links according to rules
you define, and extract different kinds of data depending on the page it’s
visiting.
In our example, it creates a sort of loop, following all the links to the next page
until it doesn’t find one – handy for crawling blogs, forums and other sites with
pagination.
A shortcut for creating Requests¶
As a shortcut for creating Request objects you can use
‘author’: (‘span small::text’)(),
yield (next_page, )
Unlike quest, supports relative URLs directly – no
need to call urljoin. Note that just returns a Request
instance; you still have to yield this Request.
You can also pass a selector to instead of a string;
this selector should extract necessary attributes:
for href in (‘ a::attr(href)’):
yield (href, )
For elements there is a shortcut: uses their href
attribute automatically. So the code can be shortened further:
for a in (‘ a’):
yield (a, )
To create multiple requests from an iterable, you can use
llow_all instead:
anchors = (‘ a’)
yield from llow_all(anchors, )
or, shortening it further:
yield from llow_all(css=’ a’, )
More examples and patterns¶
Here is another spider that illustrates callbacks and following links,
this time for scraping author information:
class AuthorSpider():
name = ‘author’
start_urls = [”]
author_page_links = (‘ + a’)
yield from llow_all(author_page_links, rse_author)
pagination_links = (‘ a’)
yield from llow_all(pagination_links, )
def parse_author(self, response):
def extract_with_css(query):
return (query)(default=”)()
‘name’: extract_with_css(”),
‘birthdate’: extract_with_css(”),
‘bio’: extract_with_css(”), }
This spider will start from the main page, it will follow all the links to the
authors pages calling the parse_author callback for each of them, and also
the pagination links with the parse callback as we saw before.
Here we’re passing callbacks to
llow_all as positional
arguments to make the code shorter; it also works for
Request.
The parse_author callback defines a helper function to extract and cleanup the
data from a CSS query and yields the Python dict with the author data.
Another interesting thing this spider demonstrates is that, even if there are
many quotes from the same author, we don’t need to worry about visiting the
same author page multiple times. By default, Scrapy filters out duplicated
requests to URLs already visited, avoiding the problem of hitting servers too
much because of a programming mistake. This can be configured by the setting
DUPEFILTER_CLASS.
Hopefully by now you have a good understanding of how to use the mechanism
of following links and callbacks with Scrapy.
As yet another example spider that leverages the mechanism of following links,
check out the CrawlSpider class for a generic
spider that implements a small rules engine that you can use to write your
crawlers on top of it.
Also, a common pattern is to build an item with data from more than one page,
using a trick to pass additional data to the callbacks.
Using spider arguments¶
You can provide command line arguments to your spiders by using the -a
option when running them:
scrapy crawl quotes -O -a tag=humor
These arguments are passed to the Spider’s __init__ method and become
spider attributes by default.
In this example, the value provided for the tag argument will be available
via You can use this to make your spider fetch only quotes
with a specific tag, building the URL based on the argument:
url = ”
tag = getattr(self, ‘tag’, None)
if tag is not None:
url = url + ‘tag/’ + tag
yield quest(url, )
‘author’: (”)(), }
If you pass the tag=humor argument to this spider, you’ll notice that it
will only visit URLs from the humor tag, such as
You can learn more about handling spider arguments here.
Next steps¶
This tutorial covered only the basics of Scrapy, but there’s a lot of other
features not mentioned here. Check the What else? section in
Scrapy at a glance chapter for a quick overview of the most important ones.
You can continue from the section Basic concepts to know more about the
command-line tool, spiders, selectors and other things the tutorial hasn’t covered like
modeling the scraped data. If you prefer to play with an example project, check
the Examples section.
Python Scrapy tutorial for beginners - 01 - Creating your first ...

Python Scrapy tutorial for beginners – 01 – Creating your first …

Learn how to fetch the data of any website with Python and the Scrapy Framework in just minutes. On the first lesson of ‘Python scrapy tutorial for beginners’, we will scrape the data from a book store, extracting all the information and storing in a file.
In this post you will learn:
Prepare your environment and install everythingHow to create a Scrapy project and spiderHow to fetch the data from the HTMLTo manipulate the data and extract the data you wantHow to store the data into a, and file
Video version of this lesson
Preparing your environment and installing everything
Before anything, we need to prepare our environment and install everything.
In Python, we create virtual environments to have a separated environment with different example, Project1 has Python 3. 4 and Scrapy 1. 2, and Project2 Python 3. 7. 3. As we keep separated environments, one for each project, we will never have a conflict by having different versions of packages.
You can use Conda, virtualenv or Pipenv to create a virtual environment. In this course, I will use pipenv. You only need to install it with pip install pipenv and to create a new virtual environment with pipenv shell.
Once you are set, install Scrapy with pip install scrapy. That’s all you need.
Time to create the project and your spider.
Base image provided by Vecteezy
Creating a project and a spider – And what they are
Before anything, we need to create a Scrapy project. In your current folder, enter:
scrapy startproject books
This will create a project named ‘books’. Inside you’ll find a few files. I’ll explain them in a more detailed post but here’s a brief explanation:
books/
<-- Configuration file (DO NOT TOUCH! ) tutorial/ <-- Empty file that marks this as a Python folder <-- Model of the item to scrap <-- Scrapy processing hooks (DO NOT TOUCH) <-- What to do with the scraped item <-- Project settings file spiders/ <-- Directory of our spiders (empty by now) After creating a project, navigate to the project created (cd books) and once inside the folder, create a spider by passing it the name and the root URL without ‘www’: scrapy genspider spider Now we have our spider inside the spider folder! You will have something like this: # -*- coding: utf-8 -*- import scrapy class SpiderSpider(): name = 'spider' allowed_domains = [''] start_urls = [''] def parse(self, response): pass First, we import scrapy. Then, a class is created inheriting ‘Spider’ from Scrapy. That class has 3 variables and a method. The variables are the spider’s name, the allowed_domains and the start_URL. Pretty self-explanatory. The name is what we will use in a second to run the spider, allowed_domains limit the scope of the scraping process (It can’t go outside any URL not specified here) and start_urls are the starting point of the scrapy spider. In this case, just one. The parse method is internally called when we start the Scrapy spider. Right now has only ‘pass’: It does nothing. Let’s solve that. How to fetch data from the HTML We are going to query the HTML and to do so we need Xpath, a query language. Don’t you worry, even if it seems weird at first, it is easy to learn as all you need are a few functions. Parse method But first, let’s see what we have on ‘parse’ method. Parse it’s called automatically when the Scrapy spider starts. As arguments, we have self (the instance of the class) and a response. The response is what the server returns when we request an HTML. In this class, we are requesting and in response we have an object with all the HTML, a status message and more. Replace “pass” with ‘print()‘ and run the spider: scrapy crawl spider This is what we got: Between a lot of information, we see that we have crawled the start_url, got a 200 HTTP message (Success) and then the spider stopped. Besides ‘status’, our spider has a lot of methods. The one we are going to use right now is ‘xpath’. Our first steps with Xpath Open the starting URL, and right-click -> inspect any book. A side menu will open with the HTML structure of the website (if not, make sure you have selected the ‘Elements’ tab). You’ll have something like this:
We can see that each ‘article’ tag contains all the information we want.
The plan is to grab all articles, then, one by one, get all the information from each book.
First, let’s see how we select all articles.
If we click on the HTML the side menu and press Control + F, the search menu opens:
At the bottom-right, you can read “Find by string, selector or Xpath”. Scrapy uses Xpath, so let’s use it.
To start a query with Xpath, write ‘//’ then what you want to find. We want to grab all the articles, so type ‘//article’. We want to be more accurate, so let’s grab all the articles with the attribute ‘class = product_pod’. To specify an attribute, type it between brackets, like this: ‘//article[@class=”product_pod”]’.
You can see now that we have selected 20 elements: The 20 initial books.
Seems like we got it! Let’s copy that Xpath instruction and use it to select the articles in our spider. Then, we store all the books.
all_books = (‘//article[@class=”product_pod”]’)
Once we have all the books, we want to look inside each book for the information we want. Let’s start with the title. Go to your URL and search where the full title is located. Right-click any title and then select ‘Inspect’.
Inside the h3 tag, there is an ‘a’ tag with the book title as ‘title’ attribute. Let’s loop over the books and extract it.
for book in all_books:
title = (‘. //h3/a/@title’). extract_first()
We get all the books, and for each one of them, we search for the ‘h3’ tag, then the ‘a’ tag, and we select the @title attribute. We want that text, so we use ‘extract_first‘ (we can also ‘use extract’ to extract all of them).
As we are scraping, not the whole HTML but a small subset (the one in ‘book’) we need to put a dot at the start of the Xpath function. Remember: ‘//’ for the whole HTML response, ‘. //’ for a subset of that HTML we already extracted.
We have the title, now go the price. Right click the price and inspect it.
The text we want is inside a ‘p’ tag with the ‘price_color’ class inside a ‘div’ tag. Add this after the title:
price = (‘. //div/p[@class=”price_color”]/text()’). extract_first()
We go to any ‘div’, with a ‘p’ child that has a ‘price_color’ class, then we use ‘text()’ function to get the text. And then, we extract_first() our selection.
Let’s see if what we have. Print both the price and the title and run the spider.
print(title)
print(price)
Everything is working as planned. Let’s take the image URL too. Right-click the image, inspect it:
We don’t have an URL here but a partial one.
The ‘src’ attribute has the relative URL, not the whole URL. The ‘’ is missing. Well, we just need to add it. Add this at the bottom of your method.
image_url = art_urls[0] + (‘. //img[@class=”thumbnail”]/@src’). extract_first()
print(image_url)
We get the ‘img‘ tag with the class ‘thumbnail’, we get the relative URL with ‘src’ then we add the first (and only) start_url. Again, let’s print the result. Run the spider again.
Looking good! Open any of the URL and you’ll see the cover’s thumbnail.
Now let’s extract the URL so we can buy any book if we are interested.
The book URL is stored in the href of both the title and the thumbnail. Any of both will do.
book_url = art_urls[0] + (‘. //h3/a/@href’). extract_first()
print(book_url)
Run the spider again:
Click on any URL and you’ll go to that book website.
Now we are selecting all the fields we want, but we are not doing anything with it, right? We need to ‘yield’ (or ‘return’) them. For each book, we are going to return it’s title, price, image and book URL.
Remove all the prints and yield the items like a dictionary:
yield {
‘title’: title,
‘price’: price,
‘Image URL’: image_url,
‘Book URL’: book_url, }
Run the spider and look at the terminal:
Saving the data into a file
While it looks cool on the terminal, there is no use. Why don’t we store it into a file we can use later?
When we run our spider we have optional arguments. One of them is the name of the file you want to store. Run this.
scrapy crawl spider -o
Wait until it’s done… a new file has appeared! Double click it to open it.
All the information we saw on the terminal is now stored into a ‘’. Isn’t that cool? We can do the same with and files:
Conclusion
I know the first time is tricky, but you have learnt the basics of Scrapy. You know how to:
Create a Scrapy spider to navigate an URLA Scrapy project is structuredUse Xpath to extract the dataStore the data in, and files
I suggest you keep training. Look for an URL you want to scrape and try extracting a few fields as you did at the Beautiful Soup tutorial. The trick of Scrapy is learning how Xpath works.
But…do you remember that each book has an URL like this one?
Inside each item we scraped, there’s more information we can take. And we’ll do it in the second lesson of this series.
My Youtube tutorial videos
Final code on Github
Reach to me on Twitter
Your first Web Scraping script with Python and Beautiful Soup
A Minimalist End-to-End Scrapy Tutorial (Part I) | by Harry Wang

A Minimalist End-to-End Scrapy Tutorial (Part I) | by Harry Wang

Systematic Web Scraping for BeginnersPhoto by Paweł Czerwiński on UnsplashPart I, Part II, Part III, Part IV, Part VWeb scraping is an important skill for data scientists. I have developed a number of ad hoc web scraping projects using Python, BeautifulSoup, and Scrapy in the past few years and read a few books and tons of online tutorials along the way. However, I have not found a simple beginner level tutorial that is end-to-end in the sense that covers all basic steps and concepts in a typical Scrapy web scraping project (therefore Minimalist in the title) — that’s why I am writing this and hope the code repo can serve as a template to help jumpstart your web scraping people ask: should I use BeautifulSoup or Scrapy? They are different things: BeautifulSoup is a library for parsing HTML and XML and Scrapy is a web scraping framework. You can use BeautifulSoup instead of Scrapy build-in selectors if you want but comparing BeautifulSoup to Scrapy is like comparing the Mac keyboard to the iMac or a better metaphor as stated in the official documentation “like comparing jinja2 to Django” if you know what they are:) — In short, you should learn Scrapy if you want to do serious and systematic web;DR, show me the code:In this tutorial series, I am going to cover the following steps:(This tutorial) Start a Scrapy project from scratch and develop a simple spider. One important thing is the use of Scrapy Shell for analyzing pages and debugging, which is one of the main reasons you should use Scrapy over BeautifulSoup. (Part II) Introduce Item and ItemLoader and explain why you want to use them (although they make your code seem more complicated at first). (Part III) Store the data to the database using ORM (SQLAlchemy) via Pipelines and show how to set up the most common One-to-Many and Many-to-Many relationships. (Part IV) Deploy the project to Scrapinghub (you have to pay for service such as scheduled crawling jobs) or set up your own servers completely free of charge by using the great open source project ScrapydWeb and Heroku. (Part V) I created a separate repo (Scrapy + Selenium) to show how to crawl dynamic web pages (such as a page that loads additional content via scrolling) and how to use proxy networks (ProxyMesh) to avoid getting prerequisites:Basic knowledge on Python (Python 3 for this tutorial), virtual environment, Homebrew, etc., see my other article for how to set up the environment: How to Setup Mac for Python DevelopmentBasic knowledge of Git and Github. I recommend the Pro Git knowledge of database and ORM, e. g., Introduction to Structured Query Language (SQL)’s get started! First, create a new folder, setup Python 3 virtual environment inside the folder, and install Scrapy. To make this step easy, I created a starter repo, which you can fork and clone (see Python3 virtual environment documentation if needed):$ git clone cd scrapy-tutorial-starter$ python3. 6 -m venv venv$ source venv/bin/activate$ pip install -r requirements. txtYour folder should look like the following and I assume we always work in the virtual environment. Note that we only have one package in the so scrapy startproject tutorial to create an empty scrapy project and your folder looks like:Two identical “tutorial” folders were created. We don’t need the first level “tutorial” folder — delete it and move the second level “tutorial” folder with its contents one-level up — I know this is confusing but that’s all you have to do with the folder structure. Now, your folder should look like:Don’t worry about the auto-generated files so far, we will come back to those files later. This tutorial is based on the official Scrapy tutorial. Therefore, the website we are going to crawl is, which is quite simple: there are pages of quotes with authors and tags:When you click the author, it goes to the author detail page with name, birthday, and, create a new file named “” in the “spider” folder with the following content:You just created a spider named “quotes”, which sends a request to and gets the response from the server. However, the spider does not do anything so far when parsing the response and simply outputs a string to the console. Let’s run this spider: scrapy crawl quotes, you should see the output like:Next, let’s analyze the response, i. e., the HTML page at using Scrapy Shell by running:$ scrapy shell 20:10:40 [] INFO: Spider opened2019-08-21 20:10:41 [] DEBUG: Crawled (404) (referer: None)2019-08-21 20:10:41 [] DEBUG: Crawled (200) (referer: None)[s] Available Scrapy objects:[s] scrapy scrapy module (contains quest, lector, etc)[s] crawler [s] item {}[s] request [s] response <200 >[s] settings [s] spider [s] Useful shortcuts:[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)[s] fetch(req) Fetch a quest and update local objects[s] shelp() Shell help (print this help)[s] view(response) View response in a browser>>>You can select elements using either Xpath selector or CSS selector and Chrome DevTools is often used to analyze the page (we won’t cover the selector details, please read the documents to learn how to use them):For example, you can test the selector and see the results in Scrapy Shell — assume we want to get the quote block shown above:You can either use Xpath (“//div[@class=’quote’]”)() (() shows the first selected element, use () to show all) or (“div ”)(). I bolded the quote text, author, and tags we want to get from this quote block:>>> (“//div[@class=’quote’]”)()’

‘We can proceed in the shell to get the data as follows:get all quote blocks into “quotes”use the first quote in “quotes”: quotes[0]try the css selectors>>> quotes = (“//div[@class=’quote’]”)>>> quotes[0](“”)()[‘“The world as we have created it is a process of our thinking. ”’]>>> quotes[0](“”)()[‘Albert Einstein’]>>> quotes[0](“”)()[‘change’, ‘deep-thoughts’, ‘thinking’, ‘world’]It seems that the selectors shown above get what we need. Note that I am mixing Xpath and CSS selectors for the demonstration purpose here — no need to use both in this, let’s revise the spider file and use keyword yield to output the selected data to the console (note that each page has many quotes and we use a loop to go over all of them):import scrapyclass QuotesSpider(): name = “quotes”start_urls = [”]def parse(self, response): (‘hello this is my first spider’) quotes = (”) for quote in quotes: yield { ‘text’: (”)(), ‘author’: (”)(), ‘tags’: (”)(), }Run the spider again: scrapy crawl quotes and you can see the extracted data in the log:You can save the data in a JSON file by running: scrapy crawl quotes -o far, we get all quote information from the first page, and our next task is to crawl all pages. You should notice a “Next” button at the bottom of the front page for page navigation — the logic is: click the Next button to go to the next page, get the quotes, click Next again till the last page without the Next Chrome DevTools, we can get the URL of the next page:Let’s test it out in Scrapy Shell by running scrapy shell again:$ scrapy shell… >>> (‘ a::attr(href)’)()’/page/2/’Now we can write the following code for the spider to go over all pages to get all quotes:next_page = response. urljoin(next_page) gets the full URL and yield quest(next_page, ) sends a new request to get the next page and use a callback function to call the same parse function to get the quotes from the new ortcuts can be used to further simplify the code above: see this section. Essentially, supports relative URLs (no need to call urljoin) and automatically uses the href attribute for . So, the code can be shortened further:for a in (‘ a’): yield (a, )Now, run the spider again scrapy crawl quotes you should see quotes from all 10 pages have been extracted. Hang in there — we are almost done for this first part. The next task is to crawl the individual author’s shown above, when we process each quote, we can go to the individual author’s page by following the highlighted link — let’s use Scrapy Shell to get the link:$ scrapy shell… >>> (‘ + a::attr(href)’)()’/author/Albert-Einstein’So, during the loop of extracting each quote, we issue another request to go to the corresponding author’s page and create another parse_author function to extract the author’s name, birthday, born location and bio and output to the console. The updated spider looks like the following:Run the spider again scrapy crawl quotes and double-check that everything you need to extract is output to the console correctly. Note that Scrapy is based on Twisted, a popular event-driven networking framework for Python and thus is asynchronous. This means that the individual author page may not be processed in sync with the corresponding quote, e. g., the order of the author page results may not match the quote order on the page. We will discuss how to link the quote with its corresponding author page in the later ngratulations, you have finished Part I of this more about Item and ItemLoader in Part I, Part II, Part III, Part IV, Part V

Frequently Asked Questions about python spider tutorial

Leave a Reply

Your email address will not be published.