Python Scrapy Selenium

Scrapy middleware to handle javascript pages using selenium

Scrapy middleware to handle javascript pages using selenium.
Installation
$ pip install scrapy-selenium
You should use python>=3. 6.
You will also need one of the Selenium compatible browsers.
Configuration
Add the browser to use, the path to the driver executable, and the arguments to pass to the executable to the scrapy settings:
from shutil import which
SELENIUM_DRIVER_NAME = ‘firefox’
SELENIUM_DRIVER_EXECUTABLE_PATH = which(‘geckodriver’)
SELENIUM_DRIVER_ARGUMENTS=[‘-headless’] # ‘–headless’ if using chrome instead of firefox
Optionally, set the path to the browser executable:
python SELENIUM_BROWSER_EXECUTABLE_PATH = which(‘firefox’)
In order to use a remote Selenium driver, specify SELENIUM_COMMAND_EXECUTOR instead of SELENIUM_DRIVER_EXECUTABLE_PATH:
python SELENIUM_COMMAND_EXECUTOR = ‘localhost:4444/wd/hub’
Add the SeleniumMiddleware to the downloader middlewares:
DOWNLOADER_MIDDLEWARES = {
‘leniumMiddleware’: 800}
Usage
Use the leniumRequest instead of the scrapy built-in Request like below:
from scrapy_selenium import SeleniumRequest
yield SeleniumRequest(url=url, rse_result)
The request will be handled by selenium, and the request will have an additional meta key, named driver containing the selenium driver with the request processed.
def parse_result(self, response):
print([‘driver’])
For more information about the available driver methods and attributes, refer to the selenium python documentation
The selector response attribute work as usual (but contains the html processed by the selenium driver).
print((‘//title/@text’))
Additional arguments
The leniumRequest accept 4 additional arguments:
wait_time / wait_until
When used, selenium will perform an Explicit wait before returning the response to the spider.
from import By
from pport import expected_conditions as EC
yield SeleniumRequest(
url=url,
rse_result,
wait_time=10,
wait_until=EC. element_to_be_clickable((, ‘someid’)))
screenshot
When used, selenium will take a screenshot of the page and the binary data of the captured will be added to the response meta:
screenshot=True)
with open(”, ‘wb’) as image_file:
([‘screenshot’])
script
When used, selenium will execute custom JavaScript code.
script=’rollTo(0, );’, )
selenium with scrapy for dynamic page - Stack Overflow

selenium with scrapy for dynamic page – Stack Overflow

If (url doesn’t change between the two pages) then you should add dont_filter=True with your quest() or scrapy will find this url as a duplicate after processing first page.
If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middleware which can handle javascript pages using selenium or you can do that by launching any headless browser
But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.
Here is an example:
class ScrollScraper(Spider):
name = “scrollingscraper”
quote_url = ”
start_urls = [quote_url + “1”]
def parse(self, response):
quote_item = QuoteItem()
print
data = ()
for item in (‘quotes’, []):
quote_item[“author”] = (‘author’, {})(‘name’)
quote_item[‘quote’] = (‘text’)
quote_item[‘tags’] = (‘tags’)
yield quote_item
if data[‘has_next’]:
next_page = data[‘page’] + 1
yield Request(self. quote_url + str(next_page))
When pagination url is same for every pages & uses POST request then you can use rmRequest() instead of quest(), both are same but FormRequest adds a new argument (formdata=) to the constructor.
Here is another spider example form this post:
class SpiderClass():
# spider name and all
name = ‘ajax’
page_incr = 1
start_urls = [”]
pagination_url = ”
sel = Selector(response)
if ge_incr > 1:
json_data = ()
sel = Selector((‘content’, ”))
# your code here
# pagination code starts here
if (‘//div[@class=”panel-wrapper”]’):
ge_incr += 1
formdata = {
‘sorter’: ‘recent’,
‘location’: ‘main loop’,
‘loop’: ‘main loop’,
‘action’: ‘sort’,
‘view’: ‘grid’,
‘columns’: ‘3’,
‘paginated’: str(ge_incr),
‘currentquery[category_name]’: ‘reviews’}
yield FormRequest(gination_url, formdata=formdata, )
else:
return
Web Scraping Framework Review: Scrapy VS Selenium

Web Scraping Framework Review: Scrapy VS Selenium

Introduction:
This is the #11 post of my Scrapy Tutorial Series, in this Scrapy tutorial, I will talk about the features of Scrapy and Selenium, Compare them, and help you decide which one is better for your projects.
Talk About Selenium
Selenium is a framework which is designed to automate test for web applications. It provides a way for developer to write tests in a number of popular programming languages such as C#, Java, Python, Ruby, etc. The tests writen by developer can again most web browsers such as Chrome, IE and Firefox.
As you can see, you can write Python script to control the web brwoser to do some work automatically. For example, you can make browser visit craigslist, click target elemnt or navigate to the target page, get the html source code of page.
from selenium import webdriver
from import Keys
driver = refox()
(“)
assert “Python” in
elem = nd_element_by_name(“q”)
nd_keys(“selenium”)
nd_keys()
assert “Google” in
()
From the code above, you can see, the API is very beginner-friendly, you can easily write code with Selenium. That is why it is so popular in developer community. Even Selenium is mainly use to automate tests for web applications, it can also be used to develope web spider, many people has done this before.
Talk About Scrapy
Scrapy is a web crawling framework for developer to write code to create spider, which define how a certain site (or a group of sites) will be scraped. The biggest feature is that it is built on Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking (aka asynchronous) code for concurrency, which makes the spider performance is very great.
For those who have no idea what is asynchronous, here is a simple explanation.
When you do something synchronously, you wait for it to finish before moving on to another task. When you do something asynchronously, you can move on to another task before it finishes.
Scrapy has built-in support for extracting data from HTML sources using XPath expression and CSS expression.
Which One Should You Choose?
The two Python web scraping frameworks are created to do different jobs. Selenium is only used to automate web browser interaction, Scrapy is used to download HTML, process data and save it.
When you compare Selenium vs Scrapy to figure out what is the best for your project, you should consider following issues.
Javascript
You should use some tool such as Dev Tool from Chrome to help you figure out how the data is displayed on the dynamic page of target site. If the data is included in html source code, both frameworks can work fine and you can choose one as you like. But in some cases the data show up after many ajax/pjax requests, the workflow make it hard to use Scrapy to extract the data. If you are faced with this situation, I recommend you to use Selenium instead.
Data Size
Before coding, you need to estimiate the data size of the extracted data, and the urls need to visit. Scrapy only visit the url you told him, but Selenium will control the browser to visit all js file, css file and img file to render the page, that is why Selenium is much slower than Scrapy when crawling.
If the data size is big, Scrapy is the better option because it can save you a lot of time and time is a valuable thing.
Extensibility
The architecture of Scrapy is well designed, you can easily develop custom middleware or pipeline to add custom functionality. Your Scrapy project can be both robust and flexible. After you develop several Scrapy projects, you will benefit from the architecture and like its design because it is easy to migrate from existing Scrapy spider project to another one. You can check this artcile to see how to quickly save the scraped data into Database by using Scrapy pipeline without modifying the code of spider. Scrapy Tutorial #9: How To Use Scrapy Item
So if your project is small, the logic is not very complex and you want job done quickly, you can use Selenium to keep your project simple. If your project needs more customization such as proxy, data pipeline, then the Scrapy might be your choice here.
Ecosystem
Very few people have talked about this before when comparing web scraping tools. Think about why people like to use WordPress to build CMS instead of other frameworks, the key is ecosystem. So many themes, plugins can help people quickly build a CMS which meet the requirement.
Scrapy have so many related projects, plugins on open source websites such as Github, and many discussions on StackOverflow can help you fix the potential issue. For example, if you want to use proxy with your spider project, you can check a project scrapy-proxies help you send HTTP requests using random proxy from list. All you need is just changing some settings.
Best Practise
If you like Scrapy and you also want it to understand JavaScript, there are also some options for you.
You can create new instance of webdriver from Selenium in parse method of Scrapy spider, do some work, extract the data, and then close it after all work done. You should remember to close it or it might cause some problem such as memory.
Scrapy has officlal project(I really like its ecosystem) called scrapy-splash to provides Scrapy and Javascript integration.
If you are Selenium’s fan, and want spider to run quietly, you can try to use Phantomjs, a headless browser. I like to develop spider using Selenium with ChromeDriver because it is easy to debug, when I am done, the spider would run with phantomjs in terminal.
Conclusion
So which one is better web scraping framwork? There is no solid answer, the answer depends heavily on the actual situation. Below is a quick reference table.
Framework
Selenium
Scrapy
Javascript Support
Support javascript very well
It is time consuming to inspect and develop spider to simulate ajax/pjax requests
Good option for small data set
Work fine on big data set
Not very easy to extend the project
You can easily develop custom middleware or pipeline to add custom function, easy to maintain.
Not many related projects or plugins
Many related projects, plugins on open source websites such as Github, and many discussions on StackOverflow can help you fix the potential issue.
In short, If the job is a very simple project, then Selenium can be your choice. If you want a more powerful and flexible web crawler, or you indeed have some experience in programming, then Scrapy is definitely the winner here. What is more, if you want your Scrapy spider to understand the javascript, just try methods mentioned above.
If you are also interested in BeautifulSoup, a great web scraping framework in Python world, you can take a look at Scrapy VS Beautiful Soup
Resources
Scrapy Doc
selenium-python

Frequently Asked Questions about python scrapy selenium

Can Selenium be used with Scrapy?

Combining Selenium with Scrapy is a simpler process. All that needs to be done is let Selenium render the webpage and once it is done, pass the webpage’s source to create a Scrapy Selector object. And from here on, Scrapy can crawl the page with ease and effectively extract a large amount of data.Aug 6, 2020

Which is better Selenium or Scrapy?

In short, If the job is a very simple project, then Selenium can be your choice. If you want a more powerful and flexible web crawler, or you indeed have some experience in programming, then Scrapy is definitely the winner here.Jan 2, 2021

How do you use Scrapy in Python?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .Nov 8, 2019

Leave a Reply

Your email address will not be published. Required fields are marked *