Python Scrapy Proxy

How To Set Up A Custom Proxy In Scrapy? – Zyte

When scraping the web at a reasonable scale, you can come across a series of problems and challenges. You may want to access a website from a specific country/region. Or maybe you want to work around anti-bot solutions. Whatever the case, to overcome these obstacles you need to use and manage proxies. In this article, I’m going to cover how to set up a custom proxy inside your Scrapy spider in an easy and straightforward way. Also, we’re going to discuss what are the best ways to solve your current and future proxy issues. You will learn how to do it yourself but you can also just use Zyte Proxy Manager (formerly Crawlera) to take care of your proxies.
Why you need smart proxies for your web scraping projects?
If you are extracting data from the web at scale, you’ve probably already figured out the answer. IP banning. The website you are targeting might not like that you are extracting data even though what you are doing is totally ethical and legal. When your scraper is banned, it can really hurt your business because the incoming data flow that you were so used to is suddenly missing. Also, sometimes websites have different information displayed based on country or region. To solve these problems we use proxies for successful requests to access the public data we need.
Setting up proxies in Scrapy
Setting up a proxy inside Scrapy is easy. There are two easy ways to use proxies with Scrapy – passing proxy info as a request parameter or implementing a custom proxy middleware.
Option 1: Via request parameters
Normally when you send a request in Scrapy you just pass the URL you are targeting and maybe a callback function. If you want to use a specific proxy for that URL you can pass it as a meta parameter, like this:
def start_requests(self):
for url in art_urls:
return Request(url=url,,
headers={“User-Agent”: “My UserAgent”},
meta={“proxy”: “})
The way it works is that inside Scrapy, there’s a middleware called HttpProxyMiddleware which takes the proxy meta parameter from the request object and sets it up correctly as the used proxy. The middleware is enabled by default so there is no need to set it up.
Option 2: Create custom middleware
Another way to utilize proxies while scraping is to actually create your own middleware. This way the solution is more modular and isolated. Essentially, what we need to do is the same thing as when passing the proxy as a meta parameter:
from import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
[“proxy”] = ”
request. headers[“Proxy-Authorization”] =
basic_auth_header(“”, “”)
In the code above, we define the proxy URL and the necessary authentication info. Make sure that you also enable this middleware in the settings and put it before the HttpProxyMiddleware:
DOWNLOADER_MIDDLEWARES = {
‘stomProxyMiddleware’: 350,
‘tpProxyMiddleware’: 400, }
How to verify if your proxy is working?
To verify that you are indeed scraping using your proxy you can scrape a test site that tells you your IP address and location (like this one). If it shows the proxy address and not your computer’s actual IP it is working correctly.
Rotating proxies
Now that you know how to set up Scrapy to use a proxy you might think that you are done. Your IP banning problems are solved forever. Unfortunately not! What if the proxy we just set up gets banned as well? What if you need multiple proxies for multiple pages? Don’t worry there is a solution called IP rotation and it is key for successful scraping projects.
When you rotate a pool of IP addresses what you’re doing is essentially randomly picking one address to make the request in your scraper. If it succeeds, aka returns the proper HTML page, we can extract data and we’re happy. If it fails for some reason (IP ban, timeout error, etc… ) we can’t extract the data so we need to pick another IP address from the pool and try again. Obviously, this can be a nightmare to manage manually so we recommend using an automated solution for this.
IP rotation in Scrapy
If you want to implement IP rotation for your Scrapy spider you can install the scrapy-rotating-proxies middleware which has been created just for this.
It will take care of the rotating itself, adjusting crawling speed, and making sure that we’re using proxies that are actually alive.
After installation and enabling the middleware you just have to add your proxies that you want to use as a list to
ROTATING_PROXY_LIST = [
”,
#… ]
Also, you can customize things like ban detection methods, page retries with different proxies, etc…
Conclusion
So now you know how to set up a proxy in your Scrapy project and how to manage simple IP rotation. But if you are scaling up your scraping projects you will quickly find yourself drowned in proxy related issues. Thus, you will lose data quality and ultimately you will waste a lot of time and resources dealing with proxy problems.
For example, you will find yourself dealing with unreliable proxies, you’ll get poor success rate and poor data quality, etc… and really just get bogged down in the minor technical details that stop you from focusing on what really matters: making use of the data. How can we end this never-ending struggle? By using an already available solution that handles well all the mentioned headaches and struggles.
This is exactly why we created Zyte Proxy Manager (formerly Crawlera). Zyte Proxy Manager enables you to reliably crawl at scale, managing thousands of proxies internally, so you don’t have to. You never need to worry about rotating or swapping proxies again. Here’s how you can use Zyte Proxy Manager with Scrapy.
If you’re tired of troubleshooting proxy issues and would like to give Zyte Proxy Manager a try then signup today. It has a 14-day FREE trial!
Useful links
Scrapy proxy rotating middleware
Discussion on Github about Socks5 proxies and scrapy
Scrapy and proxies - Stack Overflow

Scrapy and proxies – Stack Overflow

How do you utilize proxy support with the python web-scraping framework Scrapy?
bdd3, 3405 gold badges28 silver badges42 bronze badges
asked Jan 17 ’11 at 6:17
1
From the Scrapy FAQ,
Does Scrapy work with HTTP proxies?
Yes. Support for HTTP proxies is provided (since Scrapy 0. 8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware.
The easiest way to use a proxy is to set the environment variable _proxy. How this is done depends on your shell.
C:>set _proxy=proxy:port
csh% setenv _proxy proxy:port
sh$ export _proxy=proxy:port
if you want to use proxy and visited web, to set the environment variable _proxy you should follow below,
answered Jan 17 ’11 at 6:29
ephemientephemient185k34 gold badges263 silver badges383 bronze badges
5
Single Proxy
Enable HttpProxyMiddleware in your, like this:
DOWNLOADER_MIDDLEWARES = {
‘tpProxyMiddleware’: 1}
pass proxy to request via
request = Request(url=”)
[‘proxy’] = “host:port”
yield request
You also can choose a proxy address randomly if you have an address pool. Like this:
Multiple Proxies
class MySpider(BaseSpider):
name = “my_spider”
def __init__(self, *args, **kwargs):
super(MySpider, self). __init__(*args, **kwargs)
oxy_pool = [‘proxy_address1’, ‘proxy_address2’,…, ‘proxy_addressN’]
def parse(self, response):.. code…
if something:
yield t_request(url)
def get_request(self, url):
req = Request(url=url)
if oxy_pool:
[‘proxy’] = (oxy_pool)
return req
Kurt Peek38. 5k59 gold badges212 silver badges397 bronze badges
answered Dec 16 ’13 at 10:25
AmomAmom6395 silver badges6 bronze badges
3
1-Create a new file called “” and save it in your scrapy project and add the following code to it.
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
[‘proxy’] = “YOUR_PROXY_IP:PORT”
# Use the following lines if your proxy requires authentication
proxy_user_pass = “USERNAME:PASSWORD”
# setup basic authentication for the proxy
encoded_user_pass = base64. encodestring(proxy_user_pass)
request. headers[‘Proxy-Authorization’] = ‘Basic ‘ + encoded_user_pass
2 – Open your project’s configuration file (. /project_name/) and add the following code
‘tpProxyMiddleware’: 110,
‘oxyMiddleware’: 100, }
Now, your requests should be passed by this proxy. Simple, isn’t it?
answered Apr 18 ’15 at 10:46
that would be:
export _proxy=user:password@proxy:port
answered Jan 18 ’13 at 14:58
As I’ve had trouble by setting the environment in /etc/environment, here is what I’ve put in my spider (Python):
os. environ[“_proxy”] = “localhost:12345”
answered Nov 18 ’15 at 7:58
In Windows I put together a couple of previous answers and it worked. I simply did:
C:> set _proxy = username:password@proxy:port
and then I launched my program:
C:/… /RightFolder> scrapy crawl dmoz
where “dmzo” is the program name (I’m writing it because it’s the one you find in a tutorial on internet, and if you’re here you have probably started from the tutorial).
answered Oct 27 ’15 at 13:20
I would recommend you to use a middleware such as scrapy-proxies. You can rotate proxies, filter bad proxies or use a single proxy for all your request. Also, using a middleware will save you the trouble of setting up proxy on every run.
This is directly from the GitHub README.
Install the scrapy-rotating-proxy library
pip install scrapy_proxies
In your add the following settings
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
”: 90,
‘scrapy_proxies. RandomProxy’: 100,
‘tpProxyMiddleware’: 110, }
# Proxy list containing entries like
# host1:port
# username:password@host2:port
# host3:port
#…
PROXY_LIST = ‘/path/to/proxy/’
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence:
#CUSTOM_PROXY = “host1:port”
Here you can change retry times, set a single or rotating proxy
Then add your proxy to a file like this
host1:port
username:password@host2:port
host3:port
After this all your requests for that project will be sent through proxy. Proxy is rotated for every request randomly. It will not affect concurrency.
Note: if you donot want to use proxy. You can simply comment the scrapy_proxy middleware line.
# ‘scrapy_proxies. RandomProxy’: 100,
Happy crawling!!!
answered Aug 13 ’19 at 10:26
AmitAmit5254 silver badges20 bronze badges
Here is what I do
Method 1:
Create a Download Middleware like this
class ProxiesDownloaderMiddleware(object):
[‘proxy’] = ‘user:pass@host:port’
and enable that in
DOWNLOADER_MIDDLEWARES: {
‘oxiesDownloaderMiddleware’: 600, },
That is it, now proxy will be applied to every request
Method 2:
Just enable HttpProxyMiddleware in and then do this for each request
yield Request(url=…, meta={‘proxy’: ‘user:pass@host:port’})
answered Mar 9 at 7:00
Umair AyubUmair Ayub14. 2k12 gold badges57 silver badges131 bronze badges
2
Not the answer you’re looking for? Browse other questions tagged python scrapy or ask your own question.
scrapy-rotating-proxies - PyPI

scrapy-rotating-proxies – PyPI

Project description
scrapy-rotating-proxies
This package provides a Scrapy middleware to use rotating proxies,
check that they are alive and adjust crawling speed.
License is MIT.
Installation
pip install scrapy-rotating-proxies
Usage
Add ROTATING_PROXY_LIST option with a list of proxies to
ROTATING_PROXY_LIST = [
”,
#… ]
As an alternative, you can specify a ROTATING_PROXY_LIST_PATH options
with a path to a file with proxies, one per line:
ROTATING_PROXY_LIST_PATH = ‘/my/path/’
ROTATING_PROXY_LIST_PATH takes precedence over ROTATING_PROXY_LIST
if both options are present.
Then add rotating_proxies middlewares to your DOWNLOADER_MIDDLEWARES:
DOWNLOADER_MIDDLEWARES = {
#…
‘tatingProxyMiddleware’: 610,
‘nDetectionMiddleware’: 620,
#… }
After this all requests will be proxied using one of the proxies from
the ROTATING_PROXY_LIST / ROTATING_PROXY_LIST_PATH.
Requests with “proxy” set in their meta are not handled by
scrapy-rotating-proxies. To disable proxying for a request set
[‘proxy’] = None; to set proxy explicitly use
[‘proxy’] = ““.
Concurrency
By default, all default Scrapy concurrency options (DOWNLOAD_DELAY,
AUTHTHROTTLE_…, CONCURRENT_REQUESTS_PER_DOMAIN, etc) become
per-proxy for proxied requests when RotatingProxyMiddleware is enabled.
For example, if you set CONCURRENT_REQUESTS_PER_DOMAIN=2 then
spider will be making at most 2 concurrent connections to each proxy,
regardless of request url domain.
Customization
scrapy-rotating-proxies keeps track of working and non-working proxies,
and re-checks non-working from time to time.
Detection of a non-working proxy is site-specific.
By default, scrapy-rotating-proxies uses a simple heuristic:
if a response status code is not 200, response body is empty or if
there was an exception then proxy is considered dead.
You can override ban detection method by passing a path to
a custom BanDectionPolicy in ROTATING_PROXY_BAN_POLICY option, e. g. :
#
ROTATING_PROXY_BAN_POLICY = ”
The policy must be a class with response_is_ban
and exception_is_ban methods. These methods can return True
(ban detected), False (not a ban) or None (unknown). It can be convenient
to subclass and modify default BanDetectionPolicy:
# myproject/
from import BanDetectionPolicy
class MyPolicy(BanDetectionPolicy):
def response_is_ban(self, request, response):
# use default rules, but also consider HTTP 200 responses
# a ban if there is ‘captcha’ word in response body.
ban = super(MyPolicy, self). response_is_ban(request, response)
ban = ban or b’captcha’ in
return ban
def exception_is_ban(self, request, exception):
# override method completely: don’t take exceptions in account
return None
Instead of creating a policy you can also implement response_is_ban
and exception_is_ban methods as spider methods, for example:
class MySpider():
return b’banned’ in
It is important to have these rules correct because action for a failed
request and a bad proxy should be different: if it is a proxy to blame
it makes sense to retry the request with a different proxy.
Non-working proxies could become alive again after some time.
scrapy-rotating-proxies uses a randomized exponential backoff for these
checks – first check happens soon, if it still fails then next check is
delayed further, etc. Use ROTATING_PROXY_BACKOFF_BASE to adjust the
initial delay (by default it is random, from 0 to 5 minutes). The randomized
exponential backoff is capped by ROTATING_PROXY_BACKOFF_CAP.
Settings
ROTATING_PROXY_LIST – a list of proxies to choose from;
ROTATING_PROXY_LIST_PATH – path to a file with a list of proxies;
ROTATING_PROXY_LOGSTATS_INTERVAL – stats logging interval in seconds,
30 by default;
ROTATING_PROXY_CLOSE_SPIDER – When True, spider is stopped if
there are no alive proxies. If False (default), then when there is no
alive proxies all dead proxies are re-checked.
ROTATING_PROXY_PAGE_RETRY_TIMES – a number of times to retry
downloading a page using a different proxy. After this amount of retries
failure is considered a page failure, not a proxy failure.
Think of it this way: every improperly detected ban cost you
ROTATING_PROXY_PAGE_RETRY_TIMES alive proxies. Default: 5.
It is possible to change this option per-request using
max_proxies_to_try key – for example, you can use a higher
value for certain pages if you’re sure they should work.
ROTATING_PROXY_BACKOFF_BASE – base backoff time, in seconds.
Default is 300 (i. e. 5 min).
ROTATING_PROXY_BACKOFF_CAP – backoff time cap, in seconds.
Default is 3600 (i. 60 min).
ROTATING_PROXY_BAN_POLICY – path to a ban detection policy.
Default is ”.
FAQ
Q: Where to get proxy lists? How to write and maintain ban rules?
A: It is up to you to find proxies and maintain proper ban rules
for web sites; scrapy-rotating-proxies doesn’t have anything built-in.
There are commercial proxy services like which can
integrate with Scrapy (see)
and take care of all these details.
CHANGES
0. 6. 2 (2019-05-25)
mean_backoff_time stats are always returned as float, to make
saving stats in databases easier.
0. 1 (2019-04-03)
Fixed incorrect “proxies/good” stats values.
0. 6 (2018-12-28)
Proxy information is added to scrapy stats:
proxies/unchecked
proxies/reanimated
proxies/dead
proxies/good
proxies/mean_backoff
0. 5 (2017-10-09)
ROTATING_PROXY_LIST_PATH option allows to pass file name
with a proxy list.
0. 4 (2017-06-06)
ROTATING_PROXY_BACKOFF_CAP option allows to change max backoff time
from the default 1 hour.
0. 3. 2 (2017-06-05)
fixed proxy authentication issue.
0. 1 (2017-03-20)
fixed OverflowError during backoff computation.
0. 3 (2017-03-14)
redirects with empty bodies are no longer considered bans
(thanks Diga Widyaprana).
ROTATING_PROXY_BAN_POLICY option allows to customize ban detection
for all spiders.
0. 2. 3 (2017-03-03)
max_proxies_to_try key allows to override
ROTATING_PROXY_PAGE_RETRY_TIMES option per-request.
0. 2 (2017-03-01)
Update default ban detection rules: scrapy. exceptions. IgnoreRequest
is not a ban.
0. 1 (2017-02-08)
changed ROTATING_PROXY_PAGE_RETRY_TIMES default value – it is now 5.
0. 2 (2017-02-07)
improved default ban detection rules;
log ban stats.
0. 1 (2017-02-01)
Initial release
Download files
Download the file for your platform. If you’re not sure which to choose, learn more about installing packages.
Files for scrapy-rotating-proxies, version 0. 2
Filename, size
File type
Python version
Upload date
Hashes
(15. 4 kB)
Wheel
3
May 25, 2019
View
(13. 1 kB)
Source
None
View

Frequently Asked Questions about python scrapy proxy

Leave a Reply

Your email address will not be published. Required fields are marked *