Scrapy Https Proxy

How to use http and https proxy together in scrapy? – Stack …

I am new in scrapy. I found that for use proxy but I want to use and proxy together because when I crawl the links there has and links. How do I use also and proxy?
class ProxyMiddleware(object):
def process_request(self, request, spider):
[‘proxy’] = “YOUR_PROXY_IP:PORT”
#like here [‘proxy’] = “YOUR_PROXY_IP:PORT”
proxy_user_pass = “USERNAME:PASSWORD”
# setup basic authentication for the proxy
encoded_user_pass = base64. encodestring(proxy_user_pass)
request. headers[‘Proxy-Authorization’] = ‘Basic ‘ + encoded_user_pass
Falko15. 6k12 gold badges50 silver badges94 bronze badges
asked Jul 9 ’15 at 9:40
You could use standard environment variables with the combination of the HttpProxyMiddleware:
This middleware sets the HTTP proxy to use for requests, by setting the proxy meta value for Request objects.
Like the Python standard library modules urllib and urllib2, it obeys the following environment variables:
_proxy
no_proxy
You can also set the meta key proxy per-request, to a value like some_proxy_server:port.
answered Jul 9 ’15 at 10:55
GHajbaGHajba3, 5515 gold badges25 silver badges33 bronze badges
2
Not the answer you’re looking for? Browse other questions tagged python scrapy or ask your own question.
How To Set Up A Custom Proxy In Scrapy? - Zyte

How To Set Up A Custom Proxy In Scrapy? – Zyte

When scraping the web at a reasonable scale, you can come across a series of problems and challenges. You may want to access a website from a specific country/region. Or maybe you want to work around anti-bot solutions. Whatever the case, to overcome these obstacles you need to use and manage proxies. In this article, I’m going to cover how to set up a custom proxy inside your Scrapy spider in an easy and straightforward way. Also, we’re going to discuss what are the best ways to solve your current and future proxy issues. You will learn how to do it yourself but you can also just use Zyte Proxy Manager (formerly Crawlera) to take care of your proxies.
Why you need smart proxies for your web scraping projects?
If you are extracting data from the web at scale, you’ve probably already figured out the answer. IP banning. The website you are targeting might not like that you are extracting data even though what you are doing is totally ethical and legal. When your scraper is banned, it can really hurt your business because the incoming data flow that you were so used to is suddenly missing. Also, sometimes websites have different information displayed based on country or region. To solve these problems we use proxies for successful requests to access the public data we need.
Setting up proxies in Scrapy
Setting up a proxy inside Scrapy is easy. There are two easy ways to use proxies with Scrapy – passing proxy info as a request parameter or implementing a custom proxy middleware.
Option 1: Via request parameters
Normally when you send a request in Scrapy you just pass the URL you are targeting and maybe a callback function. If you want to use a specific proxy for that URL you can pass it as a meta parameter, like this:
def start_requests(self):
for url in art_urls:
return Request(url=url,,
headers={“User-Agent”: “My UserAgent”},
meta={“proxy”: “})
The way it works is that inside Scrapy, there’s a middleware called HttpProxyMiddleware which takes the proxy meta parameter from the request object and sets it up correctly as the used proxy. The middleware is enabled by default so there is no need to set it up.
Option 2: Create custom middleware
Another way to utilize proxies while scraping is to actually create your own middleware. This way the solution is more modular and isolated. Essentially, what we need to do is the same thing as when passing the proxy as a meta parameter:
from import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
[“proxy”] = ”
request. headers[“Proxy-Authorization”] =
basic_auth_header(“”, “”)
In the code above, we define the proxy URL and the necessary authentication info. Make sure that you also enable this middleware in the settings and put it before the HttpProxyMiddleware:
DOWNLOADER_MIDDLEWARES = {
‘stomProxyMiddleware’: 350,
‘tpProxyMiddleware’: 400, }
How to verify if your proxy is working?
To verify that you are indeed scraping using your proxy you can scrape a test site that tells you your IP address and location (like this one). If it shows the proxy address and not your computer’s actual IP it is working correctly.
Rotating proxies
Now that you know how to set up Scrapy to use a proxy you might think that you are done. Your IP banning problems are solved forever. Unfortunately not! What if the proxy we just set up gets banned as well? What if you need multiple proxies for multiple pages? Don’t worry there is a solution called IP rotation and it is key for successful scraping projects.
When you rotate a pool of IP addresses what you’re doing is essentially randomly picking one address to make the request in your scraper. If it succeeds, aka returns the proper HTML page, we can extract data and we’re happy. If it fails for some reason (IP ban, timeout error, etc… ) we can’t extract the data so we need to pick another IP address from the pool and try again. Obviously, this can be a nightmare to manage manually so we recommend using an automated solution for this.
IP rotation in Scrapy
If you want to implement IP rotation for your Scrapy spider you can install the scrapy-rotating-proxies middleware which has been created just for this.
It will take care of the rotating itself, adjusting crawling speed, and making sure that we’re using proxies that are actually alive.
After installation and enabling the middleware you just have to add your proxies that you want to use as a list to
ROTATING_PROXY_LIST = [
”,
#… ]
Also, you can customize things like ban detection methods, page retries with different proxies, etc…
Conclusion
So now you know how to set up a proxy in your Scrapy project and how to manage simple IP rotation. But if you are scaling up your scraping projects you will quickly find yourself drowned in proxy related issues. Thus, you will lose data quality and ultimately you will waste a lot of time and resources dealing with proxy problems.
For example, you will find yourself dealing with unreliable proxies, you’ll get poor success rate and poor data quality, etc… and really just get bogged down in the minor technical details that stop you from focusing on what really matters: making use of the data. How can we end this never-ending struggle? By using an already available solution that handles well all the mentioned headaches and struggles.
This is exactly why we created Zyte Proxy Manager (formerly Crawlera). Zyte Proxy Manager enables you to reliably crawl at scale, managing thousands of proxies internally, so you don’t have to. You never need to worry about rotating or swapping proxies again. Here’s how you can use Zyte Proxy Manager with Scrapy.
If you’re tired of troubleshooting proxy issues and would like to give Zyte Proxy Manager a try then signup today. It has a 14-day FREE trial!
Useful links
Scrapy proxy rotating middleware
Discussion on Github about Socks5 proxies and scrapy
Source code for scrapy.downloadermiddlewares.httpproxy

Source code for scrapy.downloadermiddlewares.httpproxy

import base64
from import unquote, urlunparse
from quest import getproxies, proxy_bypass, _parse_proxy
from scrapy. exceptions import NotConfigured
from import urlparse_cached
from import to_bytes
[docs]class HttpProxyMiddleware:
def __init__(self, auth_encoding=’latin-1′):
th_encoding = auth_encoding
oxies = {}
for type_, url in getproxies()():
try:
oxies[type_] = self. _get_proxy(url, type_)
# some values such as ‘/var/run/’ can’t be parsed
# by _parse_proxy and as such should be skipped
except ValueError:
continue
@classmethod
def from_crawler(cls, crawler):
if not tbool(‘HTTPPROXY_ENABLED’):
raise NotConfigured
auth_encoding = (‘HTTPPROXY_AUTH_ENCODING’)
return cls(auth_encoding)
def _basic_auth_header(self, username, password):
user_pass = to_bytes(
f'{unquote(username)}:{unquote(password)}’,
th_encoding)
return base64. b64encode(user_pass)
def _get_proxy(self, url, orig_type):
proxy_type, user, password, hostport = _parse_proxy(url)
proxy_url = urlunparse((proxy_type or orig_type, hostport, ”, ”, ”, ”))
if user:
creds = self. _basic_auth_header(user, password)
else:
creds = None
return creds, proxy_url
def process_request(self, request, spider):
# ignore if proxy is already set
if ‘proxy’ in
if [‘proxy’] is None:
return
# extract credentials if present
creds, proxy_url = self. _get_proxy([‘proxy’], ”)
[‘proxy’] = proxy_url
if creds and not (‘Proxy-Authorization’):
request. headers[‘Proxy-Authorization’] = b’Basic ‘ + creds
elif not oxies:
parsed = urlparse_cached(request)
scheme =
# ‘no_proxy’ is only supported by schemes
if scheme in (”, ”) and proxy_bypass(name):
if scheme in oxies:
self. _set_proxy(request, scheme)
def _set_proxy(self, request, scheme):
creds, proxy = oxies[scheme]
[‘proxy’] = proxy
if creds:
request. headers[‘Proxy-Authorization’] = b’Basic ‘ + creds

Frequently Asked Questions about scrapy https proxy

Leave a Reply

Your email address will not be published. Required fields are marked *