Scraping Python Selenium

Web Scraping using Selenium and Python – ScrapingBee


Updated:
08 July, 2021
9 min read
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.
Today we are going to take a look at Selenium (with Python ❤️) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago! ) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
Clicking on buttons
Filling forms
Scrolling
Taking a screenshot
It is also useful for executing Javascript code. Let’s say that you want to scrape a Single Page Application. Plus you haven’t found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
Chrome download page
Chrome driver binary
selenium package
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
from selenium import webdriver
DRIVER_PATH = ‘/path/to/chromedriver’
driver = (executable_path=DRIVER_PATH)
(”)
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).
You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
from import Options
options = Options()
options. headless = True
d_argument(“–window-size=1920, 1200”)
driver = (options=options, executable_path=DRIVER_PATH)
(“)
print(ge_source)
()
The ge_source will return the full page HTML code.
Here are two other interesting WebDriver properties:
gets the page’s title
rrent_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
Tag name
Class name
IDs
XPath
CSS selectors
We recently published an article explaining XPath. Don’t hesitate to take a look if you aren’t familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.
A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.
Let’s say that we want to locate the h1 tag in this HTML:

… some stuff

Super title



h1 = nd_element_by_name(‘h1’)
h1 = nd_element_by_class_name(‘someclass’)
h1 = nd_element_by_xpath(‘//h1’)
h1 = nd_element_by_id(‘greatID’)
All these methods also have find_elements (note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
all_links = nd_elements_by_tag_name(‘a’)
Some elements aren’t easily accessible with an ID or a simple class, and that’s when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It’s a powerful way to extract any element on a page, based on it’s absolute position on the DOM, or relative to another element.
WebElement
A WebElement is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
Accessing the text of the element with the property
Clicking on the element with ()
Accessing an attribute with t_attribute(‘class’)
Sending text to an input with: nd_keys(‘mypassword’)
There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:

This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That’s a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
Go to the login page using ()
Select the username input using nd_element_by_* and then nd_keys() to send text to the input
Follow the same process with the password input
Click on the login button using ()
Should be easy right? Let’s see the code:
login = nd_element_by_xpath(“//input”). send_keys(USERNAME)
password = nd_element_by_xpath(“//input[@type=’password’]”). send_keys(PASSWORD)
submit = nd_element_by_xpath(“//input[@value=’login’]”)()
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.
So, we’re going to check for the logout button. The logout button has the ID “logout” (easy)!
We can’t just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.
So we have to use a try/except block and catch the NoSuchElementException exception:
# dont forget from import NoSuchElementException
try:
logout_button = nd_element_by_id(“logout”)
print(‘Successfully logged in’)
except NoSuchElementException:
print(‘Incorrect login/password’)
We could easily take a screenshot using:
ve_screenshot(”)
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.
Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it’s simple and we don’t have to worry about these issues.
If you need to make screenshots at scale, feel free to try our new Screenshot API here.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
Use a (ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.
If you use a () you will probably use an arbitrary value. The problem is, you’re either waiting for too long or not enough.
Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.
With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.
element = WebDriverWait(driver, 5)(
esence_of_element_located((, “mySuperId”)))
finally:
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.
There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let’s say you want to take a screenshot of some information, but you first need to scroll a bit to see it.
You can easily do this with Selenium:
javaScript = “rollBy(0, 1000);”
driver. execute_script(javaScript)
Using a proxy with Selenium Wire
Unfortunately, Selenium proxy handling is quite basic. For example, it can’t handle proxy with authentication out of the box.
To solve this issue, you need to use Selenium Wire.
This package extends Selenium’s bindings and gives you access to all the underlying requests made by the browser.
If you need to use Selenium with a proxy with authentication this is the package you need.
pip install selenium-wire
This code snippet shows you how to quickly use your headless browser behind a proxy.
# Install the Python selenium-wire library:
# pip install selenium-wire
from seleniumwire import webdriver
proxy_username = “USER_NAME”
proxy_password = “PASSWORD”
proxy_url = ”
proxy_port = 8886
options = {
“proxy”: {
“”: f”{proxy_username}:{proxy_password}@{proxy_url}:{proxy_port}”,
“verify_ssl”: False, }, }
URL = ”
driver = (
executable_path=”YOUR-CHROME-EXECUTABLE-PATH”,
seleniumwire_options=options, )
(URL)
Blocking images and JavaScript
With Selenium, by using the correct Chrome options, you can block some requests from being made.
This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.
To do this, you need to launch Chrome with the below options:
chrome_options = romeOptions()
### This blocks images and javascript requests
chrome_prefs = {
“fault_content_setting_values”: {
“images”: 2,
“javascript”: 2, }}
chrome_options. experimental_options[“prefs”] = chrome_prefs
###
chrome_options=chrome_options, )
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don’t hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn’t have an API, it’s maybe* a good idea to automate it with Selenium, just don’t forget this xkcd:
Web Scraping Using Python Selenium | Toptal

Web Scraping Using Python Selenium | Toptal

Web scraping has been used to extract data from websites almost from the time the World Wide Web was born. In the early days, scraping was mainly done on static pages – those with known elements, tags, and data.
More recently, however, advanced technologies in web development have made the task a bit more difficult. In this article, we’ll explore how we might go about scraping data in the case that new technology and other factors prevent standard scraping.
Traditional Data Scraping
As most websites produce pages meant for human readability rather than automated reading, web scraping mainly consisted of programmatically digesting a web page’s mark-up data (think right-click, View Source), then detecting static patterns in that data that would allow the program to “read” various pieces of information and save it to a file or a database.
If report data were to be found, often, the data would be accessible by passing either form variables or parameters with the URL. For example:
Python has become one of the most popular web scraping languages due in part to the various web libraries that have been created for it. One popular library, Beautiful Soup, is designed to pull data out of HTML and XML files by allowing searching, navigating, and modifying tags (i. e., the parse tree).
Browser-based Scraping
Recently, I had a scraping project that seemed pretty straightforward and I was fully prepared to use traditional scraping to handle it. But as I got further into it, I found obstacles that could not be overcome with traditional methods.
Three main issues prevented me from my standard scraping methods:
Certificate. There was a certificate required to be installed to access the portion of the website where the data was. When accessing the initial page, a prompt appeared asking me to select the proper certificate of those installed on my computer, and click OK.
Iframes. The site used iframes, which messed up my normal scraping. Yes, I could try to find all iframe URLs, then build a sitemap, but that seemed like it could get unwieldy.
JavaScript. The data was accessed after filling in a form with parameters (e. g., customer ID, date range, etc. ). Normally, I would bypass the form and simply pass the form variables (via URL or as hidden form variables) to the result page and see the results. But in this case, the form contained JavaScript, which didn’t allow me to access the form variables in a normal fashion.
So, I decided to abandon my traditional methods and look at a possible tool for browser-based scraping. This would work differently than normal – instead of going directly to a page, downloading the parse tree, and pulling out data elements, I would instead “act like a human” and use a browser to get to the page I needed, then scrape the data – thus, bypassing the need to deal with the barriers mentioned.
Selenium
In general, Selenium is well-known as an open-source testing framework for web applications – enabling QA specialists to perform automated tests, execute playbacks, and implement remote control functionality (allowing many browser instances for load testing and multiple browser types). In my case, this seemed like it could be useful.
My go-to language for web scraping is Python, as it has well-integrated libraries that can generally handle all of the functionality required. And sure enough, a Selenium library exists for Python. This would allow me to instantiate a “browser” – Chrome, Firefox, IE, etc. – then pretend I was using the browser myself to gain access to the data I was looking for. And if I didn’t want the browser to actually appear, I could create the browser in “headless” mode, making it invisible to any user.
Project Setup
To start experimenting, I needed to set up my project and get everything I needed. I used a Windows 10 machine and made sure I had a relatively updated Python version (it was v. 3. 7. 3). I created a blank Python script, then loaded the libraries I thought might be required, using PIP (package installer for Python) if I didn’t already have the library loaded. These are the main libraries I started with:
Requests (for making HTTP requests)
URLLib3 (URL handling)
Beautiful Soup (in case Selenium couldn’t handle everything)
Selenium (for browser-based navigation)
I also added some calling parameters to the script (using the argparse library) so that I could play around with various datasets, calling the script from the command line with different options. Those included Customer ID, from- month/year, and to-month/year.
Problem 1 – The Certificate
The first choice I needed to make was which browser I was going to tell Selenium to use. As I generally use Chrome, and it’s built on the open-source Chromium project (also used by Edge, Opera, and Amazon Silk browsers), I figured I would try that first.
I was able to start up Chrome in the script by adding the library components I needed, then issuing a couple of simple commands:
# Load selenium components
from selenium import webdriver
from import By
from import WebDriverWait, Select
from pport import expected_conditions as EC
from import TimeoutException
# Establish chrome driver and go to report site URL
url = ”
driver = ()
(url)
Since I didn’t launch the browser in headless mode, the browser actually appeared and I could see what it was doing. It immediately asked me to select a certificate (which I had installed earlier).
The first problem to tackle was the certificate. How to select the proper one and accept it in order to get into the website? In my first test of the script, I got this prompt:
This wasn’t good. I did not want to manually click the OK button each time I ran my script.
As it turns out, I was able to find a workaround for this – without programming. While I had hoped that Chrome had the ability to pass a certificate name on startup, that feature did not exist. However, Chrome does have the ability to autoselect a certificate if a certain entry exists in your Windows registry. You can set it to select the first certificate it sees, or else be more specific. Since I only had one certificate loaded, I used the generic format.
Thus, with that set, when I told Selenium to launch Chrome and a certificate prompt came up, Chrome would “AutoSelect” the certificate and continue on.
Problem 2 – Iframes
Okay, so now I was in the site and a form appeared, prompting me to type in the customer ID and the date range of the report.
By examining the form in developer tools (F12), I noticed that the form was presented within an iframe. So, before I could start filling in the form, I needed to “switch” to the proper iframe where the form existed. To do this, I invoked Selenium’s switch-to feature, like so:
# Switch to iframe where form is
frame_ref = nd_elements_by_tag_name(“iframe”)[0]
iframe = (frame_ref)
Good, so now in the right frame, I was able to determine the components, populate the customer ID field, and select the date drop-downs:
# Find the Customer ID field and populate it
element = nd_element_by_name(“custId”)
nd_keys(custId) # send a test id
# Find and select the date drop-downs
select = Select(nd_element_by_name(“fromMonth”))
lect_by_visible_text(from_month)
select = Select(nd_element_by_name(“fromYear”))
lect_by_visible_text(from_year)
select = Select(nd_element_by_name(“toMonth”))
lect_by_visible_text(to_month)
select = Select(nd_element_by_name(“toYear”))
lect_by_visible_text(to_year)
Problem 3 – JavaScript
The only thing left on the form was to “click” the Find button, so it would begin the search. This was a little tricky as the Find button seemed to be controlled by JavaScript and wasn’t a normal “Submit” type button. Inspecting it in developer tools, I found the button image and was able to get the XPath of it, by right-clicking.
Then, armed with this information, I found the element on the page, then clicked it.
# Find the ‘Find’ button, then click it
nd_element_by_xpath(“/html/body/table/tbody/tr[2]/td[1]/table[3]/tbody/tr[2]/td[2]/input”)()
And voilà, the form was submitted and the data appeared! Now, I could just scrape all of the data on the result page and save it as required. Or could I?
Getting the Data
First, I had to handle the case where the search found nothing. That was pretty straightforward. It would display a message on the search form without leaving it, something like “No records found. ” I simply searched for that string and stopped right there if I found it.
But if results did come, the data was presented in divs with a plus sign (+) to open a transaction and show all of its detail. An opened transaction showed a minus sign (-) which when clicked would close the div. Clicking a plus sign would call a URL to open its div and close any open one.
Thus, it was necessary to find any plus signs on the page, gather the URL next to each one, then loop through each to get all data for every transaction.
# Loop through transactions and count
links = nd_elements_by_tag_name(‘a’)
link_urls = [t_attribute(‘href’) for link in links]
thisCount = 0
isFirst = 1
for url in link_urls:
if ((“”) >= 0): # URL to link to transactions
if isFirst == 1: # already expanded +
isFirst = 0
else:
(url) # collapsed +, so expand
# Find closest element to URL element with correct class to get tran type nd_element_by_xpath(“//*[contains(@href, ‘/retail/transaction/results/’)]/following::td[@class=’txt_75b_lmnw_T1R10B1′]”)
# Get transaction status
status = nd_element_by_class_name(‘txt_70b_lmnw_t1r10b1’)
# Add to count if transaction found
if (tran_type in [‘Move In’, ‘Move Out’, ‘Switch’]) and
(status == “Complete”):
thisCount += 1
In the above code, the fields I retrieved were the transaction type and the status, then added to a count to determine how many transactions fit the rules that were specified. However, I could have retrieved other fields within the transaction detail, like date and time, subtype, etc.
For this project, the count was returned back to a calling application. However, it and other scraped data could have been stored in a flat file or a database as well.
Additional Possible Roadblocks and Solutions
Numerous other obstacles might be presented while scraping modern websites with your own browser instance, but most can be resolved. Here are a few:
Trying to find something before it appears
While browsing yourself, how often do you find that you are waiting for a page to come up, sometimes for many seconds? Well, the same can occur while navigating programmatically. You look for a class or other element – and it’s not there!
Luckily, Selenium has the ability to wait until it sees a certain element, and can timeout if the element doesn’t appear, like so:
element = WebDriverWait(driver, 10). until(esence_of_element_located((, “theFirstLabel”)))
Getting through a Captcha
Some sites employ Captcha or similar to prevent unwanted robots (which they might consider you). This can put a damper on web scraping and slow it way down.
For simple prompts (like “what’s 2 + 3? ”), these can generally be read and figured out easily. However, for more advanced barriers, there are libraries that can help try to crack it. Some examples are 2Captcha, Death by Captcha, and Bypass Captcha.
Website structural changes
Websites are meant to change – and they often do. That’s why when writing a scraping script, it’s best to keep this in mind. You’ll want to think about which methods you’ll use to find the data, and which not to use. Consider partial matching techniques, rather than trying to match a whole phrase. For example, a website might change a message from “No records found” to “No records located” – but if your match is on “No records, ” you should be okay. Also, consider whether to match on XPATH, ID, name, link text, tag or class name, or CSS selector – and which is least likely to change.
Summary: Python and Selenium
This was a brief demonstration to show that almost any website can be scraped, no matter what technologies are used and what complexities are involved. Basically, if you can browse the site yourself, it generally can be scraped.
Now, as a caveat, it does not mean that every website should be scraped. Some have legitimate restrictions in place, and there have been numerous court cases deciding the legality of scraping certain sites. On the other hand, some sites welcome and encourage data to be retrieved from their website and in some cases provide an API to make things easier.
Either way, it’s best to check with the terms and conditions before starting any project. But if you do go ahead, be assured that you can get the job done.
Recommended Resources for Complex Web Scraping:
Advanced Python Web Scraping: Best Practices & Workarounds
Scalable do-it-yourself scraping: How to build and run scrapers on a large scale
Web Scraping Using Selenium Python - Analytics Vidhya

Web Scraping Using Selenium Python – Analytics Vidhya

Introduction: –
Machine learning is fueling today’s technological marvels such as driver-less cars, space flight, image, and speech recognition. However, one Data Science professional would need a large volume of data to build a robust & reliable machine learning model for such business problems.
Data mining or gathering data is a very primitive step in the data science life cycle. As per business requirements, one may have to gather data from sources like SAP servers, logs, Databases, APIs, online repositories, or web.
Tools for web scraping like Selenium can scrape a large volume of data such as text and images in a relatively short time.
Table of Contents: –
What is Web Scraping
Why Web Scraping
How Web Scraping is useful
What is Selenium
Setup & tools
Implementation of Image Web Scrapping using Selenium Python
Headless Chrome browser
Putting it altogether
End Notes
What is Web Scraping? :-
Web Scrapping also called “Crawling” or “Spidering” is the technique to gather data automatically from an online source usually from a website. While Web Scrapping is an easy way to get a large volume of data in a relatively short time frame, it adds stress to the server where the source is hosted.
This is also one of the main reasons why many websites don’t allow scraping all on their website. However, as long as it does not disrupt the primary function of the online source, it is fairly acceptable.
Why Web Scraping? –
There’s a large volume of data lying on the web that people can utilize to serve the business needs. So, one needs some tool or technique to gather this information from the web. And that’s where the concept of Web-Scrapping comes in to play.
How Web Scraping is useful? –
Web scraping can help us extract an enormous amount of data about customers, products, people, stock markets, etc.
One can utilize the data collected from a website such as e-commerce portal, Job portals, social media channels to understand customer’s buying patterns, employee attrition behavior, and customer’s sentiments and the list goes on.
Most popular libraries or frameworks that are used in Python for Web – Scrapping are BeautifulSoup, Scrappy & Selenium.
In this article, we’ll talk about Web-scrapping using Selenium in Python. And the cherry on top we’ll see how can we gather images from the web that you can use to build train data for your deep learning project.
What is Selenium: –
Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same.
Image Source
Now let us see how to use selenium for Web Scraping.
Setup & tools:-
Installation:
Install selenium using pip
pip install selenium
Install selenium using conda
conda install -c conda-forge selenium
Download Chrome Driver:
To download web drivers, you can choose any of below methods-
You can either directly download chrome driver from the below link-
Or, you can download it directly using below line of code-driver = (ChromeDriverManager(). install())
You can find complete documentation on selenium here. Documentation is very much self-explanatory so make sure to read it to leverage selenium with Python.
Following methods will help us to find elements in a Web-page (these methods will return a list):
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
Now let’s write one Python code to scrape images from web.
Implementation of Image Web Scrapping using Selenium Python: –
Step1: – Import libraries
import os
import selenium
from selenium import webdriver
import time
from PIL import Image
import io
import requests
from import ChromeDriverManager
from import ElementClickInterceptedException
Step 2: – Install Driver
#Install Driver
driver = (ChromeDriverManager(). install())
Step 3: – Specify search URL
#Specify Search URL
search_url=“q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568″
((q=’Car’))
I’ve used this specific URL so you don’t get in trouble for using licensed or images with copyrights. Otherwise, you can use also as a search URL.
Then we’re searching for Car in our Search URL Paste the link into to (“ Your Link Here ”) function and run the cell. This will open a new browser window for that link.
Step 4: – Scroll to the end of the page
#Scroll to the end of the page
driver. execute_script(“rollTo(0, );”)
(5)#sleep_between_interactions
This line of code would help us to reach the end of the page. And then we’re giving sleep time of 5 seconds so we don’t run in problem, where we’re trying to read elements from the page, which is not yet loaded.
Step 5: – Locate the images to be scraped from the page
#Locate the images to be scraped from the current page
imgResults = nd_elements_by_xpath(“//img[contains(@class, ‘Q4LuWd’)]”)
totalResults=len(imgResults)
Now we’ll fetch all the image links present on that particular page. We will create a “list” to store those links. So, to do that go to the browser window, right-click on the page, and select ‘inspect element’ or enable the dev tools using Ctrl+Shift+I.
Now identify any attributes such as class, id, etc. Which is common across all these images.
In our case class =”‘Q4LuWd” is common across all these images.
Step 6: – Extract the corresponding link of each Image
As we can the images are shown on the page are still the thumbnails not the original image. So to download each image, we need to click each thumbnail and extract relevant information corresponding to that image.
#Click on each Image to extract its corresponding link to download
img_urls = set()
for i in range(0, len(imgResults)):
img=imgResults[i]
try:
()
(2)
actual_images = nd_elements_by_css_selector(‘img. n3VNCb’)
for actual_image in actual_images:
if t_attribute(‘src’) and ” in t_attribute(‘src’):
(t_attribute(‘src’))
except ElementClickInterceptedException or ElementNotInteractableException as err:
print(err)
So, in the above snippet of code, we’re performing the following tasks-
Iterate through each thumbnail and then click it.
Make our browser sleep for 2 seconds (:P).
Find the unique HTML tag corresponding to that image to locate it on page
We still get more than one result for a particular image. But all we’re interested in the link for that image to download.
So, we iterate through each result for that image and extract ‘src’ attribute of it and then see whether “” is present in the ‘src’ or not. Since typically weblink starts with ‘’.
Step 7: – Download & save each image in the Destination directory
(‘C:/Qurantine/Blog/WebScrapping/Dataset1’)
for i, url in enumerate(img_urls):
file_name = f”{i:150}”
image_content = (url). content
except Exception as e:
print(f”ERROR – COULD NOT DOWNLOAD {url} – {e}”)
image_file = tesIO(image_content)
image = (image_file). convert(‘RGB’)
file_path = (baseDir, file_name)
with open(file_path, ‘wb’) as f:
(f, “JPEG”, quality=85)
print(f”SAVED – {url} – AT: {file_path}”)
print(f”ERROR – COULD NOT SAVE {url} – {e}”)
Now finally you have extracted the image for your project
Note: – Once you have written proper code then the browser is not important you can collect data without browser, which is called headless browser window, hence replace the following code with the previous one.
#Headless chrome browser
opts = romeOptions()
opts. headless =True
driver (ChromeDriverManager(). install())
In this case, the browser will not run in the background which is very helpful while deploying a solution in production.
Let’s put all this code in a function to make it more organizable and Implement the same idea to download 100 images for each category (e. g. Cars, Horses).
And this time we’d write our code using the idea of headless chrome.
Putting it all together:
Step 1 – Import all required libraries
(‘C:/Qurantine/Blog/WebScrapping’)
Step 2 – Install Chrome Driver
#Install driver
romeOptions()
opts. headless=True
driver = (ChromeDriverManager(). install(), options=opts)
In this step, we’re installing a Chrome driver and using a headless browser for web scraping.
Step 3 – Specify search URL
search_url = “q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568”
I’ve used this specific URL to scrape copyright-free images.
Step 4 – Write a function to take the cursor to the end of the page
def scroll_to_end(driver):
This snippet of code will scroll down the page
Step5. Write a function to get URL of each Image
#no license issues
def getImageUrls(name, totalImgs, driver):
((q=name))
img_count = 0
results_start = 0
while(img_count= totalImgs:
print(f”Found: {img_count} image links”)
break
else:
print(“Found:”, img_count, “looking for more image links… “)
load_more_button = nd_element_by_css_selector(“. mye4qd”)
driver. execute_script(“document. querySelector(‘. mye4qd’)();”)
results_start = len(thumbnail_results)
return img_urls
This function would return a list of URLs for each category (e. Cars, horses, etc. )
Step 6: Write a function to download each Image
def downloadImages(folder_path, file_name, url):
file_path = (folder_path, file_name)
This snippet of code will download the image from each URL.
Step7: – Write a function to save each Image in the Destination directory
def saveInDestFolder(searchNames, destDir, totalImgs, driver):
for name in list(searchNames):
(destDir, name)
if not (path):
(path)
print(‘Current Path’, path)
totalLinks=getImageUrls(name, totalImgs, driver)
print(‘totalLinks’, totalLinks)
if totalLinks is None:
print(‘images not found for:’, name)
continue
for i, link in enumerate(totalLinks):
downloadImages(path, file_name, link)
searchNames=[‘Car’, ‘horses’]
destDir=f’. /Dataset2/’
totalImgs=5
saveInDestFolder(searchNames, destDir, totalImgs, driver)
This snippet of code will save each image in the destination directory.
I’ve tried my bit to explain Web Scraping using Selenium with Python as simple as possible. Please feel free to comment on your queries. I’ll be more than happy to answer them.
You can clone my Github repository to download the whole code & data, click here!!
About the Author
Praveen Kumar Anwla
I’ve been working as a Data Scientist with product-based and Big 4 Audit firms for almost 5 years now. I have been working on various NLP, Machine learning & cutting edge deep learning frameworks to solve business problems. Please feel free to check out my personal blog, where I cover topics from Machine learning – AI, Chatbots to Visualization tools ( Tableau, QlikView, etc. ) & various cloud platforms like Azure, IBM & AWS cloud.

Frequently Asked Questions about scraping python selenium

How do I use Selenium to scrape in Python?

Implementation of Image Web Scrapping using Selenium Python: –Step1: – Import libraries. … Step 2: – Install Driver. … Step 3: – Specify search URL. … Step 4: – Scroll to the end of the page. … Step 5: – Locate the images to be scraped from the page. … Step 6: – Extract the corresponding link of each Image.More items…•Aug 30, 2020

Is Selenium using for web scraping?

Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping.

Is Python scraping legal?

Web scraping and crawling aren’t illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. … Big companies use web scrapers for their own gain but also don’t want others to use bots against them.

Leave a Reply

Your email address will not be published. Required fields are marked *