Web Scraping using Selenium and Python – ScrapingBee
●
Updated:
08 July, 2021
9 min read
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
In the last tutorial we learned how to leverage the Scrapy framework to solve common web scraping problems.
Today we are going to take a look at Selenium (with Python ❤️) in a step-by-step tutorial.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.
The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.
At the beginning of the project (almost 20 years ago! ) it was mostly used for cross-browser, end-to-end testing (acceptance tests).
Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!
Selenium is useful when you have to perform an action on a website such as:
Clicking on buttons
Filling forms
Scrolling
Taking a screenshot
It is also useful for executing Javascript code. Let’s say that you want to scrape a Single Page Application. Plus you haven’t found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.
Installation
We will use Chrome in our example, so make sure you have it installed on your local machine:
Chrome download page
Chrome driver binary
selenium package
To install the Selenium package, as always, I recommend that you create a virtual environment (for example using virtualenv) and then:
Quickstart
Once you have downloaded both Chrome and Chromedriver and installed the Selenium package, you should be ready to start the browser:
from selenium import webdriver
DRIVER_PATH = ‘/path/to/chromedriver’
driver = (executable_path=DRIVER_PATH)
(”)
This will launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code).
You should see a message stating that the browser is controlled by automated software.
To run Chrome in headless mode (without any graphical user interface), you can run it on a server. See the following example:
from import Options
options = Options()
options. headless = True
d_argument(“–window-size=1920, 1200”)
driver = (options=options, executable_path=DRIVER_PATH)
(“)
print(ge_source)
()
The ge_source will return the full page HTML code.
Here are two other interesting WebDriver properties:
gets the page’s title
rrent_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)
Locating Elements
Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract data and save it for further analysis (web scraping).
There are many methods available in the Selenium API to select elements on the page. You can use:
Tag name
Class name
IDs
XPath
CSS selectors
We recently published an article explaining XPath. Don’t hesitate to take a look if you aren’t familiar with XPath.
As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need.
A cool shortcut for this is to highlight the element you want with your mouse and then press Ctrl + Shift + C or on macOS Cmd + Shift + C instead of having to right click + inspect each time:
find_element
There are many ways to locate an element in selenium.
Let’s say that we want to locate the h1 tag in this HTML:
Super title
h1 = nd_element_by_name(‘h1’)
h1 = nd_element_by_class_name(‘someclass’)
h1 = nd_element_by_xpath(‘//h1’)
h1 = nd_element_by_id(‘greatID’)
All these methods also have find_elements (note the plural) to return a list of elements.
For example, to get all anchors on a page, use the following:
all_links = nd_elements_by_tag_name(‘a’)
Some elements aren’t easily accessible with an ID or a simple class, and that’s when you need an XPath expression. You also might have multiple elements with the same class (the ID is supposed to be unique).
XPath is my favorite way of locating elements on a web page. It’s a powerful way to extract any element on a page, based on it’s absolute position on the DOM, or relative to another element.
WebElement
A WebElement is a Selenium object representing an HTML element.
There are many actions that you can perform on those HTML elements, here are the most useful:
Accessing the text of the element with the property
Clicking on the element with ()
Accessing an attribute with t_attribute(‘class’)
Sending text to an input with: nd_keys(‘mypassword’)
There are some other interesting methods like is_displayed(). This returns True if an element is visible to the user.
It can be interesting to avoid honeypots (like filling hidden inputs).
Honeypots are mechanisms used by website owners to detect bots. For example, if an HTML input has the attribute type=hidden like this:
This input value is supposed to be blank. If a bot is visiting a page and fills all of the inputs on a form with random value, it will also fill the hidden input. A legitimate user would never fill the hidden input value, because it is not rendered by the browser.
That’s a classic honeypot.
Full example
Here is a full example using Selenium API methods we just covered.
We are going to log into Hacker News:
In our example, authenticating to Hacker News is not really useful on its own. However, you could imagine creating a bot to automatically post a link to your latest blog post.
In order to authenticate we need to:
Go to the login page using ()
Select the username input using nd_element_by_* and then nd_keys() to send text to the input
Follow the same process with the password input
Click on the login button using ()
Should be easy right? Let’s see the code:
login = nd_element_by_xpath(“//input”). send_keys(USERNAME)
password = nd_element_by_xpath(“//input[@type=’password’]”). send_keys(PASSWORD)
submit = nd_element_by_xpath(“//input[@value=’login’]”)()
Easy, right? Now there is one important thing that is missing here. How do we know if we are logged in?
We could try a couple of things:
Check for an error message (like “Wrong password”)
Check for one element on the page that is only displayed once logged in.
So, we’re going to check for the logout button. The logout button has the ID “logout” (easy)!
We can’t just check if the element is None because all of the find_element_by_* raise an exception if the element is not found in the DOM.
So we have to use a try/except block and catch the NoSuchElementException exception:
# dont forget from import NoSuchElementException
try:
logout_button = nd_element_by_id(“logout”)
print(‘Successfully logged in’)
except NoSuchElementException:
print(‘Incorrect login/password’)
We could easily take a screenshot using:
ve_screenshot(”)
Note that a lot of things can go wrong when you take a screenshot with Selenium. First, you have to make sure that the window size is set correctly.
Then, you need to make sure that every asynchronous HTTP call made by the frontend Javascript code has finished, and that the page is fully rendered.
In our Hacker News case it’s simple and we don’t have to worry about these issues.
If you need to make screenshots at scale, feel free to try our new Screenshot API here.
Waiting for an element to be present
Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React and for their front-end. These front-end frameworks are complicated to deal with because they fire a lot of AJAX calls.
If we had to worry about an asynchronous HTTP call (or many) to an API, there are two ways to solve this:
Use a (ARBITRARY_TIME) before taking the screenshot.
Use a WebDriverWait object.
If you use a () you will probably use an arbitrary value. The problem is, you’re either waiting for too long or not enough.
Also the website can load slowly on your local wifi internet connection, but will be 10 times faster on your cloud server.
With the WebDriverWait method you will wait the exact amount of time necessary for your element/data to be loaded.
element = WebDriverWait(driver, 5)(
esence_of_element_located((, “mySuperId”)))
finally:
This will wait five seconds for an element located by the ID “mySuperId” to be loaded.
There are many other interesting expected conditions like:
element_to_be_clickable
text_to_be_present_in_element
You can find more information about this in the Selenium documentation
Executing Javascript
Sometimes, you may need to execute some Javascript on the page. For example, let’s say you want to take a screenshot of some information, but you first need to scroll a bit to see it.
You can easily do this with Selenium:
javaScript = “rollBy(0, 1000);”
driver. execute_script(javaScript)
Using a proxy with Selenium Wire
Unfortunately, Selenium proxy handling is quite basic. For example, it can’t handle proxy with authentication out of the box.
To solve this issue, you need to use Selenium Wire.
This package extends Selenium’s bindings and gives you access to all the underlying requests made by the browser.
If you need to use Selenium with a proxy with authentication this is the package you need.
pip install selenium-wire
This code snippet shows you how to quickly use your headless browser behind a proxy.
# Install the Python selenium-wire library:
# pip install selenium-wire
from seleniumwire import webdriver
proxy_username = “USER_NAME”
proxy_password = “PASSWORD”
proxy_url = ”
proxy_port = 8886
options = {
“proxy”: {
“”: f”{proxy_username}:{proxy_password}@{proxy_url}:{proxy_port}”,
“verify_ssl”: False, }, }
URL = ”
driver = (
executable_path=”YOUR-CHROME-EXECUTABLE-PATH”,
seleniumwire_options=options, )
(URL)
Blocking images and JavaScript
With Selenium, by using the correct Chrome options, you can block some requests from being made.
This can be useful if you need to speed up your scrapers or reduce your bandwidth usage.
To do this, you need to launch Chrome with the below options:
chrome_options = romeOptions()
### This blocks images and javascript requests
chrome_prefs = {
“fault_content_setting_values”: {
“images”: 2,
“javascript”: 2, }}
chrome_options. experimental_options[“prefs”] = chrome_prefs
###
chrome_options=chrome_options, )
Conclusion
I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about how to scrape the web with Python don’t hesitate to take a look at our general Python web scraping guide.
Selenium is often necessary to extract data from websites using lots of Javascript. The problem is that running lots of Selenium/Headless Chrome instances at scale is hard. This is one of the things we solve with ScrapingBee, our web scraping API
Selenium is also an excellent tool to automate almost anything on the web.
If you perform repetitive tasks like filling forms or checking information behind a login form where the website doesn’t have an API, it’s maybe* a good idea to automate it with Selenium, just don’t forget this xkcd:
Web Scraping Using Selenium — Python | by Atindra Bandi
How to navigate through multiple pages of a website and scrape large amounts of data using Selenium in PythonShhh! Be Cautious Web Scraping Could be Troublesome!!! Before we delve into the topic of this article let us first understand what is web-scraping and how is it is web-scraping? Web scraping is a technique for extracting information from the internet automatically using a software that simulates human web surfing. 2. How is web-scraping useful? Web scraping helps us extract large volumes of data about customers, products, people, stock markets, etc. It is usually difficult to get this kind of information on a large scale using traditional data collection methods. We can utilize the data collected from a website such as e-commerce portal, social media channels to understand customer behaviors and sentiments, buying patterns, and brand attribute associations which are critical insights for any ’s now get our hands dirty!! Since we have defined our purpose of scraping, let us delve into the nitty-gritty of how to actually do all the fun stuff! Before that below are some of the housekeeping instructions regarding installations of packages. a. Python version: We will be using Python 3. 0, however feel free to use Python 2. 0 by making slight adjustments. We will be using jupyter notebook, so you don’t need any command line knowledge. b. Selenium package: You can install selenium package using the following command! pip install seleniumc. Chrome driver: Please install the latest version of chromedriver from note you need Google Chrome installed on your machines to work through this first and foremost thing while scraping a website is to understand the structure of the website. We will be scraping, a car forum. This website aids people in their car buying decisions. People can post their reviews about different cars in the discussion forums (very similar to how one posts reviews on Amazon). We will be scraping the discussion about entry level luxury car will scrape ~5000 comments from different users across multiple pages. We will scrape user id, date of comment and comments and export it into a csv file for any further ’s begin writing our scraper! We will first import important packages in our Notebook —#Importing packagesfrom selenium import webdriverimport pandas as pdLet’s now create a new instance of google chrome. This will help our program open an url in google = (‘Path in your computer where you have installed chromedriver’)Let’s now access google chrome and open our website. By the way, chrome knows that you are accessing it through an automated software! (”)Web page opened from python notebookSo, how does our web page look like? We will inspect 3 items (user id, date and comment) on our web page and understand how we can extract id: Inspecting the userid, we can see the highlighted text represents the XML code for user path for user idThe XML path (XPath)for the userid is shown below. There is an interesting thing to note here that the XML path contains a comment id, which uniquely denotes each comment on the website. This will be very helpful as we try to recursively scrape multiple comments. //*[@id=”Comment_5561090″]/div/div[2]/div[1]/span[1]/a[2]If we see the XPath in the picture, we will observe that it contains the user id ‘dino001’ do we extract the values inside a XPath? Selenium has a function called “find_elements_by_xpath”. We will pass our XPath into this function and get a selenium element. Once we have the element, we can extract the text inside our XPath using the ‘text’ function. In our case the text is basically the user id (‘dino001’). userid_element = nd_elements_by_xpath(‘//*[@id=”Comment_5561090″]/div/div[2]/div[1]/span[1]/a[2]’)[0]userid = userid_element. text2. Comment Date: Similar to the user id, we will now inspect the date when the comment was path for comment dateLet’s also see the XPath for the comment date. Again note the unique comment id in the XPath. //*[@id=”Comment_5561090″]/div/div[2]/div[2]/span[1]/a/timeSo, how do we extract date from the above XPath? We will again use the function “find_elements_by_xpath” to get the selenium element. Now, if we carefully observe the highlighted text in the picture, we will see that the date is stored inside the ‘title’ attribute. We can access the values inside attributes using the function ‘get_attribute’. We will pass the tag name in this function to get the value inside the er_date = nd_elements_by_xpath(‘//*[@id=”Comment_5561090″]/div/div[2]/div[2]/span[1]/a/time’)[0]date = t_attribute(‘title’)3. Comments: Lastly, let’s explore how to extract the comments of each Path for user commentsBelow is the XPath for the user comment —//*[@id=”Comment_5561090″]/div/div[3]/div/div[1]Once again, we have the comment id in our XPath. Similar to the userid we will extract the comment from the above XPathuser_message = nd_elements_by_xpath(‘//*[@id=”Comment_5561090″]/div/div[3]/div/div[1]’)[0]comment = just learnt how to scrape different elements from a web page. Now how to recursively extract these items for 5000 users? As discussed above, we will use the comment ids, which are unique for a comment to extract different users data. If we see the XPath for the entire comment block, we will see that it has a comment id associated with it. //*[@id=”Comment_5561090″]XML Path for entire comment blockThe following code snippet will help us extract all the comment ids on a particular web page. We will again use the function ‘find_elements_by_xpath’ on the above XPath and extract the ids from the ‘id’ = nd_elements_by_xpath(“//*[contains(@id, ‘Comment_’)]”) comment_ids = []for i in ids: (t_attribute(‘id’))The above code gives us a list of all the comment ids from a particular web to bring all this together? Now we will bring all the things we have seen so far into one big code, which will recursively help us extract 5000 comments. We can extract user ids, date and comments for each user on a particular web page by looping through all the comment ids we found in the previous is the code snippet to extract all comments from a particular web rapper To Scrape All Comments from a Web PageLastly, if you check our url has page numbers, starting from 702. So, we can recursively go to previous pages by simply changing the page numbers in the url to extract more comments until we get the desired number of process will take some time depending on the computational power of your computer. So, chill, have a coffee, talk to your friends and family and let Selenium do its job! Summary: We learnt how to scrape a website using Selenium in Python and get large amounts of data. You can carry out multiple unstructured data analytics and find interesting trends, sentiments, etc. using this data. If anyone is interested in looking at the complete code, here is the link to my me know if this was helpful. Enjoy Scraping BUT BE CAREFUL! If you liked reading this, I would recommend reading another article about scraping Reddit data using Reddit API and Google BigQuery written by a fellow classmate (Akhilesh Narapareddy) at the University of Texas, Austin.
Beginners Guide to Web Scraping Using Selenium in Python!
Introduction
By 2025, the world’s data will grow to 175 Zettabytes – IDC
The overall amount of data is growing and so is the unstructured data. It is estimated that about 80% of data in the universe constitutes unstructured data. Unstructured data is the data that doesn’t fit into any data model. They are as diverse as they can be – be it an image, audio, text, and many more. Industries makes effort to leverage these unstructured data as they can contain a vast amount of information. This information can be used for extensive analysis and effective decision making.
Selenium is a powerful browser automation tool. It supports various browsers like Firefox, Chrome, Internet Explorer, Edge, Safari. Webdriver is the heart of Selenium Python. It can be used to perform various operations like automating testing, perform operations on-webpage elements like close, back, get_cookie, get_screenshot_as_png, get_window_size to name a few. Some common use-cases of using selenium for web scraping are automating a login, submitting form elements, handling alert prompts, adding/deleting cookies, and much more. It can handle exceptions as well. For more details on Selenium, we can refer to the official documentation
Let’s deep dive into the world of selenium right away!
Installation
Assuming that Python is installed in the system, we can install the below library using pip/conda
pip install selenium
OR
conda install selenium
We will be using a Google Chrome driver. We can download it from this site:
Implementation
1. Import packages
We need selenium webdriver, time and pandas Python packages
from selenium import webdriver
import time
import pandas as pd
2. Declare Variables
We need to define variables to make it easier for later use. We will use actual paths. The below paths are shown only as a reference
FILE_PATH_FOLDER = ‘athon’
search_query = ”
driver = (executable_path=’C:/… /chromedriver_win32/’)
job_details = []
3. Hit the required URL to get the necessary information
We need to get the specific web element tags for getting the correct information. You can obtain this by doing a right-click on the page and click on inspect. We can click on the arrow in the top left corner or Ctrl+Shift+C to inspect a particular element and get the necessary HTML tag. A good or professional HTML site contains a unique identifier for almost all the tags associated with the information. We will leverage this property to scrape the web page
(search_query)
(5)
job_list = nd_elements_by_xpath(“//div[@data-tn-component=’organicJob’]”)
4. Get job info from the job list
We aim to fetch job title, job company, job location, job summary, and job publishes date. We will iterate the job list element and extract the required information using find_elements_by_xpath attribute of the selenium web driver. Once the iteration is over, we will quit the driver to close the browser
for each_job in job_list:
# Getting job info
job_title = nd_elements_by_xpath(“. //h2[@class=’title’]/a”)[0]
job_company = nd_elements_by_xpath(“. //span[@class=’company’]”)[0]
job_location = nd_elements_by_xpath(“. //span[@class=’location accessible-contrast-color-location’]”)[0]
job_summary = nd_elements_by_xpath(“. //div[@class=’summary’]”)[0]
job_publish_date = nd_elements_by_xpath(“. //span[@class=’date ‘]”)[0]
# Saving job info
job_info = [,,,, ]
# Saving into job_details
(job_info)
()
5. Save the data in a CSV file
We will add proper columns to the dataframe and use the to_csv attribute of the dataframe to save it as CSV
job_details_df = Frame(job_details)
lumns = [‘title’, ‘company’, ‘location’, ‘summary’, ‘publish_date’]
_csv(”, index=False)
Output
The following CSV file will be downloaded in FILE_PATH_FOLDER variable and it will look like this-
Conclusion
So, this is one of the ways by which we can scrape the data. There are numerous other packages/libraries for web scraping other than selenium and umpteen number of methods/ways by which we can achieve the desired objective. Hope that this article helped you in exploring something new. Do share your thoughts and ways by which it helped you. Open to suggestions for improvement as well.
Frequently Asked Questions about selenium python web scraping tutorial
How do I use Python web scraping in selenium?
Implementation of Image Web Scrapping using Selenium Python: –Step1: – Import libraries. … Step 2: – Install Driver. … Step 3: – Specify search URL. … Step 4: – Scroll to the end of the page. … Step 5: – Locate the images to be scraped from the page. … Step 6: – Extract the corresponding link of each Image.More items…•Aug 30, 2020
Is Selenium using for web scraping?
Selenium is an open-source web-based automation tool. Python language and other languages are used with Selenium for testing as well as web scraping.
Can Python be used for web scraping?
Python is the most popular programming language for web scraping because it can handle almost all processes related to data extraction. However, there are other languages that can be used by developers for web scraping such as Ruby, C++, PHP.Aug 12, 2020