How To Scrape Javascript Website With Python

Web-scraping JavaScript page with Python – Stack Overflow

I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here’s an example:
Use the scrapy startproject to create your scraper and write your spider, the skeleton can be as simple as this:
import scrapy
class MySpider():
name = ‘my_spider’
start_urls = [”]
def start_requests(self):
yield quest(art_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print()
The real magic happens in the Overwrite two methods in the downloader middleware, __init__ and process_request, in the following way:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = (‘SELENIUM_LOCATION’, ‘NOT_HERE’)
SELENIUM_URL = f'{SELENIUM_LOCATION}:4444/wd/hub’
chrome_options = romeOptions()
# d_experimental_option(“mobileEmulation”, mobile_emulation)
= (command_executor=SELENIUM_URL,
_capabilities())
def process_request(self, request, spider):
()
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# (”)()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy()
# copy the current url in case of redirects
url = deepcopy()
return HtmlResponse(url, body=body, encoding=’utf-8′, request=request)
Dont forget to enable this middlware by uncommenting the next lines in the file:
DOWNLOADER_MIDDLEWARES = {
‘mpleProjectDownloaderMiddleware’: 543, }
Next for dockerization. Create your Dockerfile from a lightweight image (I’m using python Alpine here), copy your project directory to it, install requirements:
# Use an official Python runtime as a parent image
FROM python:3. 6-alpine
# install some packages necessary to scrapy and then curl because it’s handy for debugging
RUN apk –update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD /my_scraper/
RUN pip install -r
ADD. /scrapers
And finally bring it all together in
version: ‘2’
services:
selenium:
image: selenium/standalone-chrome
ports:
– “4444:4444”
shm_size: 1G
my_scraper:
build:.
depends_on:
– “selenium”
environment:
– SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
-. :/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
Run docker-compose up -d. If you’re doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.
Once it’s done, you can check that your containers are running with docker ps and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it was SELENIUM_LOCATION=samplecrawler_selenium_1).
Enter your scraper container with docker exec -ti YOUR_CONTAINER_NAME sh, the command for me was docker exec -ti samplecrawler_my_scraper_1 sh, cd into the right directory and run your scraper with scrapy crawl my_spider.
The entire thing is on my github page and you can get it from here

Scraping data from a JavaScript webpage with Python

This post will walk through how to use the requests_html package to scrape options data from a JavaScript-rendered webpage. requests_html serves as an alternative to Selenium and PhantomJS, and provides a clear syntax similar to the awesome requests package. The code we’ll walk through is packaged into functions in the options module in the yahoo_fin package, but this article will show how to write the code from scratch using requests_html so that you can use the same idea to scrape other JavaScript-rendered webpages.
Note:
requests_html requires Python 3. 6+. If you don’t have requests_html installed, you can download it using pip:
pip install requests_html
Motivation
Let’s say we want to scrape options data for a particular stock. As an example, let’s look at Netflix (since it’s well known). If we go to the below site, we can see the option chain information for the earliest upcoming options expiration date for Netflix:
On this webpage there’s a drop-down box allowing us to view data by other expiration dates. What if we want to get all the possible choices – i. e. all the possible expiration dates?
We can try using requests with BeautifulSoup, but that won’t work quite the way we want. To demonstrate, let’s try doing that to see what happens.
from bs4 import BeautifulSoup
import requests
resp = (“)
html = ntent
soup = BeautifulSoup(html)
option_tags = nd_all(“option”)
Running the above code shows us that option_tags is an empty list. This is because there are no option tags found in the HTML we scrapped from the webpage above. However, if we look at the source via a web browser, we can see that there are, indeed, option tags:
Why the disconnect? The reason why we see option tags when looking at the source code in a browser is that the browser is executing JavaScript code that renders that HTML i. it modifies the HTML of the page dynamically to allow a user to select one of the possible expiration dates. This means if we try just scraping the HTML, the JavaScript won’t be executed, and thus, we won’t see the tags containing the expiration dates. This brings us to requests_html.
Using requests_html to render JavaScript
Now, let’s use requests_html to run the JavaScript code in order to render the HTML we’re looking for.
# import HTMLSession from requests_html
from requests_html import HTMLSession
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
# Run JavaScript code on webpage
()
Similar to the requests package, we can use a session object to get the webpage we need. This gets stored in a response variable, resp. If you print out resp you should see the message Response 200, which means the connection to the webpage was successful (otherwise you’ll get a different message).
Running will give us an object that allows us to print out, search through, and perform several functions on the webpage’s HTML. To simulate running the JavaScript code, we use the render method on the object. Note how we don’t need to set a variable equal to this rendered result i. running the below code:
stores the updated HTML as in attribute in Specifically, we can access the rendered HTML like this:
So now contains the HTML we need containing the option tags. From here, we can parse out the expiration dates from these tags using the find method.
option_tags = (“option”)
dates = [ for tag in option_tags]
Similarly, if we wanted to search for other HTML tags we could just input whatever those are into the find method e. g. anchor (a), paragraph (p), header tags (h1, h2, h3, etc. ) and so on.
Alternatively, we could also use BeautifulSoup on the rendered HTML (see below). However, the awesome point here is that we can create the connection to this webpage, render its JavaScript, and parse out the resultant HTML all in one package!
soup = BeautifulSoup(, “lxml”)
Lastly, we could scrape this particular webpage directly with yahoo_fin, which provides functions that wrap around requests_html specifically for Yahoo Finance’s website.
from yahoo_fin. options import get_expiration_dates
dates = get_expiration_dates(“nflx”)
Scraping options data for each expiration date
Once we have the expiration dates, we could proceed with scraping the data associated with each date. In this particular case, the pattern of the URL for each expiration date’s data requires the date be converted to Unix timestamp format. This can be done using the pandas package.
import pandas as pd
timestamps = [pd. Timestamp(date). timestamp() for date in dates]
sites = [” +
str(int(elt)) for elt in timestamps]
info = [ad_html(site) for site in sites]
calls_data = dict(zip(dates, [df[0] for df in info]))
puts_data = dict(zip(dates, [df[1] for df in info]))
Similarly, we could scrape this data using yahoo_fin. In this case, we just input the ticker symbol, NFLX and associated expiration date into either get_calls or get_puts to obtain the calls and puts data, respectively.
Note: here we don’t need to convert each date to a Unix timestamp as these functions will figure that out automatically from the input dates.
from yahoo_fin. options import get_calls, get_puts
calls_data = [get_calls(“nflx”, date) for date in dates]
puts_data = [get_puts(“nflx”, date) for date in dates]
That’s it for this post! To learn more about requests-html, check out my web scraping course on Udemy here!
To see the official documentation for requests_html, click here.

Data Science Skills: Web scraping javascript using python

There are different ways of scraping web pages using python. In my previous article, I gave an introduction to web scraping by using the libraries:requests and BeautifulSoup. However, many web pages are dynamic and use JavaScript to load their content. These websites often require a different approach to gather the this tutorial, I will present several different ways of gathering the content of a webpage that contains Javascript. The techniques used will be the following:Using selenium with Firefox web driverUsing a headless browser with phantomJSMaking an API call using a REST client or python requests libraryTL;DR For examples of scraping javascript web pages in python you can find the complete code as covered in this tutorial over on November 7th 2019: Please note, the html structure of the webpage being scraped may be updated over time and this article initially reflected the structure at the time of publication in November 2018. The article has now been updated to run with the current webpage but in the future this may again start the tutorial, I first needed to find a website to scrape. Before proceeding with your web scraper, it is important to always check the Terms & Conditions and the Privacy Policy on the website you plan to scrape to ensure that you are not breaking any of their terms of trying to find a suitable website to demonstrate, many of the examples I first looked at explicitly stated that web crawlers were prohibited. It wasn’t until reading an article about sugar content in yogurt and wondering where I could find the latest nutritional information inspired another train of thought where I could find a suitable website; online retailers often have dynamic web pages that load content using javascript so the aim of this tutorial is to scrape the nutritional information of yogurts from the web page of an online we will be using some new python libraries to access the content of the web pages and also to handle the data, these libraries will need to be installed using your usual python package manager pip. If you don’t already have beautifulsoup then you will need to install this here install seleniumpip install pandasTo use selenium as a web driver, there are a few additional requirements:FirefoxI will be using Firefox as the browser for my web driver so this means you will either need to install Firefox to follow this tutorial or alternatively you can use Chromium with ckodriverTo use the web driver we need to install a web browser engine, geckodriver. You will need to download geckodriver for your OS, extract the file and set the executable path can do this in several ways:(i) move geckodriver to a directory of your choice and define this the executable path in your python code (see later example), (ii) move geckodriver to a directory which is already a set as a directory where executable files are located, this is known as your environmental variable path. You can find out which directories are in your $PATH by the following:WindowsGo to:Control Panel > Environmental Variables > System Variables > PathMac OSX / LinuxIn your terminal use the command:echo $PATH(iii) add geckodriver location to your PATH environment variablesWindowsGo to:Control Panel > Environmental Variables > System Variables > Path > EditAdd the directory containing geckodriver to this list and saveMac OSX / LinuxAdd a line to your. bash_profile (Mac OSX) or. bash_rc (Linux)# add geckodriver to your PATHexport PATH=”$PATH:/path/to/your/directory”Restart your terminal and use the command from (ii) to check that your new path has been antomJSSimilar to the steps for geckodriver, we also need to download PhantomJS. Once downloaded, unzip the file and move to a directory of choice or add to your path executable, following the same instructions as ClientIn the final part of this blog, we will make a request to an API using a REST client. I will be using Insomnia but feel free to use whichever client you prefer! Following the standard steps outlined in my introductory tutorial into web scraping, I have inspected the webpage and want to extract the repeated HTML element:

…

As a first step, you might try using BeautifulSoup to extract this information using the following script. # import librariesimport questfrom bs4 import BeautifulSoup# specify the urlurlpage = ” print(urlpage)# query the website and return the html to the variable ‘page’page = quest. urlopen(urlpage)# parse the html using beautiful soup and store in variable ‘soup’soup = BeautifulSoup(page, ”)# find product items# at time of publication, Nov 2018:# results = nd_all(‘div’, attrs={‘class’: ‘listing category_templates clearfix productListing’})# updated Nov 2019:results = nd_all(‘div’, attrs={‘class’: ‘co-product’})print(‘Number of results’, len(results))Unexpectedly, when running the python script, the number of results returned is 0 even though I see many results on the web page! – Number of results 0When further inspecting the page, there are many dynamic features on the web page which suggests that javascript is used to present these results. By right-clicking and selecting View Page Source there are many