How to Scrape Multiple Pages of a Website Using Python?
Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, ’s dive deep and scrape a website. In this article, we are going to take the GeeksforGeeks website and extract the titles of all the articles available on the Homepage using a Python script. If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any! GeeksforGeeks HomepageScraping multiple Pages of a website Using PythonNow, there may arise various instances where you may want to get data from multiple pages from the same website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tedious task. Plus, it defies all basic principles of automation. Duh! To solve this exact problem, we will see two main techniques that will help us extract data from multiple webpages:The same websiteDifferent website URLsApproach:The approach of the program will be fairly simple, and it will be easier to understand it in a POINT format:We’ll import all the necessary up our URL strings for making a connection using the requests rsing the available data from the target page using the BeautifulSoup library’s the target page, Identify and Extract the classes and tags which contain the information that is valuable to ototype it for one page using a loop and then apply it to all the pages. Example 1: Looping through the page numbers page numbers at the bottom of the GeeksforGeeks websiteMost websites have pages labeled from 1 to N. This makes it really simple for us to loop through these pages and extract data from them as these pages have similar structures. For example:notice the last section of the URL – page/4/Here, we can see the page details at the end of the URL. Using this information we can easily create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string and iterating “i” till N) and scrape all the useful data from them. The following code will give you more clarity over how to scrape data by using a For Loop in thonimport requestsfrom bs4 import BeautifulSoup as bsreq = (URL)soup = bs(, ”)titles = nd_all(‘div’, attrs = {‘class’, ‘head’})print(titles[4])Output:Output for the above codeNow, using the above code, we can get the titles of all the articles by just sandwiching those lines with a thonimport requestsfrom bs4 import BeautifulSoup as bsfor page in range(1, 10): req = (URL + str(page) + ‘/’) soup = bs(, ”) titles = nd_all(‘div’, attrs={‘class’, ‘head’}) for i in range(4, 19): if page>1: print(f”{(i-3)+page*15}” + titles[i]) else: print(f”{i-3}” + titles[i])Output:Output for the above codeNote: The above code will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that fall under those pages. Example 2: Looping through a list of different above technique is absolutely wonderful, but what if you need to scrape different pages, and you don’t know their page numbers? You’ll need to scrape those different URLs one by one and manually code a script for every such stead, you could just make a list of these URLs and loop through them. By simply iterating the items in the list i. e. the URLs, we will be able to extract the titles of those pages without having to write code for each page. Here’s an example code of how you can do thonimport requestsfrom bs4 import BeautifulSoup as bsfor url in range(0, 2): req = (URL[url]) soup = bs(, ”) titles = nd_all(‘div’, attrs={‘class’, ‘head’}) for i in range(4, 19): if url+1 > 1: print(f”{(i – 3) + url * 15}” + titles[i]) else: print(f”{i – 3}” + titles[i])Output:Output for the above codeHow to avoid getting your IP address banned? Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win! Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively. Python3from random import randintfrom time import sleep print(randint(1, 10))The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for thon3from time import *from random import randintfor i in range(0, 3): x = randint(2, 5) print(x) sleep(x) print(f’I waited {x} seconds’)Output5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 secondsTo get you a clear idea of this function in action, refer to the code given thon3import requestsfrom bs4 import BeautifulSoup as bsfrom random import randintfrom time import sleepfor page in range(1, 10): req = (URL + str(page) + ‘/’) soup = bs(, ”) titles = nd_all(‘div’, attrs={‘class’, ‘head’}) for i in range(4, 19): if page>1: print(f”{(i-3)+page*15}” + titles[i]) else: print(f”{i-3}” + titles[i]) sleep(randint(2, 10))Output:The program has paused its execution and is waiting to resumeThe output of the above code Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics. To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course
How to Scrape Multiple Pages of a Website Using a Python …
Extracting data and ensuring data qualityThis is the second article of my web scraping guide. In the first article, I showed you how you can find, extract, and clean the data from one single web page on this article, you’ll learn how to scrape multiple web pages — a list that’s 20 pages and 1, 000 movies total — with a Python web the previous article, we scraped and cleaned the data of the title, year of release, imdb_ratings, metascore, length of movie, number of votes, and the us_gross earnings of all movies on the first page of IMDb’s Top 1, 000 was the code we used:And our results looked like this:I’ll be guiding you through these steps:You’ll request the unique URLs for every page on this IMDb ’ll iterate through each page using a for loop, and you’ll scrape each movie one by ’ll control the loop’s rate to avoid flooding the server with ’ll extract, clean, and download this final ’ll use basic data-quality best are the additional tools we’ll use in our scraper:The sleep() function from Python’s time module will control the loop’s rate by pausing the execution of the loop for a specified amount of randint() function from Python’s random module will vary the amount of waiting time between requests — within your specified intervalAs mentioned in the first article, I recommend following along in a environment if you don’t already have an IDE. I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight ternatively, you can go straight to the code, let’s begin! Import toolsLet’s import our previous tools and our new tools — time and itialize your storageLike previously, we’re going to continue to use our empty lists as storage for all the data we scrape:English movie titlesAfter we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:Analyzing our URLLet’s go to the URL of the page we‘re, let’s click on the next page and see what page 2’s URL looks like:And then page 3’s URL:What do we notice about the URL from page 2 to page 3? We notice &start=51 is added into the URL when we go to page 2, and the number 51 turns to the number 101 on page makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so is this important? This information will help us tell our loop how to go to the next page to fresher on ‘for’ loopsJust like the loop we used to loop through each movie on the first page, we’ll use a for loop to iterate through each page on the refresh, this is how a for loop works:for
Beautiful Soup Tutorial 2. – How to Scrape Multiple Web Pages
Scraping one web page is fun, but scraping more web pages is more fun. In this tutorial you’ll learn how to do just that; along the way you’ll also make good use of your collected data by doing some visualizations and analyses. While in the previous article you learned to crawl, now it’s time for you to stand up and learn to walk.
How to inspect URLs for web scraping
If you recall, in the previous part of this tutorial series we scraped only the first bestsellers page of Book Depository. Truth is, there are actually 34 pages of bestseller books that we can scrape:
Image source: Book Depository
Question: how do we scrape all 34 pages?
Answer: by first inspecting what’s happening in the URL when we switch pages.
This is the first page’s URL:
By going to the second page, you’ll notice that the URL changes to this:
The only difference is that? page=2 has been appended to the base URL. Now let’s check out what happens if we visit the third page:? page=2 turned into? page=3; can you see where I’m going with this?
It seems that by changing the number after page=, we can go to whichever page we want to. Let’s try this out real quick by replacing 3 with 28 ():
See? It works like a charm.
But wait… what about the first page? It had no? page=number in it!
Lucky for us, and are the same page with the same book results, so it seems that we’ve found a reliable solution that we can use to navigate between web pages by changing the URL.
Shortly I’ll show you how you can bring this knowledge over to web scraping, but first a quick explanation to the curious minds out there as to what the heck this? page=number thing is? part of a URL signifies the start of the so-called query string. Anything that comes after the? is the query string itself, which contains key-value pairs. In our case page is the key and the number we assign to it is its value. By assigning a certain number to page, we are able to request the bestsellers page corresponding to that number. Easy-peasy. Now, let’s put this knowledge to good use.
Scraping multiple web pages with a while loop
To complete this tutorial, we’ll need to use the same libraries from the previous article, so don’t forget to import them:
from bs4 import BeautifulSoup as bsimport requestsimport numpy as npimport pandas as pdimport as plt%matplotlib inline
(Remember:%matplotlib inline is necessary for the later data visualizations to appear if you write your code in Jupyter Notebook. )
What we’ll do in this article will be very similar to what we’ve already accomplished so far, but with more data: we’ll analyze not 30, but 1020 books.
For this reason we’ll reuse (with some small modifications) the code we’ve already written to get the titles, formats, publication years and prices of the bestseller books. To scrape multiple pages, we’ll use a while loop and the page parameters in the URLs. Keep in mind that the bestsellers list is updated daily, so don’t freak out if you don’t get the same data that are shown in this tutorial.
For starters, it’s always a good idea to build your code up step by step, so if you run into an error, you’ll immediately know which part of your code needs some rethinking. As a first step we may want to check if we can get the first 5 bestsellers URLs:
page = 1
while page! = 6:
url = f”page}”
print(url)
page = page + 1
As the output attests, we’ve succeeded in our endeavour:
Here’s the breakdown of the code:
we create the variable page that initially holds 1 as its value (because we want to start from the first bestsellers page), while page! = 6: makes sure that our code stops running when page gets the value 6 (which would mean the sixth bestsellers page); because we’re only interested in the first 5 pages, we won’t be bothering with the sixth page, the variable url will hold the bestsellers page’s URL at every iteration in a string format; we use f-strings that lets {page} receive the current value of page, at the first iteration we have page=1, so url will be: the second iteration we have page=2, so url becomes: the last URL will be: (url) prints the current URL, so we can check if we get the results we intended to get, then we increase the value of page by one at the end of every iteration.
Do you like the article so far? If so, you’ll love this 6-week data science course on Data36: The Junior Data Scientist’s First Month. It’s a 6-week simulation of being a junior data scientist at a true-to-life startup. Go check it out here:!
Collecting all bestseller books’ titles
Let’s modify our while loop just a little bit so we can loop through all 34 bestsellers pages, and get every bestseller’s title:
titles = []
while page! = 35:
response = (url)
html = ntent
soup = bs(html, “lxml”)
for h3 in nd_all(“h3″, class_=”title”):
(t_text(strip=True))
As you’ve noticed, this code is not so different from the first while loop:
with while page! = 35 we get all bestsellers pages, not just the first 5, response = (url), html = ntent, and soup = bs(html, “lxml”) are parts that you’re already familiar with (requesting pages, then creating a soup object from which we can extract the HTML content we need), we loop through all h3 elements with the class of title (for h3 in nd_all(“h3″, class_=”title”):) to get the book titles, we add each book title ((t_text(strip=True)); strip=True removes whitespaces) to the titles list that we created before the while loop.
If we check the length of titles, we get 1020 as the output, which is correct, because 30 books on a page and 34 pages (30*34) gives us 1020 books:
Let’s also print out the first 5 items of titles, just to check if we really managed to save the books’ titles:
I believe we’ve got what we wanted, so let’s move on.
Getting the formats of the books
Remember how we got the books’ formats in the previous tutorial? Let me paste the code here:
formats = (” “)formats_series = (formats)lue_counts()
We can reuse the same code in a while loop for all 34 pages (note that I’ve renamed formats to formats_on_page):
formats_all = []
formats_on_page = (” “)
for product_format in formats_on_page:
(t_text())
formats_series = (formats_all)
lue_counts()
Running the above code will result in this output:
The logic is completely the same as in the case of book titles:
we need a list (formats_all) where we can store the books’ formats (paperback, hardback, etc. ), in a while loop we request and create a BeautifulSoup representation of every page, at every iteration we find every HTML element that holds a book’s format (formats_on_page = (” “)), then we loop through (for product_format in formats_on_page:) every book format that we’ve found in the previous step, just to add their text content (for instance Paperback) to formats_all ((t_text())), and finally with the help of good old pandas we convert formats_all into a pandas series (formats_series = (formats_all)), so we can count the number of occurrences of every book format (lue_counts()).
As you can see in the above screenshot, most bestseller books are paperback (761), which – I think – is not that surprising, but good to know nonetheless.
You may wonder, though, exactly what percentage of bestsellers are our 761 paperbacks?
normalize=True to the rescue!
You see, by adding normalize=True to. value_counts(), instead of exact numbers, we get the relative frequencies of the unique values in formats_series. So the 761 paperback books constitute around 75% of all bestseller books – nice!
Following the same steps we can easily create a while loop for the publication years and prices as well.
But I won’t paste the code here, just so you can find the solution out for yourself (you know, practice makes perfect ).
(Hint: use a while loop and read the previous article’s “Getting the book formats” section to find the solution. Alternatively, later in this article the “Saving all scraped data into data-frames” section may also be of great help. )
However, I will show you what else we can do with some more data…
Visualizing bestseller books by publication year
Once you’ve created years_series and applied. value_counts() on it (in the previous section I’ve showed you how you can do it through the example of formats_series), you’ll have a pandas series object where the index column contains the publication years, and the corresponding values show the number of bestseller books published in that year (the screenshot doesn’t contain the whole series):
lue_counts() can be easily converted into a pandas dataframe object:
years_df = lue_counts(). to_frame(). reset_index()(columns={“index”:”Year”, 0:”Published books”}, inplace=True)
Your dataframe will appear like this:
In the above code. to_frame() converts the series object into a dataframe, then. reset_index() creates a new index column (beginning from 0), so that the original index column (with the publication years) can be created as a normal column in the dataframe next to the books column:
Then the () method takes care of renaming “index” and “0” to “Year” and “Published books”, respectively.
Of course, a dataframe looks better than a series, but a bar chart looks even better than a dataframe:
As you can see, most bestseller books have been published this year (surprise, surprise ), but there’s also some gems from the 1990s.
Here’s the code with which you can reproduce a similar chart:
(figsize=(14, 10))(“Number of bestseller books by publication year”, fontsize = 20)((0, 275, step=25))(rotation= 70)(“Number of bestseller books”, fontsize = 16)(“Publication year”, fontsize = 16)(years_df[“Year”], years_df[“Published books”], width = 0. 5, color = “#EF6F6C”, edgecolor = “black”)(color=’#59656F’, linestyle=’–‘, linewidth=1, axis=’y’, alpha=0. 7)
I won’t give you a deeper explanation regarding which line does what, but I do recommend that you check out Keith Galli’s and codebasics’ video on bar charts (and of course, the original matplotlib documentation).
Saving all scraped data into dataframes
In the introduction to web scraping article we created a histogram out of books’ prices; we won’t do that again based on all prices, because I’m sure that by now you can figure it out by yourself.
What I have in store for you this time is something more advanced.
What if we collected the title, the format, the publication year and the price data with one big while loop? Because in all honesty, there’s absolutely no need to scrape these data separately if we can do it in one go.
Doing so we can not only answer more interesting questions (What books are bestsellers today from the 1990s? ), but we can also make easier comparisons (for instance differences in pricing between paperback and hardback books).
First, let me show you the one big while loop that collects every piece of data we need, then I’ll explain it in detail how it works, and after that we’ll do some more analysis.
So here’s the code:
bestseller_books = []
for book in nd_all(“div”, class_=”book-item”):
bestseller_book = {}
bestseller_book[“title”] = t_text(strip=True)
bestseller_book[“format”] = (“p”, class_=”format”). get_text()
try:
bestseller_book[“year”] = (“p”, class_=”published”). get_text()[-4:]
except AttributeError:
bestseller_book[“year”] = “”
price = (“p”, class_=”price”)
original_price = (“span”, class_=”rrp”)
bestseller_book[“price”] = “”
else:
if original_price:
current_price = str(eviousSibling)()
current_price = float((“€”)[0]. replace(“, “, “. “))
current_price = float(t_text(strip=True)(“€”)[0]. “))
bestseller_book[“price”] = current_price
(bestseller_book)
Let me explain how the code works:
the whole code is just one big while loop that loops through all bestseller pages, for book in nd_all(“div”, class_=”book-item”) finds every book on a given page, then each books’ title, format, publication year and price is saved into a bestseller_book dictionary one by one; once a bestseller_book is fully created, it’s added to the bestseller_books list, bestseller_book[“title”] = t_text(strip=True) collects the books’ titles and saves them into bestseller_book, bestseller_book[“format”] = (“p”, class_=”format”). get_text() gets us the books’ formats and saves them into bestseller_book, bestseller_book[“year”] = (“p”, class_=”published”). get_text()[-4:] finds the publication years and saves them into bestseller_book; if there’s no publication information for a book, we receive an AttributeError: ‘NoneType’ object has no attribute ‘get_text’ error, so instead of letting this error stop our code from running, we assign the “” value to bestseller_book[“year”] (this is handled by the try-except block), with price = (“p”, class_=”price”) we search for the books’ prices: after this line of code we carry out the same steps we did in the previous article, then add each books’ selling price to bestseller_book[“price”]; if no price exists for a book we add “” (this is done by the try–except–else block), finally we have our bestseller_book, and add it to our bestseller_books list ((bestseller_book)).
Let’s print out the first three books of bestseller_books to quickly check what we’ve just accomplished:
See? We have all the information we need for every book in one place!
Why is it useful?
Because we can create a pandas dataframe out of it:
And then we can easily answer questions like what books are bestsellers from – let’s say – 1998:
Or which books cost more than 50 €:
books_with_prices_df = books_df[books_df[“price”]! = “”] keeps only books that have available price information, then books_with_prices_df[books_with_prices_df[“price”] > 50]() shows the first five books above 50 €.
If you need a refresher on how pandas works, please read this tutorial series on the topic.
Comparing paperback and hardback books with boxplots
I’ve got one last cool visualization for you, and again, we’ll use the data from books_with_prices_df:
First, we’ll create two new dataframes out of books_with_prices_df: one that contains only paperback books (paperback_df), and another one for hardback books (hardback_df):
paperback_df = books_with_prices_df[books_with_prices_df[“format”] == “Paperback”]hardback_df = books_with_prices_df[books_with_prices_df[“format”] == “Hardback”]paperback_df = ({“price”: float})hardback_df = ({“price”:float})
(We convert the string values of the price column into float type with ({“price”:float}. )
Let’s do a. describe() on both dataframes to summarize our data:
You can see that we have 758 (count) paperback books and 192 (count) hardback books. You may also notice that:
you can expect to buy a bestseller paperback book for an average price (mean) of 14. 4 €, but for a hardback book you’d have to pay 22. 14 €, the cheapest paperback book is 6. 7 €, while its hardback counterpart sells for 9. 13 €, and interestingly, the most expensive (max) paperback book (147 €) costs more than the most expensive hardback book (105. 3 €).
We can also visualize these data with boxplots (note: I’ve added the texts (like Q3 (75%) manually next to the boxplots to make the boxplots more understandable):
Boxplots are five-number summaries of datasets that show the minimum, the maximum, the median, the first quartile and the third quartile of a dataset. At a quick glance we can establish for example that paperback books have a lower median than hardback books, that the cheapest paperback book is cheaper than the cheapest hardback book, etc. Basically we can find information that we’ve got with the. describe() method above.
If you want to learn more about boxplots, watch this video and read this article.
Anyway, you can create the above boxplots with a few lines of code:
(figsize=(8, 5))((0, 40, step=5))()xplot([paperback_df[“price”], hardback_df[“price”]], labels=[“Paperback”, “Hardback”], showfliers=False)
(showfliers=False removes the outliers from the data, that’s the reason why the data are different on the boxplots from what. describe() shows us. )
If you’d like to read more about easy ways to summarize datasets, read this article on statistical averages, and this one on statistical variability.
Conclusion
Huh… we’ve covered a lot of ground. But it was worth it! After finishing this article (and coding along, of course) now you have enough knowledge to scrape multiple pages, and collect some basic data.
Feel free to leave a comment if you have a question or just would like to chat about web scraping. And don’t forget to subscribe to Tomi Mester’s newsletter, so you’ll be notified when the next article comes out in this web scraping series (we’ll be doing more advanced stuff, pinky promise).
Until then, keep practicing.
If you want to learn more about how to become a data scientist, take Tomi Mester’s 50-minute video course: How to Become a Data Scientist. (It’s free! )Also check out the 6-week online course: The Junior Data Scientist’s First Month video course.
Cheers, Tamas Ujhelyi