How To Scrape All Pages From A Website

How to Scrape Multiple Pages of a Website Using Python?

Web Scraping is a method of extracting useful data from a website using computer programs without having to manually do it. This data can then be exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Price Comparison Tools, Search Engines, Data Collection for AI/ML projects, ’s dive deep and scrape a website. In this article, we are going to take the GeeksforGeeks website and extract the titles of all the articles available on the Homepage using a Python script. If you notice, there are thousands of articles on the website and to extract all of them, we will have to scrape through all pages so that we don’t miss out on any! GeeksforGeeks HomepageScraping multiple Pages of a website Using PythonNow, there may arise various instances where you may want to get data from multiple pages from the same website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tedious task. Plus, it defies all basic principles of automation. Duh! To solve this exact problem, we will see two main techniques that will help us extract data from multiple webpages:The same websiteDifferent website URLsApproach:The approach of the program will be fairly simple, and it will be easier to understand it in a POINT format:We’ll import all the necessary up our URL strings for making a connection using the requests rsing the available data from the target page using the BeautifulSoup library’s the target page, Identify and Extract the classes and tags which contain the information that is valuable to ototype it for one page using a loop and then apply it to all the pages. Example 1: Looping through the page numbers page numbers at the bottom of the GeeksforGeeks websiteMost websites have pages labeled from 1 to N. This makes it really simple for us to loop through these pages and extract data from them as these pages have similar structures. For example:notice the last section of the URL – page/4/Here, we can see the page details at the end of the URL. Using this information we can easily create a for loop iterating over as many pages as we want (by putting page/(i)/ in the URL string and iterating “i” till N) and scrape all the useful data from them. The following code will give you more clarity over how to scrape data by using a For Loop in thonimport requestsfrom bs4 import BeautifulSoup as bsreq = (URL)soup = bs(, ”)titles = nd_all(‘div’, attrs = {‘class’, ‘head’})print(titles[4])Output:Output for the above codeNow, using the above code, we can get the titles of all the articles by just sandwiching those lines with a thonimport requestsfrom bs4 import BeautifulSoup as bsfor page in range(1, 10): req = (URL + str(page) + ‘/’) soup = bs(, ”) titles = nd_all(‘div’, attrs={‘class’, ‘head’}) for i in range(4, 19): if page>1: print(f”{(i-3)+page*15}” + titles[i]) else: print(f”{i-3}” + titles[i])Output:Output for the above codeNote: The above code will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that fall under those pages. Example 2: Looping through a list of different above technique is absolutely wonderful, but what if you need to scrape different pages, and you don’t know their page numbers? You’ll need to scrape those different URLs one by one and manually code a script for every such stead, you could just make a list of these URLs and loop through them. By simply iterating the items in the list i. e. the URLs, we will be able to extract the titles of those pages without having to write code for each page. Here’s an example code of how you can do thonimport requestsfrom bs4 import BeautifulSoup as bsfor url in range(0, 2): req = (URL[url]) soup = bs(, ”) titles = nd_all(‘div’, attrs={‘class’, ‘head’}) for i in range(4, 19): if url+1 > 1: print(f”{(i – 3) + url * 15}” + titles[i]) else: print(f”{i – 3}” + titles[i])Output:Output for the above codeHow to avoid getting your IP address banned? Controlling the crawl rate is the most important thing to keep in mind when carrying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most likely result in getting your IP address blacklisted. To avoid this, we can simply carry out our crawling in short random bursts of time. In other words, we add pauses or little breaks between crawling periods, which help us look like actual humans as websites can easily identify a crawler because of the speed it possesses compared to a human trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win! Now, how do we control the crawling rate? It’s simple. By using two functions, randint() and sleep() from python modules ‘random’ and ‘time’ respectively. Python3from random import randintfrom time import sleep print(randint(1, 10))The randint() function will choose a random integer between the given upper and lower limits, in this case, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will help in adding short and random breaks in the crawling rate of the program. The sleep() function will basically cease the execution of the program for the given number of seconds. Here, the number of seconds will randomly be fed into the sleep function by using the randint() function. Use the code given below for thon3from time import *from random import randintfor i in range(0, 3): x = randint(2, 5) print(x) sleep(x) print(f’I waited {x} seconds’)Output5
I waited 5 seconds
4
I waited 4 seconds
5
I waited 5 secondsTo get you a clear idea of this function in action, refer to the code given thon3import requestsfrom bs4 import BeautifulSoup as bsfrom random import randintfrom time import sleepfor page in range(1, 10): req = (URL + str(page) + ‘/’) soup = bs(, ”) titles = nd_all(‘div’, attrs={‘class’, ‘head’}) for i in range(4, 19): if page>1: print(f”{(i-3)+page*15}” + titles[i]) else: print(f”{i-3}” + titles[i]) sleep(randint(2, 10))Output:The program has paused its execution and is waiting to resumeThe output of the above code Attention geek! Strengthen your foundations with the Python Programming Foundation Course and learn the basics. To begin with, your interview preparations Enhance your Data Structures concepts with the Python DS Course. And to begin with your Machine Learning Journey, join the Machine Learning – Basic Level Course
How to Scrape Multiple Pages of a Website Using a Python ...

How to Scrape Multiple Pages of a Website Using a Python …

Extracting data and ensuring data qualityThis is the second article of my web scraping guide. In the first article, I showed you how you can find, extract, and clean the data from one single web page on this article, you’ll learn how to scrape multiple web pages — a list that’s 20 pages and 1, 000 movies total — with a Python web the previous article, we scraped and cleaned the data of the title, year of release, imdb_ratings, metascore, length of movie, number of votes, and the us_gross earnings of all movies on the first page of IMDb’s Top 1, 000 was the code we used:And our results looked like this:I’ll be guiding you through these steps:You’ll request the unique URLs for every page on this IMDb ’ll iterate through each page using a for loop, and you’ll scrape each movie one by ’ll control the loop’s rate to avoid flooding the server with ’ll extract, clean, and download this final ’ll use basic data-quality best are the additional tools we’ll use in our scraper:The sleep() function from Python’s time module will control the loop’s rate by pausing the execution of the loop for a specified amount of randint() function from Python’s random module will vary the amount of waiting time between requests — within your specified intervalAs mentioned in the first article, I recommend following along in a environment if you don’t already have an IDE. I’ll also be writing out this guide as if we were starting fresh, minus all the first guide’s explanations, so you aren’t required to copy and paste the first article’s code can compare the first article’s code with this article’s final code to see how it all worked — you’ll notice a few slight ternatively, you can go straight to the code, let’s begin! Import toolsLet’s import our previous tools and our new tools — time and itialize your storageLike previously, we’re going to continue to use our empty lists as storage for all the data we scrape:English movie titlesAfter we initialize our storage, we should have our code that makes sure we get English-translated titles from all the movies we scrape:Analyzing our URLLet’s go to the URL of the page we‘re, let’s click on the next page and see what page 2’s URL looks like:And then page 3’s URL:What do we notice about the URL from page 2 to page 3? We notice &start=51 is added into the URL when we go to page 2, and the number 51 turns to the number 101 on page makes sense because there are 50 movies on each page. Page1 is 1-50, page 2 is 51-100, page 3 is 101-150, and so is this important? This information will help us tell our loop how to go to the next page to fresher on ‘for’ loopsJust like the loop we used to loop through each movie on the first page, we’ll use a for loop to iterate through each page on the refresh, this is how a for loop works:for in : is a collection of objects—e. g. a list or tuple. The are executed once for each item in . The loop takes on the value of the next element in each time through the I mentioned earlier, each page’s URL follows a certain logic as the web pages change. To make the URL requests we’d have to vary the value of the page parameter, like this:Breaking down the URL parameters:pages is the variable we create to store our page-parameter function for our loop to iterate range(1, 1001, 50) is a function in the NumPy Python library, and it takes four arguments — but we’re only using the first three which are: start, stop, and step. step is the number that defines the spacing between each. So: Start at 1, stop at 1001, and step by at 1: This will be our first page’s at 1001: Why stop at 1001? The number in the stop parameter is the number that defines the end of the array, but it isn’t included in the array. The last page for movies would be at the URL number of 951. This page has movies 951-1000. If we used 951, it wouldn’t include this page in our scraper, so we have to go one page further to make sure we get the last at 50: We want the URL number to change by 50 each time the loop comes around — this parameter tells it to do we need to create another for loop that’ll loop our scraper through the pages function we created above, which loops through each different URL we need. We can do this simply like this:Breaking this loop down:page is the variable that’ll iterate through our pages functionpages is the function we created: range(1, 1001, 50)Inside this new loop is where we’ll request our new URLs, add our html_soup (helps us parse the HTML files), and add our movie_div (stores each div container we’re scraping). This is what it’ll look like:Breaking page down:page is the variable we’re using which stores each of our new () is the method we use to grab the contents of each URL“ is the part of the URL that stays the same when we change each page+ str(page) tells the request to add each iteration of page (the page function we’re using to change the page number of the URL) into the URL request. It also tells it to make sure it’s a string we’re using — not an integer or float — because it’s an URL link we’re building. + “&ref_=adv_nxt” is added to the end of every URL because this also does not change when we go to the next pageheaders=headers tells our scraper to bring us English-translated content from the URLs we’re requestingBreaking soup down:soup is the variable we create to assign the method BeautifulSoup toBeautifulSoup is a method we’re using that specifies a desired format of results(, “”) grabs the text contents of page and uses the HTML parser — this allows Python to read the components of the page rather than treating it as one long stringBreaking movie_div down:movie_div is the variable we use to store all of the div containers with a class of lister-item mode-advancedThe find_all() method extracts all the div containers that have a class attribute of lister-item mode-advanced from what we’ve stored in our variable soupControlling the crawl rate is beneficial for the scraper and for the website we’re scraping. If we avoid hammering the server with a lot of requests all at once, then we’re much less likely to get our IP address banned — and we also avoid disrupting the activity of the website we scrape by allowing the server to respond to other user requests as ’ll be adding this code to our new for loop:Breaking crawl rate down:The sleep() function will control the loop’s rate by pausing the execution of the loop for a specified amount of timeThe randint(2, 10) function will vary the amount of waiting time between requests for a number between 2-10 seconds. You can change these parameters to any that you note that this will delay the time it takes to grab all the data we need from every page, so be patient. There are 20 pages with a max of 10 seconds per loop, so it’d take a max of 3. 5 minutes to get all of the data with this ’s very important to practice good scraping and to scrape responsibly! Our code should now look like this:We can add our scraping for loop code into our new for loop:I’d like to point out a slight error I made in the previous article — a mistake I made regarding the cleaning of the metascore data. I received this DM from an awesome dev who was running through my article and coding along but with a different IMDb URL than the one I used to teach in the the extracting metascore data code, we wrote this:This extraction code says if there is Metascore data there, grab it — but if the data is missing, then put a dash there and the cleaning of themetascore data code, we wrote this:This cleaning code says to turn this pandas object into an integer data type, which worked for my URL I scraped because I didn’t have any missing Metascore data — e. g., no dashes in place of missing I failed to notice is if someone scraped a different IMDb page than I did, they’d possibly have missing metascore data there, and once we scraped multiple pages in this guide, we’ll have missing metascore data as does this mean? It means when we do get those dashes in place of missing data, we can’t use the code (int) to convert that entiremetascore data into an integer like I previously used — this would produce an error. We’d need to turn our metascore data into a float data type (decimal). Instead of this metascore data cleaning code:We’ll use this:Breaking down the new cleaning of the Metascore data:Top-cleaning code:movies[‘metascore’] is our Metascore data in our movies DataFrame. We’ll be assigning our new cleaned up data to our metascore [‘metascore’] tells pandas to go to the column metascore in our (‘(\d+’) — this method: (‘(\d+’) says to extract all the digits in the stringBottom-conversion code:movies[‘metascore’] is stripped of the elements we don’t need, and now we’ll assign the conversion code data to it to finish it _numeric is a method we use to change this column to a float. The reason we use this is because we have a lot of dashes in this column, and we can’t just convert it to a float using (float) — this would catch an ’coerce’ will transform the nonnumeric values, our dashes, into not-a-number (NaN) values because we have dashes in place of the data that’s ’s add our DataFrame and cleaning code to our new scraper, which will go below our loops. If you have any questions regarding how this code works, go to the first article to see what each line code should look like this:We have all the elements of our scraper ready — now it’s time to save all the data we’re about to scrape into our is the code you can add to the bottom of your program to save your data to a CSV file:In case you need a refresher, if you’re in, you can create an empty CSV file by hovering near “Files” and clicking the “Add file” option. Name it, and save it with a extension. Then, add the code to the end of your _csv(‘’)If we run and save our, we should get a file with a list of movies and all the data from 0-999:Here, I’ll discuss some basic data-quality tricks you can use when cleaning your data. You don’t need to apply any of this to our final ually, a dataset with a lot of missing data isn’t a good dataset at all. Below are ways we can look up, manipulate, and change our data — for future reference. Missing dataOne of the most common problems in a dataset is missing data. In our case, the data wasn’t available. There are a couple of ways to check and deal with missing data:Check where we’re missing data and how much is missingAdd in a default value for the missing dataDelete the rows that have missing dataDelete the columns that have a high incidence of missing dataWe’ll go through each of these in missing data:We can easily check for missing data like this:The output:This shows us where the data is missing and how much data is missing. We have 165 missing values in metascore and 161 missing in us_grossMillions— a total of 326 missing data in our default value for missing data:If you wanted to change your NaN values to something else specific, you can do so like this:For this example, I want the words “None Given” in place of metascore NaN values and empty quotes (nothing) in place of us_grossMillions NaN you print those columns, you can see our NaN values have been changed as specified:Beware: Our metascore column was an int, and our us_grossMillions column was a float prior to this change — and you can see how they’re both objects now because of the change. Be careful when changing your data, and always check to see what your data types are when making any rows with missing data:Sometimes the best route to take when having a lot of missing data is to just remove them altogether. We can do this a couple of different ways:Delete columns with missing data:Sometimes when we have too many missing values in a column, it’s best to get rid of them. We can do so like this_axis=1 is the parameter we use — it means to operate on columns, not rows. Axis=0 means rows. We could’ve used this parameter in our delete-rows section, but the default is already 0, so I didn’t use ‘any’ means if any NA values are present to drop that you have it! We’ve successfully extracted data of the top 1, 000 best movies of all time on IMDb, which included multiple pages, and saved it into a CSV file. I hope you enjoyed building a Python scraper. If you followed along, let me know how it coding!
Beautiful Soup Tutorial 2. – How to Scrape Multiple Web Pages

Beautiful Soup Tutorial 2. – How to Scrape Multiple Web Pages

Scraping one web page is fun, but scraping more web pages is more fun. In this tutorial you’ll learn how to do just that; along the way you’ll also make good use of your collected data by doing some visualizations and analyses. While in the previous article you learned to crawl, now it’s time for you to stand up and learn to walk.
How to inspect URLs for web scraping
If you recall, in the previous part of this tutorial series we scraped only the first bestsellers page of Book Depository. Truth is, there are actually 34 pages of bestseller books that we can scrape:
Image source: Book Depository
Question: how do we scrape all 34 pages?
Answer: by first inspecting what’s happening in the URL when we switch pages.
This is the first page’s URL:
By going to the second page, you’ll notice that the URL changes to this:
The only difference is that? page=2 has been appended to the base URL. Now let’s check out what happens if we visit the third page:? page=2 turned into? page=3; can you see where I’m going with this?
It seems that by changing the number after page=, we can go to whichever page we want to. Let’s try this out real quick by replacing 3 with 28 ():
See? It works like a charm.
But wait… what about the first page? It had no? page=number in it!
Lucky for us, and are the same page with the same book results, so it seems that we’ve found a reliable solution that we can use to navigate between web pages by changing the URL.
Shortly I’ll show you how you can bring this knowledge over to web scraping, but first a quick explanation to the curious minds out there as to what the heck this? page=number thing is? part of a URL signifies the start of the so-called query string. Anything that comes after the? is the query string itself, which contains key-value pairs. In our case page is the key and the number we assign to it is its value. By assigning a certain number to page, we are able to request the bestsellers page corresponding to that number. Easy-peasy. Now, let’s put this knowledge to good use.
Scraping multiple web pages with a while loop
To complete this tutorial, we’ll need to use the same libraries from the previous article, so don’t forget to import them:
from bs4 import BeautifulSoup as bsimport requestsimport numpy as npimport pandas as pdimport as plt%matplotlib inline
(Remember:%matplotlib inline is necessary for the later data visualizations to appear if you write your code in Jupyter Notebook. )
What we’ll do in this article will be very similar to what we’ve already accomplished so far, but with more data: we’ll analyze not 30, but 1020 books.
For this reason we’ll reuse (with some small modifications) the code we’ve already written to get the titles, formats, publication years and prices of the bestseller books. To scrape multiple pages, we’ll use a while loop and the page parameters in the URLs. Keep in mind that the bestsellers list is updated daily, so don’t freak out if you don’t get the same data that are shown in this tutorial.
For starters, it’s always a good idea to build your code up step by step, so if you run into an error, you’ll immediately know which part of your code needs some rethinking. As a first step we may want to check if we can get the first 5 bestsellers URLs:
page = 1
while page! = 6:
url = f”page}”
print(url)
page = page + 1
As the output attests, we’ve succeeded in our endeavour:
Here’s the breakdown of the code:
we create the variable page that initially holds 1 as its value (because we want to start from the first bestsellers page), while page! = 6: makes sure that our code stops running when page gets the value 6 (which would mean the sixth bestsellers page); because we’re only interested in the first 5 pages, we won’t be bothering with the sixth page, the variable url will hold the bestsellers page’s URL at every iteration in a string format; we use f-strings that lets {page} receive the current value of page, at the first iteration we have page=1, so url will be: the second iteration we have page=2, so url becomes: the last URL will be: (url) prints the current URL, so we can check if we get the results we intended to get, then we increase the value of page by one at the end of every iteration.
Do you like the article so far? If so, you’ll love this 6-week data science course on Data36: The Junior Data Scientist’s First Month. It’s a 6-week simulation of being a junior data scientist at a true-to-life startup. Go check it out here:!
Collecting all bestseller books’ titles
Let’s modify our while loop just a little bit so we can loop through all 34 bestsellers pages, and get every bestseller’s title:
titles = []
while page! = 35:
response = (url)
html = ntent
soup = bs(html, “lxml”)
for h3 in nd_all(“h3″, class_=”title”):
(t_text(strip=True))
As you’ve noticed, this code is not so different from the first while loop:
with while page! = 35 we get all bestsellers pages, not just the first 5, response = (url), html = ntent, and soup = bs(html, “lxml”) are parts that you’re already familiar with (requesting pages, then creating a soup object from which we can extract the HTML content we need), we loop through all h3 elements with the class of title (for h3 in nd_all(“h3″, class_=”title”):) to get the book titles, we add each book title ((t_text(strip=True)); strip=True removes whitespaces) to the titles list that we created before the while loop.
If we check the length of titles, we get 1020 as the output, which is correct, because 30 books on a page and 34 pages (30*34) gives us 1020 books:
Let’s also print out the first 5 items of titles, just to check if we really managed to save the books’ titles:
I believe we’ve got what we wanted, so let’s move on.
Getting the formats of the books
Remember how we got the books’ formats in the previous tutorial? Let me paste the code here:
formats = (” “)formats_series = (formats)lue_counts()
We can reuse the same code in a while loop for all 34 pages (note that I’ve renamed formats to formats_on_page):
formats_all = []
formats_on_page = (” “)
for product_format in formats_on_page:
(t_text())
formats_series = (formats_all)
lue_counts()
Running the above code will result in this output:
The logic is completely the same as in the case of book titles:
we need a list (formats_all) where we can store the books’ formats (paperback, hardback, etc. ), in a while loop we request and create a BeautifulSoup representation of every page, at every iteration we find every HTML element that holds a book’s format (formats_on_page = (” “)), then we loop through (for product_format in formats_on_page:) every book format that we’ve found in the previous step, just to add their text content (for instance Paperback) to formats_all ((t_text())), and finally with the help of good old pandas we convert formats_all into a pandas series (formats_series = (formats_all)), so we can count the number of occurrences of every book format (lue_counts()).
As you can see in the above screenshot, most bestseller books are paperback (761), which – I think – is not that surprising, but good to know nonetheless.
You may wonder, though, exactly what percentage of bestsellers are our 761 paperbacks?
normalize=True to the rescue!
You see, by adding normalize=True to. value_counts(), instead of exact numbers, we get the relative frequencies of the unique values in formats_series. So the 761 paperback books constitute around 75% of all bestseller books – nice!
Following the same steps we can easily create a while loop for the publication years and prices as well.
But I won’t paste the code here, just so you can find the solution out for yourself (you know, practice makes perfect ).
(Hint: use a while loop and read the previous article’s “Getting the book formats” section to find the solution. Alternatively, later in this article the “Saving all scraped data into data-frames” section may also be of great help. )
However, I will show you what else we can do with some more data…
Visualizing bestseller books by publication year
Once you’ve created years_series and applied. value_counts() on it (in the previous section I’ve showed you how you can do it through the example of formats_series), you’ll have a pandas series object where the index column contains the publication years, and the corresponding values show the number of bestseller books published in that year (the screenshot doesn’t contain the whole series):
lue_counts() can be easily converted into a pandas dataframe object:
years_df = lue_counts(). to_frame(). reset_index()(columns={“index”:”Year”, 0:”Published books”}, inplace=True)
Your dataframe will appear like this:
In the above code. to_frame() converts the series object into a dataframe, then. reset_index() creates a new index column (beginning from 0), so that the original index column (with the publication years) can be created as a normal column in the dataframe next to the books column:
Then the () method takes care of renaming “index” and “0” to “Year” and “Published books”, respectively.
Of course, a dataframe looks better than a series, but a bar chart looks even better than a dataframe:
As you can see, most bestseller books have been published this year (surprise, surprise ), but there’s also some gems from the 1990s.
Here’s the code with which you can reproduce a similar chart:
(figsize=(14, 10))(“Number of bestseller books by publication year”, fontsize = 20)((0, 275, step=25))(rotation= 70)(“Number of bestseller books”, fontsize = 16)(“Publication year”, fontsize = 16)(years_df[“Year”], years_df[“Published books”], width = 0. 5, color = “#EF6F6C”, edgecolor = “black”)(color=’#59656F’, linestyle=’–‘, linewidth=1, axis=’y’, alpha=0. 7)
I won’t give you a deeper explanation regarding which line does what, but I do recommend that you check out Keith Galli’s and codebasics’ video on bar charts (and of course, the original matplotlib documentation).
Saving all scraped data into dataframes
In the introduction to web scraping article we created a histogram out of books’ prices; we won’t do that again based on all prices, because I’m sure that by now you can figure it out by yourself.
What I have in store for you this time is something more advanced.
What if we collected the title, the format, the publication year and the price data with one big while loop? Because in all honesty, there’s absolutely no need to scrape these data separately if we can do it in one go.
Doing so we can not only answer more interesting questions (What books are bestsellers today from the 1990s? ), but we can also make easier comparisons (for instance differences in pricing between paperback and hardback books).
First, let me show you the one big while loop that collects every piece of data we need, then I’ll explain it in detail how it works, and after that we’ll do some more analysis.
So here’s the code:
bestseller_books = []
for book in nd_all(“div”, class_=”book-item”):
bestseller_book = {}
bestseller_book[“title”] = t_text(strip=True)
bestseller_book[“format”] = (“p”, class_=”format”). get_text()
try:
bestseller_book[“year”] = (“p”, class_=”published”). get_text()[-4:]
except AttributeError:
bestseller_book[“year”] = “”
price = (“p”, class_=”price”)
original_price = (“span”, class_=”rrp”)
bestseller_book[“price”] = “”
else:
if original_price:
current_price = str(eviousSibling)()
current_price = float((“€”)[0]. replace(“, “, “. “))
current_price = float(t_text(strip=True)(“€”)[0]. “))
bestseller_book[“price”] = current_price
(bestseller_book)
Let me explain how the code works:
the whole code is just one big while loop that loops through all bestseller pages, for book in nd_all(“div”, class_=”book-item”) finds every book on a given page, then each books’ title, format, publication year and price is saved into a bestseller_book dictionary one by one; once a bestseller_book is fully created, it’s added to the bestseller_books list, bestseller_book[“title”] = t_text(strip=True) collects the books’ titles and saves them into bestseller_book, bestseller_book[“format”] = (“p”, class_=”format”). get_text() gets us the books’ formats and saves them into bestseller_book, bestseller_book[“year”] = (“p”, class_=”published”). get_text()[-4:] finds the publication years and saves them into bestseller_book; if there’s no publication information for a book, we receive an AttributeError: ‘NoneType’ object has no attribute ‘get_text’ error, so instead of letting this error stop our code from running, we assign the “” value to bestseller_book[“year”] (this is handled by the try-except block), with price = (“p”, class_=”price”) we search for the books’ prices: after this line of code we carry out the same steps we did in the previous article, then add each books’ selling price to bestseller_book[“price”]; if no price exists for a book we add “” (this is done by the try–except–else block), finally we have our bestseller_book, and add it to our bestseller_books list ((bestseller_book)).
Let’s print out the first three books of bestseller_books to quickly check what we’ve just accomplished:
See? We have all the information we need for every book in one place!
Why is it useful?
Because we can create a pandas dataframe out of it:
And then we can easily answer questions like what books are bestsellers from – let’s say – 1998:
Or which books cost more than 50 €:
books_with_prices_df = books_df[books_df[“price”]! = “”] keeps only books that have available price information, then books_with_prices_df[books_with_prices_df[“price”] > 50]() shows the first five books above 50 €.
If you need a refresher on how pandas works, please read this tutorial series on the topic.
Comparing paperback and hardback books with boxplots
I’ve got one last cool visualization for you, and again, we’ll use the data from books_with_prices_df:
First, we’ll create two new dataframes out of books_with_prices_df: one that contains only paperback books (paperback_df), and another one for hardback books (hardback_df):
paperback_df = books_with_prices_df[books_with_prices_df[“format”] == “Paperback”]hardback_df = books_with_prices_df[books_with_prices_df[“format”] == “Hardback”]paperback_df = ({“price”: float})hardback_df = ({“price”:float})
(We convert the string values of the price column into float type with ({“price”:float}. )
Let’s do a. describe() on both dataframes to summarize our data:
You can see that we have 758 (count) paperback books and 192 (count) hardback books. You may also notice that:
you can expect to buy a bestseller paperback book for an average price (mean) of 14. 4 €, but for a hardback book you’d have to pay 22. 14 €, the cheapest paperback book is 6. 7 €, while its hardback counterpart sells for 9. 13 €, and interestingly, the most expensive (max) paperback book (147 €) costs more than the most expensive hardback book (105. 3 €).
We can also visualize these data with boxplots (note: I’ve added the texts (like Q3 (75%) manually next to the boxplots to make the boxplots more understandable):
Boxplots are five-number summaries of datasets that show the minimum, the maximum, the median, the first quartile and the third quartile of a dataset. At a quick glance we can establish for example that paperback books have a lower median than hardback books, that the cheapest paperback book is cheaper than the cheapest hardback book, etc. Basically we can find information that we’ve got with the. describe() method above.
If you want to learn more about boxplots, watch this video and read this article.
Anyway, you can create the above boxplots with a few lines of code:
(figsize=(8, 5))((0, 40, step=5))()xplot([paperback_df[“price”], hardback_df[“price”]], labels=[“Paperback”, “Hardback”], showfliers=False)
(showfliers=False removes the outliers from the data, that’s the reason why the data are different on the boxplots from what. describe() shows us. )
If you’d like to read more about easy ways to summarize datasets, read this article on statistical averages, and this one on statistical variability.
Conclusion
Huh… we’ve covered a lot of ground. But it was worth it! After finishing this article (and coding along, of course) now you have enough knowledge to scrape multiple pages, and collect some basic data.
Feel free to leave a comment if you have a question or just would like to chat about web scraping. And don’t forget to subscribe to Tomi Mester’s newsletter, so you’ll be notified when the next article comes out in this web scraping series (we’ll be doing more advanced stuff, pinky promise).
Until then, keep practicing.
If you want to learn more about how to become a data scientist, take Tomi Mester’s 50-minute video course: How to Become a Data Scientist. (It’s free! )Also check out the 6-week online course: The Junior Data Scientist’s First Month video course.
Cheers, Tamas Ujhelyi

Frequently Asked Questions about how to scrape all pages from a website

Leave a Reply

Your email address will not be published. Required fields are marked *