A Beginner’s Guide to learn web scraping with python! – Edureka
Last updated on Sep 24, 2021 641. 9K Views Tech Enthusiast in Blockchain, Hadoop, Python, Cyber-Security, Ethical Hacking. Interested in anything… Tech Enthusiast in Blockchain, Hadoop, Python, Cyber-Security, Ethical Hacking. Interested in anything and everything about Computers. 1 / 2 Blog from Web Scraping Web Scraping with PythonImagine you have to pull a large amount of data from websites and you want to do it as quickly as possible. How would you do it without manually going to each website and getting the data? Well, “Web Scraping” is the answer. Web Scraping just makes this job easier and faster. In this article on Web Scraping with Python, you will learn about web scraping in brief and see how to extract data from a website with a demonstration. I will be covering the following topics: Why is Web Scraping Used? What Is Web Scraping? Is Web Scraping Legal? Why is Python Good For Web Scraping? How Do You Scrape Data From A Website? Libraries used for Web Scraping Web Scraping Example: Scraping Flipkart Website Why is Web Scraping Used? Web scraping is used to collect large information from websites. But why does someone have to collect such large data from websites? To know about this, let’s look at the applications of web scraping: Price Comparison: Services such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products. Email address gathering: Many companies that use email as a medium for marketing, use web scraping to collect email ID and then send bulk emails. Social Media Scraping: Web scraping is used to collect data from Social Media websites such as Twitter to find out what’s trending. Research and Development: Web scraping is used to collect a large set of data (Statistics, General Information, Temperature, etc. ) from websites, which are analyzed and used to carry out Surveys or for R&D. Job listings: Details regarding job openings, interviews are collected from different websites and then listed in one place so that it is easily accessible to the is Web Scraping? Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form. There are different ways to scrape websites such as online Services, APIs or writing your own code. In this article, we’ll see how to implement web scraping with python. Is Web Scraping Legal? Talking about whether web scraping is legal or not, some websites allow web scraping and some don’t. To know whether a website allows web scraping or not, you can look at the website’s “” file. You can find this file by appending “/” to the URL that you want to scrape. For this example, I am scraping Flipkart website. So, to see the “” file, the URL is in-depth Knowledge of Python along with its Diverse Applications Why is Python Good for Web Scraping? Here is the list of features of Python which makes it more suitable for web scraping. Ease of Use: Python is simple to code. You do not have to add semi-colons “;” or curly-braces “{}” anywhere. This makes it less messy and easy to use. Large Collection of Libraries: Python has a huge collection of libraries such as Numpy, Matlplotlib, Pandas etc., which provides methods and services for various purposes. Hence, it is suitable for web scraping and for further manipulation of extracted data. Dynamically typed: In Python, you don’t have to define datatypes for variables, you can directly use the variables wherever required. This saves time and makes your job faster. Easily Understandable Syntax: Python syntax is easily understandable mainly because reading a Python code is very similar to reading a statement in English. It is expressive and easily readable, and the indentation used in Python also helps the user to differentiate between different scope/blocks in the code. Small code, large task: Web scraping is used to save time. But what’s the use if you spend more time writing the code? Well, you don’t have to. In Python, you can write small codes to do large tasks. Hence, you save time even while writing the code. Community: What if you get stuck while writing the code? You don’t have to worry. Python community has one of the biggest and most active communities, where you can seek help Do You Scrape Data From A Website? When you run the code for web scraping, a request is sent to the URL that you have mentioned. As a response to the request, the server sends the data and allows you to read the HTML or XML page. The code then, parses the HTML or XML page, finds the data and extracts it. To extract data using web scraping with python, you need to follow these basic steps: Find the URL that you want to scrape Inspecting the Page Find the data you want to extract Write the code Run the code and extract the data Store the data in the required format Now let us see how to extract data from the Flipkart website using Python, Deep Learning, NLP, Artificial Intelligence, Machine Learning with these AI and ML courses a PG Diploma certification program by NIT braries used for Web Scraping As we know, Python is has various applications and there are different libraries for different purposes. In our further demonstration, we will be using the following libraries: Selenium: Selenium is a web testing library. It is used to automate browser activities. BeautifulSoup: Beautiful Soup is a Python package for parsing HTML and XML documents. It creates parse trees that is helpful to extract the data easily. Pandas: Pandas is a library used for data manipulation and analysis. It is used to extract the data and store it in the desired format. Subscribe to our YouTube channel to get new updates..! Web Scraping Example: Scraping Flipkart WebsitePre-requisites: Python 2. x or Python 3. x with Selenium, BeautifulSoup, pandas libraries installed Google-chrome browser Ubuntu Operating SystemLet’s get started! Step 1: Find the URL that you want to scrapeFor this example, we are going scrape Flipkart website to extract the Price, Name, and Rating of Laptops. The URL for this page is 2: Inspecting the PageThe data is usually nested in tags. So, we inspect the page to see, under which tag the data we want to scrape is nested. To inspect the page, just right click on the element and click on “Inspect” you click on the “Inspect” tab, you will see a “Browser Inspector Box” 3: Find the data you want to extractLet’s extract the Price, Name, and Rating which is in the “div” tag respectively. Learn Python in 42 hours! Step 4: Write the codeFirst, let’s create a Python file. To do this, open the terminal in Ubuntu and type gedit
from BeautifulSoup import BeautifulSoup
import pandas as pdTo configure webdriver to use Chrome browser, we have to set the path to chromedriverdriver = (“/usr/lib/chromium-browser/chromedriver”)Refer the below code to open the URL: products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
(“)
Now that we have written the code to open the URL, it’s time to extract the data from the website. As mentioned earlier, the data we want to extract is nested in
soup = BeautifulSoup(content)
for a in ndAll(‘a’, href=True, attrs={‘class’:’_31qSD5′}):
(‘div’, attrs={‘class’:’_3wU53n’})
(‘div’, attrs={‘class’:’_1vC4OE _2rQ-NK’})
(‘div’, attrs={‘class’:’hGSR34 _2beYZw’})
()
Step 5: Run the code and extract the dataTo run the code, use the below command: python 6: Store the data in a required formatAfter extracting the data, you might want to store it in a format. This format varies depending on your requirement. For this example, we will store the extracted data in a CSV (Comma Separated Value) format. To do this, I will add the following lines to my code:df = Frame({‘Product Name’:products, ‘Price’:prices, ‘Rating’:ratings})
_csv(”, index=False, encoding=’utf-8′)Now, I’ll run the whole code again. A file name “” is created and this file contains the extracted data. I hope you guys enjoyed this article on “Web Scraping with Python”. I hope this blog was informative and has added value to your knowledge. Now go ahead and try Web Scraping. Experiment with different modules and applications of Python. If you wish to know about Web Scraping With Python on Windows platform, then the below video will help you understand how to do Scraping With Python | Python Tutorial | Web Scraping Tutorial | EdurekaThis Edureka live session on “WebScraping using Python” will help you understand the fundamentals of scraping along with a demo to scrape some details from a question regarding “web scraping with Python”? You can ask it on edureka! Forum and we will get back to you at the earliest or you can join our Python Training in Hobart get in-depth knowledge on Python Programming language along with its various applications, you can enroll here for live online Python training with 24/7 support and lifetime access.

A Practical Introduction to Web Scraping in Python
Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools.
The Internet hosts perhaps the greatest source of information—and misinformation—on the planet. Many disciplines, such as data science, business intelligence, and investigative reporting, can benefit enormously from collecting and analyzing data from websites.
In this tutorial, you’ll learn how to:
Parse website data using string methods and regular expressions
Parse website data using an HTML parser
Interact with forms and other website components
Scrape and Parse Text From Websites
Collecting data from websites using an automated process is known as web scraping. Some websites explicitly forbid users from scraping their data with automated tools like the ones you’ll create in this tutorial. Websites do this for two possible reasons:
The site has a good reason to protect its data. For instance, Google Maps doesn’t let you request too many results too quickly.
Making many repeated requests to a website’s server may use up bandwidth, slowing down the website for other users and potentially overloading the server such that the website stops responding entirely.
Let’s start by grabbing all the HTML code from a single web page. You’ll use a page on Real Python that’s been set up for use with this tutorial.
Your First Web Scraper
One useful package for web scraping that you can find in Python’s standard library is urllib, which contains tools for working with URLs. In particular, the quest module contains a function called urlopen() that can be used to open a URL within a program.
In IDLE’s interactive window, type the following to import urlopen():
>>>>>> from quest import urlopen
The web page that we’ll open is at the following URL:
>>>>>> url = ”
To open the web page, pass url to urlopen():
>>>>>> page = urlopen(url)
urlopen() returns an HTTPResponse object:
>>>>>> page
< object at 0x105fef820>
To extract the HTML from the page, first use the HTTPResponse object’s () method, which returns a sequence of bytes. Then use () to decode the bytes to a string using UTF-8:
>>>>>> html_bytes = ()
>>> html = (“utf-8”)
Now you can print the HTML to see the contents of the web page:
>>>>>> print(html)
Name: Aphrodite
Favorite animal: Dove
Favorite color: Red
Hometown: Mount Olympus
Once you have the HTML as text, you can extract information from it in a couple of different ways.
A Primer on Regular Expressions
Regular expressions—or regexes for short—are patterns that can be used to search for text within a string. Python supports regular expressions through the standard library’s re module.
To work with regular expressions, the first thing you need to do is import the re module:
Regular expressions use special characters called metacharacters to denote different patterns. For instance, the asterisk character (*) stands for zero or more of whatever comes just before the asterisk.
In the following example, you use findall() to find any text within a string that matches a given regular expression:
>>>>>> ndall(“ab*c”, “ac”)
[‘ac’]
The first argument of ndall() is the regular expression that you want to match, and the second argument is the string to test. In the above example, you search for the pattern “ab*c” in the string “ac”.
The regular expression “ab*c” matches any part of the string that begins with an “a”, ends with a “c”, and has zero or more instances of “b” between the two. ndall() returns a list of all matches. The string “ac” matches this pattern, so it’s returned in the list.
Here’s the same pattern applied to different strings:
>>>>>> ndall(“ab*c”, “abcd”)
[‘abc’]
>>> ndall(“ab*c”, “acc”)
>>> ndall(“ab*c”, “abcac”)
[‘abc’, ‘ac’]
>>> ndall(“ab*c”, “abdc”)
[]
Notice that if no match is found, then findall() returns an empty list.
Pattern matching is case sensitive. If you want to match this pattern regardless of the case, then you can pass a third argument with the value re. IGNORECASE:
>>>>>> ndall(“ab*c”, “ABC”)
>>> ndall(“ab*c”, “ABC”, re. IGNORECASE)
[‘ABC’]
You can use a period (. ) to stand for any single character in a regular expression. For instance, you could find all the strings that contain the letters “a” and “c” separated by a single character as follows:
>>>>>> ndall(“a. c”, “abc”)
>>> ndall(“a. c”, “abbc”)
>>> ndall(“a. c”, “ac”)
>>> ndall(“a. c”, “acc”)
[‘acc’]
The pattern. * inside a regular expression stands for any character repeated any number of times. For instance, “a. *c” can be used to find every substring that starts with “a” and ends with “c”, regardless of which letter—or letters—are in between:
>>>>>> ndall(“a. *c”, “abc”)
>>> ndall(“a. *c”, “abbc”)
[‘abbc’]
>>> ndall(“a. *c”, “ac”)
>>> ndall(“a. *c”, “acc”)
Often, you use () to search for a particular pattern inside a string. This function is somewhat more complicated than ndall() because it returns an object called a MatchObject that stores different groups of data. This is because there might be matches inside other matches, and () returns every possible result.
The details of the MatchObject are irrelevant here. For now, just know that calling () on a MatchObject will return the first and most inclusive result, which in most cases is just what you want:
>>>>>> match_results = (“ab*c”, “ABC”, re. IGNORECASE)
>>> ()
‘ABC’
There’s one more function in the re module that’s useful for parsing out text. (), which is short for substitute, allows you to replace text in a string that matches a regular expression with new text. It behaves sort of like the. replace() string method.
The arguments passed to () are the regular expression, followed by the replacement text, followed by the string. Here’s an example:
>>>>>> string = “Everything is
>>> string = (“<. *>“, “ELEPHANTS”, string)
>>> string
‘Everything is ELEPHANTS. ‘
Perhaps that wasn’t quite what you expected to happen.
() uses the regular expression “<. *>” to find and replace everything between the first < and last >, which spans from the beginning of
Alternatively, you can use the non-greedy matching pattern *?, which works the same way as * except that it matches the shortest possible string of text:
>>> string = (“<. *? >“, “ELEPHANTS”, string)
“Everything is ELEPHANTS if it’s in ELEPHANTS. ”
This time, () finds two matches,
Check Your Understanding
Expand the block below to check your understanding.
Write a program that grabs the full HTML from the following URL:
Then use () to display the text following “Name:” and “Favorite Color:” (not including any leading spaces or trailing HTML tags that might appear on the same line).
You can expand the block below to see a solution.
First, import the urlopen function from the quest module:
from quest import urlopen
Then open the URL and use the () method of the HTTPResponse object returned by urlopen() to read the page’s HTML:
url = ”
html_page = urlopen(url)
html_text = ()(“utf-8”)
() returns a byte string, so you use () to decode the bytes using the UTF-8 encoding.
Now that you have the HTML source of the web page as a string assigned to the html_text variable, you can extract Dionysus’s name and favorite color from his profile. The structure of the HTML for Dionysus’s profile is the same as Aphrodite’s profile that you saw earlier.
You can get the name by finding the string “Name:” in the text and extracting everything that comes after the first occurence of the string and before the next HTML tag. That is, you need to extract everything after the colon (:) and before the first angle bracket (<). You can use the same technique to extract the favorite color. The following for loop extracts this text for both the name and favorite color: for string in ["Name: ", "Favorite Color:"]: string_start_idx = (string) text_start_idx = string_start_idx + len(string) next_html_tag_offset = html_text[text_start_idx:]("<") text_end_idx = text_start_idx + next_html_tag_offset raw_text = html_text[text_start_idx: text_end_idx] clean_text = (" \r\n\t") print(clean_text) It looks like there’s a lot going on in this forloop, but it’s just a little bit of arithmetic to calculate the right indices for extracting the desired text. Let’s break it down: You use () to find the starting index of the string, either "Name:" or "Favorite Color:", and then assign the index to string_start_idx. Since the text to extract starts just after the colon in "Name:" or "Favorite Color:", you get the index of the the character immediately after the colon by adding the length of the string to start_string_idx and assign the result to text_start_idx. You calculate the ending index of the text to extract by determining the index of the first angle bracket (<) relative to text_start_idx and assign this value to next_html_tag_offset. Then you add that value to text_start_idx and assign the result to text_end_idx. You extract the text by slicing html_text from text_start_idx to text_end_idx and assign this string to raw_text. You remove any whitespace from the beginning and end of raw_text using () and assign the result to clean_text. At the end of the loop, you use print() to display the extracted text. The final output looks like this: This solution is one of many that solves this problem, so if you got the same output with a different solution, then you did great! When you’re ready, you can move on to the next section. Use an HTML Parser for Web Scraping in Python Although regular expressions are great for pattern matching in general, sometimes it’s easier to use an HTML parser that’s explicitly designed for parsing out HTML pages. There are many Python tools written for this purpose, but the Beautiful Soup library is a good one to start with. Install Beautiful Soup To install Beautiful Soup, you can run the following in your terminal: $ python3 -m pip install beautifulsoup4 Run pip show to see the details of the package you just installed: $ python3 -m pip show beautifulsoup4 Name: beautifulsoup4 Version: 4. 9. 1 Summary: Screen-scraping library Home-page: Author: Leonard Richardson Author-email: License: MIT Location: c:\realpython\venv\lib\site-packages Requires: Required-by: In particular, notice that the latest version at the time of writing was 4. 1. Create a BeautifulSoup Object Type the following program into a new editor window: from bs4 import BeautifulSoup page = urlopen(url) html = ()("utf-8") soup = BeautifulSoup(html, "") This program does three things: Opens the URL using urlopen() from the quest module Reads the HTML from the page as a string and assigns it to the html variable Creates a BeautifulSoup object and assigns it to the soup variable The BeautifulSoup object assigned to soup is created with two arguments. The first argument is the HTML to be parsed, and the second argument, the string "", tells the object which parser to use behind the scenes. "" represents Python’s built-in HTML parser. Use a BeautifulSoup Object Save and run the above program. When it’s finished running, you can use the soup variable in the interactive window to parse the content of html in various ways. For example, BeautifulSoup objects have a. get_text() method that can be used to extract all the text from the document and automatically remove any HTML tags. Type the following code into IDLE’s interactive window: >>>>>> print(t_text())
Profile: Dionysus
Name: Dionysus
Favorite animal: Leopard
Favorite Color: Wine
There are a lot of blank lines in this output. These are the result of newline characters in the HTML document’s text. You can remove them with the string. replace() method if you need to.
Often, you need to get only specific text from an HTML document. Using Beautiful Soup first to extract the text and then using the () string method is sometimes easier than working with regular expressions.
However, sometimes the HTML tags themselves are the elements that point out the data you want to retrieve. For instance, perhaps you want to retrieve the URLs for all the images on the page. These links are contained in the src attribute of
In this case, you can use find_all() to return a list of all instances of that particular tag:
>>>>>> nd_all(“img”)
[
This returns a list of all
Let’s explore this a little by first unpacking the Tag objects from the list:
>>>>>> image1, image2 = nd_all(“img”)
Each Tag object has a property that returns a string containing the HTML tag type:
You can access the HTML attributes of the Tag object by putting their name between square brackets, just as if the attributes were keys in a dictionary.
For example, the
To get the source of the images in the Dionysus profile page, you access the src attribute using the dictionary notation mentioned above:
>>>>>> image1[“src”]
‘/static/’
>>> image2[“src”]
Certain tags in HTML documents can be accessed by properties of the Tag object. For example, to get the
>>>>>>
If you look at the source of the Dionysus profile by navigating to the profile page, right-clicking on the page, and selecting View page source, then you’ll notice that the
Beautiful Soup automatically cleans up the tags for you by removing the extra space in the opening tag and the extraneous forward slash (/) in the closing tag.
You can also retrieve just the string between the title tags with the property of the Tag object:
‘Profile: Dionysus’
One of the more useful features of Beautiful Soup is the ability to search for specific kinds of tags whose attributes match certain values. For example, if you want to find all the
>>>>>> nd_all(“img”, src=”/static/”)
[
This example is somewhat arbitrary, and the usefulness of this technique may not be apparent from the example. If you spend some time browsing various websites and viewing their page sources, then you’ll notice that many websites have extremely complicated HTML structures.
When scraping data from websites with Python, you’re often interested in particular parts of the page. By spending some time looking through the HTML document, you can identify tags with unique attributes that you can use to extract the data you need.
Then, instead of relying on complicated regular expressions or using () to search through the document, you can directly access the particular tag you’re interested in and extract the data you need.
In some cases, you may find that Beautiful Soup doesn’t offer the functionality you need. The lxml library is somewhat trickier to get started with but offers far more flexibility than Beautiful Soup for parsing HTML documents. You may want to check it out once you’re comfortable using Beautiful Soup.
BeautifulSoup is great for scraping data from a website’s HTML, but it doesn’t provide any way to work with HTML forms. For example, if you need to search a website for some query and then scrape the results, then BeautifulSoup alone won’t get you very far.
Write a program that grabs the full HTML from the page at the URL Using Beautiful Soup, print out a list of all the links on the page by looking for HTML tags with the name a and retrieving the value taken on by the href attribute of each tag.
The final output should look like this:
You can expand the block below to see a solution:
First, import the urlopen function from the quest module and the BeautifulSoup class from the bs4 package:
Each link URL on the /profiles page is a relative URL, so create a base_url variable with the base URL of the website:
base_url = ”
You can build a full URL by concatenating base_url with a relative URL.
Now open the /profiles page with urlopen() and use () to get the HTML source:
html_page = urlopen(base_url + “/profiles”)
With the HTML source downloaded and decoded, you can create a new BeautifulSoup object to parse the HTML:
soup = BeautifulSoup(html_text, “”)
nd_all(“a”) returns a list of all links in the HTML source. You can loop over this list to print out all the links on the webpage:
for link in nd_all(“a”):
link_url = base_url + link[“href”]
print(link_url)
The relative URL for each link can be accessed through the “href” subscript. Concatenate this value with base_url to create the full link_url.
Interact With HTML Forms
The urllib module you’ve been working with so far in this tutorial is well suited for requesting the contents of a web page. Sometimes, though, you need to interact with a web page to obtain the content you need. For example, you might need to submit a form or click a button to display hidden content.
The Python standard library doesn’t provide a built-in means for working with web pages interactively, but many third-party packages are available from PyPI. Among these, MechanicalSoup is a popular and relatively straightforward package to use.
In essence, MechanicalSoup installs what’s known as a headless browser, which is a web browser with no graphical user interface. This browser is controlled programmatically via a Python program.
Install MechanicalSoup
You can install MechanicalSoup with pip in your terminal:
$ python3 -m pip install MechanicalSoup
You can now view some details about the package with pip show:
$ python3 -m pip show mechanicalsoup
Name: MechanicalSoup
Version: 0. 12. 0
Summary: A Python library for automating interaction with websites
Home-page: Author: UNKNOWN
Author-email: UNKNOWN
Requires: requests, beautifulsoup4, six, lxml
In particular, notice that the latest version at the time of writing was 0. 0. You’ll need to close and restart your IDLE session for MechanicalSoup to load and be recognized after it’s been installed.
Create a Browser Object
Type the following into IDLE’s interactive window:
>>>>>> import mechanicalsoup
>>> browser = owser()
Browser objects represent the headless web browser. You can use them to request a page from the Internet by passing a URL to their () method:
>>> page = (url)
page is a Response object that stores the response from requesting the URL from the browser:
The number 200 represents the status code returned by the request. A status code of 200 means that the request was successful. An unsuccessful request might show a status code of 404 if the URL doesn’t exist or 500 if there’s a server error when making the request.
MechanicalSoup uses Beautiful Soup to parse the HTML from the request. page has a attribute that represents a BeautifulSoup object:
>>>>>> type()
You can view the HTML by inspecting the attribute:
Please log in to access Mount Olympus:
Notice this page has a
]
As you can see from the output above, html tags sometimes come with attributes such as class, src, etc. These attributes provide additional information about html elements. You can use a for loop and the get(‘”href”) method to extract and print out only hyperlinks.
all_links = nd_all(“a”)
for link in all_links:
print((“href”))
/results/2017GPTR
#individual
#team
mailto:[email protected]
#tabs-1
None
To print out table rows only, pass the ‘tr’ argument in nd_all().
# Print the first 10 rows for sanity check
rows = nd_all(‘tr’)
print(rows[:10])
[
,
,
,
,
]
The goal of this tutorial is to take a table from a webpage and convert it into a dataframe for easier manipulation using Python. To get there, you should get all table rows in list form first and then convert that list into a dataframe. Below is a for loop that iterates through table rows and prints out the cells of the rows.
for row in rows:
row_td = nd_all(‘td’)
print(row_td)
type(row_td)
[
,
,
,
,
,
,
]
sultSet
The output above shows that each row is printed with html tags embedded in each row. This is not what you want. You can use remove the html tags using Beautiful Soup or regular expressions.
The easiest way to remove html tags is to use Beautiful Soup, and it takes just one line of code to do this. Pass the string of interest into BeautifulSoup() and use the get_text() method to extract the text without html tags.
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, “lxml”). get_text()
print(cleantext)
[14TH, INTEL TEAM M, 04:43:23, 00:58:59 – DANIELLE CASILLAS, 01:02:06 – RAMYA MERUVA, 01:17:06 – PALLAVI J SHINDE, 01:25:11 – NALINI MURARI]
Using regular expressions is highly discouraged since it requires several lines of code and one can easily make mistakes. It requires importing the re (for regular expressions) module. The code below shows how to build a regular expression that finds all the characters inside the < td > html tags and replace them with an empty string for each table row.
First, you compile a regular expression by passing a string to match to mpile(). The dot, star, and question mark (. *? ) will match an opening angle bracket followed by anything and followed by a closing angle bracket. It matches text in a non-greedy fashion, that is, it matches the shortest possible string. If you omit the question mark, it will match all the text between the first opening angle bracket and the last closing angle bracket. After compiling a regular expression, you can use the () method to find all the substrings where the regular expression matches and replace them with an empty string. The full code below generates an empty list, extract text in between html tags for each row, and append it to the assigned list.
import re
list_rows = []
cells = nd_all(‘td’)
str_cells = str(cells)
clean = mpile(‘<. *? >‘)
clean2 = ((clean, ”, str_cells))
(clean2)
print(clean2)
type(clean2)
str
The next step is to convert the list into a dataframe and get a quick view of the first 10 rows using Pandas.
df = Frame(list_rows)
(10)
0
[Finishers:, 577]
1
[Male:, 414]
2
[Female:, 163]
3
[]
4
[1, 814, JARED WILSON, M, TIGARD, OR, 00:36:21…
5
[2, 573, NATHAN A SUSTERSIC, M, PORTLAND, OR,…
6
[3, 687, FRANCISCO MAYA, M, PORTLAND, OR, 00:3…
7
[4, 623, PAUL MORROW, M, BEAVERTON, OR, 00:38:…
8
[5, 569, DEREK G OSBORNE, M, HILLSBORO, OR, 00…
9
[6, 642, JONATHON TRAN, M, PORTLAND, OR, 00:39…
The dataframe is not in the format we want. To clean it up, you should split the “0” column into multiple columns at the comma position. This is accomplished by using the () method.
df1 = df[0](‘, ‘, expand=True)
This looks much better, but there is still work to do. The dataframe has unwanted square brackets surrounding each row. You can use the strip() method to remove the opening square bracket on column “0. ”
df1[0] = df1[0](‘[‘)
The table is missing table headers. You can use the find_all() method to get the table headers.
col_labels = nd_all(‘th’)
Similar to table rows, you can use Beautiful Soup to extract text in between html tags for table headers.
all_header = []
col_str = str(col_labels)
cleantext2 = BeautifulSoup(col_str, “lxml”). get_text()
(cleantext2)
print(all_header)
[‘[Place, Bib, Name, Gender, City, State, Chip Time, Chip Pace, Gender Place, Age Group, Age Group Place, Time to Start, Gun Time, Team]’]
You can then convert the list of headers into a pandas dataframe.
df2 = Frame(all_header)
()
[Place, Bib, Name, Gender, City, State, Chip T…
Similarly, you can split column “0” into multiple columns at the comma position for all rows.
df3 = df2[0](‘, ‘, expand=True)
The two dataframes can be concatenated into one using the concat() method as illustrated below.
frames = [df3, df1]
df4 = (frames)
Below shows how to assign the first row to be the table header.
df5 = ([0])
At this point, the table is almost properly formatted. For analysis, you can start by getting an overview of the data as shown below.
Int64Index: 597 entries, 0 to 595
Data columns (total 14 columns):
[Place 597 non-null object
Bib 596 non-null object
Name 593 non-null object
Gender 593 non-null object
City 593 non-null object
State 593 non-null object
Chip Time 593 non-null object
Chip Pace 578 non-null object
Gender Place 578 non-null object
Age Group 578 non-null object
Age Group Place 578 non-null object
Time to Start 578 non-null object
Gun Time 578 non-null object
Team] 578 non-null object
dtypes: object(14)
memory usage: 70. 0+ KB
(597, 14)
The table has 597 rows and 14 columns. You can drop all rows with any missing values.
df6 = (axis=0, how=’any’)
Also, notice how the table header is replicated as the first row in df5. It can be dropped using the following line of code.
df7 = ([0])
You can perform more data cleaning by renaming the ‘[Place’ and ‘ Team]’ columns. Python is very picky about space. Make sure you include space after the quotation mark in ‘ Team]’.
(columns={‘[Place’: ‘Place’}, inplace=True)
(columns={‘ Team]’: ‘Team’}, inplace=True)
The final data cleaning step involves removing the closing bracket for cells in the “Team” column.
df7[‘Team’] = df7[‘Team’](‘]’)
It took a while to get here, but at this point, the dataframe is in the desired format. Now you can move on to the exciting part and start plotting the data and computing interesting statistics.
The first question to answer is, what was the average finish time (in minutes) for the runners? You need to convert the column “Chip Time” into just minutes. One way to do this is to convert the column to a list first for manipulation.
time_list = df7[‘ Chip Time’]()
# You can use a for loop to convert ‘Chip Time’ to minutes
time_mins = []
for i in time_list:
h, m, s = (‘:’)
math = (int(h) * 3600 + int(m) * 60 + int(s))/60
(math)
#print(time_mins)
The next step is to convert the list back into a dataframe and make a new column (“Runner_mins”) for runner chip times expressed in just minutes.
df7[‘Runner_mins’] = time_mins
The code below shows how to calculate statistics for numeric columns only in the dataframe.
scribe(include=[])
Runner_mins
count
577. 000000
mean
60. 035933
std
11. 970623
min
36. 350000
25%
51. 000000
50%
59. 016667
75%
67. 266667
max
101. 300000
Interestingly, the average chip time for all runners was ~60 mins. The fastest 10K runner finished in 36. 35 mins, and the slowest runner finished in 101. 30 minutes.
A boxplot is another useful tool to visualize summary statistics (maximum, minimum, medium, first quartile, third quartile, including outliers). Below are data summary statistics for the runners shown in a boxplot. For data visualization, it is convenient to first import parameters from the pylab module that comes with matplotlib and set the same size for all figures to avoid doing it for each figure.
from pylab import rcParams
rcParams[‘gsize’] = 15, 5
xplot(column=’Runner_mins’)
(True, axis=’y’)
(‘Chip Time’)
([1], [‘Runners’])
([< at 0x570dd106d8>],
)
The second question to answer is: Did the runners’ finish times follow a normal distribution?
Below is a distribution plot of runners’ chip times plotted using the seaborn library. The distribution looks almost normal.
x = df7[‘Runner_mins’]
ax = sns. distplot(x, hist=True, kde=True, rug=False, color=’m’, bins=25, hist_kws={‘edgecolor’:’black’})
The third question deals with whether there were any performance differences between males and females of various age groups. Below is a distribution plot of chip times for males and females.
f_fuko = [df7[‘ Gender’]==’ F’][‘Runner_mins’]
m_fuko = [df7[‘ Gender’]==’ M’][‘Runner_mins’]
sns. distplot(f_fuko, hist=True, kde=True, rug=False, hist_kws={‘edgecolor’:’black’}, label=’Female’)
sns. distplot(m_fuko, hist=False, kde=True, rug=False, hist_kws={‘edgecolor’:’black’}, label=’Male’)
< at 0x570e301fd0>
The distribution indicates that females were slower than males on average. You can use the groupby() method to compute summary statistics for males and females separately as shown below.
g_stats = oupby(” Gender”, as_index=True). describe()
print(g_stats)
Runner_mins \
count mean std min 25% 50%
Gender
F 163. 0 66. 119223 12. 184440 43. 766667 58. 758333 64. 616667
M 414. 0 57. 640821 11. 011857 36. 350000 49. 395833 55. 791667
75% max
F 72. 058333 101. 300000
M 64. 804167 98. 516667
The average chip time for all females and males was ~66 mins and ~58 mins, respectively. Below is a side-by-side boxplot comparison of male and female finish times.
xplot(column=’Runner_mins’, by=’ Gender’)
ptitle(“”)
C:\Users\smasango\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\core\ FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use (… ) instead
return getattr(obj, method)(*args, **kwds)
Text(0. 5, 0. 98, ”)
In this tutorial, you performed web scraping using Python. You used the Beautiful Soup library to parse html data and convert it into a form that can be used for analysis. You performed cleaning of the data in Python and created useful plots (box plots, bar plots, and distribution plots) to reveal interesting trends using Python’s matplotlib and seaborn libraries. After this tutorial, you should be able to use Python to easily scrape data from the web, apply cleaning techniques and extract useful insights from the data.
If you would like to learn more about Python, take DataCamp’s free Intro to Python for Data Science course.
Frequently Asked Questions about python web data extraction
Can Python extract data from website?
Let’s say you find data from the web, and there is no direct way to download it, web scraping using Python is a skill you can use to extract the data into a useful form that can be imported.Jul 26, 2018
How do I extract all links from a website in Python?
Import module. Make requests instance and pass into URL. Pass the requests into a Beautifulsoup() function. Use ‘a’ tag to find them all tag (‘a href ‘)Jan 7, 2021
How do you scrape a website?
How do we do web scraping?Inspect the website HTML that you want to crawl.Access URL of the website using code and download all the HTML contents on the page.Format the downloaded content into a readable format.Extract out useful information and save it into a structured format.More items…•Jul 15, 2020