Web Scraping with PHP – ScrapingBee
●
22 September, 2020
14 min read
Jérôme is an experienced PHP developer very active in the Open-Source community, if you use PHP and Firebase, you should check-out his SDK on Github (1. 4k stars).
You might have seen one of our other tutorials on how to scrape websites, for example with Ruby, JavaScript or Python, and wondered: what about the most widely used server-side programming language for websites, which, at the same time, is the most dreaded? Wonder no more – today it’s time for PHP !
Believe it or not, but PHP and web scraping have much in common: just like PHP, Web Scraping can be used in a quick and dirty way, more elaborately, and enhanced with the help of additional tools and services.
In this article, we’ll look at some ways to scrape the web with PHP. Please keep in mind that there is no general “the best way” – each approach has its use-case depending on what you need, how you like to do things, and what you want to achieve.
As an example, we will try to get a list of people that share the same birthday, as you can see, for instance, on If you want to code along, please ensure that you have installed a current version of PHP and Composer.
Create a new directory and in it, run:
$ composer init –require=”php >=7. 4″ –no-interaction
$ composer update
We’re ready!
1. HTTP Requests
When it comes to browsing the web, the most commonly used communication protocol is HTTP, the Hypertext Transport Protocol. It specifies how participants on the World Wide Web can communicate with each other. There are servers hosting resources and clients requesting resources from them.
Your browser is such a client – when we enable the developer console, select the “Network” tab and open the famous, we can see the full request sent to the server, as well as the full response:
Network tab of your browser developer console
That’s quite some request- and response headers, but in its most basic form, a request looks like this:
GET / HTTP/1. 1
Host: Let’s try to recreate what the browser just did for us!
fsockopen()
We usually don’t see this lower-deck communication, but just for the sake of it, let’s create this request with the most basic tool PHP has to offer: fsockopen():
php
#
// In HTTP, lines have to be terminated with "\r\n" because of
// backward compatibility reasons
$request = "GET / HTTP/1. 1\r\n";
$request. = "Host: r\n";
$request. = "\r\n"; // We need to add a last new line after the last header
// We open a connection to on the port 80
$connection = fsockopen('', 80);
// The information stream can flow, and we can write and read from it
fwrite($connection, $request);
// As long as the server returns something to us...
while(! feof($connection)) {
//... print what the server sent us
echo fgets($connection);}
// Finally, close the connection
fclose($connection);
And indeed, if you put this code snippet into a file and run it with php, you will see the same HTML that you get when you open in your browser.
Next step: performing an HTTP request with Assembler… just kidding! But in all seriousness: fsockopen() is usually not used to make HTTP requests with PHP; I just wanted to show you that it's feasible, using the easiest possible example. While it is possible to make all HTTP (and non-HTTP) interactions work with it, it's not fun and requires a lot of boilerplate code that we don't need to do - performing HTTP requests is a solved problem, and in PHP (and many other languages) it's solved by…
cURL
Enter cURL (a client for URLs)! Let's jump right into the code, it's quite straight forward:
// Initialize a connection with cURL (ch = cURL handle, or "channel")
$ch = curl_init();
// Set the URL
curl_setopt($ch, CURLOPT_URL, '');
// Set the HTTP method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
// Return the response instead of printing it out
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// Send the request and store the result in $response
$response = curl_exec($ch);
echo 'HTTP Status Code: '. curl_getinfo($ch, CURLINFO_HTTP_CODE). PHP_EOL;
echo 'Response Body: '. $response. PHP_EOL;
// Close cURL resource to free up system resources
curl_close($ch);
Now, this does look a lot more controlled than our previous example, doesn't it? No need to create a connection to a specific port of a particular server, to manually separate the headers from the actual response, or to close a connection. To follow a website redirect, all we need is a curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);, and there are many more options available to accommodate further needs.
Great! Now let's get to actual scraping!
2. Strings, regular expressions, and Wikipedia
Let's look at Wikipedia as our first data provider. Each day of the year has its own page for historical events, including birthdays! When we open, for example, the page for December 10th (which happens to be my birthday), we can inspect the HTML in the developer console and see how the “Births” section is structured:
Wikipedia's HTML structure
This looks nice and organized! We can see that:
There's an
header element containing Births (only one element on the whole page should™ have an ID named “Births”).
The header is immediately followed by an unordered list (
).
Each list item (
- …
) contains a year, a dash, a name, a comma, and a teaser what the given person is known for.
This is something we can work with, let’s go!
$html = file_get_contents(”);
echo $html;
Wait what? Surprise! file_get_contents() is probably the easiest way to perform uncomplicated GET requests – it’s not really meant for this use case, but PHP is a language allowing many things that you shouldn’t do . (But it’s fine for this example and for one-off scripts when you know what you’re requesting).
Script output
Have you read all the HTML that the script has printed out? I hope not, because it’s a lot! The important thing is that we know where we should start looking: we’re only interested in the part starting with id=”Births” and ending after the closing
Each list item (
) contains a year, a dash, a name, a comma, and a teaser what the given person is known for.
This is something we can work with, let’s go!
$html = file_get_contents(”);
echo $html;
Wait what? Surprise! file_get_contents() is probably the easiest way to perform uncomplicated GET requests – it’s not really meant for this use case, but PHP is a language allowing many things that you shouldn’t do . (But it’s fine for this example and for one-off scripts when you know what you’re requesting).
Script output
Have you read all the HTML that the script has printed out? I hope not, because it’s a lot! The important thing is that we know where we should start looking: we’re only interested in the part starting with id=”Births” and ending after the closing
of the list right after that:
$start = stripos($html, ‘id=”Births”‘);
$end = stripos($html, ‘
‘, $offset = $start);
$length = $end – $start;
$htmlSection = substr($html, $start, $length);
echo $htmlSection;
We’re getting closer!
Cleaner results
This is not valid HTML anymore, but at least we can see what we’re working with! Let’s use a regular expression to load all list items into an array so that we can handle each item one by one:
preg_match_all(‘@
@’, $htmlSection, $matches);
$listItems = $matches[1];
foreach ($listItems as $item) {
echo “{$item}\n\n”;}
Cleaner results (bis)
For the years and names… We can see from the output that the first number is the birth year. It’s followed by an HTML-Entity – (a dash). Finally, the name is located within the following element. Let’s grab ‘em all, and we’re done .
echo “Who was born on December 10th\n”;
echo “=============================\n\n”;
preg_match(‘@(\d+)@’, $item, $yearMatch);
$year = (int) $yearMatch[0];
preg_match(‘@;\s]*>(. *? )@i’, $item, $nameMatch);
$name = $nameMatch[1];
echo “{$name} was born in {$year}\n”;}
Final results
I don’t know about you, but I feel a bit dirty now. We achieved our goal, but instead of elegantly navigating the HTML DOM Tree, we destroyed it beyond recognition and ripped out pieces of information with commands that are not easy to understand. And worst of all, this script will show an error with items where the year is not wrapped in a link (I didn’t show you because the screenshot looks nicer without it ).
We can do better! When? Now!
3. Guzzle, XML, XPath, and IMDb
Guzzle is a popular HTTP Client for PHP that makes it easy and enjoyable to send HTTP requests. It provides you with an intuitive API, extensive error handling, and even the possibility of extending its functionality with middleware. This makes Guzzle a powerful tool that you don’t want to miss. You can install Guzzle from your terminal with composer require guzzle/guzzle.
Let’s cut to the chase and have a look at the HTML of (Wikipedia’s URLs were definitely nicer)
IMDB HTML structure
We can see straight away that we’ll need a better tool than string functions and regular expressions here. Instead of a list with list items, we see nested
We’ll try to find a solution for the year-situation later, but for now, let’s at least get the names of our jubilees with XPath, a query language to select nodes from a DOM Document.
In our new script, we’ll first fetch the page with Guzzle, convert the returned HTML string into a DOMDocument object and initialize an XPath parser with it:
require ‘vendor/’;
$Client = new \GuzzleHttp\Client();
$response = $Client->get(”);
$htmlString = (string) $response->getBody();
// HTML is often wonky, this suppresses a lot of warnings
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmlString);
$xpath = new DOMXPath($doc);
Let’s have a closer look at the HTML in the window above:
The list is contained in a
Each direct child of this container is a
Finally, the name can be found within an within a
within a
with a lister-item-content
If we look closer, we can make it even simpler and skip the child divs and class names: there is only one
in a list item, so let’s target that directly:
$links = $xpath->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
foreach ($links as $link) {
echo $link->P_EOL;}
//div[@class=”lister-list”][1] returns the first ([1]) div with an attribute named class that has the exact value lister-list
within that div, from all
elements (//h3) return all anchors ( )
We then iterate through the result and print the text content of the anchor elements
I hope I explained it well enough for this use case, but in any case, our article “Practical XPath for Web Scraping” here on this blog explains XPath far better and goes much deeper than I ever could, so definitely check it out (but finish reading this one first! )
We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.
4. Goutte and IMDB
Guzzle is one HTTP client, but many others are equally excellent – it just happens to be one of the most mature and most downloaded. PHP has a vast, active community; whatever you need, there’s a good chance someone else has written a library or framework for it, and web scraping is no exception.
Goutte is an HTTP client made for web scraping. It was created by Fabien Potencier, the creator of the Symfony Framework, and combines several Symfony components to make web scraping very comfortable:
The BrowserKit component simulates the behavior of a web browser that you can use programmatically
Think of the DomCrawler component as DOMDocument and XPath on steroids – except that steroids are bad, and the DomCrawler is good!
The CssSelector component translates CSS queries into XPath queries.
The Symfony HTTP Client is a relatively new component (it was released in 2019) – being developed and maintained by the Symfony team, it has gained in popularity very quickly.
Let’s install Goutte with composer require fabpot/goutte and recreate the previous XPath with it:
$client = new \Goutte\Client();
$crawler = $client->request(‘GET’, ”);
$links = $crawler->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
This alone is already pretty good – we saved the step where we had to explicitly disable XML warnings and didn’t need to instantiate an XPath object ourselves. Now, let’s replace the XPath expression with a CSS query (thanks to the CSSSelector component integrated into Goutte):
$crawler->filter(‘ h3 a’)->each(function ($node) {
echo $node->text(). PHP_EOL;});
I like where this is going; our script is more and more looking like a conversation that even a non-programmer can understand, not just code . However, now is the time to find out if you’re coding along or not : does this script return results when running it? Because for me, it didn’t at first – I spent an hour debugging why and finally discovered a solution:
composer require masterminds/html5
As it turns out, the reason why Goutte (more precisely: the DOMCrawler) doesn’t report XML warnings is that it just throws away the parts it cannot parse. The additional library helps with HTML5 specifically, and after installing it, the script runs as expected.
We will talk more about this later, but for now, let’s remember that we’re still missing the birth years of our jubilees. This is where a web scraping library like Goutte really shines: we can click on links! And indeed: if we click one of the names in the birthday list to go to a person’s profile, we can see a “Born: ” line, and in the HTML a
If we look closer, we can make it even simpler and skip the child divs and class names: there is only one
in a list item, so let’s target that directly:
$links = $xpath->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
foreach ($links as $link) {
echo $link->P_EOL;}
//div[@class=”lister-list”][1] returns the first ([1]) div with an attribute named class that has the exact value lister-list
within that div, from all
elements (//h3) return all anchors ( )
We then iterate through the result and print the text content of the anchor elements
I hope I explained it well enough for this use case, but in any case, our article “Practical XPath for Web Scraping” here on this blog explains XPath far better and goes much deeper than I ever could, so definitely check it out (but finish reading this one first! )
We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.
4. Goutte and IMDB
Guzzle is one HTTP client, but many others are equally excellent – it just happens to be one of the most mature and most downloaded. PHP has a vast, active community; whatever you need, there’s a good chance someone else has written a library or framework for it, and web scraping is no exception.
Goutte is an HTTP client made for web scraping. It was created by Fabien Potencier, the creator of the Symfony Framework, and combines several Symfony components to make web scraping very comfortable:
The BrowserKit component simulates the behavior of a web browser that you can use programmatically
Think of the DomCrawler component as DOMDocument and XPath on steroids – except that steroids are bad, and the DomCrawler is good!
The CssSelector component translates CSS queries into XPath queries.
The Symfony HTTP Client is a relatively new component (it was released in 2019) – being developed and maintained by the Symfony team, it has gained in popularity very quickly.
Let’s install Goutte with composer require fabpot/goutte and recreate the previous XPath with it:
$client = new \Goutte\Client();
$crawler = $client->request(‘GET’, ”);
$links = $crawler->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
This alone is already pretty good – we saved the step where we had to explicitly disable XML warnings and didn’t need to instantiate an XPath object ourselves. Now, let’s replace the XPath expression with a CSS query (thanks to the CSSSelector component integrated into Goutte):
$crawler->filter(‘ h3 a’)->each(function ($node) {
echo $node->text(). PHP_EOL;});
I like where this is going; our script is more and more looking like a conversation that even a non-programmer can understand, not just code . However, now is the time to find out if you’re coding along or not : does this script return results when running it? Because for me, it didn’t at first – I spent an hour debugging why and finally discovered a solution:
composer require masterminds/html5
As it turns out, the reason why Goutte (more precisely: the DOMCrawler) doesn’t report XML warnings is that it just throws away the parts it cannot parse. The additional library helps with HTML5 specifically, and after installing it, the script runs as expected.
We will talk more about this later, but for now, let’s remember that we’re still missing the birth years of our jubilees. This is where a web scraping library like Goutte really shines: we can click on links! And indeed: if we click one of the names in the birthday list to go to a person’s profile, we can see a “Born: ” line, and in the HTML a
We then iterate through the result and print the text content of the anchor elements
I hope I explained it well enough for this use case, but in any case, our article “Practical XPath for Web Scraping” here on this blog explains XPath far better and goes much deeper than I ever could, so definitely check it out (but finish reading this one first! )
We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.
4. Goutte and IMDB
Guzzle is one HTTP client, but many others are equally excellent – it just happens to be one of the most mature and most downloaded. PHP has a vast, active community; whatever you need, there’s a good chance someone else has written a library or framework for it, and web scraping is no exception.
Goutte is an HTTP client made for web scraping. It was created by Fabien Potencier, the creator of the Symfony Framework, and combines several Symfony components to make web scraping very comfortable:
The BrowserKit component simulates the behavior of a web browser that you can use programmatically
Think of the DomCrawler component as DOMDocument and XPath on steroids – except that steroids are bad, and the DomCrawler is good!
The CssSelector component translates CSS queries into XPath queries.
The Symfony HTTP Client is a relatively new component (it was released in 2019) – being developed and maintained by the Symfony team, it has gained in popularity very quickly.
Let’s install Goutte with composer require fabpot/goutte and recreate the previous XPath with it:
$client = new \Goutte\Client();
$crawler = $client->request(‘GET’, ”);
$links = $crawler->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
This alone is already pretty good – we saved the step where we had to explicitly disable XML warnings and didn’t need to instantiate an XPath object ourselves. Now, let’s replace the XPath expression with a CSS query (thanks to the CSSSelector component integrated into Goutte):
$crawler->filter(‘ h3 a’)->each(function ($node) {
echo $node->text(). PHP_EOL;});
I like where this is going; our script is more and more looking like a conversation that even a non-programmer can understand, not just code . However, now is the time to find out if you’re coding along or not : does this script return results when running it? Because for me, it didn’t at first – I spent an hour debugging why and finally discovered a solution:
composer require masterminds/html5
As it turns out, the reason why Goutte (more precisely: the DOMCrawler) doesn’t report XML warnings is that it just throws away the parts it cannot parse. The additional library helps with HTML5 specifically, and after installing it, the script runs as expected.
We will talk more about this later, but for now, let’s remember that we’re still missing the birth years of our jubilees. This is where a web scraping library like Goutte really shines: we can click on links! And indeed: if we click one of the names in the birthday list to go to a person’s profile, we can see a “Born: ” line, and in the HTML a