How To Do Web Scraping In Php

Web Scraping with PHP – ScrapingBee


22 September, 2020
14 min read
Jérôme is an experienced PHP developer very active in the Open-Source community, if you use PHP and Firebase, you should check-out his SDK on Github (1. 4k stars).
You might have seen one of our other tutorials on how to scrape websites, for example with Ruby, JavaScript or Python, and wondered: what about the most widely used server-side programming language for websites, which, at the same time, is the most dreaded? Wonder no more – today it’s time for PHP !
Believe it or not, but PHP and web scraping have much in common: just like PHP, Web Scraping can be used in a quick and dirty way, more elaborately, and enhanced with the help of additional tools and services.
In this article, we’ll look at some ways to scrape the web with PHP. Please keep in mind that there is no general “the best way” – each approach has its use-case depending on what you need, how you like to do things, and what you want to achieve.
As an example, we will try to get a list of people that share the same birthday, as you can see, for instance, on If you want to code along, please ensure that you have installed a current version of PHP and Composer.
Create a new directory and in it, run:
$ composer init –require=”php >=7. 4″ –no-interaction
$ composer update
We’re ready!
1. HTTP Requests
When it comes to browsing the web, the most commonly used communication protocol is HTTP, the Hypertext Transport Protocol. It specifies how participants on the World Wide Web can communicate with each other. There are servers hosting resources and clients requesting resources from them.
Your browser is such a client – when we enable the developer console, select the “Network” tab and open the famous, we can see the full request sent to the server, as well as the full response:
Network tab of your browser developer console
That’s quite some request- and response headers, but in its most basic form, a request looks like this:
GET / HTTP/1. 1
Host: Let’s try to recreate what the browser just did for us!
fsockopen()
We usually don’t see this lower-deck communication, but just for the sake of it, let’s create this request with the most basic tool PHP has to offer: fsockopen():
header element containing Births (only one element on the whole page should™ have an ID named “Births”).
The header is immediately followed by an unordered list (

    ).
    Each list item (

  • ) contains a year, a dash, a name, a comma, and a teaser what the given person is known for.
    This is something we can work with, let’s go!
    $html = file_get_contents(”);
    echo $html;
    Wait what? Surprise! file_get_contents() is probably the easiest way to perform uncomplicated GET requests – it’s not really meant for this use case, but PHP is a language allowing many things that you shouldn’t do . (But it’s fine for this example and for one-off scripts when you know what you’re requesting).
    Script output
    Have you read all the HTML that the script has printed out? I hope not, because it’s a lot! The important thing is that we know where we should start looking: we’re only interested in the part starting with id=”Births” and ending after the closing

of the list right after that:
$start = stripos($html, ‘id=”Births”‘);
$end = stripos($html, ‘

‘, $offset = $start);
$length = $end – $start;
$htmlSection = substr($html, $start, $length);
echo $htmlSection;
We’re getting closer!
Cleaner results
This is not valid HTML anymore, but at least we can see what we’re working with! Let’s use a regular expression to load all list items into an array so that we can handle each item one by one:
preg_match_all(‘@

  • (. +)
  • @’, $htmlSection, $matches);
    $listItems = $matches[1];
    foreach ($listItems as $item) {
    echo “{$item}\n\n”;}
    Cleaner results (bis)
    For the years and names… We can see from the output that the first number is the birth year. It’s followed by an HTML-Entity – (a dash). Finally, the name is located within the following element. Let’s grab ‘em all, and we’re done .
    echo “Who was born on December 10th\n”;
    echo “=============================\n\n”;
    preg_match(‘@(\d+)@’, $item, $yearMatch);
    $year = (int) $yearMatch[0];
    preg_match(‘@;\s]*>(. *? )
    @i’, $item, $nameMatch);
    $name = $nameMatch[1];
    echo “{$name} was born in {$year}\n”;}
    Final results
    I don’t know about you, but I feel a bit dirty now. We achieved our goal, but instead of elegantly navigating the HTML DOM Tree, we destroyed it beyond recognition and ripped out pieces of information with commands that are not easy to understand. And worst of all, this script will show an error with items where the year is not wrapped in a link (I didn’t show you because the screenshot looks nicer without it ).
    We can do better! When? Now!
    3. Guzzle, XML, XPath, and IMDb
    Guzzle is a popular HTTP Client for PHP that makes it easy and enjoyable to send HTTP requests. It provides you with an intuitive API, extensive error handling, and even the possibility of extending its functionality with middleware. This makes Guzzle a powerful tool that you don’t want to miss. You can install Guzzle from your terminal with composer require guzzle/guzzle.
    Let’s cut to the chase and have a look at the HTML of (Wikipedia’s URLs were definitely nicer)
    IMDB HTML structure
    We can see straight away that we’ll need a better tool than string functions and regular expressions here. Instead of a list with list items, we see nested

    s. There’s no id=”… ” that we can use to jump to the relevant content. But worst of all: the birth year is either buried in the biography excerpt or not visible at all!
    We’ll try to find a solution for the year-situation later, but for now, let’s at least get the names of our jubilees with XPath, a query language to select nodes from a DOM Document.
    In our new script, we’ll first fetch the page with Guzzle, convert the returned HTML string into a DOMDocument object and initialize an XPath parser with it:
    require ‘vendor/’;
    $Client = new \GuzzleHttp\Client();
    $response = $Client->get(”);
    $htmlString = (string) $response->getBody();
    // HTML is often wonky, this suppresses a lot of warnings
    libxml_use_internal_errors(true);
    $doc = new DOMDocument();
    $doc->loadHTML($htmlString);
    $xpath = new DOMXPath($doc);
    Let’s have a closer look at the HTML in the window above:
    The list is contained in a

    element
    Each direct child of this container is a

    with a lister-item mode-detail class attribute
    Finally, the name can be found within an within a

    within a

    with a lister-item-content
    If we look closer, we can make it even simpler and skip the child divs and class names: there is only one

    in a list item, so let’s target that directly:
    $links = $xpath->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
    foreach ($links as $link) {
    echo $link->P_EOL;}
    //div[@class=”lister-list”][1] returns the first ([1]) div with an attribute named class that has the exact value lister-list
    within that div, from all

    elements (//h3) return all anchors ( )
    We then iterate through the result and print the text content of the anchor elements
    I hope I explained it well enough for this use case, but in any case, our article “Practical XPath for Web Scraping” here on this blog explains XPath far better and goes much deeper than I ever could, so definitely check it out (but finish reading this one first! )
    We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.
    4. Goutte and IMDB
    Guzzle is one HTTP client, but many others are equally excellent – it just happens to be one of the most mature and most downloaded. PHP has a vast, active community; whatever you need, there’s a good chance someone else has written a library or framework for it, and web scraping is no exception.
    Goutte is an HTTP client made for web scraping. It was created by Fabien Potencier, the creator of the Symfony Framework, and combines several Symfony components to make web scraping very comfortable:
    The BrowserKit component simulates the behavior of a web browser that you can use programmatically
    Think of the DomCrawler component as DOMDocument and XPath on steroids – except that steroids are bad, and the DomCrawler is good!
    The CssSelector component translates CSS queries into XPath queries.
    The Symfony HTTP Client is a relatively new component (it was released in 2019) – being developed and maintained by the Symfony team, it has gained in popularity very quickly.
    Let’s install Goutte with composer require fabpot/goutte and recreate the previous XPath with it:
    $client = new \Goutte\Client();
    $crawler = $client->request(‘GET’, ”);
    $links = $crawler->evaluate(‘//div[@class=”lister-list”][1]//h3/a’);
    This alone is already pretty good – we saved the step where we had to explicitly disable XML warnings and didn’t need to instantiate an XPath object ourselves. Now, let’s replace the XPath expression with a CSS query (thanks to the CSSSelector component integrated into Goutte):
    $crawler->filter(‘ h3 a’)->each(function ($node) {
    echo $node->text(). PHP_EOL;});
    I like where this is going; our script is more and more looking like a conversation that even a non-programmer can understand, not just code . However, now is the time to find out if you’re coding along or not : does this script return results when running it? Because for me, it didn’t at first – I spent an hour debugging why and finally discovered a solution:
    composer require masterminds/html5
    As it turns out, the reason why Goutte (more precisely: the DOMCrawler) doesn’t report XML warnings is that it just throws away the parts it cannot parse. The additional library helps with HTML5 specifically, and after installing it, the script runs as expected.
    We will talk more about this later, but for now, let’s remember that we’re still missing the birth years of our jubilees. This is where a web scraping library like Goutte really shines: we can click on links! And indeed: if we click one of the names in the birthday list to go to a person’s profile, we can see a “Born: ” line, and in the HTML a
      element. The next direct child is the

    1. we want is the book title. It is inside the , which is in turn inside the

      , which is inside the

      , which is finally inside the

    2. initialize Guzzle, XML and Xpath, add the following code to the file:get(”);
      $htmlString = (string) $response->getBody();
      //add this line to suppress any warnings
      libxml_use_internal_errors(true);
      $doc = new DOMDocument();
      $doc->loadHTML($htmlString);
      $xpath = new DOMXPath($doc);The above code snippet will load the web page into a string. We then parse the string using XML and assign it to the $xpath next thing you want is to target the text content inside the
      tag. Add the following code to the file:$titles = $xpath->evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
      $extractedTitles = [];
      foreach ($titles as $title) {
      $extractedTitles[] = $title->P_EOL;
      echo $title->P_EOL;}In the code snippet above, //ol[@class=”row”] gets the whole item in the list has an
      tag that we are targeting to extract the book’s actual title. We only have one

      tag containing the , which makes it easier to target it use the foreach loop to extract the text contents and echo them to the terminal. At this step you may choose to do something with your extracted data, maybe assign the data to an array variable, write to file, or store it in a database. You can execute the file using PHP on the terminal by running the command below. Remember, the highlighted part is how we named our file:php pThis should display something like this:That went, what if we wanted to also get the price of the book? The price happens to be inside

      tag, inside a

      tag. As you can see there are more than one

      tag and more than one

      tag. To find the right target, we will use the CSS class selectors which, lucky for us, are unique for each tag. Here is the code snippet to also get the price tag and concatenate it to the title string:$titles = $xpath->evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
      $prices = $xpath->evaluate(‘//ol[@class=”row”]//li//article//div[@class=”product_price”]//p[@class=”price_color”]’);
      foreach ($titles as $key => $title) {
      echo $title->textContent. ‘ @ ‘. $prices[$key]->P_EOL;}If you execute the code on your terminal, you should see something like this:Your whole code should look like this:evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
      echo $title->textContent. $prices[$key]->P_EOL;}Of course, this is a basic web scraper, and you can certainly make it better. Let’s move to the next Scraping in PHP with GoutteGoutte is another excellent HTTP client for PHP that’s specifically made for web scraping. It was developed by the creator of the Symfony Framework and provides a nice API to scrape data from the HTML/XML responses of websites. Below are some of the components it includes to make web crawling straightforward:BrowserKit Component to simulate the behavior of a web browser. CssSelector component for translating CSS queries into XPath mCrawler component brings the power of DOMDocument and mfony HTTP Client is a fairly new component from the Symfony stall Goutte via composer by executing the following command on your terminal:composer require fabpot/goutteOnce you have installed the Goutte package, create a new PHP file for our code – let’s call it this section we’ll discuss what we did with the Guzzle library in the first section. We will scrape book titles from the Books to Scrape website using Goutte. Then we’ll see how you can add the prices into an array variable and use the variable within the code. Add the following code inside the file:request(‘GET’, ”);
      $titles = $response->evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
      $prices = $response->evaluate(‘//ol[@class=”row”]//li//article//div[@class=”product_price”]//p[@class=”price_color”]’);
      // we can store the prices into an array
      $priceArray = [];
      foreach ($prices as $key => $price) {
      $priceArray[] = $price->textContent;}
      // we extract the titles and display to the terminal together with the prices
      echo $title->textContent. $priceArray[$key]. PHP_EOL;}Execute the code by running the following command in the terminal:php pHere is the output:This is one way of web scraping with ’s discuss another method using the CSS Selector component that comes with Goutte. The CSS selector is more straightforward than using XPath shown in the previous another PHP file, let’s call it Add the following code to the file:filter(‘ li article oduct_price ice_color’)->each(function ($node) use (&$prices) {
      $prices[] = $node->text();});
      // echo titles and prices
      $priceIndex = 0;
      $response->filter(‘ li article h3 a’)->each(function ($node) use ($prices, &$priceIndex) {
      echo $node->text(). $prices[$priceIndex]. PHP_EOL;
      $priceIndex++;});As you can see, using the CSS Selector component results in cleaner and more readable code. You may have noticed that we used the & operator. This ensures that we take the reference of the variable into the “each” loop, and not just the value of the variable. If the &$prices are modified within the loop, the actual value outside the loop is also modified. You can read more on assignment by references from official PHP docs. Execute the file in your terminal by running the command:php should see an output similar to the one in the previous screenshots:Our web scraper with PHP and Goutte is going well so far. Let’s go a little deeper and see if we can click on a link and navigate to a different our demo website, Books to Scrape, if you click on a title of a book, a page will load showing details of the book such as:We want to see if you we click on a link from the books list, navigate to the book details page, and extract the description. Inspect the page to see what we will be targeting:Our target flow will be from the

      element, then

      , then the

      tag which only appears once, and finally the

      tag. We have several

      tags – the tag with the description is the fourth inside the

      parent. Since arrays start at 0, we will be getting the node at the 3rd that we know what we are targeting, let’s write the, add the following composer package to help with HTML5 parsing:composer require masterminds/html5Next, modify the file as follows:filter(‘ li article oduct_price ice_color’)
      ->each(function ($node) use (&$prices) {
      // echo title, price, and description
      $response->filter(‘ li article h3 a’)
      ->each(function ($node) use ($prices, &$priceIndex, $Client) {
      $title = $node->text();
      $price = $prices[$priceIndex];
      //getting the description
      $description = $Client->click($node->link())
      ->filter(‘. content #content_inner article p’)->eq(3)->text();
      // display the result
      echo “{$title} @ {$price}: {$description}\n\n”;
      $priceIndex++;});If you execute the file in your terminal, you should see a title, price, and description displayed:Using the Goutte CSS Selector component and the option to click on a page, you can easily crawl an entire website with several pages and extract as much data as you Scraping in PHP with Simple HTML DOMSimple HTML DOM is another minimalistic PHP web scraping library that you can use to crawl a website. Let’s discuss how you can use this library to scrape a website. Just like in the previous examples, we will be scraping the Books to Scrape you can install the package, modify your file and add the following lines of code just below the require:{} block to avoid getting the versioning error:”minimum-stability”: “dev”,
      “prefer-stable”: trueNow, you can install the library with the following command:composer require simplehtmldom/simplehtmldomOnce the library is installed, create a new PHP file called have already discussed the layout of the web page we are scraping in the previous sections. So, we will just go straight to the code. Add the following code to the file:load(”);
      // echo the title
      echo $response->find(‘title’, 0)->plaintext. PHP_EOL. PHP_EOL;
      // get the prices into an array
      foreach ($response->find(‘ li article oduct_price ice_color’) as $price) {
      $prices[] = $price->plaintext;}
      foreach ($response->find(‘ li article h3 a’) as $key => $title) {
      echo “{$title->plaintext} @ {$prices[$key]} \n”;}If you execute the code in your terminal, it should display the results:You can find more methods to crawl a web page using the Simple HTML DOM library from the official API Scraping in PHP with a Headless Browser (Symfony Panther)A headless browser is a browser without a graphical user interface. Headless browsers allow you to use your terminal to load a web page in an environment similar to a web browser. This allows you to write code to control the browsing as we have just done in the previous steps. So why is this necessary? In modern web development, most developers use JavaScript web frameworks. These frameworks generate the HTML code inside the browsers. In other cases, AJAX dynamically loads the content. In the previous examples, we used a static HTML page, so the output was consistent. In dynamic cases, where you use JavaScript and AJAX to generate the HTML, the output of the DOM tree may differ greatly. This would cause our scrapers to fail. Headless browsers come into the picture to handle such issues in modern Symfony Panther PHP library works well with headless browsers. You can use the library to scrape websites and run tests using real browsers. In addition, it provides the same methods as the Goutte library, so you can use it instead of Goutte. Unlike the previous web scraping libraries we’ve discussed in this tutorial, Panther can do the following:Execute JavaScript code on web pagesSupports remote browser testingSupports asynchronous loading of elements by waiting for other elements to load before executing a line of codeSupports all implementations of Chrome of FirefoxCan take screenshotsAllows running your custom JS code or XPath queries within the context of the loaded have already been doing a lot of scraping, so let’s try something different. We will be loading an HTML page and taking a screenshot of the stall Symfony Panther with the following command:composer require symfony/pantherCreate a new php file, let’s call it Add the following code to the file:takeScreenshot($saveAs = ”);
      // let’s display some book titles
      $response->getCrawler()->filter(‘ li article h3 a’)
      ->each(function ($node) {
      echo $node->text(). PHP_EOL;});For this code to run on your system, you must install the drivers for Chrome or Firefox, depending on which client you used in your code. Fortunately, Composer can automatically do this for you. Execute the following command in your terminal to install and detect the drivers:composer require – dev dbrekelmans/bdi && vendor/bin/bdi detect driversNow you can execute the PHP file in your terminal and it will take a screenshot of the webpage and store it in the current directory. It will then display a list of titles from the nclusionIn this tutorial, we discussed the various PHP open source libraries you can use to scrape a website. If you followed along with the tutorial, you should’ve been able to create a basic scraper to crawl a page or two. While this was an introductory article, we covered most methods you can use with the libraries. You may choose to build on this knowledge and create complex web scrapers that can crawl thousands of pages. The code for this tutorial is available from this GitHub free to get in touch if you have any can check out some other articles on web scraping with Nodejs and web scraping with Python if you’re interested.
      Learn to code for free. freeCodeCamp’s open source curriculum has helped more than 40, 000 people get jobs as developers. Get started
      Web scraping in PHP - Stack Overflow

      Web scraping in PHP – Stack Overflow

      This question is fairly old but still ranks very highly on Google Search results for web scraping tools in PHP. Web scraping in PHP has advanced considerably in the intervening years since the question was asked. I actively maintain the Ultimate Web Scraper Toolkit, which hasn’t been mentioned yet but predates many of the other tools listed here except for Simple HTML DOM.
      The toolkit includes TagFilter, which I actually prefer over other parsing options because it uses a state engine to process HTML with a continuous streaming tokenizer for precise data extraction.
      To answer the original question of, “Is there any simple way to do this without any external libraries/classes? ” The answer is no. HTML is rather complex and there’s nothing built into PHP that’s particularly suitable for the task. You really need a reusable library to parse generic HTML correctly and consistently. Plus you’ll find plenty of uses for such a library.
      Also, a really good web scraper toolkit will have three major, highly-polished components/capabilities:
      Data retrieval. This is making a HTTP(S) request to a server and pulling down data. A good web scraping library will also allow for large binary data blobs to be written directly to disk as they come down off the network instead of loading the whole thing into RAM. The ability to do dynamic form extraction and submission is also very handy. A really good library will let you fine-tune every aspect of each request to each server as well as look at the raw data it sent and received on the wire. Some web servers are extremely picky about input, so being able to accurately replicate a browser is handy.
      Data extraction. This is finding pieces of content inside retrieved HTML and pulling it out, usually to store it into a database for future lookups. A good web scraping library will also be able to correctly parse any semi-valid HTML thrown at it, including Microsoft Word HTML and output where odd things show up like a single HTML tag that spans several lines. The ability to easily extract all the data from poorly designed, complex, classless tags like HTML table elements that some overpaid government employees made is also very nice to have (i. e. the extraction tool has more than just a DOM or CSS3-style selection engine available). Also, in your case, the ability to early-terminate both the data retrieval and data extraction after reading in 50KB or as soon as you find what you are looking for is a plus, which could be useful if someone submits a URL to a 500MB file.
      Data manipulation. This is the inverse of #2. A really good library will be able to modify the input HTML document several times without negatively impacting performance. When would you want to do this? Sanitizing user-submitted HTML, transforming content for a newsletter or sending other email, downloading content for offline viewing, or preparing content for transport to another service that’s finicky about input (e. g. sending to Apple News or Amazon Alexa). The ability to create a custom HTML-style template language is also a nice bonus.
      Obviously, Ultimate Web Scraper Toolkit does all of the more:
      I also like my toolkit because it comes with a WebSocket client class, which makes scraping WebSocket content easier. I’ve had to do that a couple of times.
      It was also relatively simple to turn the clients on their heads and make WebServer and WebSocketServer classes. You know you’ve got a good library when you can turn the client into a then I went and made PHP App Server with those classes. I think it’s becoming a monster!

      Frequently Asked Questions about how to do web scraping in php

      Can We Do Web Scraping using PHP?

      Web scraping lets you collect data from web pages across the internet. It’s also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.Jun 22, 2021

      Is Web scraping harmful?

      Further, data scraping can open the door to spear phishing attacks; hackers can learn the names of superiors, ongoing projects, trusted third parties, etc. Essentially, everything a hacker could need to craft their message to make it plausible and provoke the correct (rash and ill-informed) response in their victims.Aug 24, 2020

      Which language is best for web scraping?

      Python is mostly known as the best web scraper language. It’s more like an all-rounder and can handle most of the web crawling related processes smoothly. Beautiful Soup is one of the most widely used frameworks based on Python that makes scraping using this language such an easy route to take.Aug 9, 2017

    3. Leave a Reply

      Your email address will not be published. Required fields are marked *