Php Web Scraping Library

8 Awesome PHP Web Scraping Libraries and Tools – DZone

Web scraping is something developers encounter on a daily basis.
There could be different needs as far as each scraping task is concerned. It could be a product or stock pricing.
In backend development, web scraping is quite popular. There are people who keep creating quality parsers and scrapers.
In this post, we will explore some of the libraries which can enable scraping websites and storing data in a manner that could be useful for your immediate needs.
In PHP, you can do scraping with some of these libraries:
Goutte
Simple HTML DOM
htmlSQL
cURL
Requests
HTTPful
Buzz
Guzzle
1. Goutte
Description:
The Goutte library is great for it can give you amazing support regarding how to scrape content using PHP.
Based on the Symfony framework, Goutte is a web scraping as well as web crawling library.
Goutte is useful because it provides APIs to crawl websites and scrape data from the HTML/XML responses.
Goutte is licensed under the MIT license.
Features:
It works well with big projects.
It is OOP based.
It carries a medium parsing speed.
Requirements:
Goutte depends on PHP 5. 5+ and Guzzle 6+.
Documentation:
Learn more:
2. Simple HTML DOM
Written in PHP5+, an HTML DOM parser is good because it enables you to access and use HTML quite easily and comfortably.
With it, you can find the tags on an HTML page with selectors pretty much like jQuery.
You can scrape content from HTML in a single line.
It is not as fast as some of the other libraries.
Simple HTML DOM is licensed under the MIT license.
It supports invalid HTML.
Require PHP 5+.
3. htmlSQL
Basically, it is a PHP library which is experimental. It is useful because it enables you to access HTML values with a SQL-like syntax.
What this means is that you don’t need to write complex functions or regular expressions in order to scrape specific values.
If you are someone who likes SQL, you would also love this experimental library.
How it will be useful is that you can leverage it for any kind of miscellaneous task and parsing a web page pretty quickly.
While it stopped receiving updates/support in 2006, htmlSQL remains a reliable library for parsing and scraping.
htmlSQL is licensed under the BSD license.
It provides relatively fast parsing, but it has a limited functionality.
Any flavor of PHP4+ should do.
Snoopy PHP class – Version 1. 2. 3 (optional – required for web transfers).
4. cURL
cURL is well-known as one of the most popular libraries (a built-in PHP component) for extracting data from web pages.
There is no requirement to include third-party files and classes as it is a standardized PHP-library.
When you want to use PHP’s cURL functions, all you need do is install the » libcurl package. PHP will need libcurl version 7. 10. 5 or later.
5. Requests
Description
Requests is an HTTP library written in PHP.
It is sort of based on the API from the excellent Requests Python library.
Requests enable you to send HEAD, GET, POST, PUT, DELETE, and PATCH HTTP requests.
With the help of Requests, you can add headers, form data, multipart files, and parameters with simple arrays, and access the response data in the same way.
Requests is ISC Licensed.
International Domains and URLs.
Browser-style SSL Verification.
Basic/Digest Authentication.
Automatic Decompression.
Connection Timeouts.
Requires PHP version 5. 2+
6. HTTPful
HTTPful is a pretty straightforward PHP library. It is good because it is chainable as well as readable. It is aimed at making HTTP readable.
Why it is considered useful is because it allows the developer to focus on interacting with APIs rather than having to navigate through curl set_opt pages. It is also great a PHP REST client.
HTTPful is licensed under the MIT license.
Readable HTTP Method Support (GET, PUT, POST, DELETE, HEAD, PATCH, and OPTIONS).
Custom Headers.
Automatic “Smart” Parsing.
Automatic Payload Serialization.
Basic Auth.
Client Side Certificate Auth.
Request “Templates. ”
Requires PHP version 5. 3+
7. Buzz
Buzz is useful as it is quite a light library and enables you to issue HTTP requests.
Moreover, Buzz is designed to be simple and it carries the characteristics of a web browser.
Buzz is licensed under the MIT license.
Simple API.
High performance.
Requires PHP version 7. 1.
8. Guzzle
Guzzle is useful because it is a PHP HTTP client which enables you to send HTTP requests in an easy manner. It is also easy to integrate with web services.
It has a simple interface which helps you build query strings, POST requests, streaming large uploads, stream large downloads, use HTTP cookies, upload JSON data, etc.
It can send both synchronous and asynchronous requests with the help of the same interface.
It makes use of PSR-7 interfaces for requests, responses, and streams. This enables you to utilize other PSR-7 compatible libraries with Guzzle.
It can abstract away the underlying HTTP transport, enabling you to write environment and transport agnostic code; i. e., no hard dependency on cURL, PHP streams, sockets, or non-blocking event loops.
Middleware system enables you to augment and compose client behavior.
Requires PHP version 5. 3. 3+.
Conclusion
As you can see, there are web scraping tool at your disposal and it will depend upon your web scraping needs as to what kind of tools will suit you.
However, a basic understanding of these PHP libraries can help you navigate through the maze of many libraries that exist and arrive at something useful.
I hope that you liked reading this post. Feel free to share your feedback and comments!
Topics:
php libraries,
web dev,
web scraping,
web parsing
Opinions expressed by DZone contributors are their own.
Web Scraping with PHP – How to Crawl Web Pages Using ...

Web Scraping with PHP – How to Crawl Web Pages Using …

Web scraping lets you collect data from web pages across the internet. It’s also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code. But since we do not want to reinvent the wheel, we can leverage some readily available open-source PHP web scraping libraries to help us collect our data. In this tutorial, we will be discussing the various tools and services you can use with PHP to scrap a web page. The tools we will discuss are Guzzle, Goutte, Simple HTML DOM, and the headless browser Symfony before you scrape a website, you should carefully read their Terms of Service to make sure they are OK with being scraped. Scraping data – even if it’s publicly accessible – can potentially overload a website’s servers. (Who knows – if you ask politely, they may even give you an API key so you don’t have to scrape. )How to Set Up the ProjectBefore we begin, if you would like to follow along and try out the code, here are some prerequisites for your development environment:Ensure you have installed the latest version of to this link Composer to set up a composer that we will use to install the various PHP dependencies for the web scraping editor of your you are done with all that, create a project directory and navigate into the directory:mkdir php_scraper
cd php_scraperRun the following two commands in your terminal to initialize the file:composer init — require=”php >=7. 4″ — no-interaction
composer updateLet’s get Scraping with PHP using Guzzle, XML, and XPathGuzzle is a PHP HTTP client that lets you send HTTP requests quickly and easily. It has a simple interface for building query strings. XML is a markup language that encodes documents so they’re human-readable and machine-readable. And XPath is a query language that navigates and selects XML nodes. Let’s see how we can use these three tools together to scrape a by installing Guzzle via composer by executing the following command in your terminal:composer require guzzle/guzzleOnce you’ve installed Guzzle, let’s create a new PHP file to which we will be adding the code. We will call it this demonstration, we will be scraping the Books to Scrape website. You should be able to follow the same steps we define here to scrape any website of your Books to Scrape website looks like this:We want to extract the titles of the books and display them on the terminal. The first step in scraping a website is understanding its HTML layout. In this case, you can view the HTML layout of this page by right-clicking on the page, just above the first product in the list, and selecting is a screenshot showing a snippet of the page source:You can see that the list is contained inside the

    element. The next direct child is the

  1. we want is the book title. It is inside the , which is in turn inside the

    , which is inside the

    , which is finally inside the

  2. initialize Guzzle, XML and Xpath, add the following code to the file:get(”);
    $htmlString = (string) $response->getBody();
    //add this line to suppress any warnings
    libxml_use_internal_errors(true);
    $doc = new DOMDocument();
    $doc->loadHTML($htmlString);
    $xpath = new DOMXPath($doc);The above code snippet will load the web page into a string. We then parse the string using XML and assign it to the $xpath next thing you want is to target the text content inside the
    tag. Add the following code to the file:$titles = $xpath->evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
    $extractedTitles = [];
    foreach ($titles as $title) {
    $extractedTitles[] = $title->P_EOL;
    echo $title->P_EOL;}In the code snippet above, //ol[@class=”row”] gets the whole item in the list has an
    tag that we are targeting to extract the book’s actual title. We only have one

    tag containing the , which makes it easier to target it use the foreach loop to extract the text contents and echo them to the terminal. At this step you may choose to do something with your extracted data, maybe assign the data to an array variable, write to file, or store it in a database. You can execute the file using PHP on the terminal by running the command below. Remember, the highlighted part is how we named our file:php pThis should display something like this:That went, what if we wanted to also get the price of the book? The price happens to be inside

    tag, inside a

    tag. As you can see there are more than one

    tag and more than one

    tag. To find the right target, we will use the CSS class selectors which, lucky for us, are unique for each tag. Here is the code snippet to also get the price tag and concatenate it to the title string:$titles = $xpath->evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
    $prices = $xpath->evaluate(‘//ol[@class=”row”]//li//article//div[@class=”product_price”]//p[@class=”price_color”]’);
    foreach ($titles as $key => $title) {
    echo $title->textContent. ‘ @ ‘. $prices[$key]->P_EOL;}If you execute the code on your terminal, you should see something like this:Your whole code should look like this:evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
    echo $title->textContent. $prices[$key]->P_EOL;}Of course, this is a basic web scraper, and you can certainly make it better. Let’s move to the next Scraping in PHP with GoutteGoutte is another excellent HTTP client for PHP that’s specifically made for web scraping. It was developed by the creator of the Symfony Framework and provides a nice API to scrape data from the HTML/XML responses of websites. Below are some of the components it includes to make web crawling straightforward:BrowserKit Component to simulate the behavior of a web browser. CssSelector component for translating CSS queries into XPath mCrawler component brings the power of DOMDocument and mfony HTTP Client is a fairly new component from the Symfony stall Goutte via composer by executing the following command on your terminal:composer require fabpot/goutteOnce you have installed the Goutte package, create a new PHP file for our code – let’s call it this section we’ll discuss what we did with the Guzzle library in the first section. We will scrape book titles from the Books to Scrape website using Goutte. Then we’ll see how you can add the prices into an array variable and use the variable within the code. Add the following code inside the file:request(‘GET’, ”);
    $titles = $response->evaluate(‘//ol[@class=”row”]//li//article//h3/a’);
    $prices = $response->evaluate(‘//ol[@class=”row”]//li//article//div[@class=”product_price”]//p[@class=”price_color”]’);
    // we can store the prices into an array
    $priceArray = [];
    foreach ($prices as $key => $price) {
    $priceArray[] = $price->textContent;}
    // we extract the titles and display to the terminal together with the prices
    echo $title->textContent. $priceArray[$key]. PHP_EOL;}Execute the code by running the following command in the terminal:php pHere is the output:This is one way of web scraping with ’s discuss another method using the CSS Selector component that comes with Goutte. The CSS selector is more straightforward than using XPath shown in the previous another PHP file, let’s call it Add the following code to the file:filter(‘ li article oduct_price ice_color’)->each(function ($node) use (&$prices) {
    $prices[] = $node->text();});
    // echo titles and prices
    $priceIndex = 0;
    $response->filter(‘ li article h3 a’)->each(function ($node) use ($prices, &$priceIndex) {
    echo $node->text(). $prices[$priceIndex]. PHP_EOL;
    $priceIndex++;});As you can see, using the CSS Selector component results in cleaner and more readable code. You may have noticed that we used the & operator. This ensures that we take the reference of the variable into the “each” loop, and not just the value of the variable. If the &$prices are modified within the loop, the actual value outside the loop is also modified. You can read more on assignment by references from official PHP docs. Execute the file in your terminal by running the command:php should see an output similar to the one in the previous screenshots:Our web scraper with PHP and Goutte is going well so far. Let’s go a little deeper and see if we can click on a link and navigate to a different our demo website, Books to Scrape, if you click on a title of a book, a page will load showing details of the book such as:We want to see if you we click on a link from the books list, navigate to the book details page, and extract the description. Inspect the page to see what we will be targeting:Our target flow will be from the

    element, then

    , then the

    tag which only appears once, and finally the

    tag. We have several

    tags – the tag with the description is the fourth inside the

    parent. Since arrays start at 0, we will be getting the node at the 3rd that we know what we are targeting, let’s write the, add the following composer package to help with HTML5 parsing:composer require masterminds/html5Next, modify the file as follows:filter(‘ li article oduct_price ice_color’)
    ->each(function ($node) use (&$prices) {
    // echo title, price, and description
    $response->filter(‘ li article h3 a’)
    ->each(function ($node) use ($prices, &$priceIndex, $Client) {
    $title = $node->text();
    $price = $prices[$priceIndex];
    //getting the description
    $description = $Client->click($node->link())
    ->filter(‘. content #content_inner article p’)->eq(3)->text();
    // display the result
    echo “{$title} @ {$price}: {$description}\n\n”;
    $priceIndex++;});If you execute the file in your terminal, you should see a title, price, and description displayed:Using the Goutte CSS Selector component and the option to click on a page, you can easily crawl an entire website with several pages and extract as much data as you Scraping in PHP with Simple HTML DOMSimple HTML DOM is another minimalistic PHP web scraping library that you can use to crawl a website. Let’s discuss how you can use this library to scrape a website. Just like in the previous examples, we will be scraping the Books to Scrape you can install the package, modify your file and add the following lines of code just below the require:{} block to avoid getting the versioning error:”minimum-stability”: “dev”,
    “prefer-stable”: trueNow, you can install the library with the following command:composer require simplehtmldom/simplehtmldomOnce the library is installed, create a new PHP file called have already discussed the layout of the web page we are scraping in the previous sections. So, we will just go straight to the code. Add the following code to the file:load(”);
    // echo the title
    echo $response->find(‘title’, 0)->plaintext. PHP_EOL. PHP_EOL;
    // get the prices into an array
    foreach ($response->find(‘ li article oduct_price ice_color’) as $price) {
    $prices[] = $price->plaintext;}
    foreach ($response->find(‘ li article h3 a’) as $key => $title) {
    echo “{$title->plaintext} @ {$prices[$key]} \n”;}If you execute the code in your terminal, it should display the results:You can find more methods to crawl a web page using the Simple HTML DOM library from the official API Scraping in PHP with a Headless Browser (Symfony Panther)A headless browser is a browser without a graphical user interface. Headless browsers allow you to use your terminal to load a web page in an environment similar to a web browser. This allows you to write code to control the browsing as we have just done in the previous steps. So why is this necessary? In modern web development, most developers use JavaScript web frameworks. These frameworks generate the HTML code inside the browsers. In other cases, AJAX dynamically loads the content. In the previous examples, we used a static HTML page, so the output was consistent. In dynamic cases, where you use JavaScript and AJAX to generate the HTML, the output of the DOM tree may differ greatly. This would cause our scrapers to fail. Headless browsers come into the picture to handle such issues in modern Symfony Panther PHP library works well with headless browsers. You can use the library to scrape websites and run tests using real browsers. In addition, it provides the same methods as the Goutte library, so you can use it instead of Goutte. Unlike the previous web scraping libraries we’ve discussed in this tutorial, Panther can do the following:Execute JavaScript code on web pagesSupports remote browser testingSupports asynchronous loading of elements by waiting for other elements to load before executing a line of codeSupports all implementations of Chrome of FirefoxCan take screenshotsAllows running your custom JS code or XPath queries within the context of the loaded have already been doing a lot of scraping, so let’s try something different. We will be loading an HTML page and taking a screenshot of the stall Symfony Panther with the following command:composer require symfony/pantherCreate a new php file, let’s call it Add the following code to the file:takeScreenshot($saveAs = ”);
    // let’s display some book titles
    $response->getCrawler()->filter(‘ li article h3 a’)
    ->each(function ($node) {
    echo $node->text(). PHP_EOL;});For this code to run on your system, you must install the drivers for Chrome or Firefox, depending on which client you used in your code. Fortunately, Composer can automatically do this for you. Execute the following command in your terminal to install and detect the drivers:composer require – dev dbrekelmans/bdi && vendor/bin/bdi detect driversNow you can execute the PHP file in your terminal and it will take a screenshot of the webpage and store it in the current directory. It will then display a list of titles from the nclusionIn this tutorial, we discussed the various PHP open source libraries you can use to scrape a website. If you followed along with the tutorial, you should’ve been able to create a basic scraper to crawl a page or two. While this was an introductory article, we covered most methods you can use with the libraries. You may choose to build on this knowledge and create complex web scrapers that can crawl thousands of pages. The code for this tutorial is available from this GitHub free to get in touch if you have any can check out some other articles on web scraping with Nodejs and web scraping with Python if you’re interested.
    Learn to code for free. freeCodeCamp’s open source curriculum has helped more than 40, 000 people get jobs as developers. Get started
    PHP Scraper - An opinionated web-scraping library for PHP

    PHP Scraper – An opinionated web-scraping library for PHP

    by Peter Thaleikis (opens new window) Web scraping using PHP can done easier. This is an opinionated wrapper around some great PHP libraries to make accessing the web easier. The examples tell the story much better. Have a look! # The Idea ️ Accessing websites and collecting basic information of the web is too complex. This wrapper around Goutte (opens new window) makes it easier. It saves you from XPath and co., giving you direct access to everything you need. Web scraping with PHP re-imagined. # Supporters ️ This project is sponsored by: Want to sponsor this project? Contact me (opens new window). # Examples Here are some examples of what the web scraping library can do at this point: # Scrape Meta Information: Most other information can be accessed directly – either as string or an array. # Scrape Content, such as Images: Some information optionally is returned as an array with details. For this example, a simple list of images is available using $web->images too. This should make your web scraping easier. More example code can be found in the sidebar or the tests. # Installation As usual, done via composer: This automatically ensures the package is loaded and you can start to scrape the web. You can now use any of the noted examples. # Contributing Awesome, if you would like contribute please check the guidelines before getting started. # Tests The code is roughly covered with end-to-end tests. For this, simple web-pages are hosted under, loaded and parsed using PHPUnit (opens new window). These tests are also suitable as examples – see tests/! This being said, there are probably edge cases which aren’t working and may cause trouble. If you find one, please raise a bug on GitHub.

    Frequently Asked Questions about php web scraping library

    Can PHP be used for web scraping?

    Web scraping lets you collect data from web pages across the internet. It’s also called web crawling or web data extraction. PHP is a widely used back-end scripting language for creating dynamic websites and web applications. And you can implement a web scraper using plain PHP code.Jun 22, 2021

    Which library is used for Web scraping?

    BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents.Apr 24, 2020

    Why Web scraping is bad?

    Site scraping can be a powerful tool. In the right hands, it automates the gathering and dissemination of information. In the wrong hands, it can lead to theft of intellectual property or an unfair competitive edge.Apr 18, 2016

  3. Leave a Reply

    Your email address will not be published. Required fields are marked *