Building a Web Scraper in an Azure Function – Applied …
Web development is arguably the most popular area of software development right now. Software developers can make snappy, eye-catching websites, and build robust APIs. I’ve recently developed a specific interest in a less discussed facet of web development: web scraping.
Web scraping is the process of programmatically analyzing a website’s Document Object Model (DOM) to extract specific data of interest. Web scraping is a powerful tool for automating certain features such as filling out a form, submitting data, etc. Some of these abilities will depend if the site allows web scraping or not. One thing to keep in mind, if you want to web scrape, is that some websites will be using cookies/session state. So some automation tasks might need to abide by the use of the site’s cookies/session state. It should go without saying, but please be a good Samaritan when web scraping since it can negatively impact site performance.
Getting Started
Let’s get started with building a web scraper in an Azure Function! For this example, I am using an HTTP Trigger Azure Function written in C#. However, you can have your Azure Function utilize a completely different trigger type, and your web scraper can be written in other languages if preferred.
Here is a list of Azure resources that were created for this demo:
Before we start writing code, we need to take care of a few more things first.
Let’s first select a website to scrape data from. I feel that the CDC’s COVID-19 site is an excellent option for this demo. Next, we need to pick out what data to fetch from the website. I plan to fetch the total number of USA cases, new USA cases, and the date that the data was last updated.
Now that we have that out of the way, we need to bring in the dependencies for this solution. Luckily, there is only one dependency we need to install. The NuGet package is called HtmlAgilityPack. Once that package has been installed into our solution, we can then start coding.
Coding the Web Scraper
Since the web scraper component will be pulling in multiple sets of data, it is good to capture them inside a custom resource model. Here is a snapshot of the resource model that will be used for the web scraper.
Now it’s time to start coding the web scraper class. This class will utilize a few components from the HtmlAgilityPack package that was brought into the project earlier.
The web scraper class has a couple of class-level fields, one public method, and a few private methods. The method “GetCovidStats” performs a few simple tasks to get our data from the website. The first step is setting up an HTML document object that will be used to load HTML and parse the actual HTML document we get back from the site. Then, there is an HTTP call out to the website we want to hit.
Right after that, we ensure the call out to the website results in a success status code. If not, an exception is thrown with a few details of the failing network call.
We then load the HTML that we received back from the network call into our HTML document object. There are several calls to a method that will perform the extraction of the data we are looking for. Now you might be wondering what those long strings are in the method call. Those are the full xpaths for each targeted HTML element. You can obtain them by opening the dev tools in your browser, selecting the HTML element, and right-clicking it in the dev tools. From there, select “copy full xpath”.
Next, we need to set up the endpoint class for our Azure Function. Luckily for us, the out of the box template sets up a few things automatically. In the endpoint class, we are merely calling our web scraper class and returning its results to the client calling the Azure Function.
Now comes time to test out the Azure Function! I used Postman for this, and these are the results.
Closing Thoughts
Overall, web scraping can be a powerful tool at your disposal if sites do not offer APIs for you to consume. It allows you to swiftly grab essential data off of a site and even automate specific tasks in the browser. With great power comes great responsibility, so please use these tools with care!
Web scraping data with Azure Data Factory – SQLShack
This article will show you how to web scrape data using Azure Data Factory and store the data on one of the Azure data repositories.
Introduction
Typically, when data sources are being considered, we tend to think of sources like relational databases, NoSQL databases, file-based data sources, data warehouses or data lakes. One huge, unstructured, or semi-structured source of data is the web pages that are publicly or privately accessible to the web. These pages contain various elements like text, images, media, etc. But one of the most valuable elements in the web pages from a data perspective is the web tables which can be directly mapped to data objects in data repositories or even be stored in the form of files.
This technique of reading or extracting data from web pages is popularly known as web scraping.
The technique of web scraping is not new. Most of the popular programming languages or frameworks like R, Python,, Java, etc. provide libraries that can directly web scrape data, convert as well parse the data in JSON format and process it as desired. With the advent of cloud computing platforms, Extract Transform Load (ETL) services are available in a Platform-as-a-Service model, which allows building data pipelines that execute on a managed infrastructure. For large scale web-scraping, for example, scraping data from a data source like Wikipedia at scheduled intervals that involves hundreds to thousands of pages, one would want to execute such tasks on a data pipeline that runs on a managed infrastructure on the cloud instead of typically executing the same using custom code on a single virtual machine. Azure Data Factory is Azure’s ETL offering that supports building data pipelines.
It also supports building data pipelines that can web scrape data.
Creating web scraping data pipelines with Azure Data Factory
The first thing we will need to web scrape data is the actual data itself. This data should be in the form of some tables on a web page hosted on a publicly accessible website. One of the easiest means of accessing such a website is Wikipedia. We will be using a page from Wikipedia as our data source, but one can use any publicly accessible webpage that contains a table as a data source for this exercise. Shown below is the webpage that we are considering scraping using our data pipeline that we will build using Azure Data Factory. This page can be accessed from here.
This page contains multiple tables. If we scroll down this page, we will find another table as shown below.
Typically, pages may contain multiple tables and we may want to scrape one or more of them, hence we have selected a
page with multiple tables.
Now that we have the data source identified, it is time to start working on our data pipeline creation. It is
assumed that one has the required access to the Azure Data Factory service on the Azure platform. Navigate to the
Azure portal, open the Data Factories service and it will open the dashboard page that lists all the Azure Data
factory instances you may have. If it’s the first time you are working with Azure Data Factory, you may not have any
instance created. Click on the Add button to create a new Azure Data Factory instance. Provide
basic details of your subscription, resource group and name of the data factory instance and click on the Create
button, which will result in the creation of a new data factory instance as shown below.
Click on the instance name and it will open the dashboard of the specific instance. You should be able to see a link
in the middle of the screen titled “Author and Monitor”. Click on this link and it will open the Azure Data Factory
portal, which is the development and administrative console that facilitates the creation of data pipelines. This is
the location where we will create and host our data pipeline that will web scrape data from the webpage that we have
identified earlier.
We intend to copy data from the web table on the web page, so we can start with the Copy Data feature. Click on the link titled “Copy data” as shown above and it will directly start the creation of a new data pipeline that walks through the steps required to copy data from source to destination. In the first step, we need to provide the name of the task and optionally a description for the same. Then we need to provide the scheduling requirements to execute this task. For now, we can continue with the “Run once now” option. This can be modified later as well. Once done, click on the Next button to proceed to the next step.
In this step, we need to register the data source from which we intend to copy data. Click on Add new connection
button which will open a list of supported data sources as shown below. For our use case, we need to select
Web Table as the source as shown below. Once selected, click on the Continue
button.
In this step, we need to provide the name of the linked service being created i. e. the web table data source and
optionally a description for it. We need an integration runtime to be created to connect to this data source. Expand
the integration runtime drop-down and click on the New option in the drop-down.
It will pop up the below options to create a new integration runtime. Depending on the availability and suitability
options, one can select any option to create the runtime. One of the most straightforward approaches is the
Self-Hosted option with Express settings which will download and install an executable on your local machine and
create a runtime. Choose the desired option here and create a new runtime.
Once the runtime is created, you will find the same listed in the dropdown as shown below. Now we can provide the
URL of the Wikipedia page that we have identified and click on the Test Connection button. If everything is in
place, this connection will be successful as shown below. One can also configure authentication type if one is
reading from a private webpage that requires authentication, unlike Wikipedia which is a publicly accessible website
with anonymous authentication.
Now that we have registered the data source, we need to configure the dataset from this data source that we intend
to extract. In this step, we can specify the path to the web table or specify an index, whichever is convenient. An
index value of 0 would result in the extraction of the table shown in the preview pane. If you compare it with the
web page, it’s the first table on the page.
Change the value of the index to 1 and it will show the next table on the web page. In this way, even if we do not
know the path to the table in the web page, we can just specify the index and extract the desired web table from the
web page. We will continue with the first table with index 0 and click on the Next button.
Now that we have selected the data source, we need to select the destination where the extracted data will be
loaded. Depending on the requirement, we can register any supported data destinations. In this case, using the steps
shown earlier, Azure Blob Storage has been added as the destination as we will be storing the data in the form of a
file.
In the next step, provide the details of the location where the file will be stored as well as the details of the
file like file name, concurrent connections, and block size.
In the last step, we need to provide the format of data that we need to save in the file. We intend to save the file
in a comma-separated format, so we have chosen the settings shown below.
Once done, move to the next step where you can review the details, save the data pipeline, and execute the same.
Once the execution completes, the output file would be generated at the location that we selected earlier. Navigate
to the storage account location where the file has been created, download the same, and open it. The output will
look as shown below.
In this way, we can web scrape data using Azure Data Factory and populate it on data repositories on the Azure
platform.
Conclusion
In this article, we learned the concept of web scraping. We learned about the Web Table data source in Azure Data
Factory and learned to build a data pipeline that can web scrape data and store the same on Azure data repositories.
Author Recent Posts Gauri is a SQL Server Professional and has 6+ years experience of working with global multinational consulting and technology organizations. She is very passionate about working on SQL Server topics like Azure SQL Database, SQL Server Reporting Services, R, Python, Power BI, Database engine, etc. She has years of experience in technical documentation and is fond of technology has a deep experience in designing data and analytics solutions and ensuring its stability, reliability, and performance. She is also certified in SQL Server and have passed certifications like 70-463: Implementing Data Warehouses with Microsoft SQL all posts by Gauri Mahajan
Create a Website Scraper for Azure Functions – Jason N …
There are many reasons you may need a website scraper. One of the biggest reasons I use website scrapers is to prevent me from visiting a site to look for something on a regular basis and losing the time spent on that site. For instance, when COVID-19 first hit, I visited the stats page on the Pennsylvania Department of Health each day. Another instance may be to watch for a sale item during Amazon’s Prime Day.
Getting Started
To get started, we’ll want to create an Azure Function. We can do that a few different ways:
Use an extension for Visual Studio Code
Use the Azure extension for Visual Studio
Use the command line
Use the Azure Portal
At this point, use the method that you feel most comfortable with. I tend to use the command line or the Azure extension for Visual Studio Code as they tend to leave the codebase very clean. I’m making this function with C# so I can use some 3rd party libraries.
In my case, I’ve called my HttpTrigger function ScrapeSite.
Modifying the Function
Once the function is created, it should look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public static class ScrapeSite
{
[FunctionName(“ScrapeSite”)]
public static async Task
[HttpTrigger(ction, “get”, “post”, Route = null)] HttpRequest req,
ILogger log)
log. LogInformation(“C# HTTP trigger function processed a request. “);
string name = [“name”];
string requestBody = await new StreamReader(). ReadToEndAsync();
dynamic data = serializeObject(requestBody);
name = name?? data? ;
string responseMessage = NullOrEmpty(name)? “This HTTP triggered function executed successfully. Pass a name in the query string or in the request body for a personalized response. “: $”Hello, {name}. This HTTP triggered function executed successfully. “;
return new OkObjectResult(responseMessage);}}
We’ll bring in the NuGet package for HtmlAgilityPack so we can grab the appropriate area of our page. To do this, we’ll use a command line, navigate to our project and run:
dotnet add package HtmlAgilityPack
In my case, I’m going to connect to Walmart and look at several Xbox products. I’ll be querying the buttons on the page to look at the InnerHtml of the button and ensure that it does not read “Get in-stock alert”. If it does, that means that the product is out of stock.
Our first step is to connect to the URL and read the page content. I’ll do this by creating a sealed class that can be used to help deliver the properties back to the function:
public sealed class ProductAvailability
public bool IsAvailable { get; set;}
public string Url { get; set;}}
In this case, I’ll be returning a boolean value as well as the URL that I’m attempting to scrape from. This will allow me to redirect the user to that location when necessary.
While it is not illegal to screen scrape websites, you should make sure that you have the appropriate permission before scraping the site. In addition, if you scrape too often, the site may deem you as a bot and may block your IP address.
Next, I’m going to add a static class called Scraper. This will actually handle the majority of the scraping process. The class will take advantage of the HtmlWeb. LoadFromWebAsync() method in the HtmlAgilityPack package. The reason for this is that the built-in HttpClient() lacks the necessary headers to properly call most sites. If we use this library instead, most websites will record us as a bot.
After we connect to the URL, we’ll use a selector to grab all buttons and then use a LINQ query to count how many buttons contain the text “Get in-stock alert”. We’ll update the ProductAvailability object and return it back.
public static class Scraper
public static async Task
var productAvailability = new ProductAvailability(){ Url = uri};
var web = new HtmlWeb();
var html = await web. LoadFromWebAsync();
var htmlString = cumentNode;
var buttons = lectNodes(“//button”)();
var outOfStock = (c => ntains(“Get in-stock alert”));
Available = outOfStock == 0;
return productAvailability;}}
Finally, we’ll update our function to call the GetProductAvailability method multiple times:
var products = new List
(await tProductAvailability(“));
return new OkObjectResult(products);}}
Results
Now, we can run our function from within Visual Studio Code. To do this, hit the F5 key. This will require that you have the Azure Functions Core Tools installed. If you do not, you’ll be prompted to install it. After it’s installed and you press F5, you’ll be prompted to visit your local URL for your function. If successful, you should see the following results (as of this post) for the above two products:
[{“isAvailable”:false, “url”:”}, {“isAvailable”:true, “url”:”}]
Conclusion
In this post we created a new Azure Function, built the function using VS Code, and connected to to obtain product information. If you’re interested in reviewing the finished product, be sure to check out the repository below:
Browse Repository