Node Js Get Website Content

Get external website content using node js – Stack Overflow

In my website I am using node js for backend and html for front end. I need to get external website metadata (keywords).
Have any package for get the metadata in node js?
For example i have 100 website url in array following like this.
var arrayName = [“, “, “, “, “, “]
I need to get all website metadata particularly in keywords of metadata.
In node js have any package for this?
I found some code in google.
var options = {
host: ”,
port: 80,
path: ‘/’};
(options, function(res) {
(“Got response: ” + atusCode);})(‘error’, function(e) {
(“Got error: ” + ssage);});
Have any other options?
Expected Outputs:
Array1 = [“keyword1”, “keyword2”, “keyword3”];
Array2 = [“keyword1”, “keyword2”, “keyword3”];
Array3 = [“keyword1”, “keyword2”, “keyword3”];
Array1, Array2, Array3 are Site1, Site2, Site3 like this.
The Ultimate Guide to Web Scraping with Node.js

The Ultimate Guide to Web Scraping with Node.js

So what’s web scraping anyway? It involves automating away the laborious task of collecting information from are a lot of use cases for web scraping: you might want to collect prices from various e-commerce sites for a price comparison site. Or perhaps you need flight times and hotel/AirBNB listings for a travel site. Maybe you want to collect emails from various directories for sales leads, or use data from the internet to train machine learning/AI models. Or you could even be wanting to build a search engine like Google! Getting started with web scraping is easy, and the process can be broken down into two main parts:acquiring the data using an HTML request library or a headless browser, and parsing the data to get the exact information you guide will walk you through the process with the popular request-promise module, CheerioJS, and Puppeteer. Working through the examples in this guide, you will learn all the tips and tricks you need to become a pro at gathering any data you need with! We will be gathering a list of all the names and birthdays of U. S. presidents from Wikipedia and the titles of all the posts on the front page of things first: Let’s install the libraries we’ll be using in this guide (Puppeteer will take a while to install as it needs to download Chromium as well) your first request
Next, let’s open a new text file (name the file), and write a quick function to get the HTML of the Wikipedia “List of Presidents” Chrome DevToolsCool, we got the raw HTML from the web page! But now we need to make sense of this giant blob of text. To do that, we’ll need to use Chrome DevTools to allow us to easily search through the HTML of a web Chrome DevTools is easy: simply open Google Chrome, and right click on the element you would like to scrape (in this case I am right clicking on George Washington, because we want to get links to all of the individual presidents’ Wikipedia pages):Now, simply click inspect, and Chrome will bring up its DevTools pane, allowing you to easily inspect the page’s source rsing HTML with Cheerio. jsAwesome, Chrome DevTools is now showing us the exact pattern we should be looking for in the code (a “big” tag with a hyperlink inside of it). Let’s use to parse the HTML we received earlier to return a list of links to the individual Wikipedia pages of U. check to make sure there are exactly 45 elements returned (the number of U. presidents), meaning there aren’t any extra hidden “big” tags elsewhere on the page. Now, we can go through and grab a list of links to all 45 presidential Wikipedia pages by getting them from the “attribs” section of each we have a list of all 45 presidential Wikipedia pages. Let’s create a new file (named), which will contain a function to take a presidential Wikipedia page and return the president’s name and birthday. First things first, let’s get the raw HTML from George Washington’s Wikipedia ’s once again use Chrome DevTools to find the syntax of the code we want to parse, so that we can extract the name and birthday with we see that the name is in a class called “firstHeading” and the birthday is in a class called “bday”. Let’s modify our code to use to extract these two it all togetherPerfect! Now let’s wrap this up into a function and export it from this let’s return to our original file and require the module. We’ll then apply it to the list of wikiUrls we gathered JavaScript PagesVoilà! A list of the names and birthdays of all 45 U. presidents. Using just the request-promise module and should allow you to scrape the vast majority of sites on the cently, however, many sites have begun using JavaScript to generate dynamic content on their websites. This causes a problem for request-promise and other similar HTTP request libraries (such as axios and fetch), because they only get the response from the initial request, but they cannot execute the JavaScript the way a web browser, to scrape sites that require JavaScript execution, we need another solution. In our next example, we will get the titles for all of the posts on the front page of Reddit. Let’s see what happens when we try to use request-promise as we did in the previous ’s what the output looks like:Hmmm…not quite what we want. That’s because getting the actual content requires you to run the JavaScript on the page! With Puppeteer, that’s no problem. Puppeteer is an extremely popular new module brought to you by the Google Chrome team that allows you to control a headless browser. This is perfect for programmatically scraping pages that require JavaScript execution. Let’s get the HTML from the front page of Reddit using Puppeteer instead of! The page is filled with the correct content! Now we can use Chrome DevTools like we did in the previous looks like Reddit is putting the titles inside “h2” tags. Let’s use to extract the h2 tags from the
Additional ResourcesAnd there’s the list! At this point you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:List of web scraping proxy servicesList of handy web scraping toolsList of web scraping tipsComparison of web scraping proxiesCheerio DocumentationPuppeteer Documentation
Learn to code for free. freeCodeCamp’s open source curriculum has helped more than 40, 000 people get jobs as developers. Get started
Web Scraping with Javascript and NodeJS - ScrapingBee

Web Scraping with Javascript and NodeJS – ScrapingBee

Javascript has become one of the most popular and widely used languages due to the massive improvements it has seen and the introduction of the runtime known as NodeJS. Whether it’s a web or mobile application, Javascript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.
Prerequisites
This post is primarily aimed at developers who have some level of experience with Javascript. However, if you have a firm understanding of Web Scraping but have no experience with Javascript, this post could still prove useful.
Below are the recommended prerequisites for this article:
✅ Experience with Javascript
✅ Experience using DevTools to extract selectors of elements
✅ Some experience with ES6 Javascript (Optional)
⭐ Make sure to check out the resources at the end of this article to learn more!
Outcomes
After reading this post will be able to:
Have a functional understanding of NodeJS
Use multiple HTTP clients to assist in the web scraping process
Use multiple modern and battle-tested libraries to scrape the web
Understanding NodeJS: A brief introduction
Javascript is a simple and modern language that was initially created to add dynamic behavior to websites inside the browser. When a website is loaded, Javascript is run by the browser’s Javascript Engine and converted into a bunch of code that the computer can understand.
For Javascript to interact with your browser, the browser provides a Runtime Environment (document, window, etc. ).
This means that Javascript is not the kind of programming language that can interact with or manipulate the computer or it’s resources directly. Servers, on the other hand, are capable of directly interacting with the computer and its resources, which allows them to read files or store records in a database.
When introducing NodeJS, the crux of the idea was to make Javascript capable of running not only client-side but also server-side. To make this possible, Ryan Dahl, a skilled developer took Google Chrome’s v8 Javascript Engine and embedded it with a C++ program named Node.
So, NodeJS is a runtime environment that allows an application written in Javascript to be run on a server as well.
As opposed to how most languages, including C and C++, deal with concurrency, which is by employing multiple threads, NodeJS makes use of a single main thread and utilizes it to perform tasks in a non-nlocking manner with the help of the Event Loop.
Putting up a simple web server is fairly simple as shown below:
const = require(”);
const PORT = 3000;
const server = eateServer((req, res) => {
atusCode = 200;
tHeader(‘Content-Type’, ‘text/plain’);
(‘Hello World’);});
(port, () => {
(`Server running at PORT:${port}/`);});
If you have NodeJS installed and you run the above code by typing(without the < and >) in node opening up your browser, and navigating to localhost:3000, you will see some text saying, “Hello World”. NodeJS is ideal for applications that are I/O intensive.
HTTP clients: querying the web
HTTP clients are tools capable of sending a request to a server and then receiving a response from it. Almost every tool that will be discussed in this article uses an HTTP client under the hood to query the server of the website that you will attempt to scrape.
Request
Request is one of the most widely used HTTP clients in the Javascript ecosystem. However, currently, the author of the Request library has officially declared that it is deprecated. This does not mean it is unusable. Quite a lot of libraries still use it, and it is every bit worth using.
It is fairly simple to make an HTTP request with Request:
const request = require(‘request’)
request(”, function (
error,
response,
body) {
(‘error:’, error)
(‘body:’, body)})
You can find the Request library at GitHub, and installing it is as simple as running npm install request.
You can also find the deprecation notice and what this means here. If you don’t feel safe about the fact that this library is deprecated, there are other options down below!
Axios
Axios is a promise-based HTTP client that runs both in the browser and NodeJS. If you use TypeScript, then Axios has you covered with built-in types.
Making an HTTP request with Axios is straight-forward. It ships with promise support by default as opposed to utilizing callbacks in Request:
const axios = require(‘axios’)
axios
(”)
((response) => {
(response)})
((error) => {
(error)});
If you fancy the async/await syntax sugar for the promise API, you can do that too. But since top level await is still at stage 3, we will have to make use of an async function instead:
async function getForum() {
try {
const response = await (
”)
(response)} catch (error) {
(error)}}
All you have to do is call getForum! You can find the Axios library at Github and installing Axios is as simple as npm install axios.
SuperAgent
Much like Axios, SuperAgent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but SuperAgent has more dependencies and is less popular.
Regardless, making an HTTP request with Superagent using promises, async/await, or callbacks looks like this:
const superagent = require(“superagent”)
const forumURL = ”
// callbacks
superagent
(forumURL)
((error, response) => {
// promises
(error)})
// promises with async/await
const response = await (forumURL)
You can find the SuperAgent library at GitHub and installing Superagent is as simple as npm install superagent.
For the upcoming few web scraping tools, Axios will be used as the HTTP client.
Note that there are other great HTTP clients for web scrapinglike node-fetch!
Regular expressions: the hard way
The simplest way to get started with web scraping without any dependencies is to use a bunch of regular expressions on the HTML string that you fetch using an HTTP client. But there is a big tradeoff. Regular expressions aren’t as flexible and both professionals and amateurs struggle with writing them correctly.
For complex web scraping, the regular expression can also get out of hand. With that said, let’s give it a go. Say there’s a label with some username in it, and we want the username. This is similar to what you’d have to do if you relied on regular expressions:
const htmlString = ‘
const result = (/

Hello world

‘)
$(”)(‘Hello there! ‘)
$(‘h2’). addClass(‘welcome’)
$()
//

Hello there!

As you can see, using Cheerio is similar to how you’d use jQuery.
However, it does not work the same way that a web browser works, which means it does not:
Render any of the parsed or manipulated DOM elements
Apply CSS or load any external resource
Execute Javascript
So, if the website or web application that you are trying to crawl is Javascript-heavy (for example a Single Page Application), Cheerio is not your best bet. You might have to rely on other options mentionned later in this article.
To demonstrate the power of Cheerio, we will attempt to crawl the r/programming forum in Reddit and, get a list of post names.
First, install Cheerio and axios by running the following command:
npm install cheerio axios.
Then create a new file called, and copy/paste the following code:
const axios = require(‘axios’);
const cheerio = require(‘cheerio’);
const getPostTitles = async () => {
const { data} = await (
”);
const $ = (data);
const postTitles = [];
$(‘div > > a’)((_idx, el) => {
const postTitle = $(el)()
(postTitle)});
return postTitles;} catch (error) {
throw error;}};
getPostTitles()
((postTitles) => (postTitles));
getPostTitles() is an asynchronous function that will crawl the Reddit’s old r/programming forum. First, the HTML of the website is obtained using a simple HTTP GET request with the axios HTTP client library. Then the HTML data is fed into Cheerio using the () function.
With the help of the browser Dev-Tools, you can obtain the selector that is capable of targeting all of the postcards. If you’ve used jQuery, the $(‘div > > a’) is probably familiar. This will get all the posts. Since you only want the title of each post individually, you have to loop through each post. This is done with the help of the each() function.
To extract the text out of each title, you must fetch the DOM element with the help of Cheerio (el refers to the current element). Then, calling text() on each element will give you the text.
Now, you can pop open a terminal and run node You’ll then see an array of about 25 or 26 different post titles (it’ll be quite long). While this is a simple use case, it demonstrates the simple nature of the API provided by Cheerio.
If your use case requires the execution of Javascript and loading of external sources, the following few options will be helpful.
JSDOM: the DOM for Node
JSDOM is a pure Javascript implementation of the Document Object Model to be used in NodeJS. As mentioned previously, the DOM is not available to Node, so JSDOM is the closest you can get. It more or less emulates the browser.
Once a DOM is created, it is possible to interact with the web application or website you want to crawl programmatically, so something like clicking on a button is possible. If you are familiar with manipulating the DOM, using JSDOM will be straightforward.
const { JSDOM} = require(‘jsdom’)
const { document} = new JSDOM(

Hello world

‘)
const heading = document. querySelector(”)
heading. textContent = ‘Hello there! ‘
(‘welcome’)
nerHTML
As you can see, JSDOM creates a DOM. Then you can manipulate this DOM with the same methods and properties you would use while manipulating the browser DOM.
To demonstrate how you could use JSDOM to interact with a website, we will get the first post of the Reddit r/programming forum and upvote it. Then, we will verify if the post has been upvoted.
Start by running the following command to install JSDOM and Axios:
npm install jsdom axios
Then, make a file named and copy/paste the following code:
const { JSDOM} = require(“jsdom”)
const upvoteFirstPost = async () => {
const { data} = await (“);
const dom = new JSDOM(data, {
runScripts: “dangerously”,
resources: “usable”});
const { document} =;
const firstPost = document. querySelector(“div > > “);
();
const isUpvoted = ntains(“upmod”);
const msg = isUpvoted? “Post has been upvoted successfully! “: “The post has not been upvoted! “;
return msg;} catch (error) {
upvoteFirstPost()(msg => (msg));
upvoteFirstPost() is an asynchronous function that will obtain the first post in r/programming and upvote it. To do this, axios sends an HTTP GET request to fetch the HTML of the URL specified. Then a new DOM is created by feeding the HTML that was fetched earlier.
The JSDOM constructor accepts the HTML as the first argument and the options as the second. The two options that have been added perform the following functions:
runScripts: When set to “dangerously”, it allows the execution of event handlers and any Javascript code. If you do not have a clear idea of the credibility of the scripts that your application will run, it is best to set runScripts to “outside-only”, which attaches all of the Javascript specification provided globals to the window object, thus preventing any script from being executed on the inside.
resources: When set to “usable”, it allows the loading of any external script declared using the