Npm Web Scraper

nodejs-web-scraper – npm

nodejs-web-scraper6. 0. 2 • Public • Published 12 hours ago Readme Explore BETA6 Dependencies3 Dependents77 Versions nodejs-web-scraper is a simple tool for scraping/crawling server-side rendered pages.
It supports features like recursive scraping(pages that “open” other pages), file download and handling, automatic retries of failed requests, concurrency limitation, pagination, request delay, etc. Tested on Node 10 – 16(Windows 7, Linux Mint).
The API uses Cheerio selectors. Click here for reference
For any questions or suggestions, please open a Github issue.
Installation
$ npm install nodejs-web-scraper
Basic examples
Collect articles from a news site
Get data of every page as a dictionary
Download images
Use multiple selectors
Advanced
Pagination
Get an entire HTML file
Downloading a file that is not an image
getElementContent and getPageResponse hooks
Add additional conditions
Scraping an auth protected site
API
Pagination explained
Error Handling
Automatic Logs
Concurrency
License
Disclaimer
Let’s say we want to get every article(from every category), from a news site. We want each item to contain the title,
story and image link(or links).
const { Scraper, Root, DownloadContent, OpenLinks, CollectContent} = require(‘nodejs-web-scraper’);
const fs = require(‘fs’);
(async () => {
const config = {
baseSiteUrl: `,
startUrl: `,
filePath: ‘. /images/’,
concurrency: 10, //Maximum concurrent jobs. More than 10 is not fault is 3.
maxRetries: 3, //The scraper will try to repeat a failed request few times(excluding 404). Default is 5.
logPath: ‘. /logs/’//Highly recommended: Creates a friendly JSON for each operation object, with all the relevant data. }
const scraper = new Scraper(config);//Create a new Scraper instance, and pass config to it.
//Now we create the “operations” we need:
const root = new Root();//The root object fetches the startUrl, and starts the process.
//Any valid cheerio selector can be passed. For further reference: const category = new OpenLinks(‘. category’, {name:’category’});//Opens each category page.
const article = new OpenLinks(‘article a’, {name:’article’});//Opens each article page.
const image = new DownloadContent(‘img’, { name: ‘image’});//Downloads images.
const title = new CollectContent(‘h1’, { name: ‘title’});//”Collects” the text from each H1 element.
const story = new CollectContent(‘ntent’, { name: ‘story’});//”Collects” the the article body.
dOperation(category);//Then we create a scraping “tree”:
dOperation(article);
dOperation(image);
dOperation(title);
dOperation(story);
await (root);
const articles = tData()//Will return an array of all article objects(from all categories), each
//containing its “children”(titles, stories and the downloaded image urls)
//If you just want to get the stories, do the same with the “story” variable:
const stories = tData();
fs. writeFile(‘. /’, ringify(articles), () => {})//Will produce a formatted JSON containing all article pages and their selected data.
fs. /’, ringify(stories), () => {})})();
This basically means: “go to; Open every category; Then open every article in each category page; Then collect the title, story and image href, and download all images on that page”.
An alternative, perhaps more firendly way to collect the data from a page, would be to use the “getPageObject” hook.
const { Scraper, Root, OpenLinks, CollectContent, DownloadContent} = require(‘nodejs-web-scraper’);
const pages = [];//All ad pages.
//pageObject will be formatted as {title, phone, images}, becuase these are the names we chose for the scraping operations below.
//Note that each key is an array, because there might be multiple elements fitting the querySelector.
//This hook is called after every page finished scraping.
//It will also get an address argument.
const getPageObject = (pageObject, address) => {
(pageObject)}
logPath: ‘. /logs/’}
const scraper = new Scraper(config);
const root = new Root();//Open pages 1-10. You need to supply the querystring that the site uses(more details in the API docs).
const jobAds = new OpenLinks(‘ h2 a’, { name: ‘Ad page’, getPageObject});//Opens every job ad, and calls the getPageObject, passing the formatted dictionary.
const phones = new CollectContent(‘. details-desc ‘, { name: ‘phone’})//Important to choose a name, for the getPageObject to produce the expected results.
const titles = new CollectContent(‘h1’, { name: ‘title’});
dOperation(jobAds);
dOperation(titles);
dOperation(phones);
fs. /’, ringify(pages), () => {});//Produces a formatted JSON with all job ads. })()
Let’s describe again in words, what’s going on here: “Go to; Then paginate the root page, from 1 to 10; Then, on each pagination page, open every job ad; Then, collect the title, phone and images of each ad. ”
Download all images from a page
A simple task to download all images in a page(including base64)
const { Scraper, Root, DownloadContent} = require(‘nodejs-web-scraper’);
baseSiteUrl: `, //Important to provide the base url, which is the same as the starting url, in this example.
cloneFiles: true, //Will create a new image file with an appended name, if the name already exists. Default is false. }
const root = new Root();//Root corresponds to the artUrl. This object starts the entire process
const images = new DownloadContent(‘img’)//Create an operation that downloads all image tags in a given page(any Cheerio selector can be passed).
dOperation(images);//We want to download the images from the root page, we need to Pass the “images” operation to the root.
await (root);//Pass the Root to the () and you’re done. })();
When done, you will have an “images” folder with all downloaded files.
If you need to select elements from different possible classes(“or” operator), just pass comma separated classes.
This is part of the Jquery specification(which Cheerio implemets), and has nothing to do with the scraper.
const { Scraper, Root, CollectContent} = require(‘nodejs-web-scraper’);
startUrl: `, }
function getElementContent(element){
// Do something… }
const root = new Root();
const title = new CollectContent(‘. first_class,. second_class’, {getElementContent});//Any of these will fit.
await (root);})();
Advanced Examples
Get every job ad from a job-offering site. Each job object will contain a title, a phone and image hrefs. Being that the site is paginated, use the pagination feature.
const root = new Root({ pagination: { queryString: ‘page_num’, begin: 1, end: 10}});//Open pages 1-10.
// YOU NEED TO SUPPLY THE QUERYSTRING that the site uses(more details in the API docs). “page_num” is just the string used on this example site.
const jobAds = new OpenLinks(‘ h2 a’, { name: ‘Ad page’, getPageObject});//Opens every job ad, and calls the getPageObject, passing the formatted object.
const images = new DownloadContent(‘img’, { name: ‘images’})
dOperation(images);
const sanitize = require(‘sanitize-filename’);//Using this npm module to sanitize file names.
const { Scraper, Root, OpenLinks} = require(‘nodejs-web-scraper’);
removeStyleAndScriptTags: false//Telling the scraper NOT to remove style and script tags, cause i want it in my html files, for this example. }
let directoryExists;
const getPageHtml = (html, pageAddress) => {//Saving the HTML file, using the page address as a name.
if(! directoryExists){
dirSync(‘. /html’);
directoryExists = true;}
const name = sanitize(pageAddress)
fs. writeFile(`. /html/${name}`, html, () => {})}
const root = new Root({ pagination: { queryString: ‘page_num’, begin: 1, end: 100}});
const jobAds = new OpenLinks(‘ h2 a’, { getPageHtml});//Opens every job ad, and calls a hook after every page is done.
await (root);})()
Description: “Go to; Paginate 100 pages from the root; Open every job ad; Save every job ad page as an html file;
filePath: ‘. /videos/’,
const video = new DownloadContent(”, { contentType: ‘file’});//The “contentType” makes it clear for the scraper that this is NOT an image(therefore the “href is used instead of “src”).
const description = new CollectContent(‘h1’).
dOperation(video);
dOperation(description);
(tData())//You can call the “getData” method on every operation object, giving you the aggregated data collected by it.
Description: “Go to; Download every video; Collect each h1; At the end, get the entire data from the “description” object;
const getPageResponse = async (response) => {
//Do something with (the HTML content). No need to return anything. }
const myDivs=[];
const getElementContent = (content, pageAddress) => {
(`myDiv content from page ${pageAddress} is ${content}… `)}
const articles = new OpenLinks(‘article a’);
const posts = new OpenLinks(‘ a'{getPageResponse});//Is called after the HTML of a link was fetched, but before the children have been scraped. Is passed the response object of the page.
const myDiv = new CollectContent(”, {getElementContent});//Will be called after every “myDiv” element is collected.
dOperation(articles);
dOperation(myDiv);
dOperation(posts);
dOperation(myDiv)
Description: “Go to; Open every article link; Collect each; Call getElementContent()”.
“Also, from, open every post; Before scraping the children(myDiv object), call getPageResponse(); CollCollect each “.
In some cases, using the cheerio selectors isn’t enough to properly filter the DOM nodes. This is where the “condition” hook comes in. Both OpenLinks and DownloadContent can register a function with this hook, allowing you to decide if this DOM node should be scraped, by returning true or false.
/**
* Will be called for each node collected by cheerio, in the given operation(OpenLinks or DownloadContent)
*/
const condition = (cheerioNode) => {
//Note that cheerioNode contains other useful methods, like html(), hasClass(), parent(), attr() and more.
const text = ()();//Get the innerText of the tag.
if(text === ‘some text i am looking for’){//Even though many links might fit the querySelector, Only those that have this innerText,
// will be “opened”.
return true}}
//Let’s assume this page has many links with the same CSS class, but not all are what we need.
const linksToOpen = new OpenLinks(‘some-css-class-that-is-just-not-enough’, {condition});
dOperation(linksToOpen);
Please refer to this guide:
class Scraper(config)
The main nodejs-web-scraper object. Starts the entire scraping process via (Root). Holds the configuration and global state.
These are the available options for the scraper, with their default values:
const config ={
baseSiteUrl: ”, // your site sits in a subfolder, provide the path WITHOUT it.
startUrl: ”, //Mandatory. The page from which the process begins.
logPath:null, //Highly create a log for each scraping operation(object).
cloneFiles: true, //If an image with the same name exists, a new file with a number appended to it is created. Otherwise. it’s overwritten.
removeStyleAndScriptTags: true, // Removes any