Haskell Web Scraping

scalpel: A high level web scraping library for Haskell. – Hackage

Scalpel is a web scraping library inspired by libraries like
Parsec
and Perl’s Web::Scraper.
Scalpel builds on top of TagSoup
to provide a declarative and monadic interface.
There are two general mechanisms provided by this library that are used to build
web scrapers: Selectors and Scrapers.
Selectors
Selectors describe a location within an HTML DOM tree. The simplest selector,
that can be written is a simple string value. For example, the selector
“div” matches every single div node in a DOM. Selectors can be combined
using tag combinators. The // operator to define nested relationships within a
DOM tree. For example, the selector “div” // “a” matches all anchor tags
nested arbitrarily deep within a div tag.
In addition to describing the nested relationships between tags, selectors can
also include predicates on the attributes of a tag. The @: operator creates a
selector that matches a tag based on the name and various conditions on the
tag’s attributes. An attribute predicate is just a function that takes an
attribute and returns a boolean indicating if the attribute matches a criteria.
There are several attribute operators that can be used to generate common
predicates. The @= operator creates a predicate that matches the name and
value of an attribute exactly. For example, the selector “div” @: [“id” @= “article”] matches div tags where the id attribute is equal to “article”.
Scrapers
Scrapers are values that are parameterized over a selector and produce a value
from an HTML DOM tree. The Scraper type takes two type parameters. The first
is the string like type that is used to store the text values within a DOM tree.
Any string like type supported by ringLike is valid. The second type
is the type of value that the scraper produces.
There are several scraper primitives that take selectors and extract content
from the DOM. Each primitive defined by this library comes in two variants:
singular and plural. The singular variants extract the first instance matching
the given selector, while the plural variants match every instance.
Example
Complete examples can be found in the
examples folder in the
scalpel git repository.
The following is an example that demonstrates most of the features provided by
this library. Supposed you have the following hypothetical HTML located at
” and you would like to extract a list of all
of the comments.

Sally

Woo hoo!

Bill

Susan

WTF!?!



The following snippet defines a function, allComments, that will download
the web page, and extract all of the comments into a list:
type Author = String
data Comment
= TextComment Author String
| ImageComment Author URL
deriving (Show, Eq)
allComments:: IO (Maybe [Comment])
allComments = scrapeURL ” comments
where
comments:: Scraper String [Comment]
comments = chroots (“div” @: [hasClass “container”]) comment
comment:: Scraper String Comment
comment = textComment <|> imageComment
textComment:: Scraper String Comment
textComment = do
author <- text $ "span" @: [hasClass "author"] commentText <- text $ "div" @: [hasClass "text"] return $ TextComment author commentText imageComment:: Scraper String Comment imageComment = do imageURL <- attr "src" $ "img" @: [hasClass "image"] return $ ImageComment author imageURL Tips & Tricks The primitives provided by scalpel are intentionally minimalistic with the assumption being that users will be able to build up complex functionality by combining them with functions that work on existing type classes (Monad, Applicative, Alternative, etc. ). This section gives examples of common tricks for building up more complex behavior from the simple primitives provided by this library. OverloadedStrings Selector, TagName and AttributeName are all IsString instances, and thus it is convenient to use scalpel with OverloadedStrings enabled. If not using OverloadedStrings, all tag names must be wrapped with tagSelector. Matching Wildcards Scalpel has 3 different wildcard values each corresponding to a distinct use case. anySelector is used to match all tags: textOfAllTags = texts anySelector AnyTag is used when matching all tags with some attribute constraint. For example, to match all tags with the attribute class equal to "button": textOfTagsWithClassButton = texts $ AnyTag @: [hasClass "button"] AnyAttribute is used when matching tags with some arbitrary attribute equal to a particular value. For example, to match all tags with some attribute equal to "button": textOfTagsWithAnAttributeWhoseValueIsButton = texts $ AnyTag @: [AnyAttribute @= "button"] Complex Predicates It is possible to run into scenarios where the name and attributes of a tag are not sufficient to isolate interesting tags and properties of child tags need to be considered. In these cases the guard function of the Alternative type class can be combined with chroot and anySelector to implement predicates of arbitrary complexity. Building off the above example, consider a use case where we would like find the html contents of a comment that mentions the word "cat". The strategy will be the following: Isolate the comment div using chroot. Then within the context of that div the textual contents can be retrieved with text anySelector. This works because the first tag within the current context is the div tag selected by chroot, and the anySelector selector will match the first tag within the current context. Then the predicate that "cat" appear in the text of the comment will be enforced using guard. If the predicate fails, scalpel will backtrack and continue the search for divs until one is found that matches the predicate. Return the desired HTML content of the comment div. catComment:: Scraper String String catComment = -- 1. First narrow the current context to the div containing the comment's -- textual content. chroot ("div" @: [hasClass "comment", hasClass "text"]) $ do -- 2. anySelector can be used to access the root tag of the current context. contents <- text anySelector -- 3. Skip comment divs that do not contain "cat". guard ("cat" `isInfixOf` contents) -- 4. Generate the desired value. html anySelector For the full source of this example, see complex-predicates in the examples directory. Generalized Repetition The pluralized versions of the primitive scrapers (texts, attrs, htmls) allow the user to extract content from all of the tags matching a given selector. For more complex scraping tasks it will at times be desirable to be able to extract multiple values from the same tag. Like the previous example, the trick here is to use a combination of the chroots function and the anySelector selector. Consider an extension to the original example where image comments may contain some alt text and the desire is to return a tuple of the alt text and the URLs of the images. to isolate each img tag using chroots. Then within the context of each img tag, use the anySelector selector to extract the alt and src attributes from the current tag. Create and return a tuple of the extracted attributes. altTextAndImages:: Scraper String [(String, URL)] altTextAndImages = -- 1. First narrow the current context to each img tag. chroots "img" $ do -- 2. Use anySelector to access all the relevant content from the the currently -- selected img tag. altText <- attr "alt" anySelector srcUrl <- attr "src" anySelector -- 3. Combine the retrieved content into the desired final result. return (altText, srcUrl) generalized-repetition Operating with other monads inside the Scraper ScraperT is a monad transformer scraper: it allows lifting m a operations inside a ScraperT str m a with functions like: -- Particularizes to 'm a -> ScraperT str m a’
lift:: (MonadTrans t, Monad m) => m a -> t m a
— Particularizes to things like `IO a -> ScraperT str IO a’
liftIO:: MonadIO m => IO a -> m a
Example: Perform HTTP requests on page images as you scrape:
Isolate images using chroots.
Within that context of an img tag, obtain the src attribute containing
the location of the file.
Perform an IO operation to request metadata headers from the source.
Use the data to build and return more complex data
— Holds original link and data if it could be fetched
data Image = Image String (Maybe Metadata)
deriving Show
— Holds mime type and file size
data Metadata = Metadata String Int
— Scrape the page for images: get their metadata
scrapeImages:: URL -> ScraperT String IO [Image]
scrapeImages topUrl = do
source <- attr "src" "img" guard. not. null $ source -- getImageMeta is called via liftIO because ScrapeT transforms over IO liftM (Image source) $ liftIO (getImageMeta topUrl source) downloading data For more documentation on monad transformers, see the hackage page scalpel-core The scalpel package depends on '-client' and '-client-tls' to provide networking support. For projects with an existing HTTP client these dependencies may be unnecessary. For these scenarios users can instead depend on scalpel-core which does not provide networking support and has minimal dependencies. Troubleshooting My Scraping Target Doesn't Return The Markup I Expected Some websites return different markup depending on the user agent sent along with the request. In some cases, this even means returning no markup at all in an effort to prevent scraping. To work around this, you can add your own user agent string. #! /usr/local/bin/stack -- stack runghc --resolver lts-6. 24 --install-ghc --package scalpel-0. 6. 0 {-# LANGUAGE NamedFieldPuns #-} {-# LANGUAGE OverloadedStrings #-} import import qualified as HTTP -- Create a new manager settings based on the default TLS manager that updates -- the request headers to include a custom user agent. managerSettings:: nagerSettings managerSettings = sManagerSettings { nagerModifyRequest = req -> do
req’ <- nagerModifyRequest sManagerSettings req return $ req' { questHeaders = (serAgent, "My Custom UA"): questHeaders req'}} main = do manager <- Just <$> wManager managerSettings
html <- scrapeURLWithConfig (def { manager}) url $ htmls anySelector maybe printError printHtml html url = " printError = putStrLn "Failed" printHtml = mapM_ putStrLn A list of user agent strings can be found here. Comparing the same web scraper in Haskell, Python, Go

Comparing the same web scraper in Haskell, Python, Go

So this project started with a need – or, not really a need, but an annoyance I realized would be a good opportunity to strengthen my Haskell, even if the solution probably wasn’t worth it in the end.
There’s a blog I follow (Fake Nous) that uses WordPress, meaning its comment section mechanics and account system are as convoluted and nightmarish as Haskell’s package management. In particular I wanted to see if I could do away with relying on kludgy WordPress notifications that only seem to work occasionally and write a web scraper that’d fetch the page, find the recent comments element and see if a new comment had been posted.
I’ve done the brunt of the job now – I wrote a Haskell script that outputs the “Name on Post” string of the most recent comment. And I thought it’d be interesting to compare the Haskell solution to Python and Go solutions.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TupleSections #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE ViewPatterns #-}
import
import qualified as DOM
import qualified as Cursor
import qualified as Selector
import qualified as Types
import qualified as XML
import (Text, unpack)
main = do
resp <- runReq defaultHttpConfig $ req GET ( "") NoReqBody lbsResponse mempty let dom = omDocument $ rseLBS $ responseBody resp recentComments = XMLNode $ $ head $ "#recentcomments" $ dom newest = head $ deChildren recentComments putStrLn $ getCommentText newest getCommentText commentElem = let children = deChildren commentElem in foldl (++) "" $ unwrap <$> children
unwrap:: -> String
unwrap (deContent (ntentText s)) = unpack s
unwrap e = unwrap $ head $ deChildren e
Enter fullscreen mode
Exit fullscreen mode
My Haskell clocs in at 25 lines, although if you remove unused language extensions, it comes down to 21 (The other four in there just because they’re “go to” extensions for me). So 21 is a fairer count. If you don’t count imports as lines of code, it can be 13.
Writing this was actually not terribly difficult; of the 5 or so hours I probably put into it in the end, 90% of that time was spent struggling with package management (the worst aspect of Haskell). In the end I finally resorted to Stack even though this is a single-file script that should be able to compile with just ghc.
I’m proud of my work though, and thought it reflected fairly well on a language to do this so concisely. My enthusiasm dropped a bit when I wrote a Python solution:
import requests
from bs4 import BeautifulSoup
file = (“)
dom = BeautifulSoup(file, features=”)
recentcomments = (id = ‘recentcomments’)
print(”(list(ildren)[0]. strings))
6 lines to Haskell’s 21, or 4 to 13. Damn. I’m becoming more and more convinced nothing will ever displace my love for Python.
Course you can attribute some of Haskell’s relative size to having an inferior library, but still.
Here’s a Go solution:
package main
import (
“fmt”
“net/”
“”
“”)
func main() {
var resp, err = (“)
must(err)
defer ()
tree, err:= ()
sel, err:= mpile(“#recentcomments > *:first-child”)
// It will only match one element.
for _, elem:= range (tree) {
var name = rstChild
var on = xtSibling
(“%s%s%sn”, unwrap(name), unwrap(on), unwrap(xtSibling))}}
func unwrap(node *) string {
if == html. TextNode {
return}
return unwrap(rstChild)}
func must(err error) {
if err! = nil {
panic(err)}}
32 lines, including imports. So at least Haskell came in shorter than Go. I’m proud of you, Has- oh nevermind, that’s not a very high bar to clear.
It would be reasonable to object that the Python solution is so brief because it doesn’t need a main function, but in real Python applications you generally still want that. But even if I modify it:
def main():
return ”(list(ildren)[0]. strings)
if __name__ == ‘__main__’: main()
It only clocs in at 8 lines, including imports.
An alternate version of the Go solution that doesn’t hardcode the number of nodes (since the Python and Haskell ones don’t):
(“%sn”, textOfNode(elem))}}
func textOfNode(node *) string {
var total string
var elem = rstChild
for elem! = nil {
total += unwrap(elem)
elem = xtSibling}
return total}
Though it ends up being 39 lines.
Maybe Python’s lead would decrease if I implemented the second half, having the scripts save the last comment they found in a file, read it on startup, and update if it’s different and notify me somehow (email could be an interesting test). I doubt it, but if people like this post I’ll finish them.
Edit: I finished them.
Current state of web scraping using Haskell - Reddit

Current state of web scraping using Haskell – Reddit

Hello all, I would like to know what is the current state of web scraping using Haskell… which libraries are best suited for scraping with maintaining sessions also. Thanks in advance for in or sign up to leave a comment
level 1 · 4y · edited 4yI have done a lot of web scraping with Haskell recently. I’ve used scalpel and find it to be very convenient for standard web pages. I haven’t gotten into more complex scraping involving form POSTs but that would be easy to add. Full-blown JavaScript-aware scraping is something I have not entertained yet and I’m sure is much more heavy-duty usage, I recently released a rather crude scraping “engine” which helps you scrape thousands of pages using your own set of rules. For anonymity, you can fire up a bunch of tor proxies and tell the engine to run all its web requests through them (concurrently). It also supports things like caching, throttling, and User-Agent spoofing. 2I’m using scalpel-core and it works great for simple/regular HTML. I prefer using wreq to perform the requests. I’m very interested in your package for the sake of anonymity. Care to expand how it works? I looked at your code, but my limited knowledge of Tor and being somewhat new to Haskell makes it hard for me to wrap my head around 2Thanks for sharing I will look at itlevel 2I’ve also been using wreq for 2 · 4y · edited 4yWreq doesn’t see much development lately and there are issues that are not addressed. Most importantly connection sharing in multithread environment. As a shameless plug, I have written Req:. Readme also compares the library with existing solutions and has an example of 2Yes, I have read your post…. but wanted to get current status!! level 2Would wreq also be suitable for use in a wrapper library web based () API? level 1Try hs-scrape which internally uses wreq and ‘s an example of logging into PayPal and displaying your balance with hs-scrape which internally uses wreq and xml-conduit:import licative
import
import qualified as T
import (putStrLn)
import Prelude hiding (putStrLn)
import (attributeIs, content, element, ($//),
(&/))
— At the bottom of this file you’ll find a repl session[0] to help understand the getPaypalBalance function.
— Additionally there is a more verbose version of the getPaypalBalance function that makes the composition
— and order of operations more explicit.
getPaypalBalance cursor = fromMaybe (error “Failed to get balance”) $ listToMaybe $
cursor $//
— Create ‘Axis’ that matches element named “div” who has an attribute
— named “class” and attribute value named “balanceNumeral”
— This axis will apply to the descendants of cursor.
element “div” >=> attributeIs “class” “balanceNumeral” &/
— The Axis following &/ below matches the results of the previous Axis.
— In other words, the following Axis will match all descendants inside of

element “span” >=> attributeIs “class” “h2” &/
— The content Axis is applied to the results of the previous Axis.
— In other words, it gets the content out.
content
level 2Thanks for sharing it looks nice…!! level 1I have tried Scalpel, and it is a decent parser (although it lacks documentation on regex use, e. g. for matching hrefs that link to a json). However, it’s not a web scrapper. It lacks the ability to interact with e. loading delays, js, etc. I was going to try webdriver for that, but eventually I switched languages, so no feedback on 1I wrote this about two years ago. It’s pretty simple, but one maybe interesting aspect is that I wound up using hxt-css instead of TagSoup, because that was the easiest way to support loading standard CSS selector strings at runtime. Scalpel didn’t exist yet, though… I need to take a look at that! level 1I use tagsoup and built the crawling infrastructure around that. If you are crawling many sites and only keeping small portions, it’s really important to use the copy function of ByteString/Text/… to prevent massive amounts of memory to be 1I successfully use webdriver to let click through dozens of post forms and linkslevel 2how was your experience… have you used hs-scrape… I am having problem installing selenium driver (of version mismatch)level 1Do most people use taggy-lens to get at the data in HTML or what? level 1Now, ideally making a scraper would be a few hours of work, including finding CSS description of the task:scrape = withWebDriver… $ do
get “/…. ”
elts <- cssSelect "a. docLink" forall elts $ elt -> do
click elt
— we enter new page
subElts <- cssSelect "a. textLink" forall subElts $ subElt -> do
contentElt <- cssSelect ". content" liftIO $ writeFile (uuidFrom (show subElt ++ show elt)) $ htmlText contentElt I have seen a lot of people willing to talk about it, but few willing to offer solution. Even one I hired, has just reposted the question on Reddit, instead of writing the code:-) 2seems nice... you should have written it yourself in one hour:-).

Frequently Asked Questions about haskell web scraping

Leave a Reply

Your email address will not be published. Required fields are marked *