Scrape Flight Prices

How we scrape 300k prices per day from Google Flights

Brisk Voyage finds cheap, last-minute weekend trips for our members. The basic idea is that we continuously check a bunch of flight and hotel prices, and when we find a trip that’s a low-priced outlier, we send an email with booking find flight and hotel prices, we scrape Google Flights and Google Hotels. Hotels are relatively simple: to find the cheapest hotel at 100 destinations over 5 different dates each, we have to scrape 500 Google Hotels pages raping flight prices is a larger challenge. To find the cheapest round-trip flight for 100 destinations over 5 different dates each, that’s 500 pages. Flights, however, don’t just have a destination airport — they also have an origin airport. Those 500 flights have to be checked from each origin airport. We have to check those flights to Aspen from every airport around NYC, every airport around Los Angeles, and every airport in between. We don’t have this extra origin airport “dimension” when scraping hotels. This means there are several orders of magnitude more flight prices than hotel prices relevant to our by Matthew Smith on UnsplashThe goal is to find the cheapest flight between each origin/destination pair for a given set of dates. To do this, we fetch around 300, 000 flight prices from 25, 000 Google Flights pages every day. This isn’t an astronomical number, but it’s large enough that we (at least, as a bootstrapped company) have to care about cost efficiency. Over the past year or so, we’ve iterated on our scraping methodology to arrive at a fairly robust and flexible solution. Below, I’ll describe each tool we use in our scraping stack, roughly ordered by the flow of Simple Queue Service (SQS)We use SQS to serve a queue of URLs to crawl. Google Flights URLs look like this: JFK, LGA, ;c:USD;e:1;sd:1; three-letter codes in the URL above are IATA airport codes. If you click that link, notice how there are multiple destination airports. This is one key to efficient Google Flights scraping! Since Google Flights allows for multiple trips to be searched at once, we can fetch the cheapest prices for multiple round-trips on one page. Google sometimes won’t show flights for all trips queried. To make sure we have all the data we need, we detect when an origin/destination isn’t displayed in the result, then re-queue the trip to be searched on its own. This ensures we collect the price of the cheapest flight available for each trip. A single SQS queue stores all Google Flights URLs that need to be crawled. When the crawler runs, it will pick off a message from the queue. Order is not important, so a standard queue (not a FIFO) is used. Here’s what the queue looks like when it’s full of messages:AWS Lambda (using Chalice)Lambda is where the crawler actually runs. We use Chalice, which is an excellent Lambda microframework for Python, to deploy functions to Lambda. Although Serverless is the most popular Lambda framework, it is written in NodeJS. This is a turn-off for us given that we’re most familiar with Python, and want to keep our stack uniform. We’ve been very happy with Chalice — it’s as simple as Flask to use, and it allows the entire Brisk Voyage backend to be Python on crawler consists of two Lambda functions:The primary Lambda function ingests messages from the SQS queue, crawls Google Flights, and stores the output. When this function runs, it launches a Chrome browser on the Lambda instance and crawls the page. We define this as a pure Lambda function with Chalice, as this function will be separately mbda_function()def crawl(event, context):.. second Lambda function triggers multiple instances of the first function. This runs as many crawlers as we need in parallel. This function is defined to run at two minutes past the hour during UTC hours 15–22 every day. It starts 50 instances of the primary crawl function, for 50 parallel hedule(“cron(2 15, 16, 17, 18, 19, 20, 21, 22? * * *)”)def start_crawlers(event): (“Starting crawlers. ”) n_crawlers = 50 client = (“lambda”) for n in range(n_crawlers): response = ( FunctionName=”collector-dev-crawl”, InvocationType=”Event”, Payload=’{“crawler_id”: ‘ + str(n) + “}”, ) (“Started crawlers. ”)An alternative would be to use the SQS queue as an event source for the crawl function, so that when the queue populates, the crawlers automatically scale up. We originally used this approach. There is one big drawback, however: the maximum number of messages that can be ingested by one invocation (batch size) is 10, meaning that the function has to be freshly invoked for every group of 10 messages. This not only causes compute inefficiency, but increases bandwidth drastically as the browser cache is destroyed every time the function restarts. There are ways around this, but in our experience, they have lead lots of extra complexity. A note on Lambda costsLambda costs $0. 00001667/GB/second, while many EC2 instances cost one-sixth of that. We currently pay around $50/month for Lambda, so this would mean we could substantially reduce these costs. Lambda, however, has two big benefits: first, it scales up and down instantly with zero effort on our part, meaning we are not ever paying for an idling server. Second, it’s what the rest of our stack is built on. Less technology means less cognitive overhead. If the number of pages we crawl ramps up, it will make sense to reconsider EC2 or a similar compute service. At this point, I think an extra $40 per month ($50 on Lambda vs ~$10 on EC2) is worth the simplicity for ppeteerPyppeteer is a Python library for interacting with Puppeteer, a headless Chrome API. Since Google Flights requires Javascript to load prices, it is necessary to actually render the full page. Each of the 50 crawl functions launches its own copy of a headless Chrome browser, which is controlled with nning headless Chrome on Lambda was a challenge. We had to pre-package the necessary libraries into Chalice that were not pre-installed on Amazon Linux, the OS that Lambda runs on top of. These libraries are added to the vendor directory inside our Chalice project which tells Chalice to install them on each Lambda crawl instance:As the 50 crawl functions start up over ~5 seconds, Chrome instances are launched within each function. This is a powerful system; it can scale into thousands of concurrent Chrome instances by changing one line of crawl function reads a URL from the SQS queue, then Pyppeteer tells Chrome to navigate to that page behind a rotating residential proxy from PacketStream. The residential proxy is necessary to prevent Google from blocking the IP Lambda makes requests the page loads and renders, the rendered HTML can be extracted, and the flight prices, airlines, and times can be read from the page. There are 10–15 flight results per page that we want to extract (we mostly care about the cheapest, but the others can be useful too). This is currently done manually by traversing the page structure, but this is brittle. The crawler can break if Google changes one element on the page. In the future, I’d like to use something that’s less reliant on the page the prices are extracted, we delete the SQS message, and re-queue any origin/destinations that weren’t displayed within the flight results. We then move onto the next page — a new URL is pulled from the queue, and the process repeats. After the first page crawled in each crawl instance, pages require much less bandwidth to load (~100kb instead of 3 MB) due to Chrome’s caching. This means that we want to keep the instance alive as long as possible, and crawl as many trips as we can in order to preserve the cache. Because it’s advantageous to persist functions to retain the cache, crawl‘s timeout is 15 minutes, which is the maximum that AWS currently allows. DynamoDBAfter it’s extracted from the page, we have to store the flight data. We chose DynamoDB for this because it has on-demand scaling. This was important for us, as we were uncertain about what kinds of loads we would need. It’s also cheap, and 25GB comes free under AWS’s Free Tier. DynamoDB has taken some work to get right. Normally, tables can only have one primary index with one sort key. Adding secondary indices is possible, but is either limited or requires additional provisioning, which increases costs. Due to this limitation on indices, DynamoDB works best when the usage is fully thought-out beforehand. It took us a couple tries to get the table design right. In retrospect, DynamoDB is a little inflexible for the kind of product we’re building. Now that Aurora Serverless offers PostgreSQL, we should probably switch to that at some gardless, we store all of our flight data in a single table. The index has a primary key of the destination airport’s IATA code, and a secondary range key which is a are great for DyanmoDB because they are unique, have an embedded timestamp, and are lexicographically sortable by that timestamp. This allows the range key to both serve as a unique identifier and support queries such as “give me the cheapest trips to BED that we’ve crawled in the past 30 minutes”:response = ( KeyConditionExpression=Key(“destination”)(iata_code) & Key(“id”)(earliest_id), FilterExpression=Attr(“cheapest_entry”)(1), )Monitoring and testingWe use Dashbird for monitoring the crawler and everything else that’s run under Lambda. Good monitoring is a requirement for scraping applications because page structure changes are a constant danger. At any time (even multiple times per day, as we’ve seen with Google Flights recently… sigh) the page structure can change, which breaks the crawler. We need to be alerted when this occurs. We have two separate mechanisms to track this:A Dashbird alert that emails us when there is a crawl failure. A GitHub Action that runs every 3 hours which runs a test crawl and verifies the results make sense. Since that crawler isn’t running 24/7, this alerts us when Google Flights changes their page structure outside of operating hours. This way, the crawler can be fixed prior to starting for the combination of SQS, Lambda, Chalice, Pyppeteer, DynamoDB, Dashbird, and GitHub Actions works well for us. It is a low-overhead solution that is entirely Python, requires no instance provisioning, and keeps costs relatively we’re satisfied with this at our current scale of roughly 300k prices and 25k pages per day, there is some room for improvement as our data needs grow:First, we should move to EC2 servers that are automatically provisioned when needed. As we crawl more, the delta between Lambda and EC2 costs will increase to the point where it will make sense to run on EC2. EC2 will cost roughly 1/5th of Lambda in terms of compute, but will require more overhead and won’t reduce bandwidth costs. Once Lambda costs are a concern, we’ll get to, we would move from DynamoDB to Serverless Aurora, which allows for more flexible data usage. This will be useful as our library of prices and the appetite for alternative data uses you found this interesting, you might like Brisk Voyage — our free newsletter sends you a cheap weekend trip every few days from airports near you. If you like the service, we’ve also got a Premium version which will send you more trips and has a few more are, for sure, more ways our crawling system can be improved. If you’ve got any ideas, please let us know!
How we scrape 300k prices per day from Google Flights | Hacker News

How we scrape 300k prices per day from Google Flights | Hacker News

> This isn’t an astronomical number, but it’s large enough that we (at least, as a bootstrapped company) have to care about cost efficiency…. by externalizing the costs to a third general, I’m really surprised that they published this article. It’s like they described exactly the data that somebody working on preventing scraping would need to block this traffic, in totally unnecessary level of detail. (E. g. telling exactly which ASN this traffic would be arriving from, describing the very specific timing of their traffic spikes, the kind of multi-city searches that probably see almost no organic traffic). I just don’t get it. It’s like they’re intentionally trying to get blocked so that they can write a follow-up “how Google blocked our bootstrapped business” blog post.
Or they just don’t understand that what they are doing is illegal. I’m always surprised by the level of ignorance, but I’ve seen more than one startup burn because the founders didn’t understand which taxes were due and, thus, failed to account for them in their pricing.
Scraping public data is not illegal in the US. > I’m always surprised by the level of ignoranceSuch as the ignorance displayed in your comment?
> what they are doing is illegalIt’s not illegal. Google can sue them and bury them in court fees and potentially win a civil suit, but it sure as hell isn’t illegal.
I’m pretty sure this is legal wrt CFAA because they are not circumventing any access control ever, 300k requests per day surely is enough that this could be considered a sort of denial of service attack and violate fair Google wanted to, they would scale down their servers for a day, wait for this traffic peak to hit, document how it made the service unavailable to others, and now you have a valid offense to sue for.
I am surprised people think this could affect the servers? If they scrape 300k prices from 25k pages a day, and the crawl runs every hour from 15-22 UTC, then that means the 25k is spread across 3. 5k pages per crawl. Even if the crawl is aggressive and completes within a minute that is still only 58 QPS.
They are using residential proxies to evade rate-limiting by google. I don’t know if it’s enough to trigger CFAA, but it shows at a minimum that they know that what they are doing is not what google considers fair use.
> what they are doing is illegalillegal means violate criminal code. > now you have a valid offense to sue forThat’s civil, not ‘s pedantic, but words matter.
> now you have a valid offense to sue That’s also what I said above when I said “Google can sue them”. It’s not illegal is my entire point. You can sue anyone, for any reason, at any time, and even win. That doesn’t make something Google sends them a C&D, and expressly forbids them from doing this activity, and implements technical measures to prevent them from doing so, and they continue doing so then they may start approaching the area of illegal (Craigslist v. 3Taps would agree, hiQ v. LinkedIn would disagree).
> denial of service attack300k requests / day is a little over 3 per second. That’s not much.
Unethical to build a business scraping data from a company that makes money scraping data?
I agree with you in principle, but having worked on both sides of this, I think there’s very little chance they get blocked at their current traffic levels. I do think that if they ever get traction they’ll have a lot of problems – there’s a reason GDS access to flight availability is slow, expensive, and difficult to implement well. Scraping definitely won’t scale.
> E. telling exactly which ASN this traffic would be arriving fromThe article mentions that they are using rotating residential proxies.
This is almost certainly guaranteed to be either monetization of botnets, or inadvertently installed adware.
You do know that “legal” means “against the law” right? And when you say something is illegal, then you need to produce the law it violates; there’s no proof from a company trying to prevent you from doing it.
> The crawl function reads a URL from the SQS queue, then Pyppeteer tells Chrome to navigate to that page behind a rotating residential proxy. The residential proxy is necessary to prevent Google from blocking the IP Lambda makes requests from. I am very interested in what a ‘rotating residential proxy’ is. Are they routing requests through random people’s internet connections? Are these people willing participants? Where do they come from?
Check out Luminati for example. They have a huge network of true residential IPs to exit traffic from, and you have to pay a hefty premium per GB of traffic to do so ($12. 50 per GB for rotating residential IPs, but requires a minimum $500 commitment per month). The reason they can offer this is because they’re exiting traffic through the users of the free Hola VPN Chrome extension.
How awful. “80M+ Monthly devices hosting Luminati’s SDK” &
“100% Peers chose to opt-in to Luminati’s network” ()There is a 0% chance that 80M+ are agreeing “I am OK with Luminati selling access to my home internet connection to any party able to pay”, which seems like an honest description of their business model. More likely Luminati is paying unscrupulous app developers to include this SDK in their apps, and some put some legalese into 10, 000 word install-time agreements that no one reads.
I think you can make a reasonable argument that Hola VPN is largely exploiting users who don’t actually understand and consent to having their IP address and connection used as a proxy.
To those lamenting that they’re scraping… Google is the biggest scraper of them all. Facebook, Amazon, Google, Microsoft. All the big boys scrape voraciously, yet try their best to block themselves from being scraped. Scraping is vital for the functionality of the internet. The narrative that scraping is evil is what big companies want you to you block small scrapers from your site but permit giants like Googlebot and Bing all you’re doing is locking in a monopoly that’s bad for everyone
Google has the (often implicit) permission of the website owner to scrape. OTOH, Google Flights explicitly disallows scraping results.
No, Google’s scraping is opt-out only, which they offer to be does not need anyone’s permission to scrape publicly accessible data, and they are not required to follow any opt-out requests.
It’s ironic writing an article like that, while their ToS states:> As a user of the Site, you agree not to:> 1. Systematically retrieve data or other content from the Site to create or compile, directly or indirectly, a collection, compilation, database, or directory without written permission from us.
Irony is even deeper when you look on the other side, which is Google who made most their money off scraping data from people in different ‘s data scraping/middlemen all the way down… I wonder if Google indexes their scrape results to throw some loops in the mix.
Google also curates and republishes data from a lot of sites, including news sites and informational sites that significantly reduces traffic to other sites, etc. There’s a lot of data Google scraped that wasn’t necessarily explicitly given permission to outside their page crawler. They chose “beg for forgiveness” over “ask for permission” in many point being that there’s irony in every direction, the proverbial “pot calling the kettle black. ” Lots of irony in both directions.
Just because its in the ToS doesn’t mean its enforceable. That line is not enforceable in the US.
It’s strange they write about this so openly. Aren’t they wary that someone at Google Fights will read it and they will try blocking them? (E. by scrambling the page’s code)
You’re supposed to break rules and laws in the early days; it’s part of the startup ging about it publicly, as they’re doing it: that may appear newish, but I’m sure some other startup did that 15 years ago.
Interesting. A scraper scraping a scraper. I don’t get what the value add is over clients just searching Google Flights directly. Not trying to be mean, just trying to understand.
Google Flights isn’t a scraper, it’s an evolution of ITA matrix from what I remember, directly connected to that GDS. They aren’t piggy backing on someone else’s is what this guy could have done, instead of behaving like pond scum. It’s not like it’s particularly complicated to get programmatic access to a GDS API, that’s what they’re there for.
Pond scum? He’s scraping some data from a company that got rich scraping data and that probably will tell him to stop doing it. I mean if he’s pond scum, what level of scum are those guys with upshot sites? What level of scum is Mark Zuckerberg? Pond scum is generally supposed to be pretty scummy, I can think of thousands of people more scummy than someone scraping from google.
It’s expensive to get access to a GDS API and, from what I’ve heard, the data they provide is quite difficult to work with. There’s a reason Google bought ITA for $700m, right? If this project ever grows, it could make sense to pull from a GDS.
> It’s expensive to get access to a GDS API and, from what I’ve heard, the data they provide is quite difficult to work, it’s expensive to provide live answers for flight search queries across hundreds of airlines and thousands of airports… Some of the old booking interfaces are ugly, but for simple searching most of them provide relatively sane REST/JSONI don’t understand your attitude, steal it until you make it?
That’s the attitude of just about every successful company in history. Once large enough, some of them (e. YouTube) even force industrial changes to accommodate all the theft that made them anwhile on the topic of attitudes, referring to a startup as ‘pond scum’ simply because they scrape an extremely expensive data set, especially regarding an industry with a long and controversial history of strategies designed to avoid price transparency.. hmm.
Well, Google Flights is probably the best publicly available data on flight prices. > Brisk Voyage finds cheap, last-minute weekend trips for our members. The basic idea is that we continuously check a bunch of flight and hotel prices, and when we find a trip that’s a low-priced outlier, we send an email with booking Ok, this could actually be interesting. At least in the short while.. )
Flights isn’t really the best way of getting cheap flights. They pepper the results, especially if they think you’re scraping (which they probably do). Matrix is more accurate. Using a GDS is even more accurate but that costs money.
The way I read it, they scrape 25k pages per day? I wonder if that could already bring them on Googles radar. If so, Google would probably send a cease and desist letter and this startup would simply give up. I wonder if Google would also demand their legal expenses? Probably a couple thousand dollars? I know, nobody would go to court against Google – but what would happen if this did go to court? Which laws would Google cite to deem this illegal?
All the (AWS) technologies used are totally unnecessary. SQS/DynamoDB/Lambda. I can buy a laptop in walmart for $500 and i can do all the scrapping in starbucks wifi.
Right, it seems like they overbuilt this hacky solution. You are scraping, eventually you just need to subscribe to the data. Why invest that much effort into a temporary solution.
Lambda is needed to get rotating IPs and scale while avoiding browser fingerprinting. SQS takes the results of those scrapes and puts them into a database, DynamoDB. It’s a straightforward web scraping pipeline.
Lambda isn’t enough. You’ll get blocked in a heartbeat. You still need a proxy service.
Of course its unnecessary. The point is that you can do it in the cloud, instead of on a laptop at starbucks…
You state that you care about costs but you end up using some of the most expensive cloud offerings out there?
I’m torn about their account. It’s true that you could easily scrape 25k pages per day on a small VPS that costs less than the $50 Lambda costs they mentioned. And in order to scrape from that VPS you wouldn’t have to engineer this much with getting Chrome to run in Lambda, batching URLs, and you wouldn’t worry about Lambda timeouts because you could run the whole scrape in one session more or less. So you could say that the engineering effort they spent building this was a waste of money. On the other hand, if they ever do need to scale up for whatever reason (information spread across more pages, or they need to scrape more services, or need multiple attempts per URL), all they have to do is push a button, at which point the upfront engineering effort will have paid off. Either way, their current Lambda costs are definitely eclipsed by the costs of paying for the residential proxy IPs. My two cents.
How to try to trick the internet for the best deals on flights - Mashable

How to try to trick the internet for the best deals on flights – Mashable

It pays to be cheap, secretive, flexible, and quick.
Never pay retail if you don’t have to.
Credit: MASHABLE COMPOSITE; SHUTTERSTOCK / IAN_ANUPONG
Traveling isn’t just about the destination. Carry On is our series devoted to how we get away in the digital age, from the choices we make to the experiences we share.
Searching for online fares for flights can feel like a game with constantly changing rules. Don’t buy tickets on the weekend. Buy them on Tuesdays at this specific time. Wait until three days before you need to fly for a sudden price drop. Never search for the same flight more than three times. Log out of every browser and social network before even looking up flights. Most of the questionable logic behind these pieces of advice don’t hold up. And even one that seems like a clear winner in the age of ubiquitous internet surveillance is less helpful than it seems: When tracking down cheap flights online, it’s best to turn off all tracking. Go private, and quicklyChris Rodgers, CEO of search-engine optimization agency Colorado SEO Pros, suggests using private browsing, but it might not actually do much more than give you peace of mind. On the Chrome browser that means opening an “incognito” window. Your smartphone browser, like Safari on iOS, has private browsing, as well. “It is a good idea to use private browsing… when searching for flights to prevent as much tracking as possible, ” Rodgers said in a recent email conversation. “This could help prevent specific rate hikes that could be tied to an individual user’s new and return sessions. ” He advises on quick decision-making instead of repeatedly returning to check on pricing over a long stretch of time with more opportunity for Rodgers admitted that pricing is more likely set by factors beyond your personal browsing history and settings, so this doesn’t work every suggests researching flights as usual and then leaving the website and returning on a different device while in private browsing mode to actually buy the tickets. “However if your fare is in high demand it may go up regardless, ” he warned.
Kayak, the flight and hotel aggregator, bursts any illusions that switching out devices or going incognito can help you save a few bucks. In fact, Kayak’s North America regional director Steve Sintra, says logging into airline accounts or services like Kayak, Priceline, and Hipmunk might actually help you game the system. He said while signed onto your personal account you are more likely to see better fares and deals. But mostly it comes down to timing and said in a recent phone call, “When you find a good fare, book it, don’t wait. ”
If price is your only concern, aggregators can piece together an itinerary for you that may have the worst layovers and mixed-and-matched airlines, but — hey — it’ll be cheaper. The more flexible you are on timing and other factors the more likely you’ll find early You might save a few bucks by waiting, but it’s not usually worth it, Sintra said, with the extra stress, chance of losing a seat, and typically paying way more. “The longer you wait the greater the likelihood the prices will go up, ” Sintra on historical data and flight pricing trends, Kayak has found that you’ll find the best prices four months in advance for holiday travel. The two weeks before a flight doesn’t suddenly mean a price drop, he added. Instead prices hold steady at what’s usually the highest-cost flights. Prices fluctuate, so it can seem smart to wait and then suddenly you’re kicking yourself if you miss a low-price the sales Myths about buying airline tickets on Tuesdays mostly stem from sale schedules. While most major airlines open up sales on Tuesdays, airline sales happen all the time. So yeah, you’ll be able to book a flight with a potential discount on a Tuesday, but those prices will be available for however long the sale is running. You can join airline email lists to get notified about flight sales directly in your cheapPlan your trip around cheaper travel days. If you travel on the actual holiday (Thanksgiving or Christmas Day instead of Dec. 22, for example) prices are typically lower. Same with travel after a holiday. If you can fly on a random Monday or Tuesday you’ll see cheaper fares compared to a Friday or Sunday evening flight. On Southwest’s website, the low-fare calendar shows the lowest prices each day of the month. You can see the patterns for a November trip from San Francisco to Phoenix. Note the Nov. 11 Veterans Day holiday and how the pricing is higher around then.
Find the cheapest flights.
Credit: screengrab
Tricks and hacks aside, if you see a good flight that works for you, snatch it.
By signing up to the Mashable newsletter you agree to receive electronic communications
from Mashable that may sometimes include advertisements or sponsored content.

Frequently Asked Questions about scrape flight prices

Is it legal to scrape Google flights?

Scraping public data is not illegal in the US. Such as the ignorance displayed in your comment? It’s not illegal. Google can sue them and bury them in court fees and potentially win a civil suit, but it sure as hell isn’t illegal.Jun 13, 2020

How do you trick flight prices?

Wait until three days before you need to fly for a sudden price drop. Never search for the same flight more than three times. Log out of every browser and social network before even looking up flights. Most of the questionable logic behind these pieces of advice don’t hold up.Nov 7, 2019

Is crawling illegal?

Web data scraping and crawling aren’t illegal by themselves, but it is important to be ethical while doing it. Don’t tread onto other people’s sites without being considerate. Respect the rules of their site.Nov 17, 2017

Leave a Reply

Your email address will not be published. Required fields are marked *