Scraping Tumblr

lhecker/tumblr-scraper – GitHub

This project was created as a black box, from scratch reimplementation of Liru/tumblr-downloader, recreating and improving upon its features.
Features
Downloads all photos and videos of a blog, including those inlined into posts
Automatically stops scraping a blog where it left off the last time
Allows filtering out reblogs
Uses Tumblr’s v2 API, which is more robust and significantly faster
Simulates Tumblr’s private API to even scrape private blogs if needed
All downloads are parallelized
TODOs
Documentation (up until now this strictly has been a private project)
Crawling of >5000 posts per day will lead to rate limiting
Continuing a previously failed crawl/scrape is not supported
Setting the before field in the config allows you to scrape backwards starting at a date in the past.
That way you can manually, iteratively scrape a huge blog in “sane” chunks (e. g. first everything before 2014, then 2015, 2016,… ).
Support for youtube-dl would be nice
How to scrape/download all tumblr images with a particular tag

How to scrape/download all tumblr images with a particular tag

I am trying to download many (1000’s) of images from tumblr with a particular tag (. e. g #art). I am trying to figure out the fastest and easiest way to do this. I have considered both scrapy and puppeteer as options, and I read a little bit about the tumblr API, but I’m not sure how to use the API to locally download the images I want.
Currently, puppeteer seems like the best way, but I’m not sure how to deal with the fact that tumblr uses lazy loading (e. g. what is the code for getting all the images, scrolling down, waiting for for images to load, and getting these)
Would appreciate any tips!
asked Dec 13 ’20 at 11:00
2
I recommend you use the Tumblr API, so here’s some instructions on how to go about that.
Read up on the What You Need section of the documentation
Read up on the Get Posts With Tag section
Consider using a library like PyTumblr
import pytumblr
list_of_all_posts = []
# Authenticate via OAuth
client = pytumblr. TumblrRestClient(
‘YOUR KEY HERE’)
def get_art_posts():
posts = (‘art’, **params) # returns HTML of 20 most recent posts in the tag
# use params (shown in tumblr documentation) to change the timestamp of limit of the posts
# i. to only posts before a certain time
return posts
(get_art_posts())
I’m pretty rusty with the Tumblr API, not gonna lie. But the documentation is kept well up to date. Once you have the HTML of the post, the link to the images will be in there. There’s plenty of libraries out there like Beautiful Soup that can extract the images from the HTML by their CSS selectors. Hope this helped!
answered Dec 15 ’20 at 23:26
haradaharada1631 silver badge9 bronze badges
My solution is below. Since I couldn’t use offset, I used the timestamps of each post as an offset instead. Since I was trying to specifically get the links of images in the posts, I did a little processing of the output as well. I then used a simple python script to download every image from my list of links. I have included a website and an additional stack overflow post which I found helpful.
def get_all_posts(client, blog):
offset = None
for i in range(48):
#response = (blog, limit=20, offset=offset, reblog_info=True, notes_info=True)
response = (‘YOUR TAG HERE’, limit=20, before=offset)
for post in response:
# for post in response:
if(‘photos’ not in post):
#print(post)
if(‘body’ in post):
body = post[‘body’]
body = (‘<') body = [b for b in body if 'img src=' in b] if(body): body = body[0]('"') print(body[1]) yield body[1] else: yield print(post['photos'][0]['original_size']['url']) yield post['photos'][0]['original_size']['url'] # move to the next offset offset = response[-1]['timestamp'] print(offset) client = pytumblr. TumblrRestClient('USE YOUR API KEY HERE') blog = 'staff' # use our function with open('{}'(blog), 'w') as out_file: for post in get_all_posts(client, blog): print(post, file=out_file) Links: Print more than 20 posts from Tumblr API Also thank you very much to Harada, whose advice helped a lot! answered Dec 16 '20 at 0:45 gollyzoomgollyzoom4071 gold badge4 silver badges16 bronze badges Not the answer you're looking for? Browse other questions tagged web-scraping scrapy puppeteer tumblr pytumblr or ask your own question. Tumblr - Reaper | Social Media scraping tool

Tumblr – Reaper | Social Media scraping tool

To download data from Tumblr, you should make use of the Tumblr API
Most public data on Tumblr is available over the API.
To see a list of all possible endpoints on the API, visit the reference:
The reference will also explain what information you can get out of the Blog and Tag endpoints.
Access token
To scrape data from the Tumblr API, you will need to create an app.
Start by signing in to Tumblr, navigating to and creating an app
Once the app is created, scroll to the top of the page and copy the OAuth consumer key into Reaper:

Frequently Asked Questions about scraping tumblr

Leave a Reply

Your email address will not be published. Required fields are marked *