Scraping video info from Youtube | Octoparse
The latest version for this tutorial is available here. Go to have a check now!
In this tutorial, we are going to show you how to scrape video information from Youtube. Also, the ready-to-use Youtube Template has been inserted in our latest version, you could check it out here: Task Templates.
If you would like to build a Youtube scraper from scratch:
you may be interested in this article about Youtube Channel Crawler
you might want to use the URL in this tutorial:
Here are the main steps in this tutorial: [Download task file here]
1. “Go To Web Page”- to open the targeted web page
2. Create a “Loop Item”- to loop enter searching keywords
3. Dealing with infinitive scrolling
a “Loop Item” -to loop extract each item
5. Extract data – to select data you need to scrape
extraction – to run your task and get data
1)”Go To Web page”- to open the targeted web page
Click “+Task” to start a new task with Advanced Mode
Paste the URL into the “Input URL” box
Click “Save URL” to move on
2)Create a “Loop Item” – to loop enter searching keywords
We can customize our “text list” to create a loop search action. Octoparse will automatically enter every keyword in the list into the search box, one line a time.
Drop a “loop item” action into the workflow designer
Go to loop mode and select “Text list”
Click “a” to enter the keywords lists with one keywords per line. Here we’ll enter “Big Data” and “Machine Learning”
Click “OK” and “OK” when you finish entering. Then you can see your keywords in the “Loop Item”
Click on the search box on the page in the built-in browser and select “Enter text” on “Action Tips”
When you click on the input field in the built-in browser, Octoparse can detect that you select a search box, the “Enter text” action will automatically appear on “Action tips”.
Input the first keyword “Big Data” on “Action Tips”
Click “OK”, then the”Enter Text” action will be generated in the workflow
Drag the “Enter Text” action into the”Loop Item. Click on the “Enter Text” action
Go to “Loop Text” and select “Use the text in loop item to fill in the text box” and click “OK” to save.
Click the search button of the web page and select “Click button” on “Action Tips”, you will notice the “Click Item” action is added into the workflow.
3)Dealing with infinitive scrolling
In this case, pagination is not an option for loading the searching results, we will need to scroll down to the bottom of the page continuously to load all the contents.
Check “Scroll down to bottom of the page when finished loading” under “Advanced Options”
Set “Scroll times” and “Internal” you need
Select “Scroll down to bottom of the page” as “Scroll way”
Click “OK” button to save the result
Make sure that you input “Scroll times”, otherwise Octoparse wouldn’t perform the scroll down action. We suggest it is better to set a relatively higher value of “Scroll times” if you need more data.
Most social media website use scroll-down-to-refresh to view more data, click here to learn more about: Dealing with infinite scrolling.
4) Create a “Loop Item” -to loop extract each item
When you create a list of items to scrape a website, sometimes the list may include several “Ads” items. To exclude the promotional video in this case, we can start building the “Loop Item” from the second row of the products on this page.
Select the second block in the built-in browser
We need to make sure the whole block of the first video item is covered in blue when you curse over your mouse. Only in this way, we could see the whole item block is highlighted in green after clicking, covering all other information like video title, channel name, total
Click the third and fourth whole video item, until Octoparse identifies all other videos.
Octoparse will automatically recognize the other blocks and highlight them in green. (If not, keep clicking on the next one till all of them are selected)
Click ” Extract text of the selected element ” on “Action Tips” panel.
Normally we can just click “Select all sub-elements” on the “Action Tips” panel, but under certain circumstances (like this case), Octoparse only recognize the sub-elements in the second block but fails to do that in other blocks. Thus, we’ll create a loop at first, and select the data of each block for extracting manually in the next step.
5) Extract data – to select data you need to scrape
Click data you need in the item block which is highlighted in red.
Click “Extract text of the selected element” and rename the “Field name” column if necessary.
Rename the fields by selecting from the pre-defined list or inputting on your own
Click “OK” to save the result.
6) Run extraction – to run your task and get data
Click “start extraction”
Select “local extraction” to run the task on your computer
Below is the output sample:
Was this article helpful? Feel free to let us know if you have any question or need our assistance.
Contact us here!
Selenium Python | Web Scraping Youtube – Analytics Vidhya
This article was submitted as part of Analytics Vidhya’s Internship Challenge.
I’m an avid YouTube user. The sheer amount of content I can watch on a single platform is staggering. In fact, a lot of my data science learning has happened through YouTube videos!
So, I was browsing YouTube a few weeks ago searching for a certain category to watch. That’s when my data scientist thought process kicked in. Given my love for web scraping and machine learning, could I extract data about YouTube videos and build a model to classify them into their respective categories?
I was intrigued! This sounded like the perfect opportunity to combine my existing Python and data science knowledge with my curiosity to learn something new. And Analytics Vidhya’s internship challenge offered me the chance to pen down my learning in article form.
Web scraping is a skill I feel every data science enthusiast should know. It is immensely helpful when we’re looking for data for our project or want to analyze specific data present only on a website. Keep in mind though, web scraping should not cross ethical and legal boundaries.
In this article, we’ll learn how to use web scraping to extract YouTube video data using Selenium and Python. We will then use the NLTK library to clean the data and then build a model to classify these videos based on specific categories.
You can also check out the below tutorials on web scraping using different libraries:
Beginner’s guide to Web Scraping in Python (using BeautifulSoup)
Web Scraping in Python using Scrapy (with multiple examples)
Beginner’s Guide on Web Scraping in R (using rest)
Note: BeautifulSoup is another library for web scraping. You can learn about this using our free course- Introduction to Web Scraping using Python.
Table of Contents
Overview of Selenium
Prerequisites for our Web Scraping Project
Setting up the Python Environment
Scraping Data from YouTube
Cleaning the Scraped Data using the NLTK Library
Building our Model to Classify YouTube Videos
Analyzing the Results
Selenium is a popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. You must have come across Selenium if you’ve worked in the IT field.
We can easily program a Python script to automate a web browser using Selenium. It gives us the freedom we need to efficiently extract the data and store it in our preferred format for future use.
Selenium requires a driver to interface with our chosen browser. Chrome, for example, requires ChromeDriver, which needs to be installed before we start scraping. The Selenium web driver speaks directly to the browser using the browser’s own engine to control it. This makes it incredibly fast.
There are a few things we must know before jumping into web scraping:
Basic knowledge of HTML and CSS is a must. We need this to understand the structure of a webpage we’re about to scrape
Python is required to clean the data, explore it, and build models
Knowledge of some basic libraries like Pandas and NumPy would be the cherry on the cake
Time to power up your favorite Python IDE (that’s Jupyter notebooks for me)! Let’s get our hands dirty and start coding.
Step 1: Install Python binding:
#Open terminal and type-
$ pip install selenium
Step 2: Download Chrome WebDriver:
Select the compatible driver for your Chrome version
To check the Chrome version you are using, click on the three vertical dots on the top right corner
Then go to Help -> About Google Chrome
Step 3: Move the driver file to a PATH:
Go to the downloads directory, unzip the file, and move it to usr/local/bin PATH.
$ cd Downloads
$ mv chromedriver /usr/local/bin/
We’re all set to begin web scraping now.
In this article, we’ll be scraping the video ID, video title, and video description of a particular category from YouTube. The categories we’ll be scraping are:
Art & Dance
So let’s begin!
First, let’s import some libraries:
Before we do anything else, open YouTube in your browser. Type in the category you want to search videos for and set the filter to “videos”. This will display only the videos related to your search. Copy the URL after doing this.
Next, we need to set up the driver to fetch the content of the URL from YouTube:
Paste the link into to (“ Your Link Here ”) function and run the cell. This will open a new browser window for that link. We will do all the following tasks in this browser window
Fetch all the video links present on that particular page. We will create a “list” to store those links
Now, go to the browser window, right-click on the page, and select ‘inspect element’
Search for the anchor tag with id = ”video-title” and then right-click on it -> Copy -> XPath. The XPath should look something like: //*[@id=”video-title”]
With me so far? Now, write the below code to start fetching the links from the page and run the cell. This should fetch all the links present on the web page and store it in a list.
Note: Traverse all the way down to load all the videos on that page.
The above code will fetch the “href” attribute of the anchor tag we searched for.
Now, we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. We will store the details of videos for different categories in these columns:
We are all set to scrape the video details from YouTube. Here’s the Python code to do it:
Let’s breakdown this code block to understand what we just did:
“wait” will ignore instances of NotFoundException that are encountered (thrown) by default in the ‘until’ condition. It will immediately propagate all others
driver: The WebDriver instance to pass to the expected conditions
timeOutInSeconds: The timeout in seconds when an expectation is called
v_category stores the video category name we searched for earlier
The “for” loop is applied on the list of links we created above
(x) traverses through all the links one-by-one and opens them in the browser to fetch the details
v_id stores the stripped video ID from the link
v_title stores the video title fetched by using the CSS path
Similarly, v_description stores the video description by using the CSS path
During each iteration, our code saves the extracted data inside the dataframe we created earlier.
We have to follow the aforementioned steps for the remaining five categories. We should have six different dataframes once we are done with this. Now, it’s time to merge them together into a single dataframe:
Voila! We have our final dataframe containing all the desired details of a video from all the categories mentioned above.
In this section, we’ll use the popular NLTK library to clean the data present in the “title” and “description” columns. NLP enthusiasts will love this section!
Before we start cleaning the data, we need to store all the columns separately so that we can perform different operations quickly and easily:
Import the required libraries first:
Now, create a list in which we can store our cleaned data. We will store this data in a dataframe later. Write the following code to create a list and do some data cleaning on the “title” column from df_title:
Did you see what we did here? We removed all the punctuation from the titles and only kept the English root words. After all these iterations, we are ready with our list full of data.
We need to follow the same steps to clean the “description” column from df_description:
Note: The range is selected as per the rows in our dataset.
Now, convert these lists into dataframes:
Next, we need to label encode the categories. The “LabelEncoder()” function encodes labels with a value between 0 and n_classes – 1 where n is the number of distinct labels.
Here, we have applied label encoding on df_category and stored the result into dfcategory. We can store our cleaned and encoded data in into a new dataframe:
We’re not quite all the way done with our cleaning and transformation part.
We should create a bag-of-words so that our model can understand the keywords from that bag to classify videos accordingly. Here’s the code to do create a bag-of-words:
Note: Here, we created 1500 features from data stored in the lists – corpus and corpus1. “X” stores all the features and “y” stores our encoded data.
We are all set for the most anticipated part of a data scientist’s role – model building!
Before we build our model, we need to divide the data into training set and test set:
Training set: A subset of the data to train our model
Test set: Contains the remaining data to test the trained model
Make sure that your test set meets the following two conditions:
Large enough to yield statistically meaningful results
Representative of the dataset as a whole. In other words, don’t pick a test set with different characteristics than the training set
We can use the following code to split the data:
Time to train the model! We will use the random forest algorithm here. So let’s go ahead and train the model using the RandomForestClassifier() function:
n_estimators: The number of trees in the forest
criterion: The function to measure the quality of a split. Supported criteria are “gini” for Gini impurity and “entropy” for information gain
Note: These parameters are tree-specific.
We can now check the performance of our model on the test set:
We get an impressive 96. 05% accuracy. Our entire process went pretty smoothly! But we’re not done yet – we need to analyze our results as well to fully understand what we achieved.
Let’s check the classification report:
The result will give the following attributes:
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision = TP/TP+FP
Recall is the ratio of correctly predicted positive observations to all the observations in the actual class. Recall = TP/TP+FN
F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. F1 Score = 2*(Recall * Precision) / (Recall + Precision)
We can check our results by creating a confusion matrix as well:
The confusion matrix will be a 6×6 matrix since we have six classes in our dataset.
I’ve always wanted to combine my interest in scraping and extracting data with NLP and machine learning. So I loved immersing myself in this project and penning down my approach.
In this article, we just witnessed Selenium’s potential as a web scraping tool. All the code used in this article is random forest algorithm Congratulations on successfully scraping and creating a dataset to classify videos!
I look forward to hearing your thoughts and feedback on this article.
How To Scrape & Export Video Information from YouTube
You’re ready to audit and optimize your (or your clients’) YouTube channel or you want to see what a competitor is doing with their YouTube videos. But you run into the problem that YouTube makes it kind of tough to scrape and export video information.
You can’t really crawl YouTube like you can a website – it’s too large and there’s no way to control your crawl. I ran into this problem a few weeks ago while trying to map out videos and optimize titles & tags in bulk.
Here’s how to scrape & export video information from YouTube without buying sketchy blackhat scraper software.
1. Get Your Setup Ready
You’re going to need:
Scraper for Chrome (free) to scrape the video URLs.
Google Sheets (free) to organize the data.
Screaming Frog (free up to 500 URLs) to crawl your videos.
2. Load up all your YouTube videos
You’re going to just keep loading more videos until you can’t load anymore.
3. Scrape the YouTube videos
Right-click on any video link, then click Scrape Similar.
Double-check that it has scraped all the videos, then Export to Google Docs.
4. Clean up your URLs and save to text file
In your Google Doc, add onto all the URLs
Then (if necessary) strip any unneeded parameters from the videos (such as &list)
5. Get your list into Screaming Frog
Copy and paste your list of URLs into a TextEdit (Mac) or Notepad (Windows) and Save as a file
Now open Screaming Frog and set it to List Mode, then crawl.
6. Get it back into a spreadsheet
Export your Screaming Frog crawl to a or and move the data to either Microsoft Excel or back to Google Sheets.
7. Use the data!
You will be primarily be looking at the Title and Meta Keywords columns. Many SEOs know that YouTube uses your video tags as a relevance & quality signal.
However, many SEOs do not know that YouTube stores the video Tag information in the otherwise useless Meta Keywords field.
Aside – just so there’s no confusion, Google does not use the meta keywords tag to gauge relevance of websites. YouTube does use Tags to judge relevance of videos hosted on YouTube. The tags are stored in the meta keywords field.
You can also use the data to judge meta descriptions & word count at scale to spot optimization opportunities.
Screaming Frog also has a Custom Extraction function. If you can find the Xpath of any HTML element on a YouTube page, then you can scrape it.
For example, if you want a point in time of YouTube views on your list of videos. Just grab the Xpath of the element via right-click, Inspect Element, Copy Xpath.
And then drop that Xpath in the Custom Extraction section of Screaming Frog.
Note – you may have to fiddle with the User Agent to keep from being blocked. Either way – it’s a useful tool for quick analysis.
Next Steps & Additional Thoughts
Go grab Scraper for Chrome and Screaming Frog and start pulling your YouTube information into a spreadsheet! Once you have your data in order, be sure to take advantage of YouTube’s bulk editor to quickly edit tags & meta information.
Don’t forget that the Meta Keywords field is on every YouTube video, so definitely spy on and leverage your competitor’s YouTube video information.
Read more about analyzing your own YouTube Analytics here.
I’m Nate Shivar – a marketing educator, consultant, and formerly Senior SEO Specialist at a marketing agency in Atlanta, GA. I try to help people who run their own them a little better. I like to geek out on Marketing, SEO, Analytics, and Better Websites. Read more About me. Sponsored Links