How to Scrape YouTube Videos: A Comprehensive Guide
Justas Vitaitis
Last updated -
In This Article
YouTube is the largest public video repository in the world. There should be no doubt that it can be an absolute treasure trove of information. There are millions of minutes of content uploaded to YouTube every day – a single day can last you years for analysis.
Unfortunately, such an amount of data also means it’s impossible to analyze manually. Web scraping is often used to extract data from the website. Building a relatively basic data extraction tool for personal use is quite easy, but if you want to do it at scale, a dedicated YouTube scraper is the way to go.
What Is Web Scraping?
Web scraping is the process of using bots and automated scripts to extract data from websites. Any type of information can be acquired, ranging from video data to text-based information. In the case of YouTube data, it will be a mix of many things, such as text, video, and images.
Most web scraping is performed by either programming a custom script (such as through Python) or using dedicated scrapers. For example, a YouTube scraper is optimized for extracting data from that website only. It may work on others, but usually will not.
Since most websites aren’t too keen on letting bots run amok, they can often ban scrapers since they send a lot of connection requests. As such, web scraping also involves the usage of proxies, since they let users change their IP address as much as needed. Proxies make IP bans, the most popular method of preventing data extraction, completely ineffective.
So, web scraping involves several moving parts – automated scripts, data extraction, and proxies. If you want to scrape YouTube data, you’ll likely need residential proxies since these are much harder to detect than any other type.
Why Scrape YouTube Data?
There are plenty of use cases for YouTube data, depending on what type you intend to collect. Video data such as titles, descriptions, and views can provide information about emerging trends. If you scrape YouTube comments, on the other hand, they can provide insights into how users interact with videos and what is deemed to be good content.
You can also extract secondary signals when comparing video data with each other. For example, there may be a wide range of videos on the same topic – looking at likes and the amount of comments and comparing them to subscribers and views can give you a better outlook on what’s considered good content.
Legal and Ethical Considerations
Like all websites, YouTube has specific rules regarding automated data collection, so it’s important to approach it with caution.
YouTube allows web scraping, but with explicit permission. Fortunately, the platform also allows limited scraping for certain non-commercial purposes like academic research. You can also use the platform’s API, which allows up to 10,000 units per day for free and additional requests at a price.
Regardless of your approach, make sure you respect the robots.txt file, which provides information on which parts of the site are accessible to bots. Additionally, rate limiting is always a good idea, as you don’t want to burden the servers with too many requests.
Most importantly, collect only the data you intend to use. This approach will help streamline your project and minimize any potential complications.
Tools for Scraping YouTube Videos
There are plenty of out-of-the-box YouTube scraper solutions available if you’re not into building a data extraction tool all by yourself. A YouTube scraper, however, will cost some money, usually scaling quickly with the amount of information you want to extract. Building your own YouTube data extraction tool is completely free, but you’ll need time to maintain it.
Octoparse
Octoparse is a scraping tool that focuses on being user-friendly. A major feature is the drag-and-drop interface, which makes coding less important and the process of data extraction less complicated.
ParseHub
Similar to Octoparse, ParseHub is a visual web scraping tool that’s relatively easy to use. It can also easily handle AJAX and JavaScript-heavy websites – something that’s always a challenge for scraping.
Scrapy
Another ready-to-use tool, however, Scrapy is heavily geared towards large-scale scraping projects, so a lot of the features are tailored to those tasks. There’s plenty of powerful data collection features and plenty of customization options.
Selenium
If you want to build your own YouTube scraper, Selenium could be the starting point. It’s a popular Python library that automates browsers – perfect for going through many URLs and collecting data from them.
Yt-dlp
It’s a command-line program and Python library that can be used to download YouTube videos and content. Highly useful if you’re planning to build your own YouTube data extraction tool.
Step-by-Step Guide to Scraping YouTube Data
One of the most popular programming languages for web scraping is Python. It’s simple, efficient, and has a lot of libraries and community support for data extraction. We’ll be using Python in our examples.
Start your IDE and open up the Terminal to install all of the necessary libraries:
pip install beautifulsoup4 selenium yt-dlp
Scraping Basic Video Information
We’ll start by scraping YouTube video information instead of downloading them directly. Most of the time, the valuable information is either in the comments, titles, or descriptions.
from yt_dlp import YoutubeDL
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
def get_video_urls(search_query):
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
url = f'https://www.youtube.com/results?search_query={search_query}'
driver.get(url)
# Wait for the page to load
time.sleep(2) # Adjust the sleep time as needed
video_urls = []
video_elements = driver.find_elements(By.XPATH, '//a[@id="video-title"]')
for video in video_elements:
video_url = video.get_attribute('href')
video_urls.append(video_url)
driver.quit()
return video_urls
def get_video_info(video_url):
opts = {
'quiet': True,
'skip_download': True
}
with YoutubeDL(opts) as yt:
info = yt.extract_info(video_url, download=False)
video_title = info.get("title", "")
view_count = info.get("view_count", "")
description = info.get("description", "")
uploader = info.get("uploader", "")
return {
'URL': video_url,
'Title': video_title,
'Views': view_count,
'Description': description,
'Uploader': uploader
}
# Example usage
search_query = 'iproyal'
video_urls = get_video_urls(search_query)
# Ensure we have some URLs
if video_urls:
video_info_list = [get_video_info(url) for url in video_urls]
for info in video_info_list:
print(
f"Title: {info['Title']}, URL: {info['URL']}, Views: {info['Views']}, Uploader: {info['Uploader']}, Description: {info['Description']}")
else:
print("No video URLs found.")
There are two ways to go about getting data. You can use “requests” and then the JSON library to search through a script file to find everything you need. Alternatively, you can use Selenium to automate a browser, which makes searching through the HTML a lot easier (albeit slower to initially collect the data).
We’re using Selenium. We initiate a headless browser instance, which visits a URL as defined by our “search_query” object at the bottom of the code. We then use Selenium and XPATH to find all video URLs and store them in a list.
After that, we’ll use “yt-dlp” to go through the list (while skipping downloads and using quiet mode) and extract the data stored within the video. We output the amount of views, title, URL, and description.
Scraping Comments
Comments can be some of the most valuable data from YouTube. They are also, however, much more complicated to scrape. Instead of updating our previous code, we’ll be building a smaller YouTube scraper that only works with comments.
If you want everything in one place, you can easily combine the code and provide user interaction if necessary.
from yt_dlp import YoutubeDL
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
def get_video_urls(search_query):
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
url = f'https://www.youtube.com/results?search_query={search_query}'
driver.get(url)
# Wait for the page to load
time.sleep(2) # Adjust the sleep time as needed
video_urls = []
video_elements = driver.find_elements(By.XPATH, '//a[@id="video-title"]')
for video in video_elements:
video_url = video.get_attribute('href')
if video_url: # Ensure the URL is not None
video_urls.append(video_url)
driver.quit()
return video_urls
def get_video_comments(video_url):
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get(video_url)
# Wait for the page to load and comments to appear
time.sleep(5) # Adjust the sleep time as needed
# Scroll down to load more comments
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
time.sleep(3) # Adjust the sleep time as needed
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Extract comments
comments = []
comment_elements = driver.find_elements(By.XPATH, '//*[@id="content-text"]')
for comment in comment_elements:
comments.append(comment.text)
driver.quit()
return comments
# Example usage
if __name__ == "__main__":
search_query = 'iproyal'
video_urls = get_video_urls(search_query)
# Ensure we have some URLs
if video_urls:
for url in video_urls:
comments = get_video_comments(url)
print(f"Comments for {url}:")
for comment in comments:
print(comment)
else:
print("No video URLs found.")
Many of the functions are quite similar. We add an additional one to visit the video URL, scroll down (to load up more comments) and then we find each one with XPATH.
Data Storage and Analysis
Printing out data from YouTube isn’t very useful as you won’t be able to analyze the information. Use the “pandas” library to create CSV files and export the information there. Otherwise, if you scrape YouTube without a dataframe library, it’ll be a good exercise, but not that useful.
pip install pandas
We’ll be using the same method to scrape YouTube as the one above – we’ll be extracting comments, but exporting them to CSV.
from yt_dlp import YoutubeDL
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
def get_video_urls(search_query):
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
url = f'https://www.youtube.com/results?search_query={search_query}'
driver.get(url)
# Wait for the page to load
time.sleep(2) # Adjust the sleep time as needed
video_urls = []
video_elements = driver.find_elements(By.XPATH, '//a[@id="video-title"]')
for video in video_elements:
video_url = video.get_attribute('href')
if video_url: # Ensure the URL is not None
video_urls.append(video_url)
driver.quit()
return video_urls
def get_video_comments(video_url):
options = Options()
options.add_argument("--headless=new")
driver = webdriver.Chrome(options=options)
driver.get(video_url)
# Wait for the page to load and comments to appear
time.sleep(5) # Adjust the sleep time as needed
# Scroll down to load more comments
last_height = driver.execute_script("return document.documentElement.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
time.sleep(3) # Adjust the sleep time as needed
new_height = driver.execute_script("return document.documentElement.scrollHeight")
if new_height == last_height:
break
last_height = new_height
# Extract comments
comments = []
comment_elements = driver.find_elements(By.XPATH, '//*[@id="content-text"]')
for comment in comment_elements:
comments.append(comment.text)
driver.quit()
return comments
def save_comments_to_csv(comments_data, filename):
df = pd.DataFrame(comments_data)
df.to_csv(filename, index=False)
# Example usage
if __name__ == "__main__":
search_query = 'iproyal'
video_urls = get_video_urls(search_query)
comments_data = []
# Ensure we have some URLs
if video_urls:
for url in video_urls:
comments = get_video_comments(url)
for comment in comments:
comments_data.append({"video_url": url, "comment": comment})
# Save to CSV
save_comments_to_csv(comments_data, 'youtube_comments.csv')
print("Comments have been saved to youtube_comments.csv")
else:
print("No video URLs found.")
Our scraper will run for some time and output a CSV file. We only need “pandas” for the export itself, so there aren’t many changes to the code.
Conclusion
If you’re willing to pay, you can use ready-built YouTube scrapers from various providers. But if you want to customize your own and pay nothing, making your own scraper is the way to go. There are plenty of ways to extract data from YouTube, so limiting yourself to Selenium and yt-dlp is not necessary, but it’s a simple and effective method.
Author
Justas Vitaitis
Senior Software Engineer
Justas is a Senior Software Engineer with over a decade of proven expertise. He currently holds a crucial role in IPRoyal’s development team, regularly demonstrating his profound expertise in the Go programming language, contributing significantly to the company’s technological evolution. Justas is pivotal in maintaining our proxy network, serving as the authority on all aspects of proxies. Beyond coding, Justas is a passionate travel enthusiast and automotive aficionado, seamlessly blending his tech finesse with a passion for exploration.
Learn More About Justas Vitaitis