How to Scrape Yelp Data: A Step-by-Step Tutorial


Vilius Dumcius
In This Article
Yelp is a well-known review platform that gained popularity almost two decades ago. Yelp reviews remain one of the most reliable and popular sources of information on various restaurants, bars, and other establishments.
Due to its notoriety, Yelp data has become highly sought after, serving as a foundation for some businesses. Yelp data can be used to supplement various business strategies, monitor competitors, and much more.
Extracting such information manually, however, can be tiring and time-consuming. A Yelp scraper can completely automate the process and store the data in an easy-to-read and simple format.
Understanding Yelp Data
Yelp data can be separated into several categories:
1. Business data
There’s plenty of business data available on the platform, as companies list addresses, phone numbers, names, websites, operating hours, and an associated category. As such, Yelp can double as a good business data catalog.
2. Reviews
The feature Yelp is known for – there are millions of lines of user feedback available on the platform for anyone to look through. While it’s often not as useful as numerical Yelp data, scraping reviews can still be beneficial in some circumstances.
3. Aggregated ratings
Simple numerical data that shows the number of ratings and the average rating of establishments. Easy to collect and potentially powerful.
4. Metadata
You can find lots of other Yelp data that could be helpful, such as amenities provided (or not provided), number of photos, etc.
These categories cover most of what Yelp data offers. While the applications for such information are somewhat niche, there’s enormous potential due to the amount of data, such as business details that are available.
On the other hand, the volume of data means you’ll be hard-pressed to find a good way to extract it manually. The most reasonable approach is to either build or get a Yelp scraper.
Is Scraping Yelp Legal?
The legality of web scraping is a somewhat complicated question. Generally, any publicly available data that does not require a login, is not personal information, or protected by copyright is fine to scrape.
While Yelp data is mostly publicly available, a lot of it could be protected by copyright (such as images or reviews) or be personally identifiable information (such as names and last names). Due to how web scraping works, it’s always best to consult a legal professional first.
There are safer web scraping zones in Yelp, however, such as search results, as these only have a business name, a single photo, a number of reviews, a single quote, and an overall review score.
Additionally, you should always follow directions provided by the Terms of Service, the robots.txt file, and general web scraping best practices . While it’s unlikely that you’d incur legal penalties, your IP address might get blocked.
Tools and Libraries for Scraping Yelp
Building a Yelp scraper will rely on third-party libraries or tools if you want to make the process efficient instead of building everything from the ground up. Python is commonly used in web scraping due to how easy to learn the language is and also due to wide community support.
Python also has plenty of libraries that support web scraping tools, making it much easier to build a Yelp scraper on your own.
Some of the popular libraries and tools include:
1. Requests lets you send various HTTP requests to a website (in this case, Yelp) easily and effectively.
2. BeautifulSoup makes parsing HTML files a lot easier and quicker, allowing you to search through the data for valuable information.
3. Selenium (or any other browser automation library) is a good choice for a fallback when Requests fail. It is a slower way to access websites, but usually less ban-prone, and can use JavaScript to load elements that Requests wouldn’t be able to.
4. Proxies , or residential proxies , to be more precise, are necessary to scrape Yelp data at scale. Getting banned is almost unavoidable, so you’ll need a pool of IP addresses.
You may also want some library to export data to a handier format than the IDE output. Pandas is a good choice for Python as it’ll allow you to export your web scraping data into a CSV file.
Step-by-Step Guide to Scraping Yelp Data
To build our Yelp scraper, we’ll be using Python and some of the aforementioned libraries. You’ll need an IDE (such as Pycharm or Visual Studio Code), and to install the libraries. So, start a project, open up the Terminal, and type in:
pip install requests beautifulsoup4 selenium pandas
It should take some time for all the libraries to be downloaded and installed. Once that’s done, we can begin coding our Yelp data scraper.
We’ll start by importing the libraries and naming our main function:
import requests
from bs4 import BeautifulSoup
def main():
We’ll also need two more functions at the least – one to scrape search results (we’ll be focusing on those) and one to parse them. Let’s start with the collection:
import requests
from bs4 import BeautifulSoup
def main():
headers ={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
url = "https://www.yelp.ie/search?find_desc=coffee&find_loc=Dublin"
get_data(url, headers)
def get_data(urls: str, http_headers: dict) -> str:
response = requests.get(urls, headers=http_headers)
if response.status_code == 200:
return response.text
else:
print(f"Failed to fetch data. Status code: {response.status_code}")
return ""
if __name__ == "__main__":
main()
Our collection function called “get_data” accepts two arguments: “urls” (a string) and “http_headers” (a dictionary). Note that at the current implementation, only a single URL is accepted, but it can be extended to accept lists as well.
In our main function, we assign the headers we want with a dictionary and provide a single URL, then call the “get_data” function. It’ll use the information to send a GET request to the URL and return the response text.
We also run a check for the HTTP status code to see if it successfully visited the website or not. If it doesn’t, we’ll run Selenium to try to visit it with a browser. Let’s leave that function out of scope for now.
There are two ways to implement our next step – parsing. It can be done as a function within a function (although that would not follow Python programming standards) or as a separate function that’s called within the main one.
Let’s do the second example:
def parse_data(html_file: str) -> None:
soup = BeautifulSoup(html_file, "html.parser")
listings = soup.find_all("h3", class_="css-1x1e1r2")
if not listings:
print("No listings found. Check the class name or page structure.")
return
for listing in listings:
business_name = listing.get_text()
print(business_name)
Since we’ll just be printing names into our output, we won’t be returning anything. If you want to export to a CSV file, you’ll need to create a list (at the very least) and transform that with Pandas.
One of the most critical parts is visiting the Yelp search results page manually to check whether our “find_all” function correctly targets the data we want. Since business names are stored in H3 and there’s an associated class, we can simply copy those.
“Find_all” also returns a ResultSet (close to a list), which we can use a “for” loop to run through and print out the business names.
Having a way to check if the “listings” object is empty is also a good approach. HTMLs change frequently, so you’ll have a way to capture semantic errors.
We could now technically run the script, but you’d likely get an error saying that you received a 403 (Forbidden) error code. We’ll need to enhance the script with Selenium as our fallback:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
def get_data_with_selenium(url: str) -> str:
options = Options()
options.add_argument("--headless")
options.add_argument("--enable-javascript")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("user-agent=Mozilla/5.0")
service = Service()
driver = webdriver.Chrome(service=service, options=options)
try:
driver.get(url)
time.sleep(5)
page_source = driver.page_source
return page_source
except Exception as e:
print(f"Selenium error: {e}")
return ""
finally:
driver.quit()
A few things are going on here. First, we start with a list of options that make Selenium more optimal. The key of these are headless mode, which turns the browser invisible, and JavaScript rendering, which is important to bypass some anti-bot methods.
Note that for debugging purposes, it’s a good idea to disable headless mode so you can see what happens on the page.
We’ll then open a service object, which will be used to start our Chrome webdriver. That driver will visit the page and attempt to download the page source unless an error occurs. Regardless, the driver is then turned off.
Since we’re using a fallback, we’ll need to update our “main” function as well:
def main():
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
url = "https://www.yelp.ie/search?find_desc=coffee&find_loc=Dublin"
data = get_data(url, headers)
if not data:
print("Falling back to Selenium...")
data = get_data_with_selenium(url)
if data:
parse_data(data)
else:
print("Failed to fetch data from Yelp.")
Now, all of these may still not be enough. Yelp is highly protective of its search results, so you may run into CAPTCHAs. Clearing these is quite difficult and your best bet is to switch IP addresses often with a proxy or use a third-party service to solve them. Alternatively, you could increase the sleep time and disable headless mode to solve CAPTCHAs yourself.
Finally, you could use additional libraries like “Selenium Stealth” to reduce the likelihood of getting blocked.
Here’s how your result so far should look like:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import time
def main():
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
url = "https://www.yelp.ie/search?find_desc=coffee&find_loc=Dublin"
data = get_data(url, headers)
if not data:
print("Falling back to Selenium...")
data = get_data_with_selenium(url)
if data:
parse_data(data)
else:
print("Failed to fetch data from Yelp.")
def get_data(urls: str, http_headers: dict) -> str:
response = requests.get(urls, headers=http_headers)
if response.status_code == 200:
return response.text
else:
print(f"Failed to fetch data. Status code: {response.status_code}")
return ""
def parse_data(html_file: str) -> None:
soup = BeautifulSoup(html_file, "html.parser")
listings = soup.find_all("a", class_="y-css-1x1e1r2")
if not listings:
print("No listings found. Check the class name or page structure.")
return
for listing in listings:
business_name = listing.get_text()
print(business_name)
def get_data_with_selenium(url: str) -> str:
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--enable-javascript")
options.add_argument("--no-sandbox")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument(f"user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36")
service = Service()
driver = webdriver.Chrome(service=service, options=options)
try:
driver.get(url)
time.sleep(15)
page_source = driver.page_source
return page_source
except Exception as e:
print(f"Selenium error: {e}")
return ""
finally:
driver.quit()
if __name__ == "__main__":
main()
Let’s make a final upgrade – export data to a CSV file. We’ll need to import Pandas and modify our “parse_data” function first:
def parse_data(html_file: str) -> list:
"""Parse the HTML content to extract business data."""
soup = BeautifulSoup(html_file, "html.parser")
listings = soup.find_all("a", class_="y-css-1x1e1r2")
business_data = []
if not listings:
print("No listings found. Check the class name or page structure.")
return business_data
for listing in listings:
business_name = listing.get_text(strip=True)
business_url = f"https://www.yelp.ie{listing['href']}"
business_data.append({"Business Name": business_name, "URL": business_url})
return business_data
Instead of simply printing, we now create an empty list to which we append the business name and URL.
After modifying the parsing function, we’ll need to upgrade our main with the imports and a new function that saves data to CSV:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
def main():
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
url = "https://www.yelp.ie/search?find_desc=coffee&find_loc=Dublin"
data = get_data(url, headers)
if not data:
print("Falling back to Selenium...")
data = get_data_with_selenium(url)
if data:
business_data = parse_data(data)
if business_data:
save_to_csv(business_data, "yelp_dublin_coffee.csv")
else:
print("No business data found to export.")
else:
print("Failed to fetch data from Yelp.")
[...]
def save_to_csv(data: list, filename: str) -> None:
"""Save the extracted business data to a CSV file."""
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"Data exported to {filename}")
We now call the “save_to_csv” function if we get any data from Yelp. That data is sent to the saving function that creates a Pandas DataFrame and exports it to a CSV. Nothing is returned because all we need is a file.
Here’s how your final result should look like:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd
import time
def main():
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"}
url = "https://www.yelp.ie/search?find_desc=coffee&find_loc=Dublin"
data = get_data(url, headers)
if not data:
print("Falling back to Selenium...")
data = get_data_with_selenium(url)
if data:
business_data = parse_data(data)
if business_data:
save_to_csv(business_data, "yelp_dublin_coffee.csv")
else:
print("No business data found to export.")
else:
print("Failed to fetch data from Yelp.")
def get_data(urls: str, http_headers: dict) -> str:
"""Fetch HTML content using the Requests library."""
response = requests.get(urls, headers=http_headers)
if response.status_code == 200:
return response.text
else:
print(f"Failed to fetch data. Status code: {response.status_code}")
return ""
def parse_data(html_file: str) -> list:
"""Parse the HTML content to extract business data."""
soup = BeautifulSoup(html_file, "html.parser")
listings = soup.find_all("a", class_="y-css-1x1e1r2")
business_data = []
if not listings:
print("No listings found. Check the class name or page structure.")
return business_data
for listing in listings:
business_name = listing.get_text(strip=True)
business_url = f"https://www.yelp.ie{listing['href']}"
business_data.append({"Business Name": business_name, "URL": business_url})
return business_data
def get_data_with_selenium(url: str) -> str:
"""Fetch HTML content using Selenium."""
options = Options()
options.add_argument("--headless")
options.add_argument("--disable-gpu")
options.add_argument("--enable-javascript")
options.add_argument("--no-sandbox")
options.add_argument("start-maximized")
options.add_argument("enable-automation")
options.add_argument("--disable-infobars")
options.add_argument("--disable-dev-shm-usage")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/132.0.0.0 Safari/537.36")
service = Service()
driver = webdriver.Chrome(service=service, options=options)
try:
driver.get(url)
time.sleep(15)
page_source = driver.page_source
return page_source
except Exception as e:
print(f"Selenium error: {e}")
return ""
finally:
driver.quit()
def save_to_csv(data: list, filename: str) -> None:
"""Save the extracted business data to a CSV file."""
df = pd.DataFrame(data)
df.to_csv(filename, index=False)
print(f"Data exported to {filename}")
if __name__ == "__main__":
main()
Yelp search results are a tough nut to crack. Sometimes you’ll get the data easily, but sometimes there’ll be a lot of legwork involved in scraping.
Best Practices for Scraping Yelp Data
There are a few approaches you need to take – rate limiting, proxies, and various fallback mechanisms. Rate limiting is the easiest, as if you’re scraping a lot of URLs, simply place delays between each visit.
Proxies will help you avoid IP bans and CAPTCHAs, both of which can be extremely painful when scraping large volumes of data.
Finally, use browser automation libraries to find out which works best to minimize detection rates.

Author
Vilius Dumcius
Product Owner
With six years of programming experience, Vilius specializes in full-stack web development with PHP (Laravel), MySQL, Docker, Vue.js, and Typescript. Managing a skilled team at IPRoyal for years, he excels in overseeing diverse web projects and custom solutions. Vilius plays a critical role in managing proxy-related tasks for the company, serving as the lead programmer involved in every aspect of the business. Outside of his professional duties, Vilius channels his passion for personal and professional growth, balancing his tech expertise with a commitment to continuous improvement.
Learn More About Vilius Dumcius