Google Sheets Web Scraping Guide 2024
Eugenijus Denisov
Last updated -
In This Article
Web scraping usually involves using programming languages like Python to create scripts, buying proxies, and creating complicated logic to evade bans.
You can avoid doing most of that and perform web scraping at a smaller scale by using Google Sheets. There are a lot of benefits to Google Sheets web scraping – no programming is required, it’s free, accessible from anywhere, and most people are at least somewhat familiar with Google Sheets.
So, Google Sheets web scraping is great if you need a smaller set of data and don’t want to worry about programming.
How to Scrape Data From a Website Into Google Sheets
There are quite a few methods you can use to perform Google Sheets web scraping. We’ll go over a few basic options that use built-in functions in Google Sheets and an advanced method that involves creating custom scripts.
Importdata Function
One of the simplest built-in functions is “importdata”:
IMPORTDATA(url)
As an example:
=IMPORTDATA("http://example.com/data.csv")
“Importdata” only accepts a URL that directly points to a CSV (Comma-Separated Values) or TSV (Tab-Separated Values) file.
Unfortunately, it’s not all that useful in regular web scraping. Few websites will have all of that important information stored in a handy CSV file. If you do find such a website, Google Sheets web scraping will be a breeze with “importdata”, however.
Importfeed Function
A significantly less popular option is “importfeed”. You can get data from RSS or ATOM feed from a URL straight into Google Sheets:
IMPORTFEED(url, [query], [headers], [num_items])
Here’s a simple rundown of all the arguments:
- url : The URL of the feed.
- query : Optional. A string that indicates which tag to fetch from the feed (like “title”). There are dedicated names, which you can find in the documentation .
- headers : Optional. Accepts TRUE or FALSE. If true, it will retrieve the headers and put them into the first row.
- num_items : Optional. The number of items to import.
=IMPORTFEED("http://example.com/feed", "items title", TRUE, 10)
In our example, Google Sheets would import 10 of the titles of items in the feed and include header rows.
Importxml Function
The “importxml” function is one of the most powerful functions for Google Sheets web scraping. However, it’s slightly more complicated as you’ll need to do some code inspection.
IMPORTXML(url, xpath_query)
URL is the link to any website or HTML document. XPath query is the path to the desired data in the website.
HTML documents have an internal logic (XPath) that can be used to access specific parts. Instead of looking at HTML as a file, we can think of its structure as a folder. For example, say we want to select all paragraphs in the HTML file’s body, so the HTMl file as a folder would look like:
"/html/body/p"
Additionally, you can use commands like “//*” to select all of some type of element. For example, “//h1” would select all text that has the “H1” tag.
Note that the “thinking of an HTML file as a folder” is an analogy and it’s not perfect. It does, however, make it easier to reason about how you can find the things you want to find.
If you’re looking for more details about all of the XPath queries and functions, W3Schools has a great tutorial .
Modern browsers, however, have a handy function that allows you to get the XPath for any element you can see on your screen.
All you have to do is right-click the element and select “Inspect”. A small new window will popup either at the right-hand side of the screen or at the bottom. Your element will be automatically selected there.
Hover over it with your mouse, right-click, scroll to copy, and select “Copy Full XPath” . Then, simply add the information to the formula query.
Here’s how it might look:
=IMPORTXML("http://example.com", "//h1")
Creating Custom Scripts
Google Sheets (and the entirety of G Suite) have a powerful addition that few people take advantage of – Custom Apps. You can code an entire custom script that would scrape web data directly to Google Sheets.
You can open up custom scripting options in Google Sheets by clicking on “Extensions” and then “Apps Script”. A new window will then open.
While there are a ton of various functions you can use, “UrlFetchApp” is the most useful for web scraping. We’ll be using that to scrape web data using Google Sheets.
Here’s a basic script that lets you collect web data:
function scrapeWebsite() {
var url = 'https://iproyal.com'; // The URL of the website to scrape
var response = UrlFetchApp.fetch(url); // Fetch the website
var content = response.getContentText(); // Get the content as text
// Parse the content for specific data
var myData = parseData(content);
// Write data to the sheet
writeToSheet(myData);
}
function parseData(html) {
// Regex to extract content between <title> tags
var dataPattern = /<title>(.*?)<\/title>/gi; // Case insensitive match
var matches = [];
var match;
while (match = dataPattern.exec(html)) {
matches.push(match[1]);
}
return matches;
}
function writeToSheet(data) {
var sheet = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
// Clear existing data to prevent appending repeatedly on every script run
sheet.clear();
// Assuming data is an array of titles, we write each to a new row
for (var i = 0; i < data.length; i++) {
sheet.appendRow([data[i]]);
}
}
We first start with a function that has our URL defined. We then call the fetch service for that particular URL, which essentially sends a GET request. Finally, the entire content of the URL is stored as text.
Our next step is a function (within a function). Since the HTML file will be messy and we won’t need everything, we define a regex pattern that finds all text between title tags. We then run through a loop that finds all matches and puts them into a list.
Finally, we call another function within our web scraping function to put all that data into our spreadsheet. Most of that function is self-explanatory, however, we do use a “for” loop to avoid putting all of the data into a single row.
Running the script will ask you to first review permissions, so make sure you grant them. Once the script is completed, you should have all the data in the Google Sheets tab from which you opened the “Apps Script” function.
Pros and Cons of Google Sheets Web Scraping
Google Sheets web scraping provides you with a powerful two-in-one package. Google Sheets is great for analyzing medium-sized datasets due to the incredibly vast pool of various functions you can use.
Additionally, Google Sheets is part of the G Suite, which means it can integrate with other similar products easily. If you need to create a complicated data pipeline using Google Sheets and some of their other products, it’s definitely going to be a lot easier than using third parties.
On the other hand, web scraping using Google Sheets is quite limited. Google prevents you from doing too many calls at once to avoid impeding on the performance of other servers, so you’re always rate-limited.
Additionally, even with custom scripts, you’re quite limited in potential file formats. While Google Sheets formats are definitely powerful and useful for data analysis, that still is an inbuilt drawback.
Finally, customization will be quite limited. You have a system that entirely relies on Google Sheets and the provided programming language. If you want to create something outside of the G Suite environment, that won’t be possible.
Author
Eugenijus Denisov
Senior Software Engineer
With over a decade of experience under his belt, Eugenijus has worked on a wide range of projects - from LMS (learning management system) to large-scale custom solutions for businesses and the medical sector. Proficient in PHP, Vue.js, Docker, MySQL, and TypeScript, Eugenijus is dedicated to writing high-quality code while fostering a collaborative team environment and optimizing work processes. Outside of work, you’ll find him running marathons and cycling challenging routes to recharge mentally and build self-confidence.
Learn More About Eugenijus Denisov