Google Sheets has some kick-ass functions that can do a lot of heavy lifting.
As a part of my online work, I often need to go to some websites and extract specific data points or sections from them.
For example, I sometimes have to copy and paste a table from a web page into Google Sheets or fetch the article title or product name from a web page into Google Sheets.
Since I spend a lot of my time in Google Sheets anyway, I thought of figuring out if I could scrape the data from the websites and extract the data into the cells in Google Sheets.
After a few hours of scanning content on various forums and YouTube, I was able to create a basic scrapper in Google Sheets (for my needs).
In this article, I want to show you how you can use simple Google Sheets formulas to scrape content from the web.
Before I show you some examples of how to scrape data in Google Sheets, let’s first have a look at some formulas that make it possible
Important Google Sheets Formulas for Scrapping
Below web scrapping formulas that we would be covering in this tutorial:
IMPORT XML formula
With the IMPORTXML formula, you can fetch the data from many structured data types such as XML, HTML, CSV/TSV, and RSS/ATOM XML feeds.
Don’t worry if you feel a bit lost here.
You’ll get a much better idea of how this works when we go to some examples and I show you how to use IMPORTXML for scrapping webpage titles or specific sections such as date or author name.
IMPORTXML formula has the below syntax:
- ‘url’ is the URL of the web page from which you want to scrape the data
- ‘xpath_query’ is the identifier that tells the formula what to scrape. For example, if you want to scrape the title of an article, you will use the query that tells the formula what part of the webpage code refers to the title.
With IMPORTHTML, you can fetch either a table or a list from a webpage.
In most cases, I use this when there is a table on a webpage that I want to fetch. In case there are multiple tables, you can specify which table to scrape.
IMPORTHTML formula has the below syntax:
IMPORTHTML(url, query, index)
- ‘url’ is the URL of the web page from which you want to scrape the data
- ‘query’ can be a “list” or a “table”, based on what you want to extract
- index is the number that will tell Google Sheets which table to list to fetch
While IMPORTXML and IMPORTHTML are meant to be used with regular URLs, the IMPROTFEED function is specifically designed to deal with RSS or Atom feeds.
A use-case of this could be when you want to fetch the top 10 article titles using the RSS feed of your favorite blog or news website.
IMPORTFEED formula has the below syntax:
IMPORTFEED(url, [query], [headers], [num_items])
- url – this is the URL of the ATOM or RSS feed of the blog/website
- [query] – this is an optional argument where you tell the formula the data point you want to fetch (such as the title of the post or the date of the post). If you don’t specify anything, it will fetch everything
- [headers] – this is an optional argument where you can make this TRUE if you want an additional row that shows the header
- [num_items] – this is also an optional argument where you specify how many items you want to fetch. For example, if you’re fetching the latest posts from a website using the RSS feed, then you can specify 5 here and the five latest posts would be displayed.
Now that we have covered all the Google Sheets formulas you need to know to scrape the websites, let’s dive in and have a look at some of the web scraping examples.
Scraping Tables From Websites into Google Sheets
Let’s say you want to scrape the table from Wikipedia about the top-grossing movies of all time.
Here is the URL of the page that has this table – https://en.wikipedia.org/wiki/List_of_highest-grossing_films
Below is the formula that would allow you to get this table into Google Sheets
The above formula goes to the specified URL, and from the HTML code of the web-page, it identifies the first table and fetches the entire table in Google Sheets.
Quite impressive to be honest.
Note that Google sheet returns an array of values, and for this formula to work you need to have the entire range of cells free and available for the IMPORTHTML formula to fill that area.
If you already have something in one of the cells (a cell that would otherwise have been used by the result of the ImportHTML formula) you would see an error.
Now, this is great.
But what if I do not want the entire table.
What if I only want the names of the top-grossing movies.
You can do that as well.
Since the formula returns an array of values, you can use this as an input within the INDEX function and only fetch the column that you need.
Below is the formula that would only give you the names of the top-grossing movies from the table.
The above INDEX formula only fetches the third column, which has the movie names.
Just the way I have used the formula to scrap the table, you can also use it to fetch a list as well. Just change the second argument from “table” to “list”.
Note: This formula would refresh when you open the Google Sheets document again later. This way, in case there is any change in the list, it would automatically be updated. If you just need the table and don’t want it to update, copy the data, and paste it as values.
Scrapping an Article Title
Here is a random article I picked from Forbes.
Now I want to use the formulas to scrape the headline of this article.
This could be useful when you have a list of 50 or 100 URLs and you want to quickly fetch the titles.
I have entered this URL in cell A2 (although you can also use the URL directly in the formula, I will use the cell reference that has the URL instead).
Now before I use the formula, I need two things for it to work – the exact URL (or the cell reference that contains the URL), and an identifier within the HTML of that page that tells me that this is the title.
Let’s see how you can find this HTML identifier.
Open the webpage and when it’s fully loaded, hover the cursor over the headline, right-click and then click on Inspect.
This opens the inspect element window and you will see that the following HTML element gets highlighted
<h1 class=”fs-headline speakable-headline font-base font-size”>Google’s Machine Learning Is Making You More Effective In 2020</h1>
From this, we need an identifier that will tell us that it’s the title and that identifier is class=”fs-headline speakable-headline font-base font-size“
Now, we can use this in our formula and get the title of the URL.
Below is the formula that will scrape the title of the given URL:
=IMPORTXML(A2,"//*[@class='fs-headline speakable-headline font-base font-size']")
Similarly, if you want to get the name of the author of this article, you can use the same steps above to find out the HTML element that is a unique identifier for the author name and use the below formula:
Scrape the Recent Articles from a Feed
If you have the feed URL of a website, you can also use the IMPORTFEED formula to fetch all the recently published articles from the feed.
For example, below is an example of the feed from Tech Crunch
I can now use this feed to get the latest articles from TechCrunch.
Below is the formula that will fetch all the data from the feed:
Since TechCrunch has the top 20 articles in the feed, you will get information such as the title of the article, author, URL, date, summary, etc.
If you only want the titles of the recent articles, you can use the below formula:
and if you want the URLs, you can use the below formula:
You can use this to create your own feed list of the top blogs/sites you want to keep a track of. You can have the feeds of these sites in separate columns and use the formula to always have a list of the latest articles from the website.
This can also be a great technique to keep a track of what your competitors are posting.
Scraping Data From Amazon
Now I want to show you something amazing.
Using Google Sheets formulas, you can scrape data from Amazon such as the product title, the rating, the total number of reviews, etc.
Here is the URL of a product that I will be using – https://www.amazon.com/Microsoft-Ergonomic-Keyboard-Business-5KV-00001/dp/B00CYX26BC/
It’s a URL to an ergonomic keyboard.
Now let’s see the formulas you can use to scrape the data from Amazon using Google Sheets.
Since the identifier used on Amazon is ‘productTitle’, you can use the URL of any product page and this will scrape the product title.
Below is the formula that will give you the price of the product:
And the below formula will give you the number of customer ratings:
The concept used in these formulas is the same that I have already covered before.
You can try and scrape more data from Amazon or any website you like. Since there is no one size that fits all, you may have to do a little bit of trial and error before you can get the formula that works.
Limitations of Google Sheets as Scraper
As I mentioned earlier, Google Sheets is not built for scraping websites. Sure it has some awesome formulas that you can use for scraping, but there are a lot of limitations you need to know about.
For example, if you’re trying to scrape a couple of tables or a couple of HTML elements, it should be fine.
But if you ask Google Sheets to do the scraping for hundreds of data points at once, you may have to wait for a long time (minutes or even hours), or worse – see an error message because Google Sheets is not able to do the scraping.
In my experience, as soon as I try and scrape more than 50 data points in one go, I face issues. Sometimes, I enter the formula and nothing happens – it shows a blank.
I’m not sure exactly what’s going on in the backend, but I feel there is a daily limit to these formulas. Google Sheets allows you to scrape 50 or 100 data points, but once you go above that limit, it stops you from doing it for some time (maybe a few hours or a day)
So this is how you can use Google Sheets as a basic web scraper and fetch the data from websites.
I hope you found this tutorial useful!
Other Google Sheets tutorials you may also like: