Our Google Docs spreadsheet to bulk check numbers of Google results has proven to be one of our most popular free tools. So, we're going to share more of our custom Google spreadsheets to allow you to perform additional functions.
Today, we'll show you how to scrape the title and URL of the top 100 Google results for any search query. Results will look like this:
In a hurry? Grab the spreadsheet from the link below:
Remember to use File >> Make a copy to get your own version to work with.
Grabbing the top 100 results
Google allows URL parameters that modify various aspects of results returned, including one to provide 100 results. This parameter is "num", with a maximum value of 100. So, our desired URL is as below:
Note that this will not give exactly the same results as retrieving ten results page by page, because Google is more likely to "Group" multiple results together with smaller results pages. So, our top 100 will contain more repetition of results from the same sites.
However, Google often block queries from Google Docs, so minimising the number of requests we make is key to avoid frustrations like errors and no results. For this reason, it's ideal to make a single request to retrieve the entire top 100, even if this affects data quality slightly.
Xpath to scrape Google results
Google mark-up the clickable result links via an H3 heading with a class of "r". So, we can retrieve the URL of the result with a simple query:
However, it would be ideal to grab the clickable text (the result title) as well which can be retrieved with the following query:
Ideally, we want to grab both items without the need to perform an extra query. Fortunately, XPath allows for multiple queries to be joined together, like so:
/h3[@class='r']//a/@href | //h3[@class='r']
This allows us to retrieve both items with a single request:
=IMPORTXML("https://www.google.com/search?q=[QUERY]num=100&","//h3[@class='r']//a/@href | //h3[@class='r']")
Because we're making two queries at once, Google will put the data into two separate rows. This makes the results display a little awkward, because we'd ideally see the title and URL in adjacent columns. This is fixed via hiding our actual results and then using the OFFSET function to reference the correct cells:
=OFFSET(B4, +C2, 0)
Google uses tracking URLs in results (url=?...&sa=) which we do not want to appear in our spreadsheet. So, we use REGEXREPLACE to tidy up the display:
=REGEXREPLACE(RESULTURL, "/url\?q=(.*)&sa.*", "$1")
Finally, to make the spreadsheet a little easier for users, we moved the actual Google domain queried into a separate 'configuration' worksheet, and also allowed for additional custom parameters in case people want to restrict results to particular countries or similar.
So, there we have it - scrape the top 100 Google results using Google Docs. Get your spreadsheet here; we hope you find it useful, and let us know any questions or problems in the comments below.