Our Google Docs spreadsheet to bulk check numbers of Google results has proven to be one of our most popular free tools. So, we're going to share more of our custom Google spreadsheets to allow you to perform additional functions.
Today, we'll show you how to scrape the title and URL of the top 100 Google results for any search query. Results will look like this:
In a hurry? Grab the spreadsheet from the link below:
https://docs.google.com/spreadsheets/d/1QmlougcU8vk4fQUTmVmgMOmplKrjWq1XtgJDgMO5i_M/edit?usp=sharing
Remember to use File >> Make a copy to get your own version to work with.
Grabbing the top 100 results
Google allows URL parameters that modify various aspects of results returned, including one to provide 100 results. This parameter is "num", with a maximum value of 100. So, our desired URL is as below:
https://www.google.co.uk/search?q=[QUERY]&num=100
Note that this will not give exactly the same results as retrieving ten results page by page, because Google is more likely to "Group" multiple results together with smaller results pages. So, our top 100 will contain more repetition of results from the same sites.
However, Google often block queries from Google Docs, so minimising the number of requests we make is key to avoid frustrations like errors and no results. For this reason, it's ideal to make a single request to retrieve the entire top 100, even if this affects data quality slightly.
Xpath to scrape Google results
Next, we need to use the importxml function to grab the right section of a Google results page. This function returns the code from a remote URL matching an XPath query.
Google mark-up the clickable result links via an H3 heading with a class of "r". So, we can retrieve the URL of the result with a simple query:
//h3[@class='r']//a/@href
However, it would be ideal to grab the clickable text (the result title) as well which can be retrieved with the following query:
//h3[@class='r']
Ideally, we want to grab both items without the need to perform an extra query. Fortunately, XPath allows for multiple queries to be joined together, like so:
/h3[@class='r']//a/@href | //h3[@class='r']
This allows us to retrieve both items with a single request:
=IMPORTXML("https://www.google.com/search?q=[QUERY]num=100&","//h3[@class='r']//a/@href | //h3[@class='r']")
Tidying up
Because we're making two queries at once, Google will put the data into two separate rows. This makes the results display a little awkward, because we'd ideally see the title and URL in adjacent columns. This is fixed via hiding our actual results and then using the OFFSET function to reference the correct cells:
=OFFSET(B4, +C2, 0)
Google uses tracking URLs in results (url=?...&sa=) which we do not want to appear in our spreadsheet. So, we use REGEXREPLACE to tidy up the display:
=REGEXREPLACE(RESULTURL, "/url\?q=(.*)&sa.*", "$1")
Finally, to make the spreadsheet a little easier for users, we moved the actual Google domain queried into a separate 'configuration' worksheet, and also allowed for additional custom parameters in case people want to restrict results to particular countries or similar.
So, there we have it - scrape the top 100 Google results using Google Docs. Get your spreadsheet here; we hope you find it useful, and let us know any questions or problems in the comments below.
3 Comments - Scrape top 100 search results with Google Docs
Mark Webster (not verified) March 06, 2017
ReplyGreat tool. Started using this last week. Is there some kind of limit on how many queries can be used per day from a paid google apps account? Today, it seemed to stop working, despite changing the Google domain.
Andy Langton March 06, 2017
ReplyHi Mark,
Unfortunately, Google periodically block "themselves" from accessing search results. What happens is the IP of Google Sheets gets sent a CAPTCHA/I'm sorry page from Google. Of course, there's no way to complete the CAPTCHA to access the results, so no results will be returned.
While the spreadsheet method is ideal for a very quick analysis, unfortunately, it has reliability issues which can't be fixed within Google sheets. The only way around it while still using Google sheets would be to use XPATH on a proxy/rotating proxies. By this point, the tech investment means you may as well use something custom rather than a spreadsheet.
Anonymous (not verified) April 27, 2017
Replywhat am i supposed to do to get this work? I made a copy of the sheet, and now what?
Leave a comment