It is very important to stay on top of the URL’s that Google and other search engines have in their indexes to ensure that removed pages are correctly redirected to appropriate, relevant content.
To maintain external links and social shares to removed content, along with reducing 404 errors and potentially loosing visitors, the implementation of 301 redirects is a must for every web site. This is why we need to employ the methods detailed in this post to identify, monitor and maintain the indexed pages within search engines.
The simplest method to identify the pages and URL’s that a search engine such as Google or Bing has indexed for your domain is to use the following search modifier.
site:yourdomain.com
Please replace ‘yourdomain.com’ with the domain that you wish to return results for.
You may wish to append your sub domain extension to return specific sub domains results. The example below would return results for the specified sub domain.
site:subdomain.yourdomain.com
This search modifier queries the search engines index and returns all of the specified domain’s URL’s in the search results.
Now that we have an idea of what URL’s are indexed for your specified domain it would be useful to export this data in to a document so that the data can be utilised.
The simplest method is to use the xpath syntax to pull the data in to a Google Drive (formally Google Docs) spread sheet, which can be done by following the steps below.
Firstly, sign in to Google Drive and create a new spread sheet document.
Secondly, in the first cell within the document (A1), input the following query, replacing ‘yourdomain.com’ with your specific domain name.
=importXml("https://www.google.com/search?q=site:yourdomain.com&num=100&start=1", "//h3/a/@href")
If the above query returns an error such as #NA or Google could not retrieve URL, then adjust the query to use a http:// lookup rather than a https:// lookup, as below.
=importXml("http://www.google.com/search?q=site:yourdomain.com&num=100&start=1", "//h3/a/@href")
This will return the first 100 results within Google index for your specified domain name. If you have more than 100 URL’s indexed, then in the first cell on line 101 (A101) input the amended query, replacing ‘start=1‘ with ‘start=101‘.
=importXml("https://www.google.com/search?q=site:yourdomain.com&num=100&start=101", "//h3/a/@href")
If the above query returns an error such as #NA or Google could not retrieve URL, then adjust the query to use a http:// lookup rather than a https:// lookup, as below.
=importXml("http://www.google.com/search?q=site:yourdomain.com&num=100&start=101", "//h3/a/@href")
Repeat this step as many times are required to pull out all of your sites indexed URL’s. Please be aware that there is a limit of 1000 URL’s that can be imported.
As you may have noticed, the returned URL’s have various query strings appended to them, so our final step is to clean these up so that we are left with a clean list of pages.
The final step is to paste the following code in to the first cell in our second column (B1).
=mid(A1,search("?q=",A1)+3,search("&sa=",A1)-(search("?q=",A1)+3))
As you can see, we are now left with the pages full URL which is indexed. To extrapolate this data for all of the other rows in our spread sheet, simply click on the small blue square of the highlighted cell (B1) and drag in down to your last row that contains data.
You should now have a spread sheet containing all of your chosen domains URL’s that Google has in its index.
With this data in a spread sheet it makes life easier to test URL’s in a browser and discover broken links that can then be redirected with relevant 301 redirects.
A 14 year industry veteran that specialises in wide array of online marketing areas such as PPC, SEO, front end web development, WordPress and Magento development.
Accredited Google Partner & Bing Ads qualifications, BA (Hons) in Digital Marketing. One half of the Director duo at Kumo.