When SharePoint is used for public facing websites, there are a lot of files and locations that should not be crawled by Search Engines. Most Search Engines respect the rules defined in a special file called robots.txt to identify areas that should not be crawled.
The Search Engines expect to find a robots.txt file at the root of the site, e.g. https://blog.eardley.org.uk/robots.txt
When a robots.txt file is defined for SharePoint there are several locations that should be excluded as they implicitly require authentication to be accessed. An example for a SharEPoint robots.txt file is as follows:
User-Agent: *
Disallow: /_Layouts/
Disallow: /SiteAssets/
Disallow: /Lists/
Disallow: /_catalogs/
Disallow: /WorkflowTasks/
The content of the file is split into two sections:
- User-Agent – This identifies particular browsers/engines that the rules apply to
- Disallow – This defines the rules that the Search Engines read to identify the locations that should not be indexed
Further information regarding robots.txt can be found at http://www.robotstxt.org/orig.html