Let The Bots Do Their Work
A clear plan should guide all of the actions taken to bring visitors to a website. In practice, some of those actions take place in the clear, while others operate more in the background.
The next two posts address behind-the-scenes steps that improve website indexing and the probability that site visitors can quickly find relevant content. In this post we explain using a so-called robots.txt file to give search engines directives about indexing a website. In the following post, we discuss using search engine readable site maps to further assist visitors in finding relevant content.
Even if search referrals are not the main site traffic source, a small investment in understand how to use a robots.txt files provides long term payoffs in effective site indexing and an enhanced visitor experience. Google Analytics, and similar web analytics services can identify the historical proportion of a site’s traffic referred by search engines and pinpoint the most relevant search engines.
For university and college websites, particularly those looking to attract overseas students there are four search engines or indexing crawlers that are likely to be relevant, Google, Bing, Baidu and Yandex.
By placing directives in a robots.txt file Google, Bing and other search engines are given detailed instructions about what and what not to index on a site. In other words, search engines can be directed to index relevant content and ignore ‘less relevant content’.
Let’s parse ‘less relevant content”. Newsletters and calendars from 1999 are less relevant to most site visitors than this year’s versions. As are files used to operate the site, access your content management system or some dynamically generated website pages.
Why not direct search engines to the good stuff and ignore the less relevant? The content is still available, as are all the links, so visitors can still access it. The content is just less likely to clutter up search results and search engines don’t waste time indexing content of low potential value to site visitors.
The mechanism for directing search engines is to place a set of instructions in a robots.txt file stored in a website’s top-level or root directory.
The balance of this guide explains how robots.txt works, provides clarification of some common misconceptions about robots.txt and describes what we find, in the wild, at university and college websites.
robots.txt or no robots.txt
Without a robots.txt file, indexing crawlers will visit every page and follow every link on a site and use the underlying indexing algorithms to determine what results to present in search results. That approach is not necessarily a bad thing, at all. Why? Because crawlers do two things:
- They recursively follow website URLs (links) and the content at those links that is accessible by a browser. If no robots.txt file is present every link is accessed.
What’s the downside of not having a robots.txt file? Three things.
First, there are many directories or folders in a website containing files that are not relevant to browser access and there is no compelling reason to index these files. There is also material that becomes less relevant over time, but may remain on a site for regulatory or other reasons: course and academic calendars, class schedules and the like. Site visitors are better served by being directed to the current material than navigating through current and old.
The second reason is that the robots.txt file can be used to tell search engines where to find the relevant XML-format sitemap or sitemaps.
Finally a robots.txt file can be used to block crawlers that you don’t wish to access your site. However, as complying with robots.txt directives is voluntary, malicious ‘bots’ will likely ignore any directives.
In preparing this guide we reviewed the main or gateway domains of about 200 (n=206) university and college websites belonging to Canadian higher education institutions to understand current practice. We’ll discuss our findings a little later, but 20% of sites (18.9% or 39 /206) do not use a robots.txt file. And, there is no harm done.
Controlling Where Search Engine Bots Crawl
If you want to control search engine indexing you can do so through robots.txt file directives. Crawlers interrogate the robots.txt file to determine any constraints on their activity. Directives on each line or record within the file provide instructions to the crawler.
Google, Bing, Baidu and Yandex recognise four field elements with the following structure:
|user-agent||:||[value]||#||optional comment||user-agent = crawler accessing a site|
|allow||:||[path]||#||optional comment||allow directive permits access|
|disallow||:||[path]||#||optional comment||disallow directive prohibits access|
|sitemap||:||[URL]||#||optional comment||Directs crawler to find an XML file at the specified URL. This can be on another server if needed.|
The fields can be organised into groups, sorted by user agent for as many individual user-agents (e.g. Googlebot, Bingbot, Baiduspider, YandexBot etc.) as needed.
There is no limit to the number of directives or records that the robots.txt file can contain, but Google ignores any robots.txt content after the first 500KB: roughly equivalent to 9,250 or more records. Yandex imposes a smaller file limit of 32KB and assumes if a file is larger than the limit, everything is allowed. In our survey, we found no robots.txt file larger than 7KB.
[value] this can either be text for a specific crawler, e.g. Googlebot or Bingbot or a wildcard ‘*’ to denote all crawlers. Most higher education robots.txt files permit all crawlers to have site access.
[path] path operates as a relative position indicator to the location of the robots.txt file. As a result, / indicates the top-most or root directory or folder. Directories or files located lower down in the hierarchy can be specified by their relative position to the top-most folder.
Be attentive to spelling as [path] can be case sensitive, depending upon a server and its configuration. Further, if a server is 'case-sensitive' and content assumes it is not, this will result in broken links (404 errors) and robots.txt directives may not have their intended effect.
[URL] a complete URL rather than a relative location tells the crawler where to find any sitemaps. In principle, the sitemaps could be located at a different domain; in practice XML-format sitemaps are usually placed in the root directory.
Putting this all together a ‘complete’ robots.txt might look like this:
# This file lists local URLs that well-behaved robots should ignore
Disallow: /registrar/archives # old stuff
Disallow: /art/culture/ # old stuff
Disallow: /education/coursework/ # old stuff
Disallow: /events/day.php # search engines only need one calendar view, so hide the rest
The #'s delineate comments to be ignored by a crawler, but inserted for readability. Crawlers ignore blank lines, but these also improve readability.
robots.txt files can contain multiple directives. Attempts to include some directories for crawling while excluding others can create conflicting instructions. To resolve this, crawlers process directives based on precedence. The principle is that the most specific rule takes precedence and other directives are ignored. For example:
disallow: / #disallow all indexing of the site
allow: /physics #allow indexing of the directory physics and all of its sub-directories and their contents.
A crawler encounters the directory http://www.exampleu.ca/physics. As the allow rule is more specific that the disallow rule, it takes precedence and the directory will be indexed.
robots.txt File Location
In order to direct the crawler in the intended manner a robots.txt file must be located in the top-most or root directory for a specific host, protocol and port number. To explain:
Crawlers see http://example.edu/ and http://cs.example.edu/ as two different hosts or domains. Placing a robots.txt file at http://example.edu/robots.txt will have no effect on the http://cs.example.edu/ domain. If you don’t want to direct how the http://cs.example.edu/ is crawled, no harm done. If you do want to direct activities you need to place a separate (but possibly identical) robots.txt file at http://cs.example.edu/robots.txt
Crawlers view http://example.ac.uk/, https://example.ac.uk and ftp://example.ac.uk as three different protocols (which they are). If those protocols use the standard ports (80, 443 and 21, respectively), *and* the resulting host and content are one and the same, then only one robots.txt file is required. If, however, a non-standard port is used, then the robots.txt file accessed this way would only apply to that service and thus, the others would need a separate applicable robots.txt file placed in each of the root directories.
The issue of locating the robots.txt file is particularly important for higher education websites. Typical university or college websites are structured as federations of sub-sites: some of these as distinct domains other times within sub-directories. In the former case separate robots.txt files are needed in each sub-domain in the latter case a robot.txt file in the root directory is the only way to enforce the desired crawling behaviour.
Processing Issues for robots.txt
Crawlers attempt to fetch the robots.txt file from its expected location or establish that a valid file does not exist. And crawlers pay attention to the response codes received from the attempt and may modify their behaviour. We have summarised the potential responses in the table.
|2XX Success||3XX Redirection||4XX Not found||5XX Server error|
|comment||Specific processing depends on the robots.txt content||If redirection results in a 2xx, then processing will be as described in the 2xx column, otherwise processing takes place as described in the 4xx column||Assumes there is no robots.txt, so all files will be crawled||Assumes a temporary error, during which no files will be crawled|
What Do We Find in Practice?
We surveyed just over 200 (n=206) Canadian university and college websites and examined the robots.txt files located in the top-level directory of the gateway domain.
Thirty nine (39) sites (18.9%) did not have a robots.txt file. As we stated up front this simply means that these sites are crawled in their entirety – save for any pages that have page-specific meta tags specifying that the page is not to be indexed or followed. In a subsequent blog post we will examine current higher education sitemap practices to see if there is a correlation between the absence of a robots.txt file and the absence or presence of an up-to-date XML sitemap.
The balance of 167 sites (81.1%) can be divided into three different robots.txt formulations as follows:
Formulation 1 – the robots.txt file directives are structured in one of two main ways
user-agent: * or user-agent: *
allow: / disallow:
The two approaches are functionally equivalent to each other and to having no robots.txt file at all. These configurations occur about 5% of the time.
Formulation 2 – the robots.txt file directives are structured in two alternatives:
user-agent: * or user-agent: *
disallow: allow: /
disallow: [path] disallow: [path]
The two approaches are functionally equivalent to each other. Disallowing nothing and allowing all and then specifying a specific location to disallow, could be achieved by simply including a disallow: [path] directive. These configurations occur 32% of the time.
Formulation 3 – the robots.txt file directives are structured as
This configuration occurs 63% of the time and, in our view, is the configuration least prone to confusion. About ten percent of sites also include a directive to indicate the specific location at which a sitemap or sitemaps can be found and all of these use Formulation 3 for their robots.txt structure.
It is perfectly OK to not have a robots.txt file: this approach simply results in all directories on a website being indexed. On the other hand, it is very straightforward to construct a robots.txt file that carefully segregates relevant content for indexing and places less relevant content in directories that will be ignored. Moreover, a robots.txt file can also specify the location for a sitemap or sitemaps that can further improve indexing efficiency and thus the ability of site visitors to find the good stuff.
Don’t have accurate and current information on all the websites you own? Not able to monitor and check each website’s content quality and risk status? Let’s talk about how we can help.