Fixing “Resources Blocked By robots.txt” Failures
Our earlier posts address preparing your website for mobile friendliness. We reviewed the background to mobile friendliness, the Google mobile-friendly test tool and set out solutions to four of the five problem categories highlighted by Google’s test tool.
We now turn our attention to the robots.txt-blocking problem, which is easy to understand, but potentially very hard to solve.
Resources Blocked By robots.txt
Robots.txt refers to an optional file that can be placed on a website to “help” the site owner control what search engines index. The file simply specifies a set of files or folders (directories) that should not be indexed. We say should because search engines may choose to ignore the robots.txt instructions as they index.
The advantage of the robots.txt file is that, if respected, a site owner can avoid engines indexing files that only exist to make the site work, as opposed to files that contain content. This situation is particularly relevant for sites using Content Management Systems or other systems that generate content dynamically. These sites use tens, hundreds or even thousands of individual script files that contain no content, but assemble content and apply formatting and layout when a page is accessed.
The main disadvantages of robots.txt are that an independent standards authority does not define its function, adherence to its operation is voluntary, the control options within the file are limited and the file can reveal details of the internal workings of a site to those with malintent.
The solution appears to be simple – alter the robots.txt to allow access to these files. And, therein lies the issue - “allow” is not a generally agreed valid term for robots.txt: even if Google does accept and act upon it. Common practice is to list only the folders not to be indexed, rather than individual don't-index-me files. An alternative solution might be to create a list of all the files not to be indexed and list these in robots.txt instead. Now, you are telling those with malicious intent exactly what’s on your site as well as creating a huge robots.txt file that will be difficult to maintain. By the way, Google's own robots.txt is currently about 8k and has been used successfully in the past by outside researchers looking for Google's “secret” projects that it has not "announce to the public".
Despite the complexity of relocation, it is the solution we recommend as it results in a smaller, more manageable robots.txt file than other solutions and it should allow you to manage caching more easily as cache-suitable files are located in one place: although, the topic of caching is beyond the scope of this post.
To read more about using robots.txt for higher education websites see our post: Let HigherEd Website Visitors Find The Good Stuff: Use robots.txt
Don’t have accurate and current information on all the websites you own? Not able to monitor and check each website’s content quality and risk status? Let’s talk about how we can help.