Making Sure Your Audience Can Find Content: XML Sitemaps
In our last post we explained how robots.txt
files provide information to Google, Bing and other search engines about which content to index and which to ignore. We also pointed out that robots.txt
files can include directives telling search engines where to find the related sitemap.xml
file.
In this post we review and explain the role of XML sitemaps in telling Google and Bing about the relative importance and update frequency of site content and the assistance they can provide in understanding harder-to-parse content such as video or audio files.
We reviewed 200 Canadian university and college websites to find that 60% of the sites do not have an XML sitemap located in the top-level directory. These sites are missing an opportunity to help search engines index key content (help SEO) and thus improve the user experience of these sites.
For an even deeper dive into sitemaps than this post will provide, take the red pill over to Dynomapper’s blog maintained by Garenne Bigby. It is sitemap central. For extra points you could also read: XML Sitemaps: Guidelines on Their Use or XML Sitemaps 101
What are XML Sitemaps?
Search engine indexing works by crawling all the links on a website and either parsing the HTML-formatted content on each page or rendering the JavaScript-generated content. In either situation, crawlers may encounter errors or simply fail to find all the pages on a site.
Rather than leave completion to chance, XML sitemaps provide search engine crawlers (Googlebot/Bingbot) with structured guidance about site content, its relative importance and its update frequency. Search engine providers also offer corresponding facilities to submit and test XML sitemaps to ensure that the instructions work as intended.
The most frequently encountered type of sitemap is an XML file, formatted according to the sitemaps.org protocol stored in a website’s root or top-most directory. It is not necessary to have a sitemap.xml
file, in the same way that it is not compulsory to have a robots.txt
file: it is simply helpful and no penalty is incurred in doing so.
Search engines will successfully index relatively small websites (say, less than 50 pages), with well-structured and largely static content. For larger website (more than 300 pages) and many college and university department websites fall into the ‘larger’ category, with regularly updated content and a mix of text, images and video. Search engine crawling, and search engine optimization, will be more effective when supported by an XML sitemap.
In fact, for complex content mixes, sites can use different types of sitemaps to ensure that each type of content is indexed and available to be found through search. Google supports extensions to the basic XML sitemap schema for:
- Video content to allow Google’s indexing software to better understand the nature of the content, including data about titles, descriptions and duration;
- Image content on a site, both images accessible on HTML-pages and images that may only be reached through JavaScript;
- News content that should appear in Google News can be identified in a Google News sitemap focused on articles and which is crawled more frequently.
To be clear, sitemaps only facilitate search engines in finding content. A robots.txt
file or the robots meta tag can be used to exclude specific pages or sections of a site from indexing.
How do XML Sitemaps work?
Sitemaps provide Google, Bing, Baidu and Yandex with structured data about the URLs included in the sitemap. While search engines support a number of different sitemap formats, XML is the most frequently encountered file format and we will limit our discussion to XML sitemap files.
The file structure is set out in the sitemaps.org XML schema for the Sitemap Protocol. There are three website maintenance takeaways worth knowing about from the Sitemap Protocol’s optional ‘tags’:
- lastmod this tag shows the last modified date for a file referenced in a URL. There is some discussion about whether Googlebot uses this optional tag, but search engines can use the date to understand how current the content is.
- changefreq this tag shows the frequency with which content at a URL changes - effectively ranging from always to never. Our research suggests that the weight Google places on this setting to influence its crawl schedule, but there is no penalty for setting this value appropriately.
- priority this tag indicates the relative importance assigned to content at a URL: ranging from 0.0 to 1.0: 0.5 being the default setting. Again, as an optional tag, it is reasonable to assume that search engines may use the information in this setting to understand which content a site owner views as most important. As a result, setting the home page to 1.0 and other key pages to 0.8 at least alerts Google to the relative importance of pages on a site.
An XML sitemap needs to contain all of a site’s URLs, with two caveats.
First, the sitemap can only include URLs that crawlers can actually fetch. If fetching a page generates anything other than a so-called 200 response from the server, it should not be included. As discussed below, XML sitemap generation software checks for these types of errors.
Second, when sites have duplicate pages or pages with duplicate content the URLs in the sitemap file must refer to the canonical page. This tactic avoids unnecessary crawling of pages that will likely be ignored in search results.
The second caveat raises the issue of using the rel='canonical'
tag on web pages to ensure that search engines understand which is the definitive version of a page. A typical use of this tag is to let Bing or Google know which of https://www.example.ac.uk or https://example.ac.uk is the definitive site for crawling purposes. This tag is often missing in web page sections, despite its importance for search engines.
In addition to including the full set of relevant URLs, the URLs in a sitemap should follow the protocol of the sitemap location. In other words if the sitemap is located at https://www.exampleu.ca/sitemap.xml the URLs in the sitemap should match the https protocol used to locate the sitemap. Higher education organisations make extensive use sub-domains, for example https://cs.exampleu.edu as distinct from https://exampleu.edu and any sitemaps must reflect the use of the relevant (in this case, cs.exampleu.edu) sub-domain.
Search engines read the XML file, understand the structure of site content, the importance assigned to it by the web team and act accordingly.
The sitemaps.org specification limits a sitemap to 50,000 URLs. And multiple smaller sitemaps can be linked via a sitemap index file to exhaustively map a mega-site.
Google recommends updating XML sitemaps daily particularly if content changes frequently, which is usual for university and college websites.
Creating Sitemaps
Sitemaps are dynamic files that should promptly reflect changes in site content or relative URL importance. So while a sitemap can be generated manually, the task is better accomplished programmatically. To this end, there are free and paid third-party applications that can generate and verify different types of sitemap and automatically submit these to the relevant search engine providers. A good starting point is provided in this article: 10 Awesome Visual, Proven Sitemap Generator Tools. Similar XML sitemap generation tools exist for Drupal, WordPress and Joomla installations.
Google, Bing, Baidu and Yandex all provide Webmaster consoles through which sitemaps can be submitted and verified. Baidu operates slightly differently. First, it doesn’t provide an English-language interface. Second, Baidu invites sites to submit sitemaps rather than processing voluntary submissions.
Submitting Sitemaps to Search Engines
There are three places that a sitemap needs to ‘appear’. First, XML sitemap generation software will place the file(s) in the root directory of the relevant domain. Second, a sitemap directive(s) should be added to a robots.txt
file. Again, most third-party sitemap applications provide a facility for editing a robots.txt
file. Finally, XML sitemaps should be submitted to the search engines most relevant to a site’s traffic.
Sitemap submission ensures that a search engines know to use the applicable maps and that maps are tested for errors.
Sitemaps can be submitted to Google using the Search Console (the re-named Google Webmaster Tools). Under the dashboard options is the crawl option, which in turn accesses the sitemap submission option.
Bing offers a similar option to submit and verify sitemaps using its Webmaster Dashboard under the Configure My Site menu option.
Yandex allows sitemap submission via its Yandex Webmaster facility [using the page translation facility as needed]. Once signed in select the Index Settings menu and the sitemap files option to register a sitemap.
What are Current Practices for University and College Websites?
To understand current practices on live websites we examined just over 200 websites operated by Canadian universities and colleges.
First, we checked these sites to see which institutions have implemented rel='canonical'
meta tags in the HTML head element of their home pages. Given that XML sitemaps need canonical URLs, this test can be viewed as a proxy for these institutions’ recognition of the potential indexing issues of duplicate content.
Second, we checked the main site domain for XML sitemaps: simply, was a sitemap present or not?
Here’s what we found.
Can You See The Real Me? Canonical Tags and Sitemaps
For the rel='canonical'
tag 62 out of 206 sites – 30% – had placed this instruction in the portion of the website home page.
Checking in the root directories of the same 206 sites, we found 78 sitemaps: 38% of sites operate with an XML sitemap. Additional sites may have sitemaps, but located them elsewhere and passed the location to the relevant search engines via the associated Webmaster tools console. It is somewhat surprising to find that over 60% of university and college websites do not use XML sitemaps to aid visitor content discovery.
We also cross-checked our previous data on robots.txt
file use, where we had found that 18 of 206 sites used the robots.txt
file to signal a sitemap.xml
location. This further exercise revealed that four of the 18 sites no longer have a map at the designated location and should either implement a map or update the robots.txt
file or both.
Conclusion
Large scale website maintenance is both tricky and time consuming and behind-the-scenes tasks such as correctly structuring robots.txt
files or regularly updating XML sitemaps can get overlooked. And, that’s why we are building a suite of services to support university and college website configuration and content maintenance.
Sitemap generation add-ins or plug-ins are available for university and college content management systems. We highly recommend using one of these applications to create and regularly update sitemaps to enhance site indexing and thus the ability of site visitors to locate content of interest.