Article Updated March 2020
How Do You Actually Find All the Websites Your University Owns?
In July 2109 we asked Do You Know How Many Websites You Own? By and large the response was, no. That’s not a good answer, because it means you don’t "understand" your web estate, the risks it poses or whether it’s can meet your end users’ needs.
A site discovery or web estate audit is an excellent solution to these problems as it establishes a web estate’s complexity and produces a website inventory. Effective audits or site discovery exercises are a blend of computer science, heuristics and practical experience.
In this post we outline our general approach to this challenge.
Step One: What is a University Website?
Typical discovery exercises involve finding higher and post-secondary education institutions’ public-facing websites. Intranets and non-public sites are part of institutional digital estates, but usually operate behind firewalls and are less of a risk exposure.
Public-facing university websites fall into two categories. Those uniquely identified by a permutation of an institution’s principal domain name: for example, *.university.ac.uk, university.*.au.edu or universidad.es/microsite.
And, sites using unaffiliated domain names: for example, https://citizenlab.ca/ (University of Toronto) or http://chss-elearning.info/ (University of Edinburgh). These are sometimes called “rogue” sites.
A site discovery exercise or web estate audit tries to find all sites in both categories, recognising members of the latter group will be trickier to track down.
Step Two: Where to Start Looking for University Websites?
Site discovery can start with a single initial URL – likely the main website address - and radiate out. This works well for universities with structured site and sub-site naming. It is less effective for universities with unstructured naming or institutions that are a product of mergers.
In complex situations it is more efficient to use many seed URLs, as these provide multiple discovery starting points and address estates in which cross-page linking is rare. The URLs can come from site directories, HTML site maps, previous discovery exercises, internal audit spreadsheets, from the keeper of domain names or simply extracted from XML sitemaps.
Step Three: How to Find All of a Higher Education Institution’s Websites
The following video shows our approach in a simplified exercise with the discovery process limited to 300 websites. If you watch the video you will note the results run to exactly 300 sites, but in many cases more sites will be uncovered from link analysis on the final pages.
Three link types are encountered during scanning. Internal links, variants of an institution’s main domain name. External links to readily-identified third party sites: other universities, social media networks, government agencies, etc. and harder-to-resolve external links to less readily-identified third party sites. The latter group of sites will need further inspection to see if they are part of the web estate. We use a URL reference database (with higher education related URLs) to resolve third party links and speed up “rogue” site identification.
Understanding and Visualising a Discovery Exercise's Results
We’ve taken the data from the video discovery exercise and plotted it in the force directed graph shown in Figure 2: use your cursor to pull the nodes and re-organise the diagram. Hovering a cursor over a node shows the corresponding website URL.
The larger purple node represents the URL used to seed the discovery process: www.nihon-u.ac.jp. The orange-coloured nodes correspond to the seed node’s child sites. The green nodes represent websites directly linked to each child website.
From the diagram we can see where dense clusters of sites occur and understand how many steps away from the homepage a visitor potentially needs to take to arrive at a site of interest. This is case the maximum number of steps is five. In Figure 2 we’ve coloured one pair of sites (nodes) red to pick out a “rogue” site and its link. As we noted above, our URL reference database helps isolate sites not following an institution’s standard naming scheme.
How to Obtain Enduring Value from a Web Estate Audit
Web site discovery exercises produce comprehensive website lists, but for risk mitigation and business continuity it makes sense to associate an owner with each site: an internal tracking and tracing process that may take months.
Basic listings can become more powerful risk and performance management tools when combined with content composition, technology, social media, accessibility, security and privacy data - all of which can be captured automatically.
In common with all audits, web site discovery exercises produce point-in-time views – new sites will come online, others will go offline. Data about each site in an institution’s web estate needs to be refreshed regularly to have enduring value as part of a digital governance framework.
We are always happy to explain or demonstrate our process.