How Do You Actually Find All the Websites Your University Owns?
We recently asked, Do You Know How Many Websites You Own? And, it turns out that the most common answer is no. But, the statement in that post about web estate audits being ‘a tricky mix of science and art’ prompted the question: how, so?
To answer why inventorying higher education websites involves a blend of computer science and heuristics, we’re going to describe our solution to this challenge. And, recall that the strategic objective of addressing this digital governance challenge is to ensure that sites within a university’s web estate advance overall marketing and communications goals.
What is a University Website?
Before you can find something, you need to know what exactly you are looking for. Or, for this exercise, define the university websites you want to inventory.
Higher and post-secondary education institutions typically have public-facing sites and a set of non-public sites/intranets. The latter are part of an institution’s overall digital estate, but we find them using a separate technique (based on this post’s general approach). Let’s leave these web properties behind their firewall and focus on the bigger task of finding public-facing sites.
A university’s public-facing sites can also be divided into two. Group one comprises sites uniquely identified by variants of an institution’s principal domain name: for example, *.university.ac.uk, university.*.au.edu or university.ca/microsite.
Group two comprises (‘rogue’) sites uniquely identified by ostensibly, unaffiliated, domain names. For example, https://citizenlab.ca/ (University of Toronto) or http://chss-elearning.info/ (University of Edinburgh).
A website discovery exercise’s primary web governance-related goal is to identify all public-facing sites. Finding group-one sites is relatively efficient. Identifying websites in the second group requires brute force.
Where to Start Looking for University Websites?
Now we’ve defined what we are looking for, we need to know where to start looking.
Almost every university, college or other higher education institution has a website list: the URLs at which some of its websites can be found. The URLs could come from a site directory, an HTML site map, from a previous discovery exercise, be on an internal audit spreadsheet, kept by whoever assigns domain names or, even, in an XML sitemap.
One way or another, there is a list to seed the hunt for websites. Typically, available lists are not quite detailed enough to set up searches, because they force a search to radiate out through a set of web properties rather than run concurrently.
To accelerate site searches we can populate a seed list with as many other potential website URLs as we can identify across an institution’s network. This is the first science bit. Beefing up a seed list means scanning (and capturing responses from) devices across an institution’s assigned set of IP addresses. For greater detail we could also ‘dump’ out all of an institution’s Domain Name System (DNS) records and add these to our seed list.
Combining the IP address scans, DNS records and the initial website list yields a set of addresses to seek out all of an institution’s websites.
How to Find All of a Higher Education Institution’s Websites?
Now we know where to look, we actually have to search the haystacks.
To hasten the process, scans can be run in parallel – servers tuned for website crawling can be spun up (and down) easily and cost-effectively at Amazon, Google, Microsoft (and others). Scanning hundreds of thousands of university webpages has also revealed a further shortcut (art of the process): not every link on every page needs to be scanned. Reviewing and testing XML sitemap content along with checking page metadata has established some ‘rules of thumb’ that can limit or eliminate unnecessary website scanning.
Scanning finds three types of links or URLs. Internal links, URLs that are variants of the main domain name: our highest priority items. Second, external links to well-known sites, for example URLs belonging to other universities, Facebook, Twitter, LinkedIn etc. A further art of the discovery process is maintaining a database of the external links typically encountered across higher education web estates – there’s little value in constantly re-building this data. Finally, there are external links that need further review, to see if they represent institutionally affiliated websites.
A typical discovery project can be represented as:
How to Produce a Definitive List of a University’s Websites?
While we’re searching the haystacks, we need to winnow the wheat from the chaff. Although we describe winnowing as a discrete process, search and analysis is iterative and continues until we run out of links to process.
Each scanning round identifies websites not on our original seed list allowing discovered sites to be recorded as part of the web estate or eliminated, as appropriate. Structured queries and regular expressions can isolate unique links, eliminate duplicates and extract data for further review.
In theory, link analysis is an entirely scientific process. But, without a few shortcuts, it can be slow going. Using the URL patterns found in analysing hundreds of thousands of higher education webpages accelerates extricating relevant websites from long lists of URLs: a bit more art.
And, for the data that needs further review, we’ve found reports and screen-captured images quickly establish relevance. Sometimes a university technically ‘owns’ a website, but no further action may be required other than recording its existence: only human intervention can identify these exceptions.
How to Obtain Enduring Value from a Web Estate Audit?
A web estate audit’s output is a database (or spreadsheet) with accurate point-in-time information about each website an institution owns. As we pointed out in Do You Know How Many Websites You Own? the largely internal process of tracking down site owners can take months. So, the exercise isn’t complete until each site has an associated owner.
The data about each website in an institution’s web estate needs to be kept current to have value as part of a digital governance framework. For example, ongoing HTTPS adoption, software upgrade/patching or cookie elimination programmes can be readily monitored using individual site data, but only if it is current.
At the same time, despite digital governance guidelines, new websites will regularly come into existence at many institutions. While, other sites will start the long process of content decay. Either way, it’s prudent to periodically re-run mini web estate audits capturing changes and updating the database (or spreadsheet). Happy Auditing!