Why you need to find every website in your digital estate

points of light on a black background representing a dense network

Is your digital estate a hot mess?

Major corporations and higher education institutions have invested heavily in constructing sophisticated digital ecosystems often comprising hundreds or thousands of websites, social media accounts and other online points of presence.

These digital estates represent significant asset portfolios, built to tell their organizations’ stories and meet their audiences’ information and online task needs. But, they can also be liabilities, by displaying no-longer-functional, dated, off-brand, inaccurate or inaccessible content. With the latter undermining confidence and organizational reputations that communications and marketing teams have worked hard to establish.

Moreover, organizations rarely have the data available to identify their digital estates’ assets and liabilities and allow them to eliminate the latter and enhance the former.

What’s needed is a process to explore and map digital estates and produce a ‘digital balance sheet’. An accurate inventory uncovering potential liabilities while collecting data to measure and improve asset effectiveness. A process that measures what’s really going on so investments in digital marketing, communications, recruitment and the like are founded on reliable data.

In this article we focus on exploring and mapping, which turns out to be much trickier than appears at first sight. For more about measuring and improving online effectiveness read our report: https://www.eqafy.com/global-eq-report.html

How digital estates are created

Higher education institutions and corporations rely on a web infrastructure that has evolved substantially over 25 years. Hardware, software and data storage costs have fallen and website and digital services use has risen, dramatically. At the same time, executive interest and investment in the processes and structures to develop and manage the resulting complexity have fluctuated markedly.

Business units, departments, campaign teams and similar groups have taken advantage of technology democratisation to proliferate their online presences. This evolution isn’t inherently bad, but it does have a price.

As teams move to new objectives, projects or campaigns the work they’ve completed lives on. Over time content and complexity bloat. Supporting systems fall out of favour and stop evolving to meet new needs. That’s why we encounter universities with thousands of independent websites, running on tens of different content management systems, hosted by dozens of suppliers. Even corporations end up in much the same situation, albeit on a smaller scale.



Those complex digital estates represent tens of thousands of hours of content creation, curation and software development. They are significant organizational assets, but visitors navigating through dense digital estates can have very poor and frustrating experiences making them seem more like liabilities.

Documentation could help. But be honest, how much relevant documentation do you encounter?

In the end, knowing all about an organization’s online presences means doing a digital estate survey or site discovery exercise.

How to understand your digital estate

The term survey suggests measuring angles, elevations and distances to produce a map. A better analogy is one of taking stock of items in warehouses with the added complication of not knowing how many warehouses there are.

image of shipping containers stacked at a port

Surveys use software crawlers or bots to follow links and visit pages, as determined by a set of initial parameters, to identify and record relevant websites. A site discovery process needs to find and record sites belonging to an organization and exclude those belonging to external entities.

Survey software also needs to cover enough ground to confidently identify a high percentage of an organization’s online presences, but not spend days, weeks or months on diminishing-marginal-returns visits to every link on every page. So, the underlying 'search' algorithms need to be efficient.

And, while the software is picking its way through an estate it would be useful to gather other value-added information. For example, why not automatically check domain registration combinations? Perhaps, picking up that Kazakhstan (.kz) website you didn’t know you’d registered. So, the software also needs to be ‘intelligent’.

While network diagrams are useful to illustrate site discovery results, as visualizing the number of sites and connections can aid in planning better user experiences and journeys, in the end a digital estate survey is more like taking inventory and recording how each location was found than map making.

But the process presents three challenges: comprehensiveness, efficiency and minimizing false positives.

What to consider before starting a site discovery exercise

Let’s see how a site discovery exercise works in more detail.

What’s a website?

Conceptually, a digital estate survey is designed to uncover every ‘website’ an organization owns or for which it has content responsibility. So, we need a website definition.

image of paper cut-out question marks

For these purposes we might include:

If surveys, based on the above parameters, ‘find’ too many sites, post-survey inspections can eliminate false positives.

What’s in scope?

Our website definition means we are mainly looking for sites identified by domain names (for example, https://www.construction.exampleco.fr) under an institution’s direct control and a minority (real world example, https://jacobsoninstitute.org/) that are not. In practice, the latter can be much harder to uncover.

We also need to set boundaries. We don’t want to explore the internet. We just want to examine relevant sites in a digital estate. So, we need to be able to recognise our organization’s links versus links to other organizations.

But, we don’t need to visit every page. From testing hundreds of complex digital estates and hundreds of thousands of pages, we see a relatively small proportion of pages contain ‘high information’ links and once crawling moves much beyond those pages the incremental information value is low. The trick is knowing how deep to set initial page scanning or crawling.

image of man reading an iPad screen

Having decided to focus on internal page links and set how deep in each website we will look before moving on, we still need additional survey parameters. We need to describe permissible variations of domain names we consider relevant: for example, are exampleinc.com and exinc.com both permissible? Should we also include exampleinc.net and exampleinc.org?

We’ll also need approximate string matching to snag organizational name variations: for example, University of Example, U of E, UofE, UoE and Example University may all be in use, even if only one is the official name.

However, while the inclusion/exclusion parameters we’ve described above, do reduce ‘noise’, it turns out they aren’t sufficient for comprehensive, efficient surveys with low false positive rates. As a result, we’ve taken data captured during testing to populate a URL reference database linking websites and their ‘owners’ to significantly reduce ambiguities. There are undoubtedly other refinements available to make identification more efficient.

What needs handling when site discovery exercises are running

Staying within bounds

With the site discovery scope well bounded we can start looking for websites and other online presences, such a social media accounts.

An organization’s homepage is a convenient jumping off point. First, because homepages are rich with links to other important locations (pages/websites) within an organization’s digital estate. Second, because it is as good a starting point as anywhere.

Site discovery is a marathon, not a sprint. Server responses, response speeds and how the discovery software processes pages all constrain how quickly links and pages can be crawled. If a site discovery exercise also includes comprehensive data collection along the way, overall processing times will be longer.

Processing will also be slower when scanning or crawling uses a headless browser to load JavaScript dependent content. This turns out to be an important consideration, because JavaScript is at the heart of much of current web content experiences. We want the discovery process to be as close as possible to a human visitor’s experience, so that subsequent manual inspections yield the same results.

image of people exploring a maze

We stated earlier that we want surveys to be efficient. This entails keeping very careful track of search paths to avoid duplicating effort and doubling back as discovery proceeds. We’d also like to ignore links that no longer work, avoid ‘suspect’ sites, handle programming and content errors and navigate and remember our way through multiple page redirections. The latter condition is very frequently encountered, because as estates evolve, content gets repeatedly relocated, while still needing to be reached from links pointing to its original location.

Handling out-of-bounds conditions

We also need rules to cope when things go wrong. What do we want to do with websites that fall outside our initial parameters, perhaps they are part of a digital estate, but we can’t be sure? Or, perhaps we’ve encountered cybersquatting and the site looks legit, but it’s not. A legal department may wish to investigate.

How should we handle content or services hosted on third-party websites (for example, blogs on medium.com, articles on LinkedIn.com or images on flickr.com) or via cloud service providers (trouble ticket reporting or email services)? And, what about organizations that have merged, split or spun out some of their operations?

Perhaps identify these as we go and highlight the results for post-survey content analysis?

Handling issues with links and content parsing

Even highly efficient crawlers can take hundreds of hours to comprehensively explore large digital estates. They need to be able to handle the many different types of malformed HTML, JavaScript or XML code they encounter so surveys don’t spiral from taking weeks to months.

While there are standards, rules and guidelines to produce well-formed HTML and JavaScript and conventions with how servers respond to queries, a portion of links, sites and their content always present problems.

image of a page of HTML code

A site discovery process must have strategies to handle exceptions: large pages, that take minutes to load on fast connections or never load on mobiles; typographical errors in URLs, unexpected character sets and a multitude of oddities and leagcy coding embedded in pages across the world wide web. 

All of these conditions can trip up the unwary, force site discoveries to end early or require manual intervention. Just as importantly, it’s worth implementing robust logging and error reporting to properly support issue resolution.

Building the digital estate inventory

The pace of website data acquisition generally follows a pareto pattern, more so if the original search parameters are well formed. In other words, a high proportion of a digital estate's websites will be found early on, with continued scanning eventually uncovering the laggards. And the laggards are often the ‘rogue sites’ of university digital estates or the past campaign, brand and temporary sites of the commercial world.

One useful tactic in capturing digital estate data is checking WHOIS facilities to:

  • attempt to confirm website ownership data and,
  • for commercial organizations test if their domain name has been registered in other jurisdictions.

The latter test can uncover unethical behaviour or highlight legitimate websites long gone from institutional memory. Automating WHOIS enquiries is complicated by policies that redact data to offer registrant privacy and by lack of data exchange standardization: unhelpfully, WHOIS server responses are free form text.

As discovery results are collected, the following will also arise:

  • Both www and non-www versions of a site exist and load in browsers. The software, or manual intervention (aided by software) will need to determine if they are the same or different sites? Perhaps, we need to advise the site owner about canonical definitions?
  • Both http and https versions of a site exist and don’t necessary re-direct as might be expected. We’ve seen http “site versions” generate errors, while the https version responded normally. Again, we’d expect discovery software to aid in diagnosing this type of issue.
  • The crawler or bot finds links to websites on a webpage(s), but there are no corresponding DNS entries. So, nothing happens. Is this a ‘real’ website or just time to update the links and the DNS registry?
  • The crawler or bot finds site links, there are DNS entries and a page loads. But, all that is displayed is “hello world”, “test site” or test, staging, or beta content is presented. All the content is visible to the public. We’d like to have these cases highlighted so we can respond, by taking material off-line or relocating it behind a firewall.

A reasonable general principle is to aid post-site discovery analysis by recording all these issues and flagging them by category.

image of a person looking through a microscope

Post site discovery analysis

Eventually, crawlers run out of links to process and surveys “complete” within the search algorithm and parameter constraints. Even a sophisticated crawler will not have found every site, but it is likely that all material sites will have been discovered and logged.

The results can be handed over for human inspection, analysis and troubleshooting. One way of viewing the results is as a ‘provisional inventory’ that needs checking for reasonableness. For a commercial organization the discovered list may have a couple of thousand website entries. For higher education institutions, the number of putative sites may be between two and five times the number typically identified in corporate digital estates.

The post-survey analysis comprises four main tasks:

  • Eliminating duplicate entries. Multiple page re-directions can make it difficult to avoid recording what turn out to be duplicate entries. One of the first tasks is to identify remaining duplicates and understand why they may exist.
  • Eliminating false positives. The provisional inventory will likely contain sites for which a survey’s sponsor does not have content responsibility: a legacy social network has slipped through the detection algorithm, a university has a separate fundraising foundation or a business unit has been sold, but widespread links to the unit remain in the network of websites. All the false positives should be confirmed and removed.
  • Finding website owners. Most organizations would like to know who is responsible for each site’s content – either a named individual or organizational department and have a central record of the associated contact details. Tracking down content owners takes time. Data recorded during surveys, DNS records and WHOIS responses can all provide clues and are useful elements in finding website owners.  
  • Identifying inactive sites for elimination. Data collected during a survey can help identify when a site’s content was last updated. It may become apparent that a site hasn’t been updated in years, the content owner has left the organization or the website is no longer needed.

While not strictly necessary for tracking down websites within a digital estate, technology and content-related data can be collected during a survey and later be used to gain a better understanding of a digital estate’s effectiveness.

For example, as sites are surveyed their accessibility can be assessed, providing an overview of how well content meets accessibility mandates. Geo-location of site hosting can identify where sensitive data is being stored to determine if jurisdictions are appropriate or if too many or too few external hosting providers are being used.

And, technical data collected during a survey can spot security and privacy concerns along with obsolete or unpatched software, protocols and content management systems.

Why do you need to find every website in your digital estate?

A digital estate survey addresses the assets and liabilities challenges discussed earlier. But it is just a step towards a bigger goal. Once you know what you’ve got, you should measure whether it is effective.

By finding all their websites organizations can focus on making their online presences do a better job of meeting their audiences’ needs and telling their stories. They can get more out of their assets while minimizing their liabilities. That’s where the real leverage lies.

 

Subscribe for digital effectiveness research:

* indicates required
Digital effectiveness research options *
Email Format
 

 

 

Blog photo image: unsplash.com/pexels.com