The Perpetual Web
5 - 8% of the Web Disappears Every Year

When I started looking into online data decay in January I was surprised to learn there was no reliable source of analysis on the subject.   So I researched it myself. These findings seem an appropriate article for our first (non-TechCrunch50) blog post on The Perpetual Web.

Estimating the amount of information that disappears due to data decay is difficult because there’s no single definition of the problem.  This is also what makes it an often undervalued issue.  For example, blog posts move down the page as they are replaced by new content, but they are generally still archived for the future.  Archiving of news, however, has taken a big step backward on the internet because a story’s prominence (is there a photo, is it breaking news?) changes constantly, and these changes are most always lost.  This kind of information was naturally saved when both the story and its context was delivered on a printed page once a day, but on the internet it’s often lost forever.

To achieve a simple, conservative estimate of data decay, I limited this research to the crudest manifestation of the problem: content that simply disappears.  I looked at 12,000 bookmarks on Delicious across a random set of users, writing some software to check each of those bookmarks to see if their URLs were still active (meaning a non-200 level HTTP status code).

The Delicious.com data set is a particularly high-value one because it relies upon Delicious’ filters and users to keep out spam and SEO junk. This data naturally only sees bookmarks that someone cared enough about to want to share and reference in the future.



The green line shows broken links as a percentage of the total over time. One of the reasons we publish decay as a range (5-8%) instead of as a fixed number is because the rate of decay changes over time (see the black trend line).  Unlike us mere humans, the longer content survives more long-lasting it becomes.  Basically, if a bookmark lasts one year, it’s more likely to make it to two, and so on.

This test is far from perfect, but once we gain more data from Perpetually.com we’ll be able to repeat this analysis less conservatively. The “gotchas” I’ve identified are

  • We don’t know the date when content was first published we substitute the date on which it was first bookmarked. This skews the results toward a shorter lifespan.
  • We’re only taking into account data decay of the type “live” to “lost”. Thus, any site that changes design or content on the page you bookmarked is still considered active. This is probably the biggest distance we maintain from having a perfect dataset.
  • The web is young, and changing fast. Decay resulting in broken bookmarks are improving, while changes are increasing. Thus, ten years ago pages disappeared quite quickly relative to the relative stability of the technology powering the web today.
  • Delicious is young and the data is not constant. The number of links in our sample of 12,000 implies a certain adoption of Delicious over the past few years. Going back three years, the sample size can be quite small. This too will improve over time.