My Old LibTech Blog (2013-2016)

Keep or toss?

Author: John Durno
Date: 2014-08-13

A tower of used books - 8447

Keeping bits around is harder than it might initially appear. In fact, no one really knows for sure how to preserve digital content for the long term, since we haven't had the chance to accumulate much practical experience. Analogue storage does provide some clues, but replicating analogue approaches in a digital environment can be tricky.

It's generally acknowledged that things tend to survive in the analogue world when they have the following characteristics:
a. Lots of copies (so at least some will survive things like fires and being dropped into bathtubs)
b. Geographically distributed (so some will survive natural disasters like earthquakes, and unnatural disasters, like wars)
c. Owned by multiple organizations and/or individuals (because organizations and individuals aren't always as durable as we'd like them to be)
d. On durable, tamper-evident media (so they don't self destruct, and we can tell if they've been altered in any significant way)

Analogue media like printed books tended to model these characteristics quite well. Typically, if you wanted to read a printed book, you (or your local library) had to own a copy of it, so there would be lots of copies around. And these individual copies had some value, so when you were done you'd either keep it, sell it or donate it to someone else. Some copies (maybe lots) would get damaged or thrown away, but for most things enough copies would remain, distributed widely enough to ensure their contents did not forever vanish from the earth.

However, digital information doesn't by itself work that way. Economies of scale push toward centralized storage (since every copy of something costs money to keep, and there only needs to be one copy available on the web for everyone to access it). And while it's true that a copy gets made every time a digital document is accessed, those copies tend to have little residual value and are typically not retained in any meaningful way after they're read. And of course digital media are pretty much the opposite of durable and tamper-evident. Media fail and go obsolete with great regularity, and bits are remarkably mutable.

So if we want to replicate the preservation advantages of analogue media in a digital world, it won't just happen of itself, as a normal consequence of events. We'll have to allocate money and time specifically for the purpose of making it happen. Which pretty much ensures that it won't happen for large classes of stuff.

Which is not to say there aren't some people trying to do exactly what I'm describing here. In fact pretty much this whole post was a paraphrase of the rationale for the LOCKSS project, a distributed storage network for quite a few scholarly journals, although by no means even close to all of them.

But it's scary how much content is at risk. Like, to take one example, the Internet Archive, an immense trove of incalculable historic and cultural value. As far as I know, the IA is hosted and backed up in San Francisco (an earthquake zone) with one partial mirror site at the Biblioteca Alexandrina (in Egypt, not perhaps the most politically stable location in the world).

This isn't enough. For a resource this valuable, there should be copies all over the world. The problem however is money, or more accurately perhaps it's getting the folks with the access to the kinds of resources required to recognize and respond to the need. It costs approximately $1M to store one petabyte for a year, and the IA has well over 10 PB of data. Given the amount of money we collectively spend on far less worthy projects you'd think that globally we could allocate $100M/year to create 10 copies of the IA. But so far at least this hasn't happened.

But hey, why worry ... surely the ever decreasing costs of storage will solve the problem eventually right? Just wait long enough, and we'll all have 10PB smartphones. Well, maybe. But there's lots of evidence to suggest that the costs of storage aren't decreasing anywhere nearly as fast as they used to, and barring some new disruptive technology (DNA storage, anyone?) we'll likely be stuck with relatively high storage costs for the foreseeable future. Not that I'm saying there won't be disruptive innovations in the storage space, only that it would be foolish to build our entire digital preservation strategy around what is by definition an unpredictable event.