[Cialug] long term storage

Thomas Kula kula at tproa.net
Wed Jul 15 12:24:43 CDT 2009


One of the key points first raised to me by a co-worker is that there are
at least three different things that tend to get grouped under the term
"backups":

 - Regular backups: e.g. "I screwed up this file two days ago and I'd like
   to get back what it looked like three days ago" or "Ooops, I
   accidentially deleted this file and I need it back"
 - Disaster recovery: e.g. "my computer was hit by a meteorite and I need
   everything back on it" or "my entire data center was taken out by a 
   tornado"
 - Archiving: e.g. "this data isn't actively used but needs to be kept
   around because it has some sort of value"

With archiving, I think there are also two distinct problems that need to
be solved:

 - bitrot: "random cosmic particles jiggled bits on this disk and now they
   are meaningless." This can be mitigated with strategies like having two
   copies of everything and a reliable hash to determine which of the copies
   is still the correct one, and a plan that tests that on a reasonable
   basis (I'm a firm believer now of the idea that you consider no data 
   backed up unless it is backed up to two different locations)

 - bitzheimers[1]: "I have these bits which I've verified are the exact bits I
   put into storage N years ago, I just have no idea what these bits actually
   mean." This really is solved by including enough metadata with the data
   to deciper the data, e.g. "while postscript may not be a used file format
   in 50 years I'm reasonably sure that with a postscript file and a copy of
   the Postscript language specification, and enough gumption in one form or
   another, I can turn that file back into a picture of my cat or whatever"

The two common threads there are "archives are kept active", in some form or
another --- something needs to thumbing through it every so often to make
sure it is still readable and not gibberish. Preventing bitrot is probably
a relatively solved problem, given enough resources, but preventing bitzheimers
on anything other than a trivial scale is probably still in way-out-there
territory. 

The key in all the cases is that there is a wide spectrum of solutions
available, with, of course, a wide range of costs that have to be balanced
against the value you think that data has --- which is another interesting
problem in itself. All kinds of early culturally relevant movies and
television shows are permanently lost because at some point someone said
"hey, let's clean all this crufty celluloid and 2 inch quad out of the
vault so we free up some space." 

[1]: I deeply wish I had invented this term, but at least one other person has
used it before me.

-- 
Thomas L. Kula | kula at tproa.net | http://kula.tproa.net/
Mathom House in Midtown, The People's Republic of Ames


More information about the Cialug mailing list