Dedupe & Compression for Long-Term Archival?


It was great to be invited to speak at the Library of Congress’s annual conference Storage Architectures for Digital Preservation where the core focus is on long-term archival…I mean reallllly long term preservation in some cases. The attendees represented a really well selected cross section of public and private organizations that had different perspectives, requirements, solutions, and missions. All that led to very lively and fun conversation such as…What is the best media for long term storage? Ideas to use laser-etched granite slabs and stainless steel film stored next to nuclear waste dumps were met with nods and more discussion.

But the general understanding of data loss mechanisms seems to be limited to anecdotal information and participants can’t say for sure whether component failure is a bigger culprit than say human error or media failure. David Rosenthal from Stanford emphasized rightly that no one has developed a realistic failure model for preservation archival that takes into account the complete picture, including that failures are correlated (for example more human error happens when the system needs maintenance). Thus the few vendor product characterizations out there have grossly exaggerated the solutions’ reliability in practice.

One consistent thread in the conference was the request for beyond-enterprise-class reliability features, Yet many in the room (DOD users aside ) have no money to buy enterprise-class storage and have little more than white box PC’s with RAID5 SATA disks stacked in a closet. Some are even storing archival collections on home-made object-storage and WAN replication solutions. Probably not a good idea without a solid DR strategy (fortunately traditional backup is not a requirement when data is completely fixed). Meanwhile, well funded institutions like LoC and Stanford maintain sophisticated storage strategies, striking a balance between goals of availability and preservation.

Dell was there to provide some guidance and vision on how dedupe and compression should be considered as an legitimate element in preservation archival, especially given most archives have budget issues that limit what they can save in the first place. It’s not a surprise that long term preservationists have a natural anxiety against anything proprietary that renders the bits into something that can’t be natively read by VI. There’s a subjective risk coefficient associated with “How can I make sure the files are readable in 10 years when the product is obsolete?” or even “How can I be sure that Dell will even be around in 80 years?”

The bottom line is now clear; that data reduction is now going to be a standard storage feature, available across entire families of products. Although there are proprietary mechanisms in all of these data-reduction solutions, the threat of the “100-year problem” can be mitigated through self-describing compression formats and software-based readers that can be stored alongside the data and flexibly deployed for future readback. In terms of compression, it’s worth noting that users crossed that bridge a long time ago when they archived JPEG images and MPEG1 video. No one should assume that 2015′s version of VLC can play that video, and preservationists are acutely aware of these format longevity issues.

The Library’s Lead Storage Engineer Thomas Youkel emphasized that a big part of preservation archival is planning for both format migration (eg re-encoding old video to h.264), media migration (moving it from 3/4″ tape to LTO4), and platform migration (replacing an old fibrechannel SAN with object storage) to take advantages of enterprise support and improvements in performance/capacity/density. With each migration — roughly at 5 year intervals and sometimes taking multiple years to accomplish — the archive owner has the opportunity to add technologies with promising value or remove any technologies that didn’t deliver. So in the end, the 50 or 100-year preservation concern may not be relevant in the context of IT-oriented decisions. As long as you can trust Dell to be here in 5 years (and I’ll bet my boss’s paycheck on that), then the storage platform is taken care of.

Check out the Library’s initiatives at www.digitalpreservation.gov

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , ,

About Mike Davis

Mike manages all marketing efforts at Ocarina.

No comments yet.

Leave a Reply