The fight between NetApp and EMC over Data Domain has put deduplication into the spotlight as never before. Yet, dedupe for backup is now a mature market, and in many ways it represents the past. The next step, in my view, is that dedupe must evolve for object stores and fixed content archives. Ultimately, a dedupe product is going to have to work across the entire spectrum of data storage - hot primary data to archive to backup and beyond.
So while dedupe for backups is where it began, next up was dedupe for primary storage. As we at Ocarina quickly discovered, online storage is quite a bit different than backups. None of the successful backup dedupe vendors - Data Domain, Diligent, Quantum and others - has been able to make a mark in dedupe for primary storage. Rather, NetApp -with its dedupe in the WAFL file system - and Ocarina have emerged as the two vendors who really have a data reduction and dedupe strategy that works for primary storage.
There’s a third, rapidly-growing segment of storage that is explicitly archive storage. While the lines between nearline, archive, and backup can sometimes be blurry, true archives tend to have certain characteristics. They are mostly object stores, they have object rather than file system interfaces (get/put/post rather than read/write), they have WORM and persistence guarantees, and they have features that allow you to manage compliance with various regulations. Examples of true archive storage are HDS’ HCAP, EMC’s Centera (and Atmos, for that matter), Caringo, and several of the cloud storage offerings like Amazon S3 or Iron Mountain Digital.
Archive storage is a prime candidate is some ways for deduplication and compression. The nature of the storage is that things are put in for long term storage, the archives grow and grow over time – so keeping costs down and making room for more data is important – and the access patterns tend to be WORN (not WORM) - that is, “write once, read (almost) never.”
But object stores in general, and archives in particular, present some interesting issues for dedupe technology, and I think that just taking backup dedupe or primary dedupe and applying it against object stores is not going to work. First of all, archives often offer guarantees of immutability. That means, if I put something in an archive, I am guaranteed that it will remain unchanged. What does that really mean? If an archive is going to last for 100 years, or even 30 years, you can be assured that the hardware it is deployed on will change 5 or more times during the life of the object archived.
So if I move an object from an archive from Vendor A with block size of 4K to a new archive from Vendor B with a block size of 8K, has that object changed? Or is how I store it separate from the guarantee that the contents of the object have not changed? It’s a key question, because it is at the heart of whether it is okay to compress and/or dedupe objects in an archive. If I put an object in an object store and take a 512 bit checksum of that object’s contents, then I compress it, is that OK as long as when I decompress it, the 512 bit checksum tells me that the file is bit-for-bit identical to when it was put in?
What about dedupe? In dedupe, the whole idea is that I store redundant information only once. In an archive, a user may be checking in a memo from the CEO that has to be kept for Sarbox compliance – but it turns out that a graphic in the memo was already stored from a PowerPoint. So it could be deduped. That would save space. To ensure that I can always get back the CEO’s memo, though, any dedupe solution must keep reference counts on how many objects are referencing a given object.
A reference count is a dynamic value –it can and must change every time a reference to an object is either added or decremented. But if an object is immuatable, am I allowed to change a field on it, such a metadata value for ref count? Is it okay to mix dedupe domains across the Sarbox archive and the email archive or the medical images archive? What does XAM mean for dedupe?
Some of these are not necessarily technical problems; they are legal ones. But the architecture of how you do dedupe for object stores, how dedupe works in a get/put environment with no standard file system, and how portability of objects across many years of hardware refresh would still work correctly if the object store has been compressed and deduped all point to some fundamentally different things that are going to have to be done for a vendor to have a true “Dedupe for Archive” or “Dedupe for Object Store” solution. It’s the third leg of the stool, along with dedupe for primary, and dedupe for backups. The holy grail, which we’ll post about soon, is end-to-end dedupe – how you keep file data in its most space-optimal form throughout its lifecycle. But you can’t get to the holy grail until you have viable appropriate solutions for each of the main types and tiers of storage.






Interesting stuff Carter. Dedupe is still a young pup. The archiving issues are especially challenging. Will the technology industry be able to drive this through the legal and healthcare establishments? I suspect many lawyers and doctors would say “Nope, you have to save every last byte every time.”
Curtis Preston has a longer set of thoughts on the general issues here:
http://www.backupcentral.com/content/view/177/47/