In the wake of the high profile Data Domain acquisition battle, data deduplication is the technology du jour. This has led to wide-ranging discussions on the topic, one of which is the question of where deduplication is being applied.
A recent Wikibon article, “Pitfalls of compressing online storage” notes that Data Domain represents only one sector of the market–dedupe for backups. But what of dedupe and compression for online storage? (Though technically of course, even backups could be said to qualify as “online storage” since they are still accessible.)
The authors, analysts Dave Vellante and David Floyer, come to similar conclusions that many storage analysts already have on this topic. Namely, that dedupe for online is a different animal, and that the vendors who service the backup world aren’t equipped to handle it. In fact a whole new market has emerged in which a different set of players are providing the data reduction solutions for primary/nearline storage. This, they note, has added a whole new layer of complexity to storage that has its own drawbacks. All true from this somewhat limited viewpoint.
InfoStor’s Dave Simpson has written a summary of the Wikibon article in a recent post by boiling down the three market sectors within online storage optimization as follows:
–“Data deduplication light” approaches such as those used by NetApp and EMC
–Host-managed data reduction (e.g., Ocarina Networks)
–In-line data compression (e.g., Storwize)
And so, as is so often typical of an emerging technology area, dedupe is currently viewed as a set of point solutions for each tier of storage. Hot storage? Try compression with Storwize. Online, nearline, archive? Try Ocarina. Backup? Try Data Domain. This is where the situation stands at the moment, and for the most part the analysts are correct to summarize it this way. However, we believe that this is just the beginning of a larger trend, to which I referred in an earlier post “The Dedupe (R)evolution.” What will really change the face of dedupe over the course of the next couple of years is the concept of end-to-end optimization.
That is, instead of shrinking a file on a filer, then expanding it to copy it to an archive, where it will be shrunk by a different solution, and then expanding it again to back it up, where it will be deduped by another solution, why not optimize a file early in its lifecycle and then manage it as an optimized object as you move it, back it up, and archive it? That saves not only on storage space, but also on network and SAN bandwidth and on the amount of CPU cycles, power, heat, and cooling used to repeatedly expand and then re-dedupe files over and over.
For now, the differences as sketched out by Simpson and Vellante are reasonably accurate from our point of view. However, some of the advantages and disadvantages may be overstated. For example, that of the performance between in-band and out-of-band. Think of it this way. In general, whether for backup or primary storage, an in-band approach is going to support faster throughput than a post process, and a post-process, because it has time to do more intelligent things, should always get better data reduction. The question is, how much of a given customer’s data needs the (sometimes slight) performance advantage of an in-line solution? If performance is good enough on a post-process, most customers would prefer to get the maximum space savings.
Most industry data suggests that over 90% of online file data - even data on high-performance primary tier filers - is accessed only infrequently starting three days after it is created.
Yet, despite this behavior on the part of users, the vast majority of a customer’s Terabytes are in files that are well over three days old. So if you have a way to choose which files you optimize, then you can go ahead and use post-processing with performance as good or better than an in-band process. You would simply do nothing to files as they are being created (so no performance penalty at all) and for their first n days of existence. Then, when the time is right, you optimize the file with the maximum space savings.
A key to achieving the best balance of space savings and performance is to be able to optimize data by policy, and to be able to tune the level of optimization by file type, owner, modified time, and a wide set of performance and tuning choices. The conventional wisdom is “Storwize for live databases, and Ocarina for files” and in general we think that is more or less true — but we also think that for most customers, that means that in 9 of 10 situations, the big dollar savings for a customer is going to be in policy-based optimization of files. As it happens, Ocarina is the one that has this capability, which in the end translates to notable differences in how well they save customers money.
Overall, it’s good to see these topics being wrestled over among analysts and journalists, and we hope it continues.