Dedupe for Primary – Another Shade of Blue


There’s been a great deal of talk about the dedupe offerings of major vendors, with some recent posts creating even more FUD–such, as, we think, this recent example from NetApp’s Alex McDonald. As someone whose company, Ocarina, is a specialist in dedupe for primary storage, I’d like to add my viewpoint to the conversation.

Data reduction is all about finding the most efficient representation for a set of information, and anything that supports that goal is fair game. I think that Shannon would agree with that. What this means is that dedupe is a type of compression, rather than something different than compression. This is an important distinction to understand.

Like traditional compressors, dedupe is a way of taking a set of data — in most cases, a large set of files — and looking for redundant information in that set. If you can find redundant information and replace it with a more efficient representation, then that’s compression as Shannon would think of it. Dedupe is different only in that the typical deployment is across a whole set of files, rather than on a single file or transmission at a time.

In fact, within both dedupe and compression, there are hierarchies of how you approach reducing data size, with the general theme being that the more granular you can be about finding redundant information, the more effective you can be at representing the original full data set with the minimal amount of data.

For many years, most compression work focused on compression for transmission (to get around narrow bandwidth network connections). This meant that a lot of it centered around realtime compression and compression with a very small context.

Now that the focus is moving towards saving money on storage, compression research is starting to look at multiple ways to get optimal results when looking at very large contexts — such as all the files on a filer, or in an enterprise. As an example, a hierarchy of dedupe might be whole file single instancing, then whole block dedupe, then variable “sliding window” dedupe, and then content-aware object dedupe. Each one gets subsequently more granular about how to find duplicate information within and across a set of files. This does not mean that one is better or worse than another. If you can find a whole duplicate file, that’s great, and very effective.

The same kind of thing applies to compression algorithms. There are generic algorithms — like Lempel-Ziv, found at the heart of almost every standard compressor — which look for patterns in a stream of 0′s and 1′s. These compressors operate on a limited context as they look for patterns within a fixed window of data. If redundant information occurs outside the context of that window, a compressor might not get good compression results.

Furthermore, much of today’s data in files have already been compressed by a generic compressor, so that means that no obvious “in the windows” patterns should be there for easy pickings. Common file types like Microsoft Office, PDF, JPEG and almost all rich media, healthcare, life sciences, and energy files are already compressed. Generic compressors, therefore, are unlikely to find, say, a corporate logo embedded in a PowerPoint, a PDF, and a Word document — even though that logo is completely redundant information. Consequently, we are starting to see more advanced compressors that either know the specific context for a given file type (e.g., how to find redundant chrominance and luminance information in a JPEG) or that are re-compressors which know how to decompress an already-compressed file and then recompress it with a better fit compressor for that file’s specific data type.

These new techniques are the natural result of the focus of compression research moving from “compression for transmission” to “compression for storage.” I believe we’re only at the front edge of what’s possible here. There is no one right answer. Furthermore, without apples-to-apples data sets, it’s impossible to gauge the commercial offerings out there. Everyone has outrageous claims about their product’s effectiveness: 20:1, 50:1, 100:1. You can construct a data set that will get any result you want in the dedupe/compression world. If you have whole file single-instancing, you can make a directory with 100 identical files and get 100:1 results.  If you have fixed block dedupe in your filer, you can make a bunch of virtual machines of nearly identical copies of Windows and claim fantastic results.

These results don’t necessarily translate to broader data sets or customer use cases. In the compression research world, there are standard test data sets — the Calgary Corpus, the Hutter Prize — that researchers can try their algorithms on and see how they do against all comers. There is no “corporate corpus” with a good mix of the kinds of files and data sets of the enterprise IT shop that would represent that level playing field for the primary storage solutions emerging now. Because that does not exist, customers should focus not on vendor claims, nor arguments about whose approach is best. They should test relevant solutions on their own mix of real world data and see who does the best for their situation.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , , , ,

About Carter George

Carter runs storage strategy for Dell

4 Responses to “Dedupe for Primary – Another Shade of Blue”

  1. Alex McDonald March 10, 2009 at 6:47 am #

    Thanks for the polite and thoughtful blog.

    I want to focus in on the real world aspects of deduplication vs compression, rather than on the theoretical. The argument about the differences seems to boil down to one of definition and practicality, as opposed to theoretic considerations. FUD it’s not; there are significant differences between the two.

    To include deduplication as compression is, at least to me, a little like the argument over jam or marmalade. If you as an American ask for marmalade to spread on your toast, in the UK at least, you’ll get a bitter concoction made with oranges. Here, we’d think of American marmalade as jam; it’s sweet.

    IMHO it’s the same for compression and deduplication. It’s not all jam.

    Secondly, the practicality of deduplication vs compression.

    Compression has a major overhead in use; files must be decompressed, either on disk or on the fly in memory, before they can be used. This is true of both reads and writes.

    In-memory decompression requires short blocks of compression, but very high compression ratios are only achieved by global or whole file techniques. So the better the compression, in general the more CPU time required compressing and decompressing. With global file compression, it means decompressing the whole file, even to read it.

    Undeduplication (how I hate that word!) has no decompression overhead (apart from indirection, which compression and a great many file systems also require) on read. On write, inflating the file may be required. But NetApp avoids having to reinflate and make a copy or replica for writes, because the same mechanism (block indirection) is used whether files are deduplicated or not. NetApp block updates do not write in place, but rather create a new block, then point to it.

    Incidentally, deduplicating machine images is very productive, for more than just space reasons. Once the data has been block deduplicated by 70% or even more, cache lookaside saves on the slowest part of the IO equation, doing the read from disk. This is another reason why deduplication is not compression.

    Hence my insistence on keeping the marmalade separate from the jam. Until you can really provide compression without the downsides of actually using the data and taking the performance hits, claiming that compression=deduplication and that Ocarina do it on primary storage seems a stretch.

    I am, btw, in favour of NetApp introducing some form of compression technology. For certain classes of data, notbly the high percentage of files that live in home directories that haven’t been touched in months, it’s invaluable. The alternative of moving the data somewhere else that is cheaper we’ve tried (ILM/HSM etc). In practise, it’s been found problematic and wantng, and compression is an excellent and more manageable alternative.

    Thanks for the opportunity to comment.

    Alex

  2. Jered Floyd March 10, 2009 at 1:47 pm #

    Carter,

    This is spot on; in many ways dedupe is like applying compression technology at a grander scale. As you point out, traditional compression techniques like LZ search for repeated tokens within the same file, or within the (relatively small) active window. Deduplication is like an LZ-style compressor than has an infinite dictionary and can tokenize across many files.

    Alex is correct when he says that reading deduplicated data is generally more efficient computationally than decompression, but that’s really an artifact of chunk-based deduplication. Vendors that do “delta” or “difference based” deduplication have a much bigger computational task. For LZ-type compressors, token lookup isn’t really that different than block indirection.

    In our Permabit Enterprise Archive we differentiate between deduplication and compression, despite the similarities, because there is a difference in the level of granularity. Deduplication is always “on” to the extent that’s a meaningful concept — deduplication of identical chunks (variable in size up to 256 KB) is inherent in the system architecture. Beyond that, however, we offer optional compression that does conventional token-based space-savings within the individual chunks. Keeping track of the miniscule “chunks” that would be the tokenized output of a traditional compression algorithm would have two much overhead, so this two-level approach provides significant savings at very high performance.

    As for the problem of existing compressed file types, this is definitely a challenge to deduplication. This sort of thing is also addressed by including compression within the storage, so that the data can be written to a storage system uncompressed, or perhaps uncompressed, chunked, deduplicated and recompressed by the storage system.

    The big problem with the latter case is that most compressed types are defined by the decompression routine while leaving compression open to vendor improvements. This means that the data represented by a file processed in such a way would be identical when read, but the compressed file itself may be modified, a property that could run afoul of data protection regulations.

    My understanding is that you do something like this latter in case in your product; how do you address this data correctness issue?

    Regards,
    Jered Floyd
    CTO, Permabit

Trackbacks/Pingbacks

  1. Online Storage Optimization » Blog Archive » What We’re Reading - April 28 - May 8, 2009

    [...] breaks down the Ocarina ECOsystem process–answer his question. I’d also suggest he read this recent post which was actually in response to another NetApp blogger, Alex [...]

  2. Online Storage Optimization » Blog Archive » The Dedupe Wars - March 25, 2010

    [...] that we are able to achieve such startling results. For more information on this, please see my earlier post in response to another NetApp blogger, Alex McDonald, who also seemed caught up in semantics around [...]

Leave a Reply