Compression and Dedupe like Oil & Water?


Ocarina customer Imagination Technology got some good press the other day in a SearchStorage article reviewing solutions for primary storage optimization. Imagination based Northwest of London provides key pieces of semiconductor IP that are in just about every smart phone out there, and they used Ocarina to double their storage in the same datacenter footprint. [note to Steve Jobs...I see you used them for iPhone graphics, can you also use their Flash acceleration? please?]. The underlying storage there is Network Appliance, and our co-processor is happily reaching in via NFS to shrink all the archived project data for reference and restore.

Like most every other company whose product is digital, access patterns follow a 90/10 rule…There’s a hot-set, and there’s the rest, and that’s the stuff you should target for data reduction. For someone who’s budget constrained or datacenter constrained (and I think that covers about everyone except maybe the NRO), data reduction brings real benefits to online storage, whether that comes from Netapp, Storwize, or Ocarina. Of course if it comes to a shrink-off, we’ll take any challenge from any taker with any dataset!

Dedupe and Compression like oil and water?
Brian raises the contentious issue of whether dedupe is better or compression is better, and whether you should do both. The truth is dedupe’d data is less easy to compress, and compressed data is less easy to dedupe. If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow. Of course it all depends on the data, and there’s a high burden on the software to be really smart about how and when it chooses different algorithms.

It takes some skill and forethought to do both well, and Ocarina’s algorithm selection logic is well tuned enough (thanks partly to the use of a neural network) that the combination of dedupe plus compression will deliver say 80% savings (or 5x effective capacity) when processing a enterprise-profile data set (where there’s some redundancy in the data). For some vertical applications however, the benefit of adding deduplication is so slight, it’ll actually be disabled. A life-sciences dataset for example has relatively little data redundancy.

Steve from Storwise is right that given in many situations compression can do much better than dedupe (for primary storage apps), and that’s certainly true when considering most of the prevailing dedupe algorithms such as SIS (full file), static block (Netapp Asis), or dynamic block. The key difference for Ocarina, and the way to make dedupe pay off in primary storage is as Brian in the article explains that we “pull files apart and deduplicate their constituent elements.” In other words we find reduction where no one else can. Imagine for example how cool it is when ECOsystem delivers dedupe benefits even when files have already been “single-instanced”.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , , , , ,

About Mike Davis

Mike manages strategy and planning for Dell's FluidFS NAS, backup, and data reduction solutions in Dell's Enterprise Solutions Group.

5 Responses to “Compression and Dedupe like Oil & Water?”

  1. The Storage Alchemist April 1, 2010 at 7:01 am #

    Mike, I tried to ask some questions on twitter yesterday but perhaps timing didn’t work out to respond. I was curious, you mention that by compressing deduplicated data you can get an additional 5% advantage over straight compression. My question is ‘at what cost’? Sure there is a $ cost, but when doing deduplication and compression on primary storage the questions I have are:
    1) What is they system overhead required to do this and can I do this real time?
    2) What format is the data in when it is compressed and deduplicated? In other words can the user access this data on the fly or do you need to perform a rehydration process to it in order to read it.

    Seems to me, the time it takes to do all of the ‘work’ required to do both could slow the system down such that it can’t happen in real time nor could I use the data without some additional process (rehydration) to get at the info.

    I’d like to hear your thoughts.

    Best,
    Steve

  2. The Storage Alchemist April 1, 2010 at 7:02 am #

    Oh, one more question – Deduplication and Compression – Oil & Water? or Milk and Cookies?

  3. josephmartins June 2, 2010 at 8:56 am #

    A bit late for me to comment, but I will anyway.

    I don’t know why, but there is a misconception among practitioners about the technologies we all refer to as compression and deduplication. They aren’t oil and water. They’re more like two different flavors of water.

    A few of my past comments about the topic can be found here:
    http://wikibon.org/wiki/v/Pitfalls_of_compressing_online_storage

    I wrote, “De-dupe (of the type the storage industry now markets) IS a form of compression. Traditional “dedupe” (aka compression) occurs within a file (e.g. gif, jpg, zip, tar, etc) where the data resides along with its dictionary. In contrast, storage-level de-dupe is executed across files and repositories using an external “dictionary” managed separately from the files. That is to say, compression is little more than ultra-granular deduplication occurring within a file.”

    Obviously, it is always possible to further compress data that has been “deduplicated” by looking for redundancy at a more granular level than that which was used to deduplicate the original data. It is no different, in practice, than turning up the level of compression or deduplication from low to high such that comparisons are made between decreasingly smaller strings.

    So, Mike, figures such as 50% compression, 20% deduplication and a combined 55% (dedupe and compression) are absolutely meaningless without additional context. The outcome of each test would depend on the base (and relative) levels of compression/deduplication. Crank up dedupe (using smaller or variable length strings) and the benefit from further compression will drop. Dial back the dedupe and the benefits from further compression will jump. A useful (if imperfect) analogy: It’s a bit like compressing a JPG in Photoshop using the “low” setting, then compressing the output file again using the “high” setting….albeit lossy.

Trackbacks/Pingbacks

  1. Tweets that mention Online Storage Optimization » Blog Archive » Compression and Dedupe like Oil & Water? -- Topsy.com - March 31, 2010

    [...] This post was mentioned on Twitter by Ocarina Networks, mike_davis. mike_davis said: New blog post; primary dedupe article, user in EDA speaks up. Debating compression vs dedupe…can they cohabitate? http://bit.ly/9TAFkP [...]

  2. Comression & Deduplication – Oil & Water or Milk & Cookies | The Storage Alchemist - April 4, 2010

    [...] week Mike Davis from Ocarina Networks published a blog post “Compression and Dedupe like Oil & Water?”

Leave a Reply