Every dedupe ratio can be converted to a percentage of data reduction and vice versa. The dedupe ratio measures against what’s left after you’ve deduped. The percentage measures against the size of the data before you dedupe. Both are valid measures. It’s also true that some solutions do a better job shrinking your data than others. Dedupe solutions that do a better job should be ranked higher when you are comparing solutions. That said, comparing the claims made on vendor websites is not a very good way to find out who can actually shrink your data better.
Why? Vendors lie.
“Lies, damned lies, and statistics” is a phrase describing the persuasive power of numbers <http://en.wikipedia.org/wiki/Number> , particularly the use of statistics <http://en.wikipedia.org/wiki/Statistics> to bolster weak arguments <http://en.wikipedia.org/wiki/Argument> , and the tendency of people to disparage statistics that do not support their positions.
The term was popularized in the United States <http://en.wikipedia.org/wiki/United_States> by Mark Twain <http://en.wikipedia.org/wiki/Mark_Twain> (among others), who attributed it to the 19th Century British <http://en.wikipedia.org/wiki/Great_Britain> Prime Minister Benjamin Disraeli <http://en.wikipedia.org/wiki/Benjamin_Disraeli> (1804-1881): “There are three kinds of lies: lies, damned lies, and statistics.”
I tend to agree with Dipesh’s original point that dedupe ratios are a little more misleading than percentages in measuring the effectiveness of data reduction, but either way what matters is “which vendor can really shrink my data better?”. I agree with Howard and Curtis who point out that it does matter when one product can do a better job of actual data reduction. And the more data you have, and just as importantly, the more you want to keep, the more it matters.
Lost somewhat in this discussion is whether the dedupe ratios claimed by vendors are actually true. The claims are not actually lies, they’re worse – they are statistics based on a set of assumptions favorable to each vendor. There’s no industry standard for reporting these numbers, nor is there a standard public data set that everyone can run their dedupe and compression engines against to get even-steven numbers. These ratio claims, whether they are 10:1 or 20:1 or whatever, are based on a set of assumptions that you have to go trolling in the fine print to find. For example, on the first full backup you run through almost any dedupe engine, you are not going to see anything remotely like 10:1 or 20:1. You’ll be lucky if you get 2:1 – most vendors will deliver less than that. If you are talking about dedupe for primary storage (NAS, a block storage array, a nearline archive) most vendors will never achieve 10:1, let alone higher ratios. Before we argue about whether 20:1 really is twice as good as 10:1, maybe we should figure out if any vendor can even get you 5:1 in the real world.
The high ratios you see in all these claims are based on the assumption that you are backing up the same data over and over. A vendor makes an assumption about a change rate in the data (eg, 10% of the data gets changed each day), and they assume you backup every day. They’ll be able to make a higher claim if they assume that you do a full backup every day, because then there will be more dupes to throw away. These claims are not based on how much they can shrink your first backup. They are based on how much they’ll have saved after you have done many backups. Look in the fine print to see how many backups you have to run to get the claimed results, and see whether those were are fulls or a mix of fulls and incrementals. If you do a full backup every day for 100 days, most dedupe solutions will get a really good result. However, if your situation is not just like the assumptions they made to get to their marketing claim, your results are going to be different than that ratio on their website. The results you actually get when you buy the product are what’s important, aren’t they? And those real results may have very very little to do with the claimed results in a marketing brochure.
Your mileage will vary. Not “may vary”. Will vary. Some data sets dedupe well (virtual machine images, databases, repetitive full backups). Some do not (Office documents, volumes full of rich media like photos, video, and graphics, Zip files, etc). Full backups dedupe better than incrementals. If you have encrypted your data, then dedupe won’t get much result at all.
Compression may also play a role. Often, the data that does not dedupe well can be compressed well. Almost every dedupe vendor also does some kind of compression. To get the best data reduction result (that’s what we’re talking about here, right?) usually requires some combination of dedupe and compression. Some dedupe solutions just have generic compressors, like LZ or zip, while others have more sophisticated compressors that can deal with complex data types.
Does that matter? Only if you have a lot of the kind of data where compression is key to getting good results. If you have 1,000 identical VMDK clones, then dedupe will get smokin’ results without any compression. If you have 1,000 Terabytes of unique photos and and compressed Microsoft Office docs, dedupe will get next to 0% data reduction. For most customers, the answer is going to be somewhere in the middle.
Right now, I don’t see any shortcut to picking the most likely 2 to 4 vendors and having them actually run their wares against a good test data set that truly represents the kind of data you have. Take one full backup, or one typical volume’s worth of your own data, have them sign an NDA, and have them run it through their product and see how well they can shrink it. Rank the vendors ability to shrink your data based on the results they get on your own real data, not on the inflated claim numbers you see on the front page of their website.

Trackbacks/Pingbacks
[...] Not as Much as You Think Written by David Vellante In a recent blog post entitled: Ocarina Weighs in- Dedupe Ratios Do Matter, VP Carter George writes: The dedupe ratio measures against what’s left after you’ve deduped. [...]