Dedupe Misconceptions


As most in the industry are aware, dedupe has becoming a standard offering from every major vendor. Dedupe for primary has become the technology of the moment, and for good reason–the rising tide of unstructured data is forcing data centers worldwide to rethink capacity planning, tiering, and storage efficiency. But there are still a few lone voices out there who are clinging to the notion that dedupe is unnecessary.

Take for example this recent post from Compellent’s Bruce Kornfeld,Is dedupe the only answer?” Kornfeld is responding to a recent SearchStorage article “Is Data Duplication Right For Your Primary Storage?

Dedupe and compression can both be applied directly to primary data, and the savings there can be comparable to what’s seen in backup. On backup data, vendors claim 20x data reduction, and on primary data we think that most customers will see about 5x.

So, you say, “That means that you get four times more space savings on backup, right?” Wrong! Actually, 20x means a savings of 95% against the size of the original data set. Actually, 5X means a savings of 80%. There’s only a 15% difference – and an 80% space reduction is a huge win for the primary storage user. Of course, vendors who do not have a dedupe solution are likely to tell you you don’t need it anyway. There are some valid concerns about dedupe for primary, but there are also some misperceptions, and there’s no reason to let misinformation be propagated.

The biggest difference between dedupe for backup and dedupe for primary is that in backup, you dedupe all of the data. There’s no reason not to. In primary data, you might not want to dedupe everything – there are some data sets it does not make sense for. That’s not a knock on dedupe for primary. It just means you should choose which data sets make sense to dedupe.

The first common misperception about dedupe for primary data is that performance will be worse. But this is really not the case. When primary data has been deduped (but not compressed), an application asks the storage for a block, and that block is retrieved. There is one lookup to map the logical block request to the physical one – but those kinds of lookups are already being done in every storage array that has any kind of storage virtualization, such as thin provisioning. The response time on a block read for deduped data is hardly different than for un-deduped data, and this is true for all primary dedupe solutions – including both NetApp and Ocarina. There’s no more overhead to retrieving a deduped block than there would be in any other block read I/O on any intelligent array –and Compellent, being a leader in arrays with lots of smarts, is well aware of this. The fact that another file may also be sharing that block has zero impact on the time it takes to read it.

It’s true that for sets of blocks that are changing all the time, you won’t get as much benefit from dedupe. That’s not because the performance will be bad. It is because when you change a block, it’s no longer a dupe. Therefore it has to be stored again as a new block. If you read a deduped block, modify it, and write it back out, it would have been a write in an un-deduped case anyway, so performance, again, is even-steven between deduped and non-deduped volumes. Everyone doing dedupe for primary – NetApp and Ocarina – does the deduplication as a post-process, so there’s no impact at all to write performance. No one is trying to dedupe that block as it is being written.

What is different, though, is that In a high rate-of-change application like a transactional database, you won’t see as much space savings with dedupe. That’s because if most of your blocks are either new or have just been changed, they won’t be dupes. Here’s misperception number 2: while there are some applications in primary storage where dedupe does not apply (the hot tablespaces in Oracle or SQL Server, for example) , what you’ll find is that most data is a good candidate for dedupe on primary and nearline storage. In fact, much more data is stored in files that are good candidates for dedupe than not. All of the typical file/print files are great candidates for dedupe, but the misperception is that applications like Exchange and virtual machines shouldn’t be deduped. As it turns out, both are great candidates for dedupe (and compression, for that matter). Let’s take a look at VM’s.

In a virtual machine environment, a storage array may be storing thousands of VMDK’s, the VMware files that store a given virtual machine. Inside each VMDK file is a complete virtual machine image, including the operating system, application files and user data. If you have 1,000 VMDK’s that holds virtual Windows machine, you’ll have tens of thousands of “files” inside that VMDK file, including a copy of Microsoft Windows, the application you are running the in the virtual machine, and often the data for that application as well. How much of the Windows operating system do you suppose is duplicated across the 1,000 VMDK’s in this example? Well, almost all of it. What’s more, the thousands of files that make up Windows do not change – are not changeable, in fact, unless you do an OS upgrade.

Large parts of the VMDK file are duplicate with others, and they stay the same, day after day. Perfect candidates for dedupe. Sure, the user data in a VMDK may change, but any competent dedupe solution is not deduplicating whole files – the dedupe solution is deduplicating something at sub-file granularity: blocks, objects, chunks, etc. NetApp dedupes 4K WAFL file system blocks. Ocarina dedupes sub-file objects. The point is, regardless of which approach you take, if most of a VMDK file stays the same, and some part changes, dedupe will work great. The parts of the VMDK file that are changing won’t be deduped, and the vast majority of the file – the OS and application binaries – will be deduped. The space savings on your storage is great, and the performance impact minimal.

In important ways, dedupe for primary storage is the perfect complement to thin provisioning. In thin provisioning, a storage solution virtualizes (i.e., lies about) the amount disk space unused. With dedupe, the same storage solution can virtualize (ie, lie about) how much space is used. The two together provide the maximum storage efficiency.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , , , , , , ,

About Carter George

Carter runs storage strategy for Dell

Trackbacks/Pingbacks

  1. Considering Ocarina Networks Optimized Data For Virtual Environments | VM /ETC - November 20, 2009

    [...] got the following information from the post titled Dedupe Misconceptions. The post specifically references VMware virtual machines, but the scenario can be easily imagined [...]

Leave a Reply