It’s very interesting to write for a blog that is focused on a specific topic–in our case, dedupe for primary–and then suddenly see the whole world wake up to the reality of it all at once. There has been quite the pile-on in the storage blogosphere of late.
So, what has been said so far? First, we had Chuck Hollis on his blog talking about primary dedupe and data I/O density. He makes some great points, but he is seeing the problem in a certain way–in essence, he’s thinking of data reduction may impact performance of primary storage. However, in some cases, dedupe can improve performance, where it allows much higher cache hit rates on highly used shared data blocks (virtual machines are the perfect example) and another fact is that a lot of storage on expensive primary tiers today does not need to be there. It started there, but it’s grown cold. If you don’t want to create another tier and move files, dedupe gives you a way to create a cheaper logical tier on the storage you already have.
In that case, some trade-off in performance is perfectly acceptable. Ocarina’s solution for deduping primary storage gives you the choice of deduping in-place (creating a logical tier 2) or doing dedupe-and-migrate as a single atomic operation, shrinking colder data and moving it off of Tier 1 storage in one step. In fact we just announced that Ocarina is now part of the EMC Velocity Technology and ISV Program, giving EMC’s Celerra a major edge over NetApp for both in-place dedupe on Celerra for primary, and for dedupe-and-tier.
A string of comments on Chuck’s blog included some heated exchanges between Chuck and arch rival NetApp’s bloggers Vaughn Stewart and Kostadis Roussos.
In response to the post, Hu Yoshida at HDS put in his view, which is that he essentially agrees with Chuck on this question. His main point is that dedupe for primary isn’t a panacea. True enough, but as Hu himself has noted in an earlier post, there’s a great advantage to integrating it when you’re already taking advantage of these other tiering, storage virtualization, and provisioning options.
Then finally, EMC Avamar’s Steve Kenniston covered a great deal of ground , and in fact ended up highlighting two key points that are complimentary parts of Ocarina’s strategy. First, we want to get as many deeply-embedded design wins with NAS and file system vendors as possible - meaning that a common “language of dedupe” would be spoken across multiple vendors. Second, we’re developing an end-to-end dedupe strategy, where a file that is deduped early in its lifecycle can be kept in its most compressed form as it moves throughout storage workflows.
Once deduped, data should never have to be rehydrated unless it is being accessed by users and applications. For all the rest of the classic storage workflows - backup, replication, data distribution, archive - there’s no reason for data to have to be rehydrated as it moves across tiers, platforms, and vendors.
Examples would be supporting replication of optimized (deduped and compressed) volumes, allowing deduped volumes to be backed up without rehydration (regardless of what the backup target is), and seamless integration with NDMP so that NDMP backups and restores can work work transparently with deduped files, without even knowing that they are deduped. The first wave of dedupe products were not only vendor-specific (NetApp Dedupe) but also tier specific (dedupe for backup, dedupe for primary, etc). While there are cases where a customer’s need for data reduction is urgent enough to deploy those point solutions, the real win is when dedupe is common and compelling across vendors and tiers.
Now, some people might say, “I already bought a dedupe appliance for my backup target, do I really need dedupe anywhere else?” But the fact is, if you dedupe upstream from your backup appliance, you not only save money on primary storage, you still get benefit from your backup appliance. Backups are repetitive - you back up a volume every day, either full or incremental. So even if you have already deduped your primary volume, by backing it up every day, you are creating more duplicates in the backup target. The dedupe appliance will find those and take them out. If the primary volume has already been deduped, though, your backup data set will be smaller, and the work that the backup appliance has to do will be faster. The benefits are cumulative - if you get 5:1 on your primary data, and then back that up every day for a month, you may end up with 100:1 savings in your backups instead of today’s 20:1.
Interestingly, EMC has all the pieces here. And actually we can show how this works in an HDS environment just as well, which we may do in a later post. If you run Ocarina to dedupe your primary file store on Celerra, Ocarina can do the following:
* Optimize some primary files, identified by file type, right where they are on the fastest primary tier. This may allow those files to see better cache hit rates.
* Optimize other files and move them to another volume on the same Celerra or another Celerra, perhaps a volume with SATA instead of Fibre drives. Because Ocarina uses EMC FileMover stubs, this means that we can create a much larger global namespace on Celerra than a simple Celerra volume would support.
* Optimize all files in policy and post them to EMC Celerra for archive in an optimized form (deduped and compressed).
* Optimize whole volumes, and then back up those volumes to EMC Data Domain, where additional dedupe will take place as backup after backup creates more dupes in the backup target.
All of this can be done - on EMC, HDS, and other vendors - as true “end to end” dedupe, where data only gets rehydrated where its needed for a live application or user I/O request.



