Martin Glassborow on his Storagebod Blog has written a controversial piece raises questions about the two hottest technologies in storage at the moment, dedupe and thin provisioning. In his post, entitled “Living on a Prayer,” he suggests that both of these technologies could be the road to a storage nightmare, in which, “you could be many times over-subscribed with de-duped storage.” He gives the example of someone turning on encryption and all the dupes reappear at once, suddenly requiring all kinds of storage capacity that wasn’t needed until then.
He also sounds the alarm on migration, saying, “migrating deduped primary storage between arrays … is going to need a lot of planning. Deduping primary storage may well be one of the ultimate vendor lock-ins if we are not careful.”
Here are some of my responses to this thought-provoking post, which will no doubt be getting a lot of attention.
On oversubscription:
I agree with Martin that there is a real risk here. When a bulk operation could cause massive rehydration, it’s essential that you have the proper warning and planning tools. There is also an economic component to this–essentially, you’re weighing paying for disk now or later.
A good dedupe solution will allow you to control the degree of over-subscription. While this does not matter so much for backup dedupe, it does matter for online. So you should be able to say, make a new copy of data every time the reference count on a duplicate hits 10 (or whatever number you choose). That way, while you limit your space savings to 10:1. You also limit your exposure to some application level decision that would cause all the duplicates to be rehydrated and returned to primary storage.
Encryption is a good example - encryption will cause most dedupe solutions to not be able to find duplicates at all if the encryption is done at the application or file level. Increasingly, we’re seeing encryption moving to the drive level, and in that case, it will be transparent to primary dedupe, but that’s not to say that there’s aren’t other cases where being oversubscribed couldn’t happen.
The lesson here is clear: Your online or primary storage dedupe tool must be able to give you the tools to manage that risk.
On Migrating Deduped Data
The topic of end-to-end deduplication is the natural next step in the maturation of the deduplication market. Today, you have many vendors, each of whom have built dedupe in to their filer as a feature. Every time you move data, you have to rehydrate it. This is often the case even when you are moving deduped data from one filer to another from the same vendor! NetApp dedupe will rehydrate every file any time you move it off the filer - for SnapMirror, for an NDMP backup, etc. There are really two things that the IT user wants to see. First, you want to be able to move optimized data in its most efficient form (deduped, compressed) not only across filers, but across vendors and storage tiers.
For example, why dedupe data on the filer, then rehydrate it, back it up to a VTL target, and then dedupe it again? Why not dedupe it once, and move the already-optimized data to the backup target, to the DR site, to the tier 2 filer? In the backup case, you’ll still get more dedupe benefit from your dedupe appliance. The repetitive nature of backups mean that when you back up the same file over and over, even if it was already deduped on the filer, it will still benefit from being deduped again with each backup. But you ought to have less data to move to the backup appliance, and you ought not to have to burn up a bunch of filer CPU cycles rehydrating files that are just headed off to backup.
Ideally you want dedupe and compression that is not a lock-in feature of a vendor, but that is a vendor-neutral data reduction solution that the IT shop can deploy across multiple filers (primary, nearline, etc), archive, and backup. And so the lesson again is to take a close look at the dedupe product and be sure that you’re not headed for vendor lock-in.
We look forward to seeing what others are saying about this provocative post.