Deduplication has been a hot topic in storage for several years now. Most of the focus has been on dedupe appliances sold in the backup market, by companies like Data Domain and Diligent (now IBM ProtecTIER).There have been dozens of articles written explaining the basic concepts, and comparing the implementations by various vendors.
Dedupe is also becoming more prominent in primary storage, with NAS market leader NetApp including a basic dedupe feature on every node. EMC followed with whole file dedupe on its Celerra family of products. Other vendors are now introducing dedupe as a feature on block storage arrays, in the cloud, and on nearline and archival storage.
In effect, what we’re seeing is the emergence and transformation of deduplication (and some other related data reduction techniques) as a new storage fundamental. Rather than being a standalone solution that customers pay a premium for, dedupe is becoming something that will be a standard feature on most mid-market and enterprise storage products by 2012.
This blossoming of dedupe is happening faster than with other value-add storage features that have followed similar paths. Data reduction in general, and dedupe in particular, represents a technology whose time seems to have come. There are two drivers. One is the exponential storage growth putting strain on both capital expenditures in IT and outpacing the ability of cheap disk drives to keep up. The other is that economic pressure and tightened budgets have made the traditionally conservative enterprise storage buyer willing to look at new technology that can store data more efficiently and at less cost. It’s also a technology that works because the data driving storage growth is unstructured data, data created by people. Database business applications are not driving storage growth – people are. It’s office documents, email, photos, and videos that are driving the information revolution, and it is in the nature of humans to create lots of copies, variants, and versions of information – in other words, humans (as opposed to database applications) are a lot more likely to create a bunch of duplicate inefficiently-stored data. Dedupe goes and finds all that, and takes the unnecessary copies out of the picture, and on the actual storage, all those duplicate copies point to one shared place. The benefits are compelling – it’s not just saving disk space, but ultimately it’s also saving power, cooling, rack space, and all the other things you didn’t have to buy when you avoided buying another rack of disks.
That said, not all dedupe is created equal. It comes in multiple flavors – fixed block or variable, block aligned or sliding window, in-band and post-process and so forth. There are those who argue fervently that one approach is the only true way to dedupe, as though it were some sort of medieval religious dispute. What matters is that the dedupe method chosen matches the use case for which it is being used.
Dedupe and Compression: Friends or Foes?
Dedupe is not always the best approach to data reduction either. Some data sets – like collections of virtual machines and repetitive backups of the same volumes – lend themselves very well to deduplication. Other data sets, such as corporate file shares and primary storage, respond better to compression. The goal of the IT user should be to store and move data in the most efficient way. Lost in the hype around dedupe is the fact that quite often compression – which is seeing a renaissance in research after years of relative inactivity – is better at shrinking data than dedupe is. The two technologies are not mutually exclusive. It is possible to apply both dedupe and compression to the same data, but finding the optimal balance is easier said than done.
For dedupe to work well, data is chunked up in to small blocks, which are then compared to see if any are the same. Duplicate blocks can be discarded, saving space. The smaller the block used, the more likely it is to find dupes. Where dedupe gets the best data reduction is with smaller chunks, compression gets better results with larger chunks. Compression works by looking at patterns and then making predictions. If you can predict the next thing, you can compress it. To best find patterns and predict data better, compression likes to have more context, and that means that bigger chunks work better. For any given data set, there’s an optimal balance, but there is no one right answer that works best for every data set.
The Dedupe Transformation
At the Gartner Data Center Conference held in December 2009, the audiences of three different sessions were polled. In the Gartner report, “Data Deduplication will be even bigger in 2010*,” following the event, analyst Dave Russell commented on the poll results. “If the 56% of those with some plans for deduplication in 2010 are combined with the 14% that are using deduplication for only a portion of their backups, and if, as in years past, 2% to 4% of those with no current plans to deploy the technology do implement it, then it’s conceivable that 72% to 74% of the audience will adopt deduplication by year-end 2010.”
A similar poll by SearchStorage of storage buyers found very similar trends. Almost everyone either plans to deploy dedupe for backup or is evaluating it. Likewise, while only 17% have deployed it for primary storage, a staggering 60% are planning to either buy or evaluate dedupe for primary in the next year. This means that CIO’s and IT Directors are expecting dedupe to have the kind of impact on IT operations that virtual machines have had over recent years.
All the major players in storage will need to decide on a dedupe strategy, bringing out standalone products in dedupe-centric markets like backup, and adding dedupe as a feature set to existing primary and nearline products in the other storage tiers. Some of the large vendors may look to startups to acquire the technology they need, either through OEM deals or acquisitions, while others will develop in house. In either case, within two years at most dedupe will have become a storage fundamental, and the landscape of vendor offerings will have changed. Today, the playing field consists of niche offerings by the big vendors and a set of innovative startups looking to breakout. Within two years, some of those startups will have gotten design wins with major vendors, some will follow in Data Domain’s footsteps and get acquired, and some will be left holding the short end of the stick.
Point Solution or Coherent Strategy?
Finally, there is one key mystery left in the dedupe marketplace, which is headed towards a situation where every storage product will have deduplication built in. At current course and speed, all of those dedupe implementations will be inconsistent and incompatible with one another, even inside the product line of a single vendor. Will that continue to be the case as dedupe becomes a standard feature? If you look at today’s two market leaders in dedupe – NetApp for primary and Data Domain for backup – you’ll see a painful scenario that we expect to see played out over and over again over the next few years. Take a volume on NetApp filled with 16 Terabytes of data. NetApp dedupe might shrink that data to 8TB, a great space savings. But when it comes time to back that data up, the NetApp rehydates (expands) the 8TB’s back to the full original 16TB’s to send it to the backup target. Let’s say the backup target is a Data Domain. Now the network needs to carry that whole 16TB, and the NetApp storage controller had to use a lot of CPU to put the data all back together, possibly slowing down other applications trying to use the NetApp. You have to buy a Data Domain model big enough to handle that 16TB ingestion in your available backup window. When the 16TB of data gets to the Data Domain, it will be deduped again, using different algorithms, getting back down to 8TB or less. In this process, a huge amount of CPU, network bandwidth, and time have been consumed to expand data and then shrink it again. For a market focused on getting to better storage efficiency, this is blatantly wasteful. This is the case today with almost any dedupe-for-primary solution backing up to a dedupe-for-backup target. What would be more useful to customers would be a consistent and compatible dedupe, allowing data that has already been deduped to be moved in its compressed format to other storage products that support a compatible implementation.
A good deal of the I/O workload in a given shop is driven by a handful of common storage management workflows – backup, replication, migration, and tiering – and all of those workflows would be more efficient if they could be done using data that had been compressed and deduplicated. To truly deliver on the promise of storage efficiency, we’ll look to see some vendors deliver consistent and compatible dedupe and compression that works across products, supporting dedupe-aware versions of those key workflows transparently.
