Get Ready for Dedupe 2.0


Data deduplication has become a very hot topic these days, especially in light of EMC’s recent and very high profile acquisition of Data Domain. This week, analyst George Crump of Storage Switzerland made some predictions as to where this technology is heading. His post, The Foundation of DeDupe’s Next Era, asserts that it will require many different approaches–likely from a number of vendors–in order to best reduce the multiple types of data found in primary storage. I agree with much of what he says, but here are some further thoughts on the topic.

First, a general observation. In every new major market, there is always an early winner, and then that early winner is typically leap-frogged by a 2.0 approach that solves the problems of the first wave. There are a number of examples of this. Browsers, for starters. Netscape made the market, only to be wiped out by Internet Explorer. In the file serving market, Auspex created the market, but NetApp blew them away. The list goes on.

With that in mind, there are four elements that I believe will define the winning architecture in Dedupe 2.0:

1. Global dedupe: Deduplication will find duplicates across multiple nodes and multiple storage pools. No matter where a data stream comes in to the solution, if it has a dupe, it will be found.

2. Post-Process: The second wave of dedupe will be a post-process architecture. Data Domain tells us as much when they focus so much of their marketing on their latest product (the 800 series) on why in-band is the right answer. They’re the market leader, they have a smoking fast new product – why are they so worried about post-processing that they make it the focus of their release messaging? Who are they worried about? Not the vendors they’ve already beaten. No, they’re worried because they know the 2.0 generation will be done this way. They are already positioning now for the new competitors they know they’ll see in the future; they’re being defensive, because they understand their own limitations better than anyone else.

There are several reasons dedupe will move to a post-process architecture, but the main one is better results in data reduction. Dedupe 2.0 won’t be just dedupe – it will  be dedupe plus content-aware compression. This means two- and three-dimensional compressors need to see the context of data, not just the small window of data passing through memory in an in-band appliance. Done right, there’s no reason why post-processing can’t be just as fast as in-band, and data reduction will be dramatically better.

3. Scale-out Processing: In Dedupe 2.0 you will be able to scale out throughput by adding more nodes to your dedupe cluster to process in-coming streams. The Dedupe 2.0 cluster will look like one single target to backup (or other) sources. It will have a load-balanced global namespace, but behind that you could have one cheap server or 32 big fast ones. You’ll be able to start small and grow big, without changing anything on the backup software or writer side. Data streams can get load-balanced to any node, and because of global dedupe, any node can dedupe in real time with data coming to any other node. Instead of having to pick which model has the right throughput for you, start with one node, and if you grow from needing half a Terabyte an hour to 5 Terabytes an hour throughput, you add a few more nodes.

4. Scale-out Capacity: As the between backups (with short retention windows) and archives (potentially long retention periods) continue to blur, the dedupe 2.0 store wants to scale out to massive amounts of storage. That should be independent of processing capacity. For example, the shop that does not backup that much every day should not have to buy some top of the line model just so that they can get enough storage to keep their backups online for 7 years.

Just like processing and throughput capability, capacity should scale independently. You also should be able to add as much storage as you want – inside a dedupe 2.0 cluster node, on a SAN, or network-attached – independently of whether you bought the small cheap dedupe node or the big fast one or a cluster of many of them.

Some vendor will deliver a dedupe 2.0 cluster solution that meets these four must-have requirements. Who knows? That might be Data Domain, the winner of the first wave. But it might be someone else, too.

The question of what to do with already-deduped input streams is a separate but interesting topic. For the most part, customers voted with their wallets against doing source dedupe for backups. After all, EMC bought Data Domain even though it already had source-based dedupe technology Avamar.

More and more, file servers and even database servers are going to be doing dedupe of the primary and nearline file systems, not for backup, but for storage efficiency in primary storage. That means that data streams going to the backup solution with dedupe are going to be already deduped in some way.

All of which raises even more questions–which will have to wait for a later post. What’s the right way to deal with that? Is the answer something that needs to be done on the source side or the backup side? Meanwhile, I invite your comments.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , , ,

About Carter George

Carter runs storage strategy for Dell

9 Responses to “Get Ready for Dedupe 2.0”

  1. Mike Ivanov August 27, 2009 at 10:43 am #

    Carter:

    Excellent post and I agree with many of your thoughts. This is exactly why we launched a new website http://www.dedupe2.com. The purpose of this site is to have discussions and collaboration on how to evolve dedupe beyond the basic backup scenario. Please join in the discussion!

    Regards,
    Mike Ivanov
    VP Marketing
    Permabit

  2. Jay Livens August 27, 2009 at 12:02 pm #

    I blogged about this post over at aboutrestore.com

  3. W. Curtis Preston August 27, 2009 at 6:24 pm #

    I couldn’t agree more with pretty much everything you said. Your thoughts on post-process and its relationship to next gen compression are interesting. I hadn’t thought of that before.

  4. YC August 28, 2009 at 12:06 am #

    How about NEC HYDRAstor (http://www.necam.com/HYDRAstor/HS8-2000.cfm)? It seems to have
    * Global Dedupe
    * Scale-out Processing
    * Scale-out Capacity

    It only lacks your company’s “post-processing content-aware (re)compression” technology, which, in your definition, is essential to the so-called Dedupe 2.0.

  5. Sunshine Mugrabi August 28, 2009 at 1:53 pm #

    Great to see all these comments. Keep them coming, folks. We’re interested in your responses to these posts.

  6. Fabrice Helliker September 1, 2009 at 3:22 am #

    Well, have to say that I want to agree with everything you said as we could use it as a plug for our product’s offering which provides source de-dupe, as well post-process de-dupe and the ability to de-dupe across multiple storage modes (backup / archiving / CDP ). (Additional to other data redunction tecniques)

    See: http://www.cofio.com/Deduplication/

    BUT…and here is the bit I cant get my head around. I see no benefit of post-dedupe over in-line de-dupe. In fact, if you can keep up with the incomming data, I think in-line de-dupe is great.

    There are always other post-process activity that can take place. Indexing for one. But its not always good to bundle all that activity into one process.

    Will 3 dimentional (whatever that means) compression be better? I doubt it will be that much better but undoubtly the company that delivers it will use it as their “winning” gadget the same way Data Domain uses in-line de-dupe.

    Our product differentiates the different data stream a file has and applies de-dupe process a per data stream, thereby ensuring file security information doesnt adversely effect de-dupe efficiency when we de-dupe files across machines (hence they all have data security attributes). Does this makes use de-dupe 2.0? Additional to de-dupe, we utilise replication like technology ensuring that only new data is brought in, so if your database adds 100k of records, only that is stored and we dont even NEED to post process the whole database to de-dupe as we didnt duplicate in the first place. Is that de-dupe 2.0?

    I hope de-dupe 2.0 term wont catch on. I’m still recovery from Web 2.0 and Storage 3.0. If it does, I know where to find you George ;)

    Fab

  7. Gideon Senderov September 2, 2009 at 10:21 am #

    George,

    This is a very good article highlighting many of the deficiencies that must be addressed with first generation deduplication solutions. I agree with most of the concepts outlined in the article and the comments, but I would have to disagree with your statement regarding Dedupe 2.0 needing to be post-process. The inherent requirement is that the dedupe process needs to be able to linearly scale capacity and performance independently and ensure the same QoS in terms of performance and deduplication effectiveness regardless of how large the overall solution gets. This is in fact exactly what you are describing in this article as well as George Crump’s article which you referenced at the top.

    Fact is that most solutions out there opted to use the post-process approach already today (“Dedupe 1.0″) since it is easier to implement versus effectively processing the data and deduplicating it inline while maintaining consistent performance as the system scales to accommodate data growth. A few things to note regarding the post-process approach which are not mentioned in the article are that:

    1. Post-process requires more capacity to be purchased and allocated up-front to write all the un-deduplicated data before it is even deduplicated and compressed to optimize capacity utilization, thereby increasing capacity requirements.
    2. Inline deduplication when done effectively, actually speeds up the write performance since any chunk of data that is identified inline as a duplicate only requires pointer/metadata information to be written to the media rather than the entire payload.
    3. Post-process still requires processing time and in essence needs “idle time” to allow it to optimize capacity, which is not always available.

    I fully agree with you and the comments regarding the limitations of the scale-up approach used by Data Domain and others, but that is exactly the reason for the need for a scalable architecture rather than making the case for post-process versus inline. NEC’s HYDRAstor (mentioned in YC’s comments) already addresses virtually all the requirements you are describing in your article. In fact, the HYDRAstor grid storage platform was specifically architected to address the inherent scalability limitations you are describing with first generation scale-up solutions. HYDRAstor already provides today the following things that you are describing as required for “Dedupe 2.0”:

    1. Global Dedupe – HYDRAstor deduplicates all data globally across the entire grid regardless of how large the system gets, while maintaining linear performance scalability as well as efficient deduplication effectiveness with a single shared repository across all nodes. Note that HYDRAstor does this while processing all data and deduplicating and compressing it inline as it comes into the grid.
    2. Content Awareness – In addition to deduplicating all data globally inline across a shared dedupe repository, HYDRAstor also offers “Application-aware deduplication” (format awareness) which leverage knowledge of the data format to provide greater deduplication efficiency for specific applications data. This can be intermixed with other general data within the single shared deduplication repository to provide cross-application deduplication.
    3. Scale-out Processing – HYDRAstor provides independent linear scalability of performance and capacity via a scale-out architecture that leverages independent nodes. Throughput can be increased by adding “Accelerator Nodes”, while capacity (and associated inline processing requirements) can be increased by adding “Storage Nodes”. Data is deduplicated inline globally with no performance degradation by leveraging HYDRAstor distributed hash table approach that overcomes the physical scalability limitations of first generation scale-up solutions such as Data Domain.
    4. Scale-out Capacity – HYDRAstor provides independent capacity scalability to massive amount of data (over 20PB of data in a single shared dedupe repository), while maintaining consistent performance without having to add additional “controllers” or Accelerator Nodes. It does so by including additional processing power and memory with each Storage Node, which integrate into the distributed global deduplication engine to ensure consistent performance as capacity grows. This overcomes the limitations of first generation scale-up approaches that are limited by the physical memory and processing limitations within a single controller, thus requiring adding unnecessary controllers which also introduce separate local deduplication silos.

    Note that another attribute that is not explicitly mentioned in your article but is mentioned in George Crump’s article that you referenced is the need to support integration of new technologies into the existing infrastructure. In essence, the platform needs to support “in-place technology refresh”, enabling non-disruptive addition and intermixing nodes spanning multiple generations as the system grows. This eliminates the need to migrate massive amounts of data across platforms over time, as well as maximizing the investment protection and utility of current infrastructure. This “self-evolving” attribute and automatic load balancing and data redistribution are also already supported by HYDRAstor today.

    I would be happy to discuss these “Dedupe 2.0″ ideas/attributes further or provide more information regarding the HYDRAstor platform which already provides a “Dedupe 2.0” type solution today.

    Thanks,

    Gideon Senderov
    Director, Product Management
    NEC, Advanced Storage Products

  8. Enterprise Features June 25, 2010 at 2:26 pm #

    I’m really excited about the Globabl Dedupe feature. I’ve been hearing a lot about it.

Trackbacks/Pingbacks

  1. About Restore » Blog Archive » Deduplication 2.0 - August 27, 2009

    [...] folks over at the Online Storage Optimization blog recently wrote a post entitled Get Ready for Dedupe 2.0 where they outline their vision for the future of deduplication.

Leave a Reply