Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Fast and Effective Dedupe

Posted by Ocarina On March - 3 - 2010

I’ve noticed a few blog posts recently about speed of deduplication in the modern data center. I agree that speed is an important factor, but keep in mind that not all dedupe is created equal. That is to say, fast is good, but only if you are also effective. One of the tricky things has been that the easiest data to compress is also usually the most carefully performance tuned. A great example of this is a database. This is because databases are comprised of simple alphanumeric fields and sparse tables. All of that is easy to reduce in size.

However, a company’s core transactional database is the most conservative asset in the data center. Introducing compression would save space, for sure, but you could only use very fast, simple compressors there. At the same time, customers will be hesitant to deploy a new layer of processing in their most sensitive application.

So, where is most data growth? In fact, it’s being driven by unstructured data – Office documents, rich media, email with attachments, PDFs, Flash videos, and so forth. This complex data does not lend itself to fast simple compressors. But perhaps we should back up for a moment and think about how customers have been behaving all along.

Throughout the history of storage, there have always been tradeoffs available between fast expensive storage, and slower but cheaper alternatives. This is not a bad thing. It gives users alternatives based on their priorities and budgets. Back in the old mainframe days, these choices were between very expensive mainframe memory and “offline” storage like drums, cards, and tapes. Today the technology is all much bigger, faster, cheaper and sexier. But really, the tradeoffs are the same.

Data reduction technology adds another layer of choice above and beyond the traditional hardware choices. Now in addition to choosing whether you want fast, expensive solid state disk (SSD) or slower but very cost-effective SATA, you can also choose whether you want to compress and/or deduplicate the data that is stored on those disks.

Just like physical disks, compression and dedupe come in a range of speeds and capabilities.
There are simple and very fast compressors that are essentially invisible in terms of their impact on storage performance. There are more complex compressors that get better results, but which may take longer, either to compress or to decompress the data. Deduplication, done well, should always be pretty fast, and streaming dedupe rates of well of 300MB/sec are now available from many vendors (including Data Domain and Ocarina).

The emergence of tools to automatically tier data to its appropriate place help make the use of all of these technologies more feasible. That applies as much to solid state disks as it does to dedupe and compression. When data tiering can be made invisible to end-users and applications, then implementing multiple physical and logical tiers of storage becomes practical.  Good examples would include EMC’s new FAST tools, Compellent’s “Fluid Data Storage”, and HDS’s Data Migrator. When users or administrators have to move data by hand to get it to a compressed tier or a solid state disk, then the operational costs offset the capital savings.

You might want to be wary when someone’s biggest claim to fame is fast dedupe. Just as the old mainframe admin had to decide whether something was important enough to live in RAM, or could be stored on cheaper tapes instead, today’s IT shops have to decide where it is most important to try to get data reduction, and what tool will get the most bang for the buck for that kind of data. You need the whole story, and then you can decide based on your own priorities.

The Year in Images

Posted by Sunshine On December - 30 - 2009

This past year, we at Online Storage Op gathered all manner of images to illustrate our posts. So as a way of looking back at 2009, here are some of the ones we liked the best–and the stories that went with them:

HolodeckHolodeck fun:

In February, Robin Harris at StorageMojo wrote about a potential breakthrough in storage technology that could change the landscape forever: quantum holographic storage. Online Storage Op was on the scene. It also gave us a chance to upload a pic of a Geordi La Forge doll. Admit it… this is one cool toy.

dna2-webSqueezing into your Genes:

This blog’s parent Ocarina had quite a year–inking partnerships with a number of major storage vendors and becoming a noted player in the hot dedupe space. It was also the year that genomics labs woke up to the need for better data reduction to deal with the coming onslaught of genetic data. In short, compression can be a matter of life and death. We reported on it here, and our readers got to relive their 10th grade biology class by looking at images like the one above.

marathon

Racing for Dedupe

As many pundits are now opining, dedupe really was one of the biggest stories of 2009, not least because of the high profile battle for Data Domain between storage titans EMC and NetApp. In the end, EMC nabbed the dedupe specialist for an eye-popping $2.1 billion.

boothbabeBooth Babe Mania:

We know our readers are sophisticated types who come here only to absorb information and opinion, and to better themselves for the benefit of all humankind. But for some odd reason we saw a major traffic spike the day we ran our post on the great Booth Babe Controversy. When we asked, everyone quickly told us, “I read the articles.” Mmmhmm!

VMworld a hit

And speaking of images that make storage folks drool, one of the most mesmerizing sights of the year was at VMworld, held in August in San Francisco. Participants descended the escalator to be greeted by gleaming rack of servers and storage–which we later learned was the result of a plan drawn on a napkin by the VMware GETO team. In any case, this year’s VMworld was a major event–and as we rightly noted, it foretold more economic activity in storage and virtualization.

nick_banner

Industry puts aside differences to try to save a life

This is one of the saddest stories of 2009, and one that demonstrates an activist and caring streak in the storage community. When word got out in May 2009 that EMC employee Nick Glasgow was in need of a bone marrow transplant, folks within the storage industry put aside competitive differences and pulled together to find him a match. Sadly, Nick passed away in October. The degree to which he inspired others will not be forgotten.

And, finally…

We never did have an egg and spoon race, but…
In November, Ocarina participated in the first ever Gestalt IT Tech Field Day, which brought independent bloggers from around the world to Silicon Valley for two days of tech deep dives. Our “bring out your data” challenge started tongues wagging well before the event began. Participants brought us their toughest data sets, and aside from those who used archaic encryption software to stump our algorithms, the results were impressive–an average of about 30% reduction on these tougher-than-tough data sets. Plus, the whole event was just a ton of fun. And it didn’t even require that we slog around the mud clapping coconut shells together.
bring-out-your-dead

Dedupe - The Big News in 2009

Posted by Sunshine On December - 7 - 2009

niketigerswoosh

It’s been a tough year — a worldwide recession, a sluggish housing market, rising unemployment … and on top of all that, the tarnished image of one of sports’ most squeaky clean players. Well, actually, there have been some bright spots. As DCIG blogger and storage analyst Jerome Wendt notes while looking back at the past year, “Deduplication is the Big Success Story of 2009.”

Wendt writes: “Deduplication is arguably one of the most notable trends of 2009 as it has been widely adopted by users after bursting onto the scene just a few years ago and has grown to be included in both software and hardware products.”

Wendt focuses on dedupe for backups, where there has been much publicized activity over the past year. The big storage story of 2009 was of course the battle between storage titans EMC and NetApp over backup dedupe specialist Data Domain. He cites an industry survey from SearchDataBackup that indicates that 41% of enterprises either are or are seriously considering dedupe to control data growth and costs. He also notes that the despite the predicted demise of Quantum, that dedupe company remains strong.

Dedupe for backups is one part of the cost reduction puzzle. Another part is to reduce data at the source, in primary storage. This is of course the specialty of this blog’s parent Ocarina, which implements a unique combination of content-aware dedupe and compression to achieve startling results. It focuses on the very types of unstructured data that are driving storage growth today–emails, images, documents, and so on. The company has been partnering with almost every leading storage provider, including HP, EMC, HDS, BlueArc, and Isilon. Another  leader in this space is NetApp, which has a strong dedupe for primary offering that has also garnered a great deal of attention.

Here’s the thing, the economy might be slowing down, but data growth continues apace. This is one reason that the storage industry has been thriving this year. But rather than standing still, what is spells is a concerted effort to keep that data under control. As Wendt notes, another of the year’s big trends is cloud storage, which offers companies more flexibility for storing some percentage of their data. I would also add that virtualization has taken a huge leap forward, not only in terms of the technology itself, but also in terms of adoption over the past year. Yet another way to attack the problem.

So if 2009 was all about dedupe for backups, I’m going to guess that 2010 will be very much about data reduction at all points on the data life cycle. What do you predict?

Image: Gizmodo

OEM or Not, Here We Come

Posted by Ocarina On October - 28 - 2009

In today’s UK Register, Chris Mellor talked with Brian Biles of Data Domain about its plans for global dedupe. In it, Brian says that Ocarina is not “synergistic” with Data Domain. Writes Chris: “Data Domain set out to solve a data protection problem whereas Ocarina set out to solve a media management problem.” He then quotes Brian, “‘I think it [Ocarina] is in a different market that’s not that synergistic. It’s a different choice from how to optimise data protection.’”

Chris’s final comment? Even if Ocarina offered an OEM deal, Data Domain wouldn’t be “enthusiastic.” Well, that remains to be seen, and actually, it isn’t the important question. Ocarina agrees that, for now, the right place for its functionality is not in the backup tier where Data Domain lives. There is no reason to believe that Data Domain’s acquisition by EMC in any way diminishes the strength of the technology partnership that already exists between Ocarina and EMC.

Ocarina is the Rolls Royce solution for online data reduction, and in that sense, we compete with NetApp Dedupe, not Data Domain. The reality is that right now, as a member of the EMC Celerra Velocity program, Ocarina has been a point of synergy for them, and we don’t see that ending any time soon. The synergy is that if you do online dedupe right on your NAS platform, including EMC’s Celerra, then it plays right in to the strengths of Data Domain when it comes time to back up.

In the Data Domain product, you have a product that’s optimized for the backup world – fast sequential throughput in support of backup windows driven by standard backup applications. In the NetApp case, you have an OK implementation of simple block dedupe, designed to give some data reduction results without sacrificing too much performance in support of random I/O by end users.

There is no right or wrong answer here – both products take the correct approach for the problem that they solve. What’s misleading is the positioning of Ocarina as a solution for media accounts. While Ocarina does have many successful installs in rich media accounts, our core dedupe engine is intended to give multiple storage vendors the same kind of fast, embedded dedupe solution that NetApp has for all online file types. Just to clear any misconceptions, Ocarina has a diverse - and fast growing - customer base, with existing customers in publishing, semiconductor, bio-informatics, energy, film-making, eDiscovery, and Web 2.0.

Because Ocarina’s solution combines dedupe with content-aware compression, Ocarina can address a much broader set of data types and customers than any dedupe-only product, including NetApp. With Ocarina, you can use policies to configure Ocarina for simple dedupe only, giving Ocarina storage partners like BlueArc, EMC, HDS, HP and Isilon equivalent data reduction and primary storage performance as NetApp dedupe.

Alternatively, you can set the policies to be more aggressive, to use all the content-aware compressors, and get much much better data reduction than NetApp while still supporting reasonably fast random I/O for end-users. Since dedupe in general does not get good results on already-compressed files – especially images, video,  Zip and other compressed data – having content-aware compressors allows Ocarina to address all those files in addition to providing great dedupe performance for corporate and enterprise file types. Finally, Ocarina works across multiple types of storage, so a customer can have a single dedupe “language” across all their NAS and primary storage vendors.

Ocarina is, therefore a better technology than NetApp dedupe that also has the advantage of being vendor agnostic. At the same time, it’s complementary to Data Domain. That synergy comes from a fundamental difference in how a customer backs up data that has been deduplicated by Ocarina versus data that has been deduplicated by NetApp. With NetApp, when you go to backup a deduped volume, NetApp will rehydate that volume, expanding the data back to its original full size. With Ocarina, we have a dedupe-aware implementation of NDMP – the backup protocol standard – that allows us to keep data in its deduplicated and compressed state as it is backed up, while still allowing single file restores.

This actually raises an interesting question: Do you still need Data Domain in that case? After all, you’re backing up already deduped data?

Well, yes, actually. Backups are repetitive. So even if you perfectly dedupe the live online volume, if you back it up every day, that process is going to create more dupes in the backup target. Data Domain will find those and eliminate them. The data reduction is additive. The combination of Ocarina for live volumes and Data Domain as a backup target has a big advantage for backups, because it shrinks the backup window. Because the first pass of dedupe has already been done on the filer, there is less data that has to move from storage to backup. If you have 100TB on a set of NAS filers, and Ocarina shrinks that to 40TB, then you’ve reduced the amount of data that needs to be sent across a network to the Data Domain by 60%, making your backup window smaller and faster. Data Domain, in turn, will shrink that data further with every subsequent backup.

Get Ready for Dedupe 2.0

Posted by Ocarina On August - 27 - 2009

Data deduplication has become a very hot topic these days, especially in light of EMC’s recent and very high profile acquisition of Data Domain. This week, analyst George Crump of Storage Switzerland made some predictions as to where this technology is heading. His post, The Foundation of DeDupe’s Next Era, asserts that it will require many different approaches–likely from a number of vendors–in order to best reduce the multiple types of data found in primary storage. I agree with much of what he says, but here are some further thoughts on the topic.

First, a general observation. In every new major market, there is always an early winner, and then that early winner is typically leap-frogged by a 2.0 approach that solves the problems of the first wave. There are a number of examples of this. Browsers, for starters. Netscape made the market, only to be wiped out by Internet Explorer. In the file serving market, Auspex created the market, but NetApp blew them away. The list goes on.

With that in mind, there are four elements that I believe will define the winning architecture in Dedupe 2.0:

1. Global dedupe: Deduplication will find duplicates across multiple nodes and multiple storage pools. No matter where a data stream comes in to the solution, if it has a dupe, it will be found.

2. Post-Process: The second wave of dedupe will be a post-process architecture. Data Domain tells us as much when they focus so much of their marketing on their latest product (the 800 series) on why in-band is the right answer. They’re the market leader, they have a smoking fast new product – why are they so worried about post-processing that they make it the focus of their release messaging? Who are they worried about? Not the vendors they’ve already beaten. No, they’re worried because they know the 2.0 generation will be done this way. They are already positioning now for the new competitors they know they’ll see in the future; they’re being defensive, because they understand their own limitations better than anyone else.

There are several reasons dedupe will move to a post-process architecture, but the main one is better results in data reduction. Dedupe 2.0 won’t be just dedupe – it will  be dedupe plus content-aware compression. This means two- and three-dimensional compressors need to see the context of data, not just the small window of data passing through memory in an in-band appliance. Done right, there’s no reason why post-processing can’t be just as fast as in-band, and data reduction will be dramatically better.

3. Scale-out Processing: In Dedupe 2.0 you will be able to scale out throughput by adding more nodes to your dedupe cluster to process in-coming streams. The Dedupe 2.0 cluster will look like one single target to backup (or other) sources. It will have a load-balanced global namespace, but behind that you could have one cheap server or 32 big fast ones. You’ll be able to start small and grow big, without changing anything on the backup software or writer side. Data streams can get load-balanced to any node, and because of global dedupe, any node can dedupe in real time with data coming to any other node. Instead of having to pick which model has the right throughput for you, start with one node, and if you grow from needing half a Terabyte an hour to 5 Terabytes an hour throughput, you add a few more nodes.

4. Scale-out Capacity: As the between backups (with short retention windows) and archives (potentially long retention periods) continue to blur, the dedupe 2.0 store wants to scale out to massive amounts of storage. That should be independent of processing capacity. For example, the shop that does not backup that much every day should not have to buy some top of the line model just so that they can get enough storage to keep their backups online for 7 years.

Just like processing and throughput capability, capacity should scale independently. You also should be able to add as much storage as you want – inside a dedupe 2.0 cluster node, on a SAN, or network-attached – independently of whether you bought the small cheap dedupe node or the big fast one or a cluster of many of them.

Some vendor will deliver a dedupe 2.0 cluster solution that meets these four must-have requirements. Who knows? That might be Data Domain, the winner of the first wave. But it might be someone else, too.

The question of what to do with already-deduped input streams is a separate but interesting topic. For the most part, customers voted with their wallets against doing source dedupe for backups. After all, EMC bought Data Domain even though it already had source-based dedupe technology Avamar.

More and more, file servers and even database servers are going to be doing dedupe of the primary and nearline file systems, not for backup, but for storage efficiency in primary storage. That means that data streams going to the backup solution with dedupe are going to be already deduped in some way.

All of which raises even more questions–which will have to wait for a later post. What’s the right way to deal with that? Is the answer something that needs to be done on the source side or the backup side? Meanwhile, I invite your comments.

The Dedupe Race

Posted by Sunshine On August - 14 - 2009

marathon

The storage press has sniffed out a good story recently. Today, Beth Pariseau has a piece up on her Storage Soup blog that hones in on the drama surrounding the technology du jour–deduplication.

The post, “HP to EMC/Data Domain: Bring it On” has a headline that’s reminiscent of the sort of fighting words we heard from our former president.

Pariseau writes: “Admittedly late to the data deduplication game, Hewlett-Packard Co. is brewing new dedupe offerings to compete with the market’s new 800-pound gorilla — EMC/Data Domain. … HP partners with Sepaton for high-end VTLs and Ocarina for primary storage data reduction, but also develops deduplication software for its entry-level disk backup devices.”

Earlier this week, Chris Mellor at The Register covered the HP-Ocarina partnership news, also talking about it in terms of the rising competition for a complete dedupe solution. His article “HP Makes Ocarina Music” has a subhead that speaks volumes: “Ocarina close to clean sweep of file vendors.”

Mellor writes: “Ocarina has similar partnerships with BlueArc, EMC and Isilon. It looks almost inevitable that every other filer supplier must be looking at the Ocarina product and thinking a reseller deal might be a good idea. Otherwise, it could lose sales to the competition when a lot of image-type data is being stored.”

It will be interesting to see how this story unfolds. We vendor bloggers are already chattering about the recent partnership announcement, such as this post on the HP Storageworks “Around The Storage Block” blog. The post, by Pete Brey, WW Extreme Storage Business Development Manager, homes in on two recent HP announcements. First, its recent acquisition of IBRIX, and second its partnership with Ocarina.

Brey writes: “Now multi-petabyte systems are great when you have zillions of files that need to be stored but so is a multi-petabyte system that is optimized so that in the same space tens of zillions can be contained. This is where Ocarina’s ECOsystem software adds its value to our NAS products. The ECOsystem software transforms your storage with its content-aware storage optimization that compresses data up to 10:1 with added features such as deduplication, ECOsnap snapshots, and its own global name space capability. The unique thing about our reseller partnership is that HP can run the ECOsystem software right on our NAS nodes, further optimizing your infrastructure. Now there aren’t too many storage vendors out there who can talk about that now, are there?”

Bragging rights, indeed.

Clearly, it’s too soon to say exactly how each player in this space will benefit and/or lose out. As Mellor’s piece obliquely refers to, this isn’t about Ocarina setting itself in opposition to any vendors–in fact, it has a partnership with EMC. Rather, it shows how each provides a piece of the puzzle. In the big picture, there needs to be a shift in thinking towards something more along the lines of end-to-end dedupe–something that our lead blogger Carter George talked about at length in his popular post, “The Dedupe (R)evolution.” But in the short-run it’s certainly good to see how each vendor is distinguishing itself, and working hard to provide the most efficient, cost-effective storage options to its customers.

What Was That Again?

Posted by Sunshine On August - 3 - 2009

Sometimes someone at Online Storage Optimization writes a post that is so controversial, so intense and thought-provoking, that it seems the entire world is lining up to read it. Later, we find out that while most of the world had a chance to peruse it, there was some small part of the global community–a system administrator here, a CIO there–who, tragically, did not get a chance.

The moment was gone. The post consigned to the archives, where it has been left to molder and sink into obscurity. Such is the pace of the news business. But today we decided that we could defy the laws of physics, and actually turn back the clock. Force time to stop, make a U-turn and come barreling back at us like some kind of souped up toy PowerWheels car on snow.

So here you have it–your guide to some of our most thought-provoking and interesting posts since this blog’s humble beginnings in April 2008. A trip down memory lane for some, and a whole new road to travel for others.

Introducing Storage Optimization Blog - This was the very first post that ever appeared on this blog. While hardly likely to win the prize for the cleverest headline, it actually said a lot more than just “hello world.” It talks about the growth of unstructured data, and even mentions a pioneering company that would later be fodder for mispronounciations on CNBC, Data Domain.

Less is More - Part 2 - Again, this is from the very early days of this blog, before we discovered that it might help to write a headline that made sense. But putting that aside, this is another example of a post that is as relevant now as it was way, way back in the olden days of Spring 2008. It even addresses an aspect of the massive scalability question that Chris Mellor just raised in an  opinion piece in The Register. In short, this post is worth a read now more than ever before.

Capacity Optimized Storage - The Emergence of the O Tier - This is one of those posts that I still look back and marvel at. In it, Carter George (our chief author) coins a phrase that has since made it into many discussions of storage tiers and tiering. That’s right, if you hear about the “O” tier, and you’re talking about a capacity optimized tier of storage, then you have Carter to thank for this handy term. And by the way, if you’re thinking you’d like to get some advice about how to create cost-efficient storage tiering, it’s not too late to sign up for webinar that will be taking place this Wednesday, August 5 at 9 a.m. PDT (Noon EDT).

Storing Bush’s Brain - This post shows that one day, we did learn about writing headlines.

Entrepreneurship Meets American Idol - This guest post by Ocarina Networks CEO Murli Thirumale offers insights on what makes for a great start-up concept. And, what doesn’t. (Hint: falling in love with your technology may not be the smartest route.)

Well, that’s it for now. Hope you were able to sit back, relax, and enjoy the ride down this winding road known as Online Storage Optimization.

Dedupe Grows Up

Posted by Sunshine On July - 29 - 2009

George Crump has a piece in Byte and Switch today that poses an important question: “Can we get to a single point of deduplication?” This is a question that we have taken up in one form or another in some of our recent posts, such as this one and this one.

In the article, Crump asks the question in another way: “… can you have all your data tiers; primary, archive and backup deduplicated by a single engine?”

In light of the recent focus on deduplication, this in my view is a question that really does need to be raised. For how long will the industry to silo out these different tiers for its deduplication solutions? And how much sense does it make to rehydrate data every time you move it, in order to once again deduplicate it? Not a lot.

Crump writes: “The current deduplication vendors could work on building out their solutions to either scale up into primary storage performance (see Data Domain’s DD880) or they could move their existing data duplication technology into other markets; see the increased speed of Ocarina Networks and Permabit as well as their move into cloud storage.”

At the same time, as we’ve pointed out here, online storage is quite a bit different than backups and so far at least, none of the successful backup dedupe vendors - Data Domain, Diligent, Quantum, etc. have been able to break into it. Rather, it is NetApp and Ocarina who have been the trailblazers.

Crump makes another key point:

“NetApp and Ocarina could continue to enhance and improve the re-hydration speed of their technologies to make read performance a non-issue, making primary storage a viable platform. Ocarina can already maintain the deduplicated format as they move through tiers, so landing on backup or archive disk would simply be another move for them.”

This is an interesting observation, and one that is often missed in reporting on both of these solutions. We look forward to seeing more debate and discussion on this issue, which was well kicked off with this piece.

Who’s Afraid of the Big, Bad, Dedupe?

Posted by Ocarina On July - 28 - 2009

Martin Glassborow on his Storagebod Blog has written a controversial piece raises questions about the two hottest technologies in storage at the moment, dedupe and thin provisioning. In his post, entitled “Living on a Prayer,” he suggests that both of these technologies could be the road to a storage nightmare, in which, “you could be many times over-subscribed with de-duped storage.” He gives the example of someone turning on encryption and all the dupes reappear at once, suddenly requiring all kinds of storage capacity that wasn’t needed until then.

He also sounds the alarm on migration, saying, “migrating deduped primary storage between arrays  … is going to need a lot of planning. Deduping primary storage may well be one of the ultimate vendor lock-ins if we are not careful.”

Here are some of my responses to this thought-provoking post, which will no doubt be getting a lot of attention.

On oversubscription:

I agree with Martin that there is a real risk here. When a bulk operation could cause massive rehydration, it’s essential that you have the proper warning and planning tools. There is also an economic component to this–essentially, you’re weighing paying for disk now or later.

A good dedupe solution will allow you to control the degree of over-subscription. While this does not matter so much for backup dedupe, it does matter for online. So you should be able to say, make a new copy of data every time the reference count on a duplicate hits 10 (or whatever number you choose). That way, while you limit your space savings to 10:1. You also limit your exposure to some application level decision that would cause all the duplicates to be rehydrated and returned to primary storage.

Encryption is a good example - encryption will cause most dedupe solutions to not be able to find duplicates at all if the encryption is done at the application or file level. Increasingly, we’re seeing encryption moving to the drive level, and in that case, it will be transparent to primary dedupe, but that’s not to say that there’s aren’t other cases where being oversubscribed couldn’t happen.

The lesson here is clear: Your online or primary storage dedupe tool must be able to give you the tools to manage that risk.

On Migrating Deduped Data

The topic of end-to-end deduplication is the natural next step in the maturation of the deduplication market. Today, you have many vendors, each of whom have built dedupe in to their filer as a feature. Every time you move data, you have to rehydrate it. This is often the case even when you are moving deduped data from one filer to another from the same vendor! NetApp dedupe will rehydrate every file any time you move it off the filer - for SnapMirror, for an NDMP backup, etc. There are really two things that the IT user wants to see. First, you want to be able to move optimized data in its most efficient form (deduped, compressed) not only across filers, but across vendors and storage tiers.

For example, why dedupe data on the filer, then rehydrate it, back it up to a VTL target, and then dedupe it again? Why not dedupe it once, and move the already-optimized data to the backup target, to the DR site, to the tier 2 filer? In the backup case, you’ll still get more dedupe benefit from your dedupe appliance. The repetitive nature of backups mean that when you back up the same file over and over, even if it was already deduped on the filer, it will still benefit from being deduped again with each backup. But you ought to have less data to move to the backup appliance, and you ought not to have to burn up a bunch of filer CPU cycles rehydrating files that are just headed off to backup.

Ideally you want dedupe and compression that is not a lock-in feature of a vendor, but that is a vendor-neutral data reduction solution that the IT shop can deploy across multiple filers (primary, nearline, etc), archive, and backup. And so the lesson again is to take a close look at the dedupe product and be sure that you’re not headed for vendor lock-in.

We look forward to seeing what others are saying about this provocative post.

EMC Dedupe - Beyond Data Domain

Posted by Mike Davis On July - 27 - 2009

With all the talk about the Data Domain acquisition, there less attention paid to EMC’s native de-dupe features in Celerra, not to mention its other related partnerships, such as with Ocarina for optimization of vertical applications. Last week I had the privilege of attending a webinar, “Surviving the Data Explosion through Data Reduction” with John Hayden, CTO of NAS Engineering at EMC, where I got a fuller picture of Celerra’s latest optimization features.

John provided us with insights on how the new Celerra NAS product integrates data optimization. And while he never mentioned Data Domain directly, an astute observer could see how well EMC is integrating prior acquisitions into its architecture, and draw conclusions from that.

First, he provided us with a couple of interesting factoids from the Digital Universe research  EMC sponsored for IDC:

  • In 2009 there is positive growth in digital content, but IT spending for servers & storage are down 6%
  • Over the next 4 years, data will grow 5x, but IT budgets will only grow 1.2x
  • The administrative and overhead cost of storage is 4-7x the CapEx

This was all a prelude to John discussing the new data optimization features for their Celerra NAS product. It’s great to see the NAS vendors recognizing the value of data optimization as a central part of the NAS stack. Drilling a little deeper, EMC basically pulled together file-level deduplication (single instance storage or SIS) from the Avamar acquisition, and LZ77 data-generic compression from their Recoverpoint acquisition. SIS + LZ77 are a good price-performance combination for generic office files and text docs, but they don’t make much of a dent where we see the real capacity and scalability challenges; vertical applications such as life sciences, oil & gas, and media. In fact, the use of generic compression is becoming impotent against the latest MS Office docs that use ZIP as a container. If you change a single text character in an office doc, the entire file changes.

So there’s a reason that Ocarina has a solid partnership with EMC, with an optimization solution that’s complementary to Celerra’s. When it comes to customers with serious capacity issues and data growth - we’re talking about gene sequencing, post-houses, and so on and so forth - there is little to gain from deduplication, and little to gain from generic compression. Not only does the optimization solution need to more intelligently unwind and understand the file structure, but it needs to make better decisions about what algorithms get applied to specific file sub-objects. The is where Ocarina comes in. Like the native Celerra de-dupe solution, the Ocarina ECOsystem integrates with the FileMover API for a tightly knit, policy-based optimization solution that works even on media and ZIP files that are already compressed.

We look forward to our collaborations with EMC, and will be very interested to watch how they continue to integrate dedupe and compression across their offerings.