Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Dedupealooza

Posted by Sunshine On February - 19 - 2010

So much talk about dedupe these days it’s hard to keep up. The industry is waking up to the reality that dedupe is one of the best ways to reduce data, thus saving on power, cooling, space and other crippling storage costs.

Some of the more thought provoking posts of late:

DCIG - How SSDs can be leveraged to Deliver Inline Deduplication for Primary Storage
Jerome Wendt responds to a comment from someone about Hifn’s Bitwackr inline dedupe. I don’t necessarily agree with Jerome’s take on this. In general, inline solutions are extremely limited, as the original commenter pointed out. But the post provides interesting food for thought.

Storagebod - Where is OnTap 8 with a bit of a rant!
Martin Glassborow isn’t talking specifically about NetApp dedupe here, but the delay on shipping OnTap8 is of interest to anyone who is concerned about data reduction products. As he puts it, the elephant in the room is that A-SIS dedupe as it now stands has limited scalability.

Recovery Monkey - More FUD busting: Deduplication - is variable-block better than fixed-block, and should you care?
This post, by Dmitris Krekoukias, argues that major distinction some vendors make about variable and fixed block deduplication is a way of distracting customers from the real issues. The post served to defend NetApp against its detractors and competitors who say fixed block dedupe is limiting. The comments field is in some ways the most interesting part, with EMC heavy Chuck Hollis raising questions about his connections with NetApp. Also, our own Mike Davis weighed in, and the numbers he cited were so notable that further commenters questioned how this could be lossless compression. At this point, we’re used to it–the industry at large has become accustomed to less than spectacular results. More on all of this in a later post.

And here’s another interesting trend. The word “dedupe” is starting to creep into the lingo in a more general way. Among storage tweeps there is a greater tendency to throw “dedupe” into their conversations about everything from their record collections to what they eat. It reminds me a little bit of the “hepcat” slang I used to hear when hanging around jazz musicians. If something was ordinary, they’d call it “B Flat,” since that’s the most common key in jazz. For example, “Oh, I just had a B Flat lunch today of a burger and fries.”

The often Twit-witty Greg Schulz recently tweeted: “ I can have dvr record on disk NBC tape delay (thats probably on disk) then dedupe da commercials.” Good plan, Greg.

This post by Steve Gillmor at TechCrunch also uses the term–in a way that I’ve never heard anyway. In this case, he’s referring to the fact that there is duplication of content across what are now becoming overlapping social networks–FriendFeed, Twitter, and the new Google Buzz.

OK that’s all for now. Keep on deduping friends!

Storage Trends - Customer is King

Posted by Sunshine On February - 4 - 2010

kingcustomerLast week’s BD Event was more than just a deal making event. It was a chance to learn about new product releases and trend in the storage industry. The big picture: gone are the days when end users had to accept whatever the storage industry handed down to them. Today’s small-to-medium-sized storage operations are all about designing systems in response to customer needs. Whether that’s developing end-to-end dedupe, refining and improving processes for data recovery management, delivering automated marketing tools, improving data migration, or creating storage that is more energy efficient, the push is towards designing systems with real world customer needs in mind.

The BD Event organizers’ deep connections within the storage arena meant that the two-day conference in Palo Alto drew a who’s who of industry folks. I was particularly pleased to see the number of analysts and consultants on site, including Jerome Wendt, George Crump, Deni Connor, Dave Vellante, Stephen Foskett and Tony Asaro (who unveiled his new project, Voices of IT). I also spoke at length with storage writer Howard Marks, who has a new project called DeepStorage.net that looks very promising for companies seeking solid research that they can use as outbound marketing.

Pleasingly enough, this blog’s parent Ocarina was very much the talk of the conference after kicking off the first day’s emerging vendor showcase. Carter George, VP Products gave away the fact that end-to-end dedupe is becoming a part of the overall strategy for the company. This information set tongues wagging. As DCIG’s Jerome Wendt later blogged: “Ocarina Networks is another company that is adapting to new demands from its customers. Originally it started out doing post-process deduplication of large image files (JPGs, MPEGs, etc.) that had been dormant for 30 days or more - great stuff! But now its customers and even OEMs (Ocarina did not say who) are coming to it and asking for it to do end-to-end data deduplication from primary disk to backup disk without ever reconstructing it. After all, once the data it deduplicated on primary storage, why reconstruct it to then deduplicate it again when it is backed up?”

A good question, and one that was hotly debated and discussed among those in attendance. As Jerome notes, this is a perfect example of the customer responsiveness trend. It’s also an acknowledgment of something that’s been obvious to end users for some time–data reduction shouldn’t have to be isolated within each storage sector. In this day and age you really shouldn’t have to buy separate products to dedupe within primary, nearline, and backup. It’s like having to buy a separate dishwasher for your pre-rinse, wash, and dry cycles.

Other standouts at the event included Bocada, which has updated its DR management software by introducing a new product, Prism. I plan to have the CEO Nancy Hurley on my podcast, and so will learn more about how this update is serving its existing and new customers. I confess that I went to her presentation mainly because I wanted her on my show, but I quickly realized that there was something here of note. That is, the company is addressing a real gap in how well these processes are managed and improved, a key consideration with a crucial component like data protection. She gave a brief overview of the user interface, and on its face it seemed intuitive and flexible.

TechValidate also served as a great example of a company that has evolved based on customer needs. As CEO and founder Brad O’Neill explained during his emerging vendor presentation, originally the company was formed to serve companies that were having trouble getting customer references. These all-important testimonials are sometimes difficult to get–as many industries are gun shy about trumpeting their connections with too many IT and storage vendors. However, O’Neill soon recognized a larger need among its customers for usable marketing materials that could be generated from the information they were gathering. Now, the company has a wide range of customers across numerous industries that are using it as a way to serve up marketing publications.

One final highlight of the event–I got to speak with the NetApp blogger known as “Dr. Dedupe,” Larry Freeman. Larry is best known for running around in a lab coat and stethoscope asking people if they know anything about dedupe. The videos of these shenanigans are posted on his blog and on NetApp TV on YouTube. I suppose in a sense he and I are competitors. Turns out, he’s been writing a book, “Evolution of the Storage Brain” and posting it as he writes it, chapter by chapter, on his blog. This means that readers have a chance to comment on it and shape it as it goes along. Check it out!

The Year in Images

Posted by Sunshine On December - 30 - 2009

This past year, we at Online Storage Op gathered all manner of images to illustrate our posts. So as a way of looking back at 2009, here are some of the ones we liked the best–and the stories that went with them:

HolodeckHolodeck fun:

In February, Robin Harris at StorageMojo wrote about a potential breakthrough in storage technology that could change the landscape forever: quantum holographic storage. Online Storage Op was on the scene. It also gave us a chance to upload a pic of a Geordi La Forge doll. Admit it… this is one cool toy.

dna2-webSqueezing into your Genes:

This blog’s parent Ocarina had quite a year–inking partnerships with a number of major storage vendors and becoming a noted player in the hot dedupe space. It was also the year that genomics labs woke up to the need for better data reduction to deal with the coming onslaught of genetic data. In short, compression can be a matter of life and death. We reported on it here, and our readers got to relive their 10th grade biology class by looking at images like the one above.

marathon

Racing for Dedupe

As many pundits are now opining, dedupe really was one of the biggest stories of 2009, not least because of the high profile battle for Data Domain between storage titans EMC and NetApp. In the end, EMC nabbed the dedupe specialist for an eye-popping $2.1 billion.

boothbabeBooth Babe Mania:

We know our readers are sophisticated types who come here only to absorb information and opinion, and to better themselves for the benefit of all humankind. But for some odd reason we saw a major traffic spike the day we ran our post on the great Booth Babe Controversy. When we asked, everyone quickly told us, “I read the articles.” Mmmhmm!

VMworld a hit

And speaking of images that make storage folks drool, one of the most mesmerizing sights of the year was at VMworld, held in August in San Francisco. Participants descended the escalator to be greeted by gleaming rack of servers and storage–which we later learned was the result of a plan drawn on a napkin by the VMware GETO team. In any case, this year’s VMworld was a major event–and as we rightly noted, it foretold more economic activity in storage and virtualization.

nick_banner

Industry puts aside differences to try to save a life

This is one of the saddest stories of 2009, and one that demonstrates an activist and caring streak in the storage community. When word got out in May 2009 that EMC employee Nick Glasgow was in need of a bone marrow transplant, folks within the storage industry put aside competitive differences and pulled together to find him a match. Sadly, Nick passed away in October. The degree to which he inspired others will not be forgotten.

And, finally…

We never did have an egg and spoon race, but…
In November, Ocarina participated in the first ever Gestalt IT Tech Field Day, which brought independent bloggers from around the world to Silicon Valley for two days of tech deep dives. Our “bring out your data” challenge started tongues wagging well before the event began. Participants brought us their toughest data sets, and aside from those who used archaic encryption software to stump our algorithms, the results were impressive–an average of about 30% reduction on these tougher-than-tough data sets. Plus, the whole event was just a ton of fun. And it didn’t even require that we slog around the mud clapping coconut shells together.
bring-out-your-dead

Dedupe - The Big News in 2009

Posted by Sunshine On December - 7 - 2009

niketigerswoosh

It’s been a tough year — a worldwide recession, a sluggish housing market, rising unemployment … and on top of all that, the tarnished image of one of sports’ most squeaky clean players. Well, actually, there have been some bright spots. As DCIG blogger and storage analyst Jerome Wendt notes while looking back at the past year, “Deduplication is the Big Success Story of 2009.”

Wendt writes: “Deduplication is arguably one of the most notable trends of 2009 as it has been widely adopted by users after bursting onto the scene just a few years ago and has grown to be included in both software and hardware products.”

Wendt focuses on dedupe for backups, where there has been much publicized activity over the past year. The big storage story of 2009 was of course the battle between storage titans EMC and NetApp over backup dedupe specialist Data Domain. He cites an industry survey from SearchDataBackup that indicates that 41% of enterprises either are or are seriously considering dedupe to control data growth and costs. He also notes that the despite the predicted demise of Quantum, that dedupe company remains strong.

Dedupe for backups is one part of the cost reduction puzzle. Another part is to reduce data at the source, in primary storage. This is of course the specialty of this blog’s parent Ocarina, which implements a unique combination of content-aware dedupe and compression to achieve startling results. It focuses on the very types of unstructured data that are driving storage growth today–emails, images, documents, and so on. The company has been partnering with almost every leading storage provider, including HP, EMC, HDS, BlueArc, and Isilon. Another  leader in this space is NetApp, which has a strong dedupe for primary offering that has also garnered a great deal of attention.

Here’s the thing, the economy might be slowing down, but data growth continues apace. This is one reason that the storage industry has been thriving this year. But rather than standing still, what is spells is a concerted effort to keep that data under control. As Wendt notes, another of the year’s big trends is cloud storage, which offers companies more flexibility for storing some percentage of their data. I would also add that virtualization has taken a huge leap forward, not only in terms of the technology itself, but also in terms of adoption over the past year. Yet another way to attack the problem.

So if 2009 was all about dedupe for backups, I’m going to guess that 2010 will be very much about data reduction at all points on the data life cycle. What do you predict?

Image: Gizmodo

Going Native CIFS

Posted by Ocarina On November - 2 - 2009

A recent comment on this blog got me thinking, and this post is the result. The commenter, who identified him or herself only as “Sto Rage” asked: “When can we expect native CIFS support on the Ocarina platforms? The current implementation is outright clunky. So until you have a working CIFS implementation, I don’t think you can compete with NetApp. You may get better compression results, but it works only for NFS data.”

It’s a good point to raise–although I disagree with the “clunky” characterization. But as to the CIFS issue, I wish the answer was as simple as “it’s in the next release,” but this is actually one of the more complex and interesting topics in storage. So hold on to your hats, I’m going to go through Ocarina and CIFS in some detail.

Here’s the short answer: We give you native CIFS support on EMC, BlueArc, HDS, and HP.
Several more NAS vendors will be putting “Ocarina Inside” soon. We give you native CIFS support if you can use our Native Format Optimization. For those customers who use our appliance as a CIFS proxy, we provide good but not perfect CIFS support today, with a roadmap of continual improvement, including the possibility of a native CIFS stack inside the appliance in the first half of next year.

Here’s the longer and more detailed answer.

Ocarina can be deployed in one of three ways:

“Ocarina Inside”: Ocarina is embedded inside or alongside a NAS vendor’s solution.
Ocarina Appliance: A split-band appliance
Ocarina Native Format Optimization (NFO): files are optimized in their native format

Each one of these deployment options has different implications for the CIFS client.

In the “Ocarina Inside” case, the NAS vendor handles all the protocol stacks, and the client gets the full, rich native CIFS implementation of each vendor. Ocarina only uses dedupe or compress for the data stream.  We are not involved in the protocol traffic at all.  Examples of “Ocarina Inside” are EMC Celerra, HP Enterprise NAS, BlueArc, and HDS HNAS.  Additional “Ocarina Inside” partners will be announced soon. This is the best form of integration, because it makes deduplication and compression completely transparent to users and applications, and lets each storage vendor deliver all their full value-add, including in the CIFS protocol stack.

In the Ocarina Appliance case, Ocarina’s optimization happens out of the customer data path, but in order to expand files to their original state upon user access, the Ocarina intercepts read requests in-band. If an I/O (over CIFS or NFS) is to an Ocarina-optimized file, we step in, rehydrate the file, and pass it on to the user. This involves being a proxy for NFS and CIFS (and other protocols including WebDAV and http).   It’s fairly easy to be a proxy for NFS and http, but CIFS is more challenging. Ocarina has done a lot of infrastructure work to ensure that we preserve all of the Windows file attributes necessary for good CIFS integration – ACL’s, Extended Attributes, Alternate Data Streams, Windows share modes and oplocks, etc.

However, we have not written our own CIFS protocol, so our Windows semantic completeness is only as good as the protocol implementation that we sit on. On the appliance, today, that is Samba. Samba has improved a great deal over the last few years, but it is still not a “native” implementation of CIFS. While many storage vendors use variants of Samba for their CIFS stack, it is admittedly not as rich as, say, CIFS on Windows (the only true native CIFS) or CIFS on NetApp.

Ocarina has multiple customers who have implemented Ocarina using both NFS and CIFS on our appliance, and while there may be corner cases where it’s just not as good as the richest CIFS implementations, it’s not “outright clunky” either. There is room for improvement, though, and this is an area of primary focus for our next set of releases. It’s probably a topic for an entirely separate post, but there is a lot going on in the CIFS world these days, and we see some pretty exciting opportunities emerging in this space.

The third case is “Native Format Optimization.” This is a special use of Ocarina where we take certain rich media file types – photos, images and video – and compress them in a special way. What we do is compress them, but have the output be a new, smaller file but in the same native format it started out in. We’ll take a JPEG photo, compress it, and produce as output another perfectly formed JPEG photo….just smaller. The same is true for example for Flash videos. Now in this case, there is no need for a decompressor or for Ocarina to be in the read path or on the protocol at all. We can read files from your NetApp, shrink them, write them back on to your NetApp and Ocarina need not be involved at all when users or applications go to access those files.

In fact, we have a major Fortune 100 company who uses our technology on a large farm of NetApp filers in just this way. In this case, users access the files over all the native protocols that the NetApp supports, including NFS, “native” CIFS, and dual protocol support (NFS and CIFS at the same time). NFO only applies to certain file types, and so it is not the right fit for every data set. However, it is worth pointing out that one of the complaints you see about other deduplication and compression solutions for primary storage is that you save space at the cost of slowing performance down. With NFO, since there is NO decompression, just a smaller file in its original native format, performance is actually and always better.  There are simply fewer bytes to read off disk, fewer bytes to move over a network and no extra hop or decompression step to go through.  It’s a fantastic solution for customers with lots of image, photo, or video data, and it works with all native CIFS implementations.

So there you have it. CIFS support in more detail than you probably ever dreamed or imagined. We look forward to your further comments.

OEM or Not, Here We Come

Posted by Ocarina On October - 28 - 2009

In today’s UK Register, Chris Mellor talked with Brian Biles of Data Domain about its plans for global dedupe. In it, Brian says that Ocarina is not “synergistic” with Data Domain. Writes Chris: “Data Domain set out to solve a data protection problem whereas Ocarina set out to solve a media management problem.” He then quotes Brian, “‘I think it [Ocarina] is in a different market that’s not that synergistic. It’s a different choice from how to optimise data protection.’”

Chris’s final comment? Even if Ocarina offered an OEM deal, Data Domain wouldn’t be “enthusiastic.” Well, that remains to be seen, and actually, it isn’t the important question. Ocarina agrees that, for now, the right place for its functionality is not in the backup tier where Data Domain lives. There is no reason to believe that Data Domain’s acquisition by EMC in any way diminishes the strength of the technology partnership that already exists between Ocarina and EMC.

Ocarina is the Rolls Royce solution for online data reduction, and in that sense, we compete with NetApp Dedupe, not Data Domain. The reality is that right now, as a member of the EMC Celerra Velocity program, Ocarina has been a point of synergy for them, and we don’t see that ending any time soon. The synergy is that if you do online dedupe right on your NAS platform, including EMC’s Celerra, then it plays right in to the strengths of Data Domain when it comes time to back up.

In the Data Domain product, you have a product that’s optimized for the backup world – fast sequential throughput in support of backup windows driven by standard backup applications. In the NetApp case, you have an OK implementation of simple block dedupe, designed to give some data reduction results without sacrificing too much performance in support of random I/O by end users.

There is no right or wrong answer here – both products take the correct approach for the problem that they solve. What’s misleading is the positioning of Ocarina as a solution for media accounts. While Ocarina does have many successful installs in rich media accounts, our core dedupe engine is intended to give multiple storage vendors the same kind of fast, embedded dedupe solution that NetApp has for all online file types. Just to clear any misconceptions, Ocarina has a diverse - and fast growing - customer base, with existing customers in publishing, semiconductor, bio-informatics, energy, film-making, eDiscovery, and Web 2.0.

Because Ocarina’s solution combines dedupe with content-aware compression, Ocarina can address a much broader set of data types and customers than any dedupe-only product, including NetApp. With Ocarina, you can use policies to configure Ocarina for simple dedupe only, giving Ocarina storage partners like BlueArc, EMC, HDS, HP and Isilon equivalent data reduction and primary storage performance as NetApp dedupe.

Alternatively, you can set the policies to be more aggressive, to use all the content-aware compressors, and get much much better data reduction than NetApp while still supporting reasonably fast random I/O for end-users. Since dedupe in general does not get good results on already-compressed files – especially images, video,  Zip and other compressed data – having content-aware compressors allows Ocarina to address all those files in addition to providing great dedupe performance for corporate and enterprise file types. Finally, Ocarina works across multiple types of storage, so a customer can have a single dedupe “language” across all their NAS and primary storage vendors.

Ocarina is, therefore a better technology than NetApp dedupe that also has the advantage of being vendor agnostic. At the same time, it’s complementary to Data Domain. That synergy comes from a fundamental difference in how a customer backs up data that has been deduplicated by Ocarina versus data that has been deduplicated by NetApp. With NetApp, when you go to backup a deduped volume, NetApp will rehydate that volume, expanding the data back to its original full size. With Ocarina, we have a dedupe-aware implementation of NDMP – the backup protocol standard – that allows us to keep data in its deduplicated and compressed state as it is backed up, while still allowing single file restores.

This actually raises an interesting question: Do you still need Data Domain in that case? After all, you’re backing up already deduped data?

Well, yes, actually. Backups are repetitive. So even if you perfectly dedupe the live online volume, if you back it up every day, that process is going to create more dupes in the backup target. Data Domain will find those and eliminate them. The data reduction is additive. The combination of Ocarina for live volumes and Data Domain as a backup target has a big advantage for backups, because it shrinks the backup window. Because the first pass of dedupe has already been done on the filer, there is less data that has to move from storage to backup. If you have 100TB on a set of NAS filers, and Ocarina shrinks that to 40TB, then you’ve reduced the amount of data that needs to be sent across a network to the Data Domain by 60%, making your backup window smaller and faster. Data Domain, in turn, will shrink that data further with every subsequent backup.

Dedupe Misconceptions

Posted by Ocarina On October - 20 - 2009

As most in the industry are aware, dedupe has becoming a standard offering from every major vendor. Dedupe for primary has become the technology of the moment, and for good reason–the rising tide of unstructured data is forcing data centers worldwide to rethink capacity planning, tiering, and storage efficiency. But there are still a few lone voices out there who are clinging to the notion that dedupe is unnecessary.

Take for example this recent post from Compellent’s Bruce Kornfeld,Is dedupe the only answer?” Kornfeld is responding to a recent SearchStorage article “Is Data Duplication Right For Your Primary Storage?

Dedupe and compression can both be applied directly to primary data, and the savings there can be comparable to what’s seen in backup. On backup data, vendors claim 20x data reduction, and on primary data we think that most customers will see about 5x.

So, you say, “That means that you get four times more space savings on backup, right?” Wrong! Actually, 20x means a savings of 95% against the size of the original data set. Actually, 5X means a savings of 80%. There’s only a 15% difference - and an 80% space reduction is a huge win for the primary storage user. Of course, vendors who do not have a dedupe solution are likely to tell you you don’t need it anyway. There are some valid concerns about dedupe for primary, but there are also some misperceptions, and there’s no reason to let misinformation be propagated.

The biggest difference between dedupe for backup and dedupe for primary is that in backup, you dedupe all of the data. There’s no reason not to. In primary data, you might not want to dedupe everything - there are some data sets it does not make sense for. That’s not a knock on dedupe for primary. It just means you should choose which data sets make sense to dedupe.

The first common misperception about dedupe for primary data is that performance will be worse. But this is really not the case. When primary data has been deduped (but not compressed), an application asks the storage for a block, and that block is retrieved. There is one lookup to map the logical block request to the physical one - but those kinds of lookups are already being done in every storage array that has any kind of storage virtualization, such as thin provisioning. The response time on a block read for deduped data is hardly different than for un-deduped data, and this is true for all primary dedupe solutions - including both NetApp and Ocarina. There’s no more overhead to retrieving a deduped block than there would be in any other block read I/O on any intelligent array –and Compellent, being a leader in arrays with lots of smarts, is well aware of this. The fact that another file may also be sharing that block has zero impact on the time it takes to read it.

It’s true that for sets of blocks that are changing all the time, you won’t get as much benefit from dedupe. That’s not because the performance will be bad. It is because when you change a block, it’s no longer a dupe. Therefore it has to be stored again as a new block. If you read a deduped block, modify it, and write it back out, it would have been a write in an un-deduped case anyway, so performance, again, is even-steven between deduped and non-deduped volumes. Everyone doing dedupe for primary - NetApp and Ocarina - does the deduplication as a post-process, so there’s no impact at all to write performance. No one is trying to dedupe that block as it is being written.

What is different, though, is that In a high rate-of-change application like a transactional database, you won’t see as much space savings with dedupe. That’s because if most of your blocks are either new or have just been changed, they won’t be dupes. Here’s misperception number 2: while there are some applications in primary storage where dedupe does not apply (the hot tablespaces in Oracle or SQL Server, for example) , what you’ll find is that most data is a good candidate for dedupe on primary and nearline storage. In fact, much more data is stored in files that are good candidates for dedupe than not. All of the typical file/print files are great candidates for dedupe, but the misperception is that applications like Exchange and virtual machines shouldn’t be deduped. As it turns out, both are great candidates for dedupe (and compression, for that matter). Let’s take a look at VM’s.

In a virtual machine environment, a storage array may be storing thousands of VMDK’s, the VMware files that store a given virtual machine. Inside each VMDK file is a complete virtual machine image, including the operating system, application files and user data. If you have 1,000 VMDK’s that holds virtual Windows machine, you’ll have tens of thousands of “files” inside that VMDK file, including a copy of Microsoft Windows, the application you are running the in the virtual machine, and often the data for that application as well. How much of the Windows operating system do you suppose is duplicated across the 1,000 VMDK’s in this example? Well, almost all of it. What’s more, the thousands of files that make up Windows do not change - are not changeable, in fact, unless you do an OS upgrade.

Large parts of the VMDK file are duplicate with others, and they stay the same, day after day. Perfect candidates for dedupe. Sure, the user data in a VMDK may change, but any competent dedupe solution is not deduplicating whole files - the dedupe solution is deduplicating something at sub-file granularity: blocks, objects, chunks, etc. NetApp dedupes 4K WAFL file system blocks. Ocarina dedupes sub-file objects. The point is, regardless of which approach you take, if most of a VMDK file stays the same, and some part changes, dedupe will work great. The parts of the VMDK file that are changing won’t be deduped, and the vast majority of the file - the OS and application binaries - will be deduped. The space savings on your storage is great, and the performance impact minimal.

In important ways, dedupe for primary storage is the perfect complement to thin provisioning. In thin provisioning, a storage solution virtualizes (i.e., lies about) the amount disk space unused. With dedupe, the same storage solution can virtualize (ie, lie about) how much space is used. The two together provide the maximum storage efficiency.

End to End Dedupe

Posted by Goutham Rao On October - 14 - 2009

Ed Note: We hope you enjoy this guest post from Goutham Rao, CTO, Ocarina Networks, a panelist at SNW this week on the topic of “Primary Storage: The New Frontier for Data Deduplication.” This offers a more detailed and nuanced look at the topics discussed on the panel.

If you’re like many in the storage industry, you think of deduplication mainly as disk optimization. However, in today’s modern data center, dedupe and storage optimization should be thought of as applying across the entire storage workflow, rather than in one particular storage component.

Why?
Because we are no longer in an era in which storage is merely about spinning disks. It is about data, which can be “at rest” and “in motion” — moving from primary storage to nearline, or to backup, or replicated to different sites. Dedupe, then, must apply to all of storage workflows. This more true than ever as massive growth of unstructured data is becoming the rule rather than the exception.

As a result, IT Administrators are saddled with more challenges than ever before.
They must manage activities such as migration, replication and backup, all of which can lead to problems as an organization’s data footprint grows.

If you think about it, storage administrators largely deal with three tasks:

·         Data Storage – Maintain data on various filers and spinning disks. Deal with volumes of various sizes. Perform all the routine maintenance associated with spinning disks, like upgrades and refreshes, replacing lost drives, filer upgrades, snapshot maintenance, quota management, storage provisioning and growth management.

·         Data Movement – Manage replication of storage tiers from one location to another, either for protection or high availability. Migrate data from one location, like branch offices, to another location, like a primary data center.

·         Data Protection – Backup of various file servers and dealing with VTL, media servers, libraries, tapes, selective file restores (DAR), tape refreshes.

As you can see, as data grows for a customer, their problems grow in these three dimensions. So if you are going to talk about “Storage Optimization,” if your solution doesn’t scale or address the above three areas, you aren’t really providing a solution at all, but rather just a band aid.

Tying the Storage Optimization Workflow Together

Based on the above observations, a good storage optimization solution should be cognizant of the lifecycle of various files in the storage system. When a file enters a primary file system, it is likely to move around and finally get backed up or deleted. The storage optimization solution should optimize data such that the optimization effect lasts through this entire workflow and lifecycle. It should optimize the files while they are at rest on the storage disks, and also the same optimized format should be communicable to other storage end points as these files move through the storage workflow. Finally, the same optimized format should be the one that can get backed up directly and also lend itself to restoration and recover operations such as “Selective File Restore, DAR.”

Since the unit of communication between various storage tires and lifecycle waypoints seems to be “FILES,” it seems logical that this optimized data format would be implemented as files ON-TOP of a file system, instead of directly modifying a file systems block device data structures. The latter is not communicable across storage waypoints.

Dedupe/Optimization for Online

In order to optimize data for online storage (be it primary or nearline usage), the optimization solution needs to be aware of the life of the data beyond that particular tier. It needs to optimize data in such a way that the optimized data format is conducive for both movement (such as replication and migration) as well as backup (and restore). This has huge implications in how the optimized data is represented. Inherently, dedupe and optimization introduce a relationship between files that did not exist before.

As different files have different movement and backup policies, the optimized representation of these unrelated files needs to be amenable to independent lifecycles. Implementing dedupe as part of the file system’s data structures itself is counter to the notion of “Global Storage Optimization.” We call this the “Data Store Problem.” This is about how the dedupe solution “represents” or stores the various optimized data blocks associated with various unrelated files.

What needs to happen?
First, the data store representation must be smart enough that it can play well with the storage workflow. Otherwise, no matter how good the dedupe/optimization solution, it will always have a localized and limited effect. Second, online storage is quite different from backup storage, which means that the dedupe algorithms and techniques must also vary. For instance, in backup workflows, if a backup target sees 52 weekly backups, it is easy to imagine how the solution can get in excess of 25X dedupe savings.  Each week’s full backup file (which is in a particular backup software format) is likely to vary less than 5% from the previous week.

But when it comes to online storage, you don’t have such obvious duplicate objects and files. The duplication does exist, but it is hard to find. The dupes are embedded within various rich files. In fact in today’s application environment, most files are in a rich encoding format, utilizing a compression and encoding scheme such as ZLIB, GZIP, PKZIP, BZIP, and many other single-file-optimization schemes. So even though there are redundancies across files, they are hard to find without digging deep for them.  You need to understand the application file format, delayer the format and find the duplicate objects.

Next, dedupe alone is insufficient for online storage just given the nature and workflow of online storage. Unlike the backup workflow, where a majority of backup softwares have purposely introduce duplication from one weekly backup to another, online data has no such redundant workflow.

Online data is different from other data objects, and so online storage optimization must rely on modern compression techniques. There are algorithms today that can further optimize data better than 25-year-old algorithms such as Lempel-Ziv. Since most of today’s data is already optimized, the solution must first decompress the files and then apply application and file specific compression techniques in conjunction with dedupe.

Standard block level dedupe approaches will not work well. The solution must identify duplicates at the appropriate boundaries. Dedupe and compression have competing goals in a way. Dedupe likes small chunk sizes–the smaller the chunk, the more likely you are to find a duplicate chunk. However, small chunks are very compression-unfriendly. Compression likes large chunks where it can obtain a good amount of context. It’s better to compress 32K worth of data compared to 8 separate 4K chunks.  So the question is, what is a good block size? This is where “Object boundary recognition” comes into play. An online dedupe solution will find the best possible object boundaries such that each object is large enough to be properly optimized, but yet no smaller chunk of that object may appear as a duplicate of any other file.

Finally, an online dedupe solution must be aware of online storage workflows, which include random-read, modify, update and delete operations. Backup dedupe solutions only have to deal with streaming writes and streaming reads.  In online storage, you have IO access patterns that involve random read/writes, backward reads, overwrites, truncates, locking, concurrent access and so on.

A related topic is reducing the penalty of optimization. Online storage has much different performance metrics compared to backup solutions. In a backup optimization product, the focus is pretty much on how much sustained throughput of ingest can the backup VTL device handle? The measurements are in terms of “MBPS.” The metrics are, how many MBPS can a single stream upload handle?

But when it comes to online storage, the focus is not on how fast can you optimize data, but rather how fast can you rehydrate the data? It is about low latency access to any random part of a compressed file. If you compress a file from beginning to end, and you get a random access request to the middle of the file, you have to rehydrate that file from the beginning in order to service that random IO request.

This will make the latency too large for practice. These things will prevent a dedupe solution from being adopted in online storage. So a good online dedupe solution will optimize data such that random read/write patterns suffer very low latency. It will also format and optimize the data in such a way that rehydration of entire files utilizes as much CPU power as available on the rehydration platform as well as perform asymmetrically (take more time for optimization but much less time for rehydration).

Dedupe/Optimization for Data Movement

The whole goal behind online-dedupe is to represent parts of various files as a singularity. It brings in relationships between files that did not exist before. This optimality is fine while data resides on that storage endpoint.  But what if one of those files needs to be moved or replicated to another storage endpoint?  Must it move in its rehydrated (unoptimized) form?

When data moves between storage endpoints and tiers, the files may not move in the way they were optimized or along with exactly those files they were optimized with. For instance, if files A, B and C were optimized and deduped with respect to each other, but files A, B, E and F need to move to another endpoint, does this mean that these files need to be rehydrated? What if the target endpoint already has some chunks of data from files A, B, E and F due to some prior unrelated operation?

A good end-to-end optimization solution will recognize data movement operations such as migration and tiering and create an optimized package for data movement such that the package is self-redundant and also does not contain information that the target already knows about. For example, consider the use case where an enterprise wishes to backup file servers daily from various branch offices to a central location. This may involve a multiple of endpoint storage servers communicating to a single file server located at a data center. The dedupe solution must not only optimize at the endpoint locations but also optimize the daily backup workflows to the central office. The dedupe solution must be globally aware of duplicates that the other endpoints may have already communicated to the central data center endpoint.

Dedupe/Optimization for Backup and Data Protection

Lastly, the online dedupe solution must be aware of the backup workflows. Deduped data needs to be backed up in an optimized form. Rehydrating data just so it can be backed up is counterproductive. It must submit the data to the backup target in such a way that single file or selective file (Direct Access Restore) may be performed at any arbitrary location.  Today’s solutions solve this by rehydrating all the optimized data. As data moves from one stage to another, such as on a disk backup target to tape, the data is rehydrated and unoptimized before movement to the backup target.

Even if the IT organization uses traditional VTL workflows with media backup servers in their backup practices, the backup file dumps must be optimized file dumps and not rehydrated file dumps. Such file dumps must be locally optimized in such a way that direct access restore (selective file and directory restores) can still be performed without requiring access to any other older backup dump.

A part of optimizing backup workflows is actually to move away from VTL workflows in the first place. A good dedupe/optimization solution for backup will allow for end user direct file restores. This will allow for administrators to not have to deal with restoring files or selective files from what could potentially involve petabytes worth of backup data. Backup is the final resting place for files. The workflow should allow for versions of files to enter the backup target and for end users to directly restore any file they want without IT involvement.

SNW in Full Swing

Posted by Sunshine On October - 13 - 2009

Plenty going on in Phoenix this week, as SNW is in its second day. Already, the reports are coming in, making this blogger feel sad, bereft, and out of it for not being there in person–stuck here as I am in the midst of some kind of gale on the Bay Area coast.

However, I’m happy to report that this blog’s parent Ocarina has a presence there–both CEO Murli Thirumale and CTO Goutham Rao are at the event, and today at 2 p.m. Rao will be on a panel: Primary Storage: The New Frontier for Data Deduplication. We invite you to attend if you want to learn more about this hot topic in storage.

The panel has a great lineup:

Moderator: Arun Taneja, Taneja Group

aruntaneja

Val Bercovici, NetApp

valb00
Jered Floyd, Permabit (who also wins the “most interesting facial hair” award)

jeredfloyd

Goutham Rao, Ocarina - no pic available

Peter Smails, Storwize

petersmails

Where’s The Growth? Storage!

Posted by Sunshine On September - 27 - 2009

cnbc

Friday’s CNBC segment is worth a watch if you’re wondering where the greatest growth will be over the next decade in the technology sector.

“Everything that we are using now–videos on YouTube. Everything we’re doing on our phone. On and on. It’s all about storage and putting stuff on the net,” opines Roger Nusbaum of Your Source Financial, who adds, “Cloud computing comes into play here.”