Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

End to End Dedupe

Posted by Goutham Rao On October - 14 - 2009

Ed Note: We hope you enjoy this guest post from Goutham Rao, CTO, Ocarina Networks, a panelist at SNW this week on the topic of “Primary Storage: The New Frontier for Data Deduplication.” This offers a more detailed and nuanced look at the topics discussed on the panel.

If you’re like many in the storage industry, you think of deduplication mainly as disk optimization. However, in today’s modern data center, dedupe and storage optimization should be thought of as applying across the entire storage workflow, rather than in one particular storage component.

Why?
Because we are no longer in an era in which storage is merely about spinning disks. It is about data, which can be “at rest” and “in motion” — moving from primary storage to nearline, or to backup, or replicated to different sites. Dedupe, then, must apply to all of storage workflows. This more true than ever as massive growth of unstructured data is becoming the rule rather than the exception.

As a result, IT Administrators are saddled with more challenges than ever before.
They must manage activities such as migration, replication and backup, all of which can lead to problems as an organization’s data footprint grows.

If you think about it, storage administrators largely deal with three tasks:

·         Data Storage – Maintain data on various filers and spinning disks. Deal with volumes of various sizes. Perform all the routine maintenance associated with spinning disks, like upgrades and refreshes, replacing lost drives, filer upgrades, snapshot maintenance, quota management, storage provisioning and growth management.

·         Data Movement – Manage replication of storage tiers from one location to another, either for protection or high availability. Migrate data from one location, like branch offices, to another location, like a primary data center.

·         Data Protection – Backup of various file servers and dealing with VTL, media servers, libraries, tapes, selective file restores (DAR), tape refreshes.

As you can see, as data grows for a customer, their problems grow in these three dimensions. So if you are going to talk about “Storage Optimization,” if your solution doesn’t scale or address the above three areas, you aren’t really providing a solution at all, but rather just a band aid.

Tying the Storage Optimization Workflow Together

Based on the above observations, a good storage optimization solution should be cognizant of the lifecycle of various files in the storage system. When a file enters a primary file system, it is likely to move around and finally get backed up or deleted. The storage optimization solution should optimize data such that the optimization effect lasts through this entire workflow and lifecycle. It should optimize the files while they are at rest on the storage disks, and also the same optimized format should be communicable to other storage end points as these files move through the storage workflow. Finally, the same optimized format should be the one that can get backed up directly and also lend itself to restoration and recover operations such as “Selective File Restore, DAR.”

Since the unit of communication between various storage tires and lifecycle waypoints seems to be “FILES,” it seems logical that this optimized data format would be implemented as files ON-TOP of a file system, instead of directly modifying a file systems block device data structures. The latter is not communicable across storage waypoints.

Dedupe/Optimization for Online

In order to optimize data for online storage (be it primary or nearline usage), the optimization solution needs to be aware of the life of the data beyond that particular tier. It needs to optimize data in such a way that the optimized data format is conducive for both movement (such as replication and migration) as well as backup (and restore). This has huge implications in how the optimized data is represented. Inherently, dedupe and optimization introduce a relationship between files that did not exist before.

As different files have different movement and backup policies, the optimized representation of these unrelated files needs to be amenable to independent lifecycles. Implementing dedupe as part of the file system’s data structures itself is counter to the notion of “Global Storage Optimization.” We call this the “Data Store Problem.” This is about how the dedupe solution “represents” or stores the various optimized data blocks associated with various unrelated files.

What needs to happen?
First, the data store representation must be smart enough that it can play well with the storage workflow. Otherwise, no matter how good the dedupe/optimization solution, it will always have a localized and limited effect. Second, online storage is quite different from backup storage, which means that the dedupe algorithms and techniques must also vary. For instance, in backup workflows, if a backup target sees 52 weekly backups, it is easy to imagine how the solution can get in excess of 25X dedupe savings.  Each week’s full backup file (which is in a particular backup software format) is likely to vary less than 5% from the previous week.

But when it comes to online storage, you don’t have such obvious duplicate objects and files. The duplication does exist, but it is hard to find. The dupes are embedded within various rich files. In fact in today’s application environment, most files are in a rich encoding format, utilizing a compression and encoding scheme such as ZLIB, GZIP, PKZIP, BZIP, and many other single-file-optimization schemes. So even though there are redundancies across files, they are hard to find without digging deep for them.  You need to understand the application file format, delayer the format and find the duplicate objects.

Next, dedupe alone is insufficient for online storage just given the nature and workflow of online storage. Unlike the backup workflow, where a majority of backup softwares have purposely introduce duplication from one weekly backup to another, online data has no such redundant workflow.

Online data is different from other data objects, and so online storage optimization must rely on modern compression techniques. There are algorithms today that can further optimize data better than 25-year-old algorithms such as Lempel-Ziv. Since most of today’s data is already optimized, the solution must first decompress the files and then apply application and file specific compression techniques in conjunction with dedupe.

Standard block level dedupe approaches will not work well. The solution must identify duplicates at the appropriate boundaries. Dedupe and compression have competing goals in a way. Dedupe likes small chunk sizes–the smaller the chunk, the more likely you are to find a duplicate chunk. However, small chunks are very compression-unfriendly. Compression likes large chunks where it can obtain a good amount of context. It’s better to compress 32K worth of data compared to 8 separate 4K chunks.  So the question is, what is a good block size? This is where “Object boundary recognition” comes into play. An online dedupe solution will find the best possible object boundaries such that each object is large enough to be properly optimized, but yet no smaller chunk of that object may appear as a duplicate of any other file.

Finally, an online dedupe solution must be aware of online storage workflows, which include random-read, modify, update and delete operations. Backup dedupe solutions only have to deal with streaming writes and streaming reads.  In online storage, you have IO access patterns that involve random read/writes, backward reads, overwrites, truncates, locking, concurrent access and so on.

A related topic is reducing the penalty of optimization. Online storage has much different performance metrics compared to backup solutions. In a backup optimization product, the focus is pretty much on how much sustained throughput of ingest can the backup VTL device handle? The measurements are in terms of “MBPS.” The metrics are, how many MBPS can a single stream upload handle?

But when it comes to online storage, the focus is not on how fast can you optimize data, but rather how fast can you rehydrate the data? It is about low latency access to any random part of a compressed file. If you compress a file from beginning to end, and you get a random access request to the middle of the file, you have to rehydrate that file from the beginning in order to service that random IO request.

This will make the latency too large for practice. These things will prevent a dedupe solution from being adopted in online storage. So a good online dedupe solution will optimize data such that random read/write patterns suffer very low latency. It will also format and optimize the data in such a way that rehydration of entire files utilizes as much CPU power as available on the rehydration platform as well as perform asymmetrically (take more time for optimization but much less time for rehydration).

Dedupe/Optimization for Data Movement

The whole goal behind online-dedupe is to represent parts of various files as a singularity. It brings in relationships between files that did not exist before. This optimality is fine while data resides on that storage endpoint.  But what if one of those files needs to be moved or replicated to another storage endpoint?  Must it move in its rehydrated (unoptimized) form?

When data moves between storage endpoints and tiers, the files may not move in the way they were optimized or along with exactly those files they were optimized with. For instance, if files A, B and C were optimized and deduped with respect to each other, but files A, B, E and F need to move to another endpoint, does this mean that these files need to be rehydrated? What if the target endpoint already has some chunks of data from files A, B, E and F due to some prior unrelated operation?

A good end-to-end optimization solution will recognize data movement operations such as migration and tiering and create an optimized package for data movement such that the package is self-redundant and also does not contain information that the target already knows about. For example, consider the use case where an enterprise wishes to backup file servers daily from various branch offices to a central location. This may involve a multiple of endpoint storage servers communicating to a single file server located at a data center. The dedupe solution must not only optimize at the endpoint locations but also optimize the daily backup workflows to the central office. The dedupe solution must be globally aware of duplicates that the other endpoints may have already communicated to the central data center endpoint.

Dedupe/Optimization for Backup and Data Protection

Lastly, the online dedupe solution must be aware of the backup workflows. Deduped data needs to be backed up in an optimized form. Rehydrating data just so it can be backed up is counterproductive. It must submit the data to the backup target in such a way that single file or selective file (Direct Access Restore) may be performed at any arbitrary location.  Today’s solutions solve this by rehydrating all the optimized data. As data moves from one stage to another, such as on a disk backup target to tape, the data is rehydrated and unoptimized before movement to the backup target.

Even if the IT organization uses traditional VTL workflows with media backup servers in their backup practices, the backup file dumps must be optimized file dumps and not rehydrated file dumps. Such file dumps must be locally optimized in such a way that direct access restore (selective file and directory restores) can still be performed without requiring access to any other older backup dump.

A part of optimizing backup workflows is actually to move away from VTL workflows in the first place. A good dedupe/optimization solution for backup will allow for end user direct file restores. This will allow for administrators to not have to deal with restoring files or selective files from what could potentially involve petabytes worth of backup data. Backup is the final resting place for files. The workflow should allow for versions of files to enter the backup target and for end users to directly restore any file they want without IT involvement.

Dedupe Grows Up

Posted by Sunshine On July - 29 - 2009

George Crump has a piece in Byte and Switch today that poses an important question: “Can we get to a single point of deduplication?” This is a question that we have taken up in one form or another in some of our recent posts, such as this one and this one.

In the article, Crump asks the question in another way: “… can you have all your data tiers; primary, archive and backup deduplicated by a single engine?”

In light of the recent focus on deduplication, this in my view is a question that really does need to be raised. For how long will the industry to silo out these different tiers for its deduplication solutions? And how much sense does it make to rehydrate data every time you move it, in order to once again deduplicate it? Not a lot.

Crump writes: “The current deduplication vendors could work on building out their solutions to either scale up into primary storage performance (see Data Domain’s DD880) or they could move their existing data duplication technology into other markets; see the increased speed of Ocarina Networks and Permabit as well as their move into cloud storage.”

At the same time, as we’ve pointed out here, online storage is quite a bit different than backups and so far at least, none of the successful backup dedupe vendors - Data Domain, Diligent, Quantum, etc. have been able to break into it. Rather, it is NetApp and Ocarina who have been the trailblazers.

Crump makes another key point:

“NetApp and Ocarina could continue to enhance and improve the re-hydration speed of their technologies to make read performance a non-issue, making primary storage a viable platform. Ocarina can already maintain the deduplicated format as they move through tiers, so landing on backup or archive disk would simply be another move for them.”

This is an interesting observation, and one that is often missed in reporting on both of these solutions. We look forward to seeing more debate and discussion on this issue, which was well kicked off with this piece.

Doing More With Less

Posted by Sunshine On July - 23 - 2009

If you’re trying to figure out how to do more with less when it comes to your storage, I’d strongly suggest you participate in an upcoming Webinar, “How to Use Storage Tiering to Create Cost Efficient Storage of your Online Data.” It will take place on August 5 at 9 a.m. PDT and 12 p.m. EDT.

The time to register for this event is now, and visiting the above link will walk you through the steps to do so.

Sponsored by Ocarina and BlueArc, the webcast will delve into the practicalities involved in achieving storage efficiency. The focus will be on use of intelligent storage tiering and capacity optimization technologies to reduce data footprint and effectively manage data center storage resources.

Featured panelists will be Noemi Greyzdorf, Research Manager, Storage Software at IDC, Victoria Kepnik, Sr Product Manager at BlueArc, and Eric Scollard, VP of Sales at Ocarina Networks.

As we have discussed here in the past, storage tiering can be one important way to reduce disk costs. As Carter George put it in a recent post: To keep up, you have to cut the flab out of your storage. This, too, calls for a two-pronged approach. … This means doing a better job of tiering, and keeping files only as long as you really need them … The second part is the “exercise” element of keeping your storage slim and trim. That is, run a storage efficiency tool –may we suggest Ocarina as one example — that will efficiently trim the fat out of your data. That kind of combination means that you really can tighten your belt on your storage budget.”

And while storage tiering and capacity optimization are frequently discussed in storage publications, this is the first time I’ve seen this particular group of storage experts come together and take a serious and significant look at the details of how this can best be achieved for your enterprise. We look forward to your participation.

Deduplication - A Hot Topic

Posted by Sunshine On June - 11 - 2009

In the wake of the news that EMC and NetApp are bidding close to $2 billion (yes, you heard that right, billion) for backup deduplication specialist Data Domain, a whole lot of people are starting to ask, “just what is this deduplication stuff anyway?” On this very blog, readers are clicking on the “Deduplication” tag as never before. A sign, we think, that people are seeking information about this all-important topic.

The battle for control of Data Domain is no coincidence or random event. It’s based on the immense and growing value in deduplication technology, which has the ability to reduce overall storage costs (CapEx and OpEx) like few other advancements. Simply put, reducing the space it takes to store files is perhaps the best way to cut costs as data that must be stored continues to skyrocket. Ultimately, as deduplication makes its way into the primary storage arena–where this blog’s parent Ocarina Networks is a pioneering player–it is becoming a game changer.

As Murli Thirumale, CEO of Ocarina Networks and an occasional contributor to this blog put it in a recent post:

“As we’re now recognizing, those in the storage industry who dismissed Data Domain’s backup storage reduction as ‘just a feature’ were proven wrong. It was a product AND a company.  Our belief is that online storage optimization is both a feature and a product. We also believe the market for this capability is very horizontal and very large and can likely support 1-2 public companies in the future.”

Carter George, lead writer on this blog and VP Products at Ocarina (and a storage industry veteran), also had the following to say:

“The Data Domain technology gives NetApp an immediate leadership position in Dedupe for Backup, but the DDUP technology will not translate easily in to dedupe for primary, where NetApp Dedupe already exists in the filer.

At $25 a share, NetApp paid a 40% premium on yesterday’s closing DDUP price of about $18/share. This is a great comparable for not only the excitement around data reduction, but also the difficulty of doing it right. In short, if it was easy to copy, then NetApp would not have paid the premium.”

Whatever happens to Data Domain–whether it becomes a part of NetApp, EMC, some other acquirer, or remains on its own–this story is a huge validation of this particular business area.

Got Ocarina?

Posted by Carter George On May - 26 - 2009

milk

With so much talk about dedupe lately, it’s hard to know which aspect of it to discuss first. We’ll begin by addressing a question we occasionally hear–most recently from a comment on this very blog. The question can be summed up as: why someone would pay an extra amount for a solution such as Ocarina, rather than simply go with the dedupe that is free from NetApp (or another vendor)? This is not unlike the old saw “Why buy the cow when you can get the milk for free?” Well, the really quick answer in this case is that you are likely going to get a LOT more milk than you would otherwise be able to.

In any case, this question was one of many indications that we need to write a post or two to help clear up some points that may be confusing to those who aren’t deeply involved in this particular corner of the storage industry. So we thought it worthwhile to get into some detail with our response.

NetApp is the market leader in NAS, and they were pioneers in dedupe for online storage. If we at Ocarina are going to be successful, we have to explain why our technology is different and better than what they have.

NetApp is a model for both good technology and execution - and so is their recent acquisition, Data Domain. We have great respect and admiration for both companies, but we do have a better mousetrap when it comes to dedupe for online storage. It’s our job to tell the world about that. We’d love to partner with NetApp. In fact, we have NetApp customers who have purchased our product and are running it in production, having evaluated us against NetApp Dedupe. However, NetApp clearly feels they own “dedupe for primary” for their filers, and their purchase of Data Domain makes it clear that they intend to make storage efficiency and dedupe a major focus of the company going forward.

We are going to have to compete with their offerings. That’s fine - competition is what drives innovation and forward progress, and what makes the technology industry so much fun to work in! That said, then, if a company like Ocarina is going to be successful, we clearly have to do two things:  1) have a better mousetrap — if we’re not better than the thing you get for free, why would anyone buy our product? and 2) have a solution that works for both NetApp customers and non-NetApp shops.

NetApp Dedupe is pretty good for some things, but it has limitations. One obvious one is that it doesn’t help the EMC, HP, Isilon, Dell, IBM, BlueArc, or HDS NAS shop. It goes without saying that every NetApp customer will try the free dedupe first, and would only come to Ocarina after realizing that they need better results. If the NetApp dedupe is “good enough,” we do not expect customers to bring in Ocarina. However, there are several for whom it is not.

Keep in mind that the NAS market is huge, and growing faster than any other storage market segment. There are many other NAS vendors besides NetApp, and all the customers of all those storage vendors - from Windows file/print servers, to big players like EMC and HDS, to aggressive technology-leading NAS players like BlueArc and Isilon - are interested in the benefits of dedupe and storage efficiency for online storage. Ocarina wants those customers to know that that technology is available for all those platforms. You don’t have to buy a NetApp filer to get dedupe for online. And we want them to know that if they do go with Ocarina, they are not just paying for something on their storage that’s no better than what they’d get for free from NetApp. They are getting something better.

Now to the Storage Switzerland report that started so much of this discussion. It was about one thing: how well Ocarina object dedupe and content-aware compression shrinks data compared to NetApp dedupe. We think we did pretty well.

But there are other issues to consider.  Two that always come up in every customer are 1) how fast can you dedupe a data set and 2) how fast can users and applications access data after it has been optimized? We’ll cover both of those topics in future benchmarks and reports, as they’re both very important topics.

I’d like to point out that there are also two other big considerations for customers who want to reduce their storage footprint using dedupe or compression technology. These are things that are less measurable in a lab report, but just as important.

One of those is the ability for a customer to buy one “dedupe for online” solution and have it work across storage from multiple vendors that they might have in house. Sure, some customers have only one file serving platform, but many have multiple. They have multiple tiers, or they have old stuff and new stuff, or they have a standard of Vendor A but got a lot of Vendor B stuff when they acquired and integrated another company.

With Ocarina, a customer can choose a single “dedupe for online” solution that would work across all those vendors’ storage - including NetApp - with a single interface, a unified management console, and even the ability to dedupe across platforms.

The second product design differentiator, and why people will pay for us instead of deploying something that’s free, is the granularity of optimization. With NetApp Dedupe, you either dedupe a volume or not. It only works on NetApp, and your choice for a given NetApp volume is dedupe or don’t dedupe. With Ocarina, you have any number of dials to choose from based on your specific needs. You can choose to optimize sets of files within a volume by fine-grained policy. Examples might be, don’t dedupe database files at all, dedupe Office and PDF files that are two days old, dedupe and compress all media, video and photo files that have not been modified for 10 days. Really, any mix of file type, age, or metadata characteristic can be used to create policies that determine not only whether a file gets optimized, but how aggressively - object dedupe only, object and subfile dedupe, object and subfile dedupe and lightweight compression, or all of the above and sophisticated content-aware compression. You can match your level of dedupe and compression to the SLA’s, business value, and characteristics of your files. I don’t know how you benchmark that, but it’s pretty significant.

With all this in mind, we plan a second post to delve into yet more issues related to the Ocarina-NetApp lab report, so please stay tuned.

Dedupe Ramp-Up

Posted by Sunshine On May - 6 - 2009

Big announcement today from Ocarina about our newest product release, Ocarina 3.0. At this point, there’s no question that dedupe for primary is one of the hottest topics in storage. Our recent partnership announcements with major vendors were well covered in the storage press, and if anything the excitement around this technology is growing.

Update as of 5/8/09: Chris Mellor has a piece out in The Register (UK) on the new Ocarina release. Even better, he references our earlier post on dedupe.

In today’s release, Ocarina details the changes that this release makes to its groundbreaking dedupe and compression solution for online data. Here’s one key part to take note of:

“New to Ocarina ECOsystem 3.0 is object deduplication, the ability to manage optimized data end-to-end and customize the workflow by deciding how and where in the life-cycle to dedupe and compress data. Object deduplication, unique to the Ocarina ECOsystem, can identify duplicate information within and across file types, tiers and vendors, resulting in much higher reduction rates.”

As Carter explained in an earlier post, this is one tangible way that Ocarina gets better results than the more standard, block-level dedupe everyone else is doing. What this means is that it can intelligently discern natural objects in a file, deduping them as it finds them. If you stop and think about it, this is clearly a far more surgical approach, and the results bear this out. A head-to-head comparison should be published soon, and I’ll be sure to let you know about it when it does.

In any case, this all bodes well for the storage industry. As we approach the petabyte era, the survival of the world’s data centers depends on finding new and innovative ways to reduce the rising tide of data.

Our Prediction for the Hottest Storage Category of 2009

Posted by Mike Davis On January - 19 - 2009

And the winner is… dedupe for online

When it comes to storage, our market research and experience with customers have led us to the following prediction: dedupe for online storage will emerge as the hottest category of the year in 2009.

The current economic climate, coupled with the pace of advancement in cloud storage have created a perfect storm in which the need for cheap online storage is growing exponentially.

This category, which has also been referred to as “dedupe for primary” is a hot one with several entrants, one of which is my company Ocarina Networks.

Some industry observers have implied that this category is being overplayed, and that dedupe for primary won’t be as hot in the coming year as others have predicted. This is no doubt due to a misunderstanding of what is meant by “primary” storage, and where the bulk of the data growth is occurring. To clarify, we’re not talking here about dedupe for transactional databases or backups. The vast increases we’ve seen in storage demand is all in files and in nearline, not in performance-oriented primary storage.

With this in mind, here are the three key areas to consider when thinking about a dedupe solution for online:

1) How much can the product shrink an online data set with a wide mix of the typical kinds of files driving storage growth?
2) How fast can users access files that have been compressed and deduplicated?
3) How easy is it to integrate this new technology into an existing file serving environment?

I’m glad to say that Ocarina excels on all three fronts. Any product can deduplicate virtual machine images. The real question is which ones can also get good results on Exchange, Office 2007, PDF, and the wide range of image-rich data found in Web 2.0, energy, life sciences, medicine, and engineering. That’s where the rubber hits the road for our customers, and so most likely you’re going to be facing the same issues for your nearline data.

Of course, only time will tell whether this prediction is correct, but I’m betting the farm on it myself.