Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Happy New Year

Posted by Sunshine On December - 29 - 2009

Tis the week for the “out of office” email messages. But the storage blogo-tweet-osphere waits for no man. Here are a few posts that caught my eye this week.

Bas Raayman sees CPU power hitting the wall: The RAM per CPU wall

Rick Vanover says 2010 could be the year for 10GigE - Will 2010 see 10 Gigabit Ethernet go mainstream?

It being the end of a year–and a decade–predictions abounded. We’re pleased to note that when it came to summarizing the top storage stories of 2009, deduplication for primary storage, the specialty of this blog’s parent Ocarina, made the big lists:

Infostor: The top 5 storage technologies of 2009 (and 2010?)

“Storage optimization (or data reduction) technologies such as data deduplication and compression can significantly reduce capacity requirements and costs … Consider data reduction for primary storage.”

SearchStorage - Beth Pariseau: Top 10 enterprise data storage news stories of 2009

“10. Data deduplication branches out. As deduplication settled into a comfortable role in backup, data-reduction technology started working its way into other parts of the data storage infrastructure, including primary as well as nearline and archived data … Ocarina and Isilon Clustered NAS help visual effects studio archive images, cut costs.”

For sheer inventiveness, blogger Stephen Foskett wins the prize with his 2009 predictions post, in which he turns the clock back and takes advantage of 20-20 hindsight: My 2009 IT Industry Predictions.

Meanwhile, social media and tech watcher Louis Gray takes himself to task and looks at all of his 2009 predictions to see how well he fared: My 2009 Tech Predictions: Mixed, But Nailed Real-Time.

OK that’s all for now. Here’s wishing all of you a happy, healthy, green and techy new decade.

Bring Out Your Data - The Deets

Posted by Sunshine On November - 4 - 2009

Lots of speculation this past week in the storage tweet-o-blog-0sphere around our “Bring Out Your Data” Challenge for Tech Field Day. We can’t wait to see what these smart and savvy participants bring us, and we’re confident about the results. There will be prizes awarded for those who stymie us and those who get the greatest reduction. This morning, we sent out a brief email giving a few more details about it. In the spirit of transparency, here is what we sent to the attendees:

Dear Tech Field Day attendee,

Ocarina Networks has issued a challenge to you for Tech Field Day: bring out your data. In brief, we’re asking you to arrive on November 13 at our offices with a thumb drive containing your toughest data set. We will compress and dedupe that data for you right in front of your eyes. This will be a chance for you to see the Ocarina ECOsystem in action so that you can assess data reduction and performance for yourself in real time.

Here are a few guidelines.

1. Try to keep it under 2 GB. This is to ensure that as many participants as possible have an opportunity to shrink their data during the four-hour time period you will be at the Ocarina offices.

2. If you would like to see both deduplication and compression, we recommend that you bring data that includes duplicates. In other words, one 2GB file is not going to be deduplicatable, but several different files that have shared objects will show much more interesting results. If you’re only interested in seeing our compression capabilities, then this isn’t necessary, but please keep in mind that the results you get in that case won’t reflect the deduplication feature.

3. Give us a mix of files from your local hard drive.

4. Label your stick. Put your name somewhere on the physical thumb drive. Also, give the directory your own first and last name.

A final note: we will return your flash drive to you at the end of the day, but please don’t bring us a sole copy of an important piece of data, as we may return it to you with the data in a compressed format.

Thanks for you participation in Tech Field Day, and we look forward to meeting you next week!

Best wishes,

The Ocarina Social Media Team

Carter George, Mike Davis, Sunshine Mugrabi, and Helen Miller-Montana

Dedupe Misconceptions

Posted by Ocarina On October - 20 - 2009

As most in the industry are aware, dedupe has becoming a standard offering from every major vendor. Dedupe for primary has become the technology of the moment, and for good reason–the rising tide of unstructured data is forcing data centers worldwide to rethink capacity planning, tiering, and storage efficiency. But there are still a few lone voices out there who are clinging to the notion that dedupe is unnecessary.

Take for example this recent post from Compellent’s Bruce Kornfeld,Is dedupe the only answer?” Kornfeld is responding to a recent SearchStorage article “Is Data Duplication Right For Your Primary Storage?

Dedupe and compression can both be applied directly to primary data, and the savings there can be comparable to what’s seen in backup. On backup data, vendors claim 20x data reduction, and on primary data we think that most customers will see about 5x.

So, you say, “That means that you get four times more space savings on backup, right?” Wrong! Actually, 20x means a savings of 95% against the size of the original data set. Actually, 5X means a savings of 80%. There’s only a 15% difference - and an 80% space reduction is a huge win for the primary storage user. Of course, vendors who do not have a dedupe solution are likely to tell you you don’t need it anyway. There are some valid concerns about dedupe for primary, but there are also some misperceptions, and there’s no reason to let misinformation be propagated.

The biggest difference between dedupe for backup and dedupe for primary is that in backup, you dedupe all of the data. There’s no reason not to. In primary data, you might not want to dedupe everything - there are some data sets it does not make sense for. That’s not a knock on dedupe for primary. It just means you should choose which data sets make sense to dedupe.

The first common misperception about dedupe for primary data is that performance will be worse. But this is really not the case. When primary data has been deduped (but not compressed), an application asks the storage for a block, and that block is retrieved. There is one lookup to map the logical block request to the physical one - but those kinds of lookups are already being done in every storage array that has any kind of storage virtualization, such as thin provisioning. The response time on a block read for deduped data is hardly different than for un-deduped data, and this is true for all primary dedupe solutions - including both NetApp and Ocarina. There’s no more overhead to retrieving a deduped block than there would be in any other block read I/O on any intelligent array –and Compellent, being a leader in arrays with lots of smarts, is well aware of this. The fact that another file may also be sharing that block has zero impact on the time it takes to read it.

It’s true that for sets of blocks that are changing all the time, you won’t get as much benefit from dedupe. That’s not because the performance will be bad. It is because when you change a block, it’s no longer a dupe. Therefore it has to be stored again as a new block. If you read a deduped block, modify it, and write it back out, it would have been a write in an un-deduped case anyway, so performance, again, is even-steven between deduped and non-deduped volumes. Everyone doing dedupe for primary - NetApp and Ocarina - does the deduplication as a post-process, so there’s no impact at all to write performance. No one is trying to dedupe that block as it is being written.

What is different, though, is that In a high rate-of-change application like a transactional database, you won’t see as much space savings with dedupe. That’s because if most of your blocks are either new or have just been changed, they won’t be dupes. Here’s misperception number 2: while there are some applications in primary storage where dedupe does not apply (the hot tablespaces in Oracle or SQL Server, for example) , what you’ll find is that most data is a good candidate for dedupe on primary and nearline storage. In fact, much more data is stored in files that are good candidates for dedupe than not. All of the typical file/print files are great candidates for dedupe, but the misperception is that applications like Exchange and virtual machines shouldn’t be deduped. As it turns out, both are great candidates for dedupe (and compression, for that matter). Let’s take a look at VM’s.

In a virtual machine environment, a storage array may be storing thousands of VMDK’s, the VMware files that store a given virtual machine. Inside each VMDK file is a complete virtual machine image, including the operating system, application files and user data. If you have 1,000 VMDK’s that holds virtual Windows machine, you’ll have tens of thousands of “files” inside that VMDK file, including a copy of Microsoft Windows, the application you are running the in the virtual machine, and often the data for that application as well. How much of the Windows operating system do you suppose is duplicated across the 1,000 VMDK’s in this example? Well, almost all of it. What’s more, the thousands of files that make up Windows do not change - are not changeable, in fact, unless you do an OS upgrade.

Large parts of the VMDK file are duplicate with others, and they stay the same, day after day. Perfect candidates for dedupe. Sure, the user data in a VMDK may change, but any competent dedupe solution is not deduplicating whole files - the dedupe solution is deduplicating something at sub-file granularity: blocks, objects, chunks, etc. NetApp dedupes 4K WAFL file system blocks. Ocarina dedupes sub-file objects. The point is, regardless of which approach you take, if most of a VMDK file stays the same, and some part changes, dedupe will work great. The parts of the VMDK file that are changing won’t be deduped, and the vast majority of the file - the OS and application binaries - will be deduped. The space savings on your storage is great, and the performance impact minimal.

In important ways, dedupe for primary storage is the perfect complement to thin provisioning. In thin provisioning, a storage solution virtualizes (i.e., lies about) the amount disk space unused. With dedupe, the same storage solution can virtualize (ie, lie about) how much space is used. The two together provide the maximum storage efficiency.

End to End Dedupe

Posted by Goutham Rao On October - 14 - 2009

Ed Note: We hope you enjoy this guest post from Goutham Rao, CTO, Ocarina Networks, a panelist at SNW this week on the topic of “Primary Storage: The New Frontier for Data Deduplication.” This offers a more detailed and nuanced look at the topics discussed on the panel.

If you’re like many in the storage industry, you think of deduplication mainly as disk optimization. However, in today’s modern data center, dedupe and storage optimization should be thought of as applying across the entire storage workflow, rather than in one particular storage component.

Why?
Because we are no longer in an era in which storage is merely about spinning disks. It is about data, which can be “at rest” and “in motion” — moving from primary storage to nearline, or to backup, or replicated to different sites. Dedupe, then, must apply to all of storage workflows. This more true than ever as massive growth of unstructured data is becoming the rule rather than the exception.

As a result, IT Administrators are saddled with more challenges than ever before.
They must manage activities such as migration, replication and backup, all of which can lead to problems as an organization’s data footprint grows.

If you think about it, storage administrators largely deal with three tasks:

·         Data Storage – Maintain data on various filers and spinning disks. Deal with volumes of various sizes. Perform all the routine maintenance associated with spinning disks, like upgrades and refreshes, replacing lost drives, filer upgrades, snapshot maintenance, quota management, storage provisioning and growth management.

·         Data Movement – Manage replication of storage tiers from one location to another, either for protection or high availability. Migrate data from one location, like branch offices, to another location, like a primary data center.

·         Data Protection – Backup of various file servers and dealing with VTL, media servers, libraries, tapes, selective file restores (DAR), tape refreshes.

As you can see, as data grows for a customer, their problems grow in these three dimensions. So if you are going to talk about “Storage Optimization,” if your solution doesn’t scale or address the above three areas, you aren’t really providing a solution at all, but rather just a band aid.

Tying the Storage Optimization Workflow Together

Based on the above observations, a good storage optimization solution should be cognizant of the lifecycle of various files in the storage system. When a file enters a primary file system, it is likely to move around and finally get backed up or deleted. The storage optimization solution should optimize data such that the optimization effect lasts through this entire workflow and lifecycle. It should optimize the files while they are at rest on the storage disks, and also the same optimized format should be communicable to other storage end points as these files move through the storage workflow. Finally, the same optimized format should be the one that can get backed up directly and also lend itself to restoration and recover operations such as “Selective File Restore, DAR.”

Since the unit of communication between various storage tires and lifecycle waypoints seems to be “FILES,” it seems logical that this optimized data format would be implemented as files ON-TOP of a file system, instead of directly modifying a file systems block device data structures. The latter is not communicable across storage waypoints.

Dedupe/Optimization for Online

In order to optimize data for online storage (be it primary or nearline usage), the optimization solution needs to be aware of the life of the data beyond that particular tier. It needs to optimize data in such a way that the optimized data format is conducive for both movement (such as replication and migration) as well as backup (and restore). This has huge implications in how the optimized data is represented. Inherently, dedupe and optimization introduce a relationship between files that did not exist before.

As different files have different movement and backup policies, the optimized representation of these unrelated files needs to be amenable to independent lifecycles. Implementing dedupe as part of the file system’s data structures itself is counter to the notion of “Global Storage Optimization.” We call this the “Data Store Problem.” This is about how the dedupe solution “represents” or stores the various optimized data blocks associated with various unrelated files.

What needs to happen?
First, the data store representation must be smart enough that it can play well with the storage workflow. Otherwise, no matter how good the dedupe/optimization solution, it will always have a localized and limited effect. Second, online storage is quite different from backup storage, which means that the dedupe algorithms and techniques must also vary. For instance, in backup workflows, if a backup target sees 52 weekly backups, it is easy to imagine how the solution can get in excess of 25X dedupe savings.  Each week’s full backup file (which is in a particular backup software format) is likely to vary less than 5% from the previous week.

But when it comes to online storage, you don’t have such obvious duplicate objects and files. The duplication does exist, but it is hard to find. The dupes are embedded within various rich files. In fact in today’s application environment, most files are in a rich encoding format, utilizing a compression and encoding scheme such as ZLIB, GZIP, PKZIP, BZIP, and many other single-file-optimization schemes. So even though there are redundancies across files, they are hard to find without digging deep for them.  You need to understand the application file format, delayer the format and find the duplicate objects.

Next, dedupe alone is insufficient for online storage just given the nature and workflow of online storage. Unlike the backup workflow, where a majority of backup softwares have purposely introduce duplication from one weekly backup to another, online data has no such redundant workflow.

Online data is different from other data objects, and so online storage optimization must rely on modern compression techniques. There are algorithms today that can further optimize data better than 25-year-old algorithms such as Lempel-Ziv. Since most of today’s data is already optimized, the solution must first decompress the files and then apply application and file specific compression techniques in conjunction with dedupe.

Standard block level dedupe approaches will not work well. The solution must identify duplicates at the appropriate boundaries. Dedupe and compression have competing goals in a way. Dedupe likes small chunk sizes–the smaller the chunk, the more likely you are to find a duplicate chunk. However, small chunks are very compression-unfriendly. Compression likes large chunks where it can obtain a good amount of context. It’s better to compress 32K worth of data compared to 8 separate 4K chunks.  So the question is, what is a good block size? This is where “Object boundary recognition” comes into play. An online dedupe solution will find the best possible object boundaries such that each object is large enough to be properly optimized, but yet no smaller chunk of that object may appear as a duplicate of any other file.

Finally, an online dedupe solution must be aware of online storage workflows, which include random-read, modify, update and delete operations. Backup dedupe solutions only have to deal with streaming writes and streaming reads.  In online storage, you have IO access patterns that involve random read/writes, backward reads, overwrites, truncates, locking, concurrent access and so on.

A related topic is reducing the penalty of optimization. Online storage has much different performance metrics compared to backup solutions. In a backup optimization product, the focus is pretty much on how much sustained throughput of ingest can the backup VTL device handle? The measurements are in terms of “MBPS.” The metrics are, how many MBPS can a single stream upload handle?

But when it comes to online storage, the focus is not on how fast can you optimize data, but rather how fast can you rehydrate the data? It is about low latency access to any random part of a compressed file. If you compress a file from beginning to end, and you get a random access request to the middle of the file, you have to rehydrate that file from the beginning in order to service that random IO request.

This will make the latency too large for practice. These things will prevent a dedupe solution from being adopted in online storage. So a good online dedupe solution will optimize data such that random read/write patterns suffer very low latency. It will also format and optimize the data in such a way that rehydration of entire files utilizes as much CPU power as available on the rehydration platform as well as perform asymmetrically (take more time for optimization but much less time for rehydration).

Dedupe/Optimization for Data Movement

The whole goal behind online-dedupe is to represent parts of various files as a singularity. It brings in relationships between files that did not exist before. This optimality is fine while data resides on that storage endpoint.  But what if one of those files needs to be moved or replicated to another storage endpoint?  Must it move in its rehydrated (unoptimized) form?

When data moves between storage endpoints and tiers, the files may not move in the way they were optimized or along with exactly those files they were optimized with. For instance, if files A, B and C were optimized and deduped with respect to each other, but files A, B, E and F need to move to another endpoint, does this mean that these files need to be rehydrated? What if the target endpoint already has some chunks of data from files A, B, E and F due to some prior unrelated operation?

A good end-to-end optimization solution will recognize data movement operations such as migration and tiering and create an optimized package for data movement such that the package is self-redundant and also does not contain information that the target already knows about. For example, consider the use case where an enterprise wishes to backup file servers daily from various branch offices to a central location. This may involve a multiple of endpoint storage servers communicating to a single file server located at a data center. The dedupe solution must not only optimize at the endpoint locations but also optimize the daily backup workflows to the central office. The dedupe solution must be globally aware of duplicates that the other endpoints may have already communicated to the central data center endpoint.

Dedupe/Optimization for Backup and Data Protection

Lastly, the online dedupe solution must be aware of the backup workflows. Deduped data needs to be backed up in an optimized form. Rehydrating data just so it can be backed up is counterproductive. It must submit the data to the backup target in such a way that single file or selective file (Direct Access Restore) may be performed at any arbitrary location.  Today’s solutions solve this by rehydrating all the optimized data. As data moves from one stage to another, such as on a disk backup target to tape, the data is rehydrated and unoptimized before movement to the backup target.

Even if the IT organization uses traditional VTL workflows with media backup servers in their backup practices, the backup file dumps must be optimized file dumps and not rehydrated file dumps. Such file dumps must be locally optimized in such a way that direct access restore (selective file and directory restores) can still be performed without requiring access to any other older backup dump.

A part of optimizing backup workflows is actually to move away from VTL workflows in the first place. A good dedupe/optimization solution for backup will allow for end user direct file restores. This will allow for administrators to not have to deal with restoring files or selective files from what could potentially involve petabytes worth of backup data. Backup is the final resting place for files. The workflow should allow for versions of files to enter the backup target and for end users to directly restore any file they want without IT involvement.

Dedupe for Primary–Everyone’s Talkin’

Posted by Ocarina On October - 7 - 2009

It’s very interesting to write for a blog that is focused on a specific topic–in our case, dedupe for primary–and then suddenly see the whole world wake up to the reality of it all at once. There has been quite the pile-on in the storage blogosphere of late.

So, what has been said so far? First, we had Chuck Hollis on his blog talking about primary dedupe and data I/O density. He makes some great points, but he is seeing the problem in a certain way–in essence, he’s thinking of data reduction may impact performance of primary storage. However, in some cases, dedupe can improve performance, where it allows much higher cache hit rates on highly used shared data blocks (virtual machines are the perfect example) and another fact is that a lot of storage on expensive primary tiers today does not need to be there. It started there, but it’s grown cold.  If you don’t want to create another tier and move files, dedupe gives you a way to create a cheaper logical tier on the storage you already have.

In that case, some trade-off in performance is perfectly acceptable. Ocarina’s solution for deduping primary storage  gives you the choice of deduping in-place (creating a logical tier 2) or doing dedupe-and-migrate as a single atomic operation, shrinking colder data and moving it off of Tier 1 storage in one step. In fact we just announced that Ocarina is now part of the EMC Velocity Technology and ISV Program, giving EMC’s Celerra a major edge over NetApp for both in-place dedupe on Celerra for primary, and for dedupe-and-tier.

A string of comments on Chuck’s blog included some heated exchanges between Chuck and arch rival NetApp’s bloggers Vaughn Stewart and Kostadis Roussos.

In response to the post, Hu Yoshida at HDS put in his view, which is that he essentially agrees with Chuck on this question. His main point is that dedupe for primary isn’t a panacea. True enough, but as Hu himself has noted in an earlier post, there’s a great advantage to integrating it when you’re already taking advantage of these other tiering, storage virtualization, and provisioning options.

Then finally, EMC Avamar’s Steve Kenniston covered a great deal of ground , and in fact ended up highlighting two key points that are complimentary parts of Ocarina’s strategy. First, we want to get as many deeply-embedded design wins with NAS and file system vendors as possible - meaning that a common “language of dedupe” would be spoken across multiple vendors. Second, we’re developing an end-to-end dedupe strategy, where a file that is deduped early in its lifecycle can be kept in its most compressed form as it moves throughout storage workflows.

Once deduped, data should never have to be rehydrated unless it is being accessed by users and applications. For all the rest of the classic storage workflows - backup, replication, data distribution, archive - there’s no reason for data to have to be rehydrated as it moves across tiers, platforms, and vendors.

Examples would be supporting replication of optimized (deduped and compressed) volumes, allowing deduped volumes to be backed up without rehydration (regardless of what the backup target is), and seamless integration with NDMP so that NDMP backups and restores can work work transparently with deduped files, without even knowing that they are deduped. The first wave of dedupe products were not only vendor-specific (NetApp Dedupe) but also tier specific (dedupe for backup, dedupe for primary, etc). While there are cases where a customer’s need for data reduction is urgent enough to deploy those point solutions, the real win is when dedupe is common and compelling across vendors and tiers.

Now, some people might say, “I already bought a dedupe appliance for my backup target, do I really need dedupe anywhere else?” But the fact is, if you dedupe upstream from your backup appliance, you not only save money on primary storage,  you still get benefit from your backup appliance. Backups are repetitive - you back up a volume every day, either full or incremental. So even if you have already deduped your primary volume, by backing it up every day, you are creating more duplicates in the backup target. The dedupe appliance will find those and take them out. If the primary volume has already been deduped, though, your backup data set will be smaller, and the work that the backup appliance has to do will be faster. The benefits are cumulative - if you get 5:1 on your primary data, and then back that up every day for a month, you may end up with 100:1 savings in your backups instead of today’s 20:1.

Interestingly, EMC has all the pieces here. And actually we can show how this works in an HDS environment just as well, which we may do in a later post. If you run Ocarina to dedupe your primary file store on Celerra, Ocarina can do the following:

* Optimize some primary files, identified by file type, right where they are on the fastest primary tier.   This may allow those files to see better cache hit rates.
* Optimize other files and move them to another volume on the same Celerra or another Celerra, perhaps a volume with SATA instead of Fibre drives.  Because Ocarina uses EMC FileMover stubs, this means that we can create a much larger global namespace on Celerra than a simple Celerra volume would support.
* Optimize all files in policy and post them to EMC Celerra for archive in an optimized form (deduped and compressed).
* Optimize whole volumes, and then back up those volumes to EMC Data Domain, where additional dedupe will take place as backup after backup creates more dupes in the backup target.

All of this can be done - on EMC, HDS, and other vendors - as true “end to end” dedupe, where data only gets rehydrated where its needed for a live application or user I/O request.

The Dedupe Race

Posted by Sunshine On August - 14 - 2009

marathon

The storage press has sniffed out a good story recently. Today, Beth Pariseau has a piece up on her Storage Soup blog that hones in on the drama surrounding the technology du jour–deduplication.

The post, “HP to EMC/Data Domain: Bring it On” has a headline that’s reminiscent of the sort of fighting words we heard from our former president.

Pariseau writes: “Admittedly late to the data deduplication game, Hewlett-Packard Co. is brewing new dedupe offerings to compete with the market’s new 800-pound gorilla — EMC/Data Domain. … HP partners with Sepaton for high-end VTLs and Ocarina for primary storage data reduction, but also develops deduplication software for its entry-level disk backup devices.”

Earlier this week, Chris Mellor at The Register covered the HP-Ocarina partnership news, also talking about it in terms of the rising competition for a complete dedupe solution. His article “HP Makes Ocarina Music” has a subhead that speaks volumes: “Ocarina close to clean sweep of file vendors.”

Mellor writes: “Ocarina has similar partnerships with BlueArc, EMC and Isilon. It looks almost inevitable that every other filer supplier must be looking at the Ocarina product and thinking a reseller deal might be a good idea. Otherwise, it could lose sales to the competition when a lot of image-type data is being stored.”

It will be interesting to see how this story unfolds. We vendor bloggers are already chattering about the recent partnership announcement, such as this post on the HP Storageworks “Around The Storage Block” blog. The post, by Pete Brey, WW Extreme Storage Business Development Manager, homes in on two recent HP announcements. First, its recent acquisition of IBRIX, and second its partnership with Ocarina.

Brey writes: “Now multi-petabyte systems are great when you have zillions of files that need to be stored but so is a multi-petabyte system that is optimized so that in the same space tens of zillions can be contained. This is where Ocarina’s ECOsystem software adds its value to our NAS products. The ECOsystem software transforms your storage with its content-aware storage optimization that compresses data up to 10:1 with added features such as deduplication, ECOsnap snapshots, and its own global name space capability. The unique thing about our reseller partnership is that HP can run the ECOsystem software right on our NAS nodes, further optimizing your infrastructure. Now there aren’t too many storage vendors out there who can talk about that now, are there?”

Bragging rights, indeed.

Clearly, it’s too soon to say exactly how each player in this space will benefit and/or lose out. As Mellor’s piece obliquely refers to, this isn’t about Ocarina setting itself in opposition to any vendors–in fact, it has a partnership with EMC. Rather, it shows how each provides a piece of the puzzle. In the big picture, there needs to be a shift in thinking towards something more along the lines of end-to-end dedupe–something that our lead blogger Carter George talked about at length in his popular post, “The Dedupe (R)evolution.” But in the short-run it’s certainly good to see how each vendor is distinguishing itself, and working hard to provide the most efficient, cost-effective storage options to its customers.

Dedupe Grows Up

Posted by Sunshine On July - 29 - 2009

George Crump has a piece in Byte and Switch today that poses an important question: “Can we get to a single point of deduplication?” This is a question that we have taken up in one form or another in some of our recent posts, such as this one and this one.

In the article, Crump asks the question in another way: “… can you have all your data tiers; primary, archive and backup deduplicated by a single engine?”

In light of the recent focus on deduplication, this in my view is a question that really does need to be raised. For how long will the industry to silo out these different tiers for its deduplication solutions? And how much sense does it make to rehydrate data every time you move it, in order to once again deduplicate it? Not a lot.

Crump writes: “The current deduplication vendors could work on building out their solutions to either scale up into primary storage performance (see Data Domain’s DD880) or they could move their existing data duplication technology into other markets; see the increased speed of Ocarina Networks and Permabit as well as their move into cloud storage.”

At the same time, as we’ve pointed out here, online storage is quite a bit different than backups and so far at least, none of the successful backup dedupe vendors - Data Domain, Diligent, Quantum, etc. have been able to break into it. Rather, it is NetApp and Ocarina who have been the trailblazers.

Crump makes another key point:

“NetApp and Ocarina could continue to enhance and improve the re-hydration speed of their technologies to make read performance a non-issue, making primary storage a viable platform. Ocarina can already maintain the deduplicated format as they move through tiers, so landing on backup or archive disk would simply be another move for them.”

This is an interesting observation, and one that is often missed in reporting on both of these solutions. We look forward to seeing more debate and discussion on this issue, which was well kicked off with this piece.

EMC Dedupe - Beyond Data Domain

Posted by Mike Davis On July - 27 - 2009

With all the talk about the Data Domain acquisition, there less attention paid to EMC’s native de-dupe features in Celerra, not to mention its other related partnerships, such as with Ocarina for optimization of vertical applications. Last week I had the privilege of attending a webinar, “Surviving the Data Explosion through Data Reduction” with John Hayden, CTO of NAS Engineering at EMC, where I got a fuller picture of Celerra’s latest optimization features.

John provided us with insights on how the new Celerra NAS product integrates data optimization. And while he never mentioned Data Domain directly, an astute observer could see how well EMC is integrating prior acquisitions into its architecture, and draw conclusions from that.

First, he provided us with a couple of interesting factoids from the Digital Universe research  EMC sponsored for IDC:

  • In 2009 there is positive growth in digital content, but IT spending for servers & storage are down 6%
  • Over the next 4 years, data will grow 5x, but IT budgets will only grow 1.2x
  • The administrative and overhead cost of storage is 4-7x the CapEx

This was all a prelude to John discussing the new data optimization features for their Celerra NAS product. It’s great to see the NAS vendors recognizing the value of data optimization as a central part of the NAS stack. Drilling a little deeper, EMC basically pulled together file-level deduplication (single instance storage or SIS) from the Avamar acquisition, and LZ77 data-generic compression from their Recoverpoint acquisition. SIS + LZ77 are a good price-performance combination for generic office files and text docs, but they don’t make much of a dent where we see the real capacity and scalability challenges; vertical applications such as life sciences, oil & gas, and media. In fact, the use of generic compression is becoming impotent against the latest MS Office docs that use ZIP as a container. If you change a single text character in an office doc, the entire file changes.

So there’s a reason that Ocarina has a solid partnership with EMC, with an optimization solution that’s complementary to Celerra’s. When it comes to customers with serious capacity issues and data growth - we’re talking about gene sequencing, post-houses, and so on and so forth - there is little to gain from deduplication, and little to gain from generic compression. Not only does the optimization solution need to more intelligently unwind and understand the file structure, but it needs to make better decisions about what algorithms get applied to specific file sub-objects. The is where Ocarina comes in. Like the native Celerra de-dupe solution, the Ocarina ECOsystem integrates with the FileMover API for a tightly knit, policy-based optimization solution that works even on media and ZIP files that are already compressed.

We look forward to our collaborations with EMC, and will be very interested to watch how they continue to integrate dedupe and compression across their offerings.

Storage News and Views July 21

Posted by Sunshine On July - 21 - 2009

What is it about mid-summer that turns everyone a little bit mad? EMC is now the proud parent of a baby DDUP. NetApp is $57 million richer. And here in the Bay Area, the weather veers from freezing to sweltering, depending on how close one is to the coast.

Here are a few headlines that caught my eye this morning:

SearchStorage: Cornell University, Shopzilla deploy primary storage data reduction to consolidate storage, keep up with data growth - Beth Pariseau reports on the way that Ocarina and Storwize are reducing data for their customers. Thanks to Ocarina, Cornell University has slashed its storage costs, and can now consider better economies of scale by consolidating storage for other departments.

StorageIO Blog: Summer Weddings: EMC+Datadomain and HP+IBRIX - Greg Schulz on the two storage mergers of the season. Some interesting thoughts on the HP-IBrix merger that I haven’t seen anywhere else. And what is this Mass./Calif. love affair all about?

Chuck’s Blog: Data Domain: The Cone of Silence is Lifted EMC’s Chuck Hollis drops some hints about how Data Domain will be integrated into its existing storage offerings. The comments are also worth reading. Chuck does a great job fielding the fevered speculation that’s going around.

Dedupe for Primary - Recent Coverage

Posted by Sunshine On June - 22 - 2009

As we keep noting on this blog, data reduction is becoming the topic du jour as storage budgets are squeezed and deduplication becomes more and more viable and effective. Dave Simpson, Editor-in-Chief of Infostor, came out with a very thorough article today on primary storage optimization. It’s a practical guide for customers who may be struggling to understand the differences between key vendors’ offerings in this new and exciting data reduction arena. They are: NetApp, EMC, Ocarina (this blog’s parent), Storwize, Hifn, and greenBytes.

According to Simpson’s article, performance is a key issue to consider when assessing primary storage optimization products. He also quotes Eric Burgener, formerly an analyst with Taneja Group (now with InMage), who notes that often time the much touted differences in reduction rates can be overplayed.

“… a handful of vendors are addressing the performance requirements associated with data compression and de-duplication on primary storage, and … users should understand that there’s not a huge difference between, say, an 8:1 data-reduction ratio and a 20:1 ratio.”

An interesting point, and one that is often overlooked in the race to show results. As we have reported on this blog in the past, the real comparisons should be about the percentage of difference, not the ratios, which can be misleading. So for example, the Ocarina ECOsystem had 200% better results on a typical home shares file mix than NetApp dedupe, with 54% reduction vs. NetApp’s 27%. These are real numbers that can give you a sense of the amount of storage space you’re likely to reclaim when deploying one of these solutions.

And by the way, Eric Burgener had a really nice post back in February when he was still at Taneja Group.  Called Pulling One Out of the Hat, it gives great advice and details about how to make the best use of your primary storage budget in these times. Definitely worth a read.

Happy Monday everyone!