Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

All good things…

Posted by Sunshine On March - 31 - 2010

Today is a bit of a sad day for me, as this will be my last post on Online Storage Optimization. For those who are just joining us, I’ve been the regular “newsy” blogger on this site. It’s never been a traditional setup. I’m not an employee of this blog’s parent, Ocarina Networks. Rather, I’m an independent social media consultant who started out as a PR rep for the company. I moved into this role when we relaunched the blog as more of a publication in early 2009. It’s been a wonderfully fruitful arrangement that allowed me all the freedom and breadth to think, talk and learn to my heart’s content.

As much as I’ve loved working on this project, the reality of my life as a consultant has meant that I am being pulled in many directions. I’m now in the process of launching a new business, currently in stealth but soon to be revealed. (And I hope I can count on your support when it does.) This new set of responsibilities makes it impossible for me to continue to follow the daily ups and downs, trials and tribulations and fascinating personalities of the storage industry as an active blogger. I will of course be watching from afar. More than that, I’ll never forget the warm welcome I received in the storage blog-o-tweet-osphere.

With that said, I feel compelled to thank certain specific people who have made the experience of being part of the storage industry particularly enjoyable and enlightening. First, Carter George, VP Products at Ocarina and the lead blogger on this site. Carter took me under his wing from the get-go, sharing his vast wealth of knowledge as a leader in the industry. He took every single one of my questions seriously, no matter how stupid, and answered them in ways that expanded my understanding of this complex and technical subject area. For those who don’t know him personally, Carter is also one of the nicest and most approachable people you could meet. He encouraged me to stretch myself, and the result is that, lo and behold, I became a reasonably well-known and recognized “storage blogger.” Not something I would’ve dreamed of in million years.

Second, Stephen Foskett, publisher of Gestalt IT. A couple months after I started working on this blog, I got a DM on Twitter from him that read, “I hope Ocarina appreciates what you’re doing for them.” It couldn’t have been better timed. I was buried in working on a white paper about Ocarina’s newest release that I honestly didn’t believe I would ever have the technical know how to finish. At the same time, I was struggling to come up with topics for the blog. I also worried, continually, that I had pissed someone or other off by what I said on this blog or on Twitter. To get this message from someone as well-respected as Stephen gave me a much needed sense that somehow or other, I was doing okay.

Third, Murli Thirumale, CEO of Ocarina. Murli doesn’t have the personality you associate with your typical Silicon Valley CEO–he’s about as far from the image we all know of the crazed egomaniac as you can get. He is a thoughtful, respectful, and yet endlessly upbeat person who has built a successful company based on a real need. We had a relaxed working relationship, and I always appreciated his occasional contributions to this blog, which offered a “big picture” understanding of what he intended when he started the company. It was an honor to work for someone like him.

Fourth Marc Farley, storage rapper and 3Par’s social media whiz. Marc was one of the first people to respond to a Twitter tweet of mine and chat with me. We talked about whether it’s possible to remove the light bulb in your fridge, I seem to recall. Marc and I ended up creating a video together that became something of a viral hit within storage circles. He is one funny, cool guy and a true storage industry veteran who nevertheless has stayed ahead of the curve.

Fifth, Storagebod, also known as Martin Glassborow. Aside from being a great source of interesting blog posts that always kept me on my toes and wanting more, Martin is a fantastic Twitter conversationalist. He seems to have read every book on the planet. He also knows a great deal about a whole lot of other subjects, from music to health to wine. A true renaissance man and therefore someone I could always count on for a laugh or a chat–often when I most needed it.

Sixth, Greg Knieriemen. Greg has been a great guy to know, and has given me lots to think about through the exciting and active storage community he created, StorageMonkeys. Last fall, he had me on as a guest on his podcast, Infosmack, where we talked booth babes and other hot topics. He’s also a very funny guy–and a sense of humor is everything in this intense business.

Seventh, George Crump. George, of Storage Switzerland, was kind enough to give me all kinds of advice about how to run a successful blog when we first relaunched. I give myself credit for listening to him, and the results were notable. We have had a great run here, and one that I am sure will continue as I pass the baton to Mike Davis and any others who jump on the bandwagon known as Online Storage Optimization. I hope you’ll keep reading. I know I will.

Compression and Dedupe like Oil & Water?

Posted by Mike Davis On March - 31 - 2010

Ocarina customer Imagination Technology got some good press the other day in a SearchStorage article reviewing solutions for primary storage optimization. Imagination based Northwest of London provides key pieces of semiconductor IP that are in just about every smart phone out there, and they used Ocarina to double their storage in the same datacenter footprint. [note to Steve Jobs...I see you used them for iPhone graphics, can you also use their Flash acceleration? please?]. The underlying storage there is Network Appliance, and our co-processor is happily reaching in via NFS to shrink all the archived project data for reference and restore.

Like most every other company whose product is digital, access patterns follow a 90/10 rule…There’s a hot-set, and there’s the rest, and that’s the stuff you should target for data reduction. For someone who’s budget constrained or datacenter constrained (and I think that covers about everyone except maybe the NRO), data reduction brings real benefits to online storage, whether that comes from Netapp, Storwize, or Ocarina. Of course if it comes to a shrink-off, we’ll take any challenge from any taker with any dataset!

Dedupe and Compression like oil and water?
Brian raises the contentious issue of whether dedupe is better or compression is better, and whether you should do both. The truth is dedupe’d data is less easy to compress, and compressed data is less easy to dedupe. If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow. Of course it all depends on the data, and there’s a high burden on the software to be really smart about how and when it chooses different algorithms.

It takes some skill and forethought to do both well, and Ocarina’s algorithm selection logic is well tuned enough (thanks partly to the use of a neural network) that the combination of dedupe plus compression will deliver say 80% savings (or 5x effective capacity) when processing a enterprise-profile data set (where there’s some redundancy in the data). For some vertical applications however, the benefit of adding deduplication is so slight, it’ll actually be disabled. A life-sciences dataset for example has relatively little data redundancy.

Steve from Storwise is right that given in many situations compression can do much better than dedupe (for primary storage apps), and that’s certainly true when considering most of the prevailing dedupe algorithms such as SIS (full file), static block (Netapp Asis), or dynamic block. The key difference for Ocarina, and the way to make dedupe pay off in primary storage is as Brian in the article explains that we “pull files apart and deduplicate their constituent elements.” In other words we find reduction where no one else can. Imagine for example how cool it is when ECOsystem delivers dedupe benefits even when files have already been “single-instanced”.

Happy New Year

Posted by Sunshine On December - 29 - 2009

Tis the week for the “out of office” email messages. But the storage blogo-tweet-osphere waits for no man. Here are a few posts that caught my eye this week.

Bas Raayman sees CPU power hitting the wall: The RAM per CPU wall

Rick Vanover says 2010 could be the year for 10GigE - Will 2010 see 10 Gigabit Ethernet go mainstream?

It being the end of a year–and a decade–predictions abounded. We’re pleased to note that when it came to summarizing the top storage stories of 2009, deduplication for primary storage, the specialty of this blog’s parent Ocarina, made the big lists:

Infostor: The top 5 storage technologies of 2009 (and 2010?)

“Storage optimization (or data reduction) technologies such as data deduplication and compression can significantly reduce capacity requirements and costs … Consider data reduction for primary storage.”

SearchStorage - Beth Pariseau: Top 10 enterprise data storage news stories of 2009

“10. Data deduplication branches out. As deduplication settled into a comfortable role in backup, data-reduction technology started working its way into other parts of the data storage infrastructure, including primary as well as nearline and archived data … Ocarina and Isilon Clustered NAS help visual effects studio archive images, cut costs.”

For sheer inventiveness, blogger Stephen Foskett wins the prize with his 2009 predictions post, in which he turns the clock back and takes advantage of 20-20 hindsight: My 2009 IT Industry Predictions.

Meanwhile, social media and tech watcher Louis Gray takes himself to task and looks at all of his 2009 predictions to see how well he fared: My 2009 Tech Predictions: Mixed, But Nailed Real-Time.

OK that’s all for now. Here’s wishing all of you a happy, healthy, green and techy new decade.

Bring Out Your Data - The Deets

Posted by Sunshine On November - 4 - 2009

Lots of speculation this past week in the storage tweet-o-blog-0sphere around our “Bring Out Your Data” Challenge for Tech Field Day. We can’t wait to see what these smart and savvy participants bring us, and we’re confident about the results. There will be prizes awarded for those who stymie us and those who get the greatest reduction. This morning, we sent out a brief email giving a few more details about it. In the spirit of transparency, here is what we sent to the attendees:

Dear Tech Field Day attendee,

Ocarina Networks has issued a challenge to you for Tech Field Day: bring out your data. In brief, we’re asking you to arrive on November 13 at our offices with a thumb drive containing your toughest data set. We will compress and dedupe that data for you right in front of your eyes. This will be a chance for you to see the Ocarina ECOsystem in action so that you can assess data reduction and performance for yourself in real time.

Here are a few guidelines.

1. Try to keep it under 2 GB. This is to ensure that as many participants as possible have an opportunity to shrink their data during the four-hour time period you will be at the Ocarina offices.

2. If you would like to see both deduplication and compression, we recommend that you bring data that includes duplicates. In other words, one 2GB file is not going to be deduplicatable, but several different files that have shared objects will show much more interesting results. If you’re only interested in seeing our compression capabilities, then this isn’t necessary, but please keep in mind that the results you get in that case won’t reflect the deduplication feature.

3. Give us a mix of files from your local hard drive.

4. Label your stick. Put your name somewhere on the physical thumb drive. Also, give the directory your own first and last name.

A final note: we will return your flash drive to you at the end of the day, but please don’t bring us a sole copy of an important piece of data, as we may return it to you with the data in a compressed format.

Thanks for you participation in Tech Field Day, and we look forward to meeting you next week!

Best wishes,

The Ocarina Social Media Team

Carter George, Mike Davis, Sunshine Mugrabi, and Helen Miller-Montana

Dedupe Misconceptions

Posted by Carter George On October - 20 - 2009

As most in the industry are aware, dedupe has becoming a standard offering from every major vendor. Dedupe for primary has become the technology of the moment, and for good reason–the rising tide of unstructured data is forcing data centers worldwide to rethink capacity planning, tiering, and storage efficiency. But there are still a few lone voices out there who are clinging to the notion that dedupe is unnecessary.

Take for example this recent post from Compellent’s Bruce Kornfeld,Is dedupe the only answer?” Kornfeld is responding to a recent SearchStorage article “Is Data Duplication Right For Your Primary Storage?

Dedupe and compression can both be applied directly to primary data, and the savings there can be comparable to what’s seen in backup. On backup data, vendors claim 20x data reduction, and on primary data we think that most customers will see about 5x.

So, you say, “That means that you get four times more space savings on backup, right?” Wrong! Actually, 20x means a savings of 95% against the size of the original data set. Actually, 5X means a savings of 80%. There’s only a 15% difference - and an 80% space reduction is a huge win for the primary storage user. Of course, vendors who do not have a dedupe solution are likely to tell you you don’t need it anyway. There are some valid concerns about dedupe for primary, but there are also some misperceptions, and there’s no reason to let misinformation be propagated.

The biggest difference between dedupe for backup and dedupe for primary is that in backup, you dedupe all of the data. There’s no reason not to. In primary data, you might not want to dedupe everything - there are some data sets it does not make sense for. That’s not a knock on dedupe for primary. It just means you should choose which data sets make sense to dedupe.

The first common misperception about dedupe for primary data is that performance will be worse. But this is really not the case. When primary data has been deduped (but not compressed), an application asks the storage for a block, and that block is retrieved. There is one lookup to map the logical block request to the physical one - but those kinds of lookups are already being done in every storage array that has any kind of storage virtualization, such as thin provisioning. The response time on a block read for deduped data is hardly different than for un-deduped data, and this is true for all primary dedupe solutions - including both NetApp and Ocarina. There’s no more overhead to retrieving a deduped block than there would be in any other block read I/O on any intelligent array –and Compellent, being a leader in arrays with lots of smarts, is well aware of this. The fact that another file may also be sharing that block has zero impact on the time it takes to read it.

It’s true that for sets of blocks that are changing all the time, you won’t get as much benefit from dedupe. That’s not because the performance will be bad. It is because when you change a block, it’s no longer a dupe. Therefore it has to be stored again as a new block. If you read a deduped block, modify it, and write it back out, it would have been a write in an un-deduped case anyway, so performance, again, is even-steven between deduped and non-deduped volumes. Everyone doing dedupe for primary - NetApp and Ocarina - does the deduplication as a post-process, so there’s no impact at all to write performance. No one is trying to dedupe that block as it is being written.

What is different, though, is that In a high rate-of-change application like a transactional database, you won’t see as much space savings with dedupe. That’s because if most of your blocks are either new or have just been changed, they won’t be dupes. Here’s misperception number 2: while there are some applications in primary storage where dedupe does not apply (the hot tablespaces in Oracle or SQL Server, for example) , what you’ll find is that most data is a good candidate for dedupe on primary and nearline storage. In fact, much more data is stored in files that are good candidates for dedupe than not. All of the typical file/print files are great candidates for dedupe, but the misperception is that applications like Exchange and virtual machines shouldn’t be deduped. As it turns out, both are great candidates for dedupe (and compression, for that matter). Let’s take a look at VM’s.

In a virtual machine environment, a storage array may be storing thousands of VMDK’s, the VMware files that store a given virtual machine. Inside each VMDK file is a complete virtual machine image, including the operating system, application files and user data. If you have 1,000 VMDK’s that holds virtual Windows machine, you’ll have tens of thousands of “files” inside that VMDK file, including a copy of Microsoft Windows, the application you are running the in the virtual machine, and often the data for that application as well. How much of the Windows operating system do you suppose is duplicated across the 1,000 VMDK’s in this example? Well, almost all of it. What’s more, the thousands of files that make up Windows do not change - are not changeable, in fact, unless you do an OS upgrade.

Large parts of the VMDK file are duplicate with others, and they stay the same, day after day. Perfect candidates for dedupe. Sure, the user data in a VMDK may change, but any competent dedupe solution is not deduplicating whole files - the dedupe solution is deduplicating something at sub-file granularity: blocks, objects, chunks, etc. NetApp dedupes 4K WAFL file system blocks. Ocarina dedupes sub-file objects. The point is, regardless of which approach you take, if most of a VMDK file stays the same, and some part changes, dedupe will work great. The parts of the VMDK file that are changing won’t be deduped, and the vast majority of the file - the OS and application binaries - will be deduped. The space savings on your storage is great, and the performance impact minimal.

In important ways, dedupe for primary storage is the perfect complement to thin provisioning. In thin provisioning, a storage solution virtualizes (i.e., lies about) the amount disk space unused. With dedupe, the same storage solution can virtualize (ie, lie about) how much space is used. The two together provide the maximum storage efficiency.

End to End Dedupe

Posted by Goutham Rao On October - 14 - 2009

Ed Note: We hope you enjoy this guest post from Goutham Rao, CTO, Ocarina Networks, a panelist at SNW this week on the topic of “Primary Storage: The New Frontier for Data Deduplication.” This offers a more detailed and nuanced look at the topics discussed on the panel.

If you’re like many in the storage industry, you think of deduplication mainly as disk optimization. However, in today’s modern data center, dedupe and storage optimization should be thought of as applying across the entire storage workflow, rather than in one particular storage component.

Why?
Because we are no longer in an era in which storage is merely about spinning disks. It is about data, which can be “at rest” and “in motion” — moving from primary storage to nearline, or to backup, or replicated to different sites. Dedupe, then, must apply to all of storage workflows. This more true than ever as massive growth of unstructured data is becoming the rule rather than the exception.

As a result, IT Administrators are saddled with more challenges than ever before.
They must manage activities such as migration, replication and backup, all of which can lead to problems as an organization’s data footprint grows.

If you think about it, storage administrators largely deal with three tasks:

·         Data Storage – Maintain data on various filers and spinning disks. Deal with volumes of various sizes. Perform all the routine maintenance associated with spinning disks, like upgrades and refreshes, replacing lost drives, filer upgrades, snapshot maintenance, quota management, storage provisioning and growth management.

·         Data Movement – Manage replication of storage tiers from one location to another, either for protection or high availability. Migrate data from one location, like branch offices, to another location, like a primary data center.

·         Data Protection – Backup of various file servers and dealing with VTL, media servers, libraries, tapes, selective file restores (DAR), tape refreshes.

As you can see, as data grows for a customer, their problems grow in these three dimensions. So if you are going to talk about “Storage Optimization,” if your solution doesn’t scale or address the above three areas, you aren’t really providing a solution at all, but rather just a band aid.

Tying the Storage Optimization Workflow Together

Based on the above observations, a good storage optimization solution should be cognizant of the lifecycle of various files in the storage system. When a file enters a primary file system, it is likely to move around and finally get backed up or deleted. The storage optimization solution should optimize data such that the optimization effect lasts through this entire workflow and lifecycle. It should optimize the files while they are at rest on the storage disks, and also the same optimized format should be communicable to other storage end points as these files move through the storage workflow. Finally, the same optimized format should be the one that can get backed up directly and also lend itself to restoration and recover operations such as “Selective File Restore, DAR.”

Since the unit of communication between various storage tires and lifecycle waypoints seems to be “FILES,” it seems logical that this optimized data format would be implemented as files ON-TOP of a file system, instead of directly modifying a file systems block device data structures. The latter is not communicable across storage waypoints.

Dedupe/Optimization for Online

In order to optimize data for online storage (be it primary or nearline usage), the optimization solution needs to be aware of the life of the data beyond that particular tier. It needs to optimize data in such a way that the optimized data format is conducive for both movement (such as replication and migration) as well as backup (and restore). This has huge implications in how the optimized data is represented. Inherently, dedupe and optimization introduce a relationship between files that did not exist before.

As different files have different movement and backup policies, the optimized representation of these unrelated files needs to be amenable to independent lifecycles. Implementing dedupe as part of the file system’s data structures itself is counter to the notion of “Global Storage Optimization.” We call this the “Data Store Problem.” This is about how the dedupe solution “represents” or stores the various optimized data blocks associated with various unrelated files.

What needs to happen?
First, the data store representation must be smart enough that it can play well with the storage workflow. Otherwise, no matter how good the dedupe/optimization solution, it will always have a localized and limited effect. Second, online storage is quite different from backup storage, which means that the dedupe algorithms and techniques must also vary. For instance, in backup workflows, if a backup target sees 52 weekly backups, it is easy to imagine how the solution can get in excess of 25X dedupe savings.  Each week’s full backup file (which is in a particular backup software format) is likely to vary less than 5% from the previous week.

But when it comes to online storage, you don’t have such obvious duplicate objects and files. The duplication does exist, but it is hard to find. The dupes are embedded within various rich files. In fact in today’s application environment, most files are in a rich encoding format, utilizing a compression and encoding scheme such as ZLIB, GZIP, PKZIP, BZIP, and many other single-file-optimization schemes. So even though there are redundancies across files, they are hard to find without digging deep for them.  You need to understand the application file format, delayer the format and find the duplicate objects.

Next, dedupe alone is insufficient for online storage just given the nature and workflow of online storage. Unlike the backup workflow, where a majority of backup softwares have purposely introduce duplication from one weekly backup to another, online data has no such redundant workflow.

Online data is different from other data objects, and so online storage optimization must rely on modern compression techniques. There are algorithms today that can further optimize data better than 25-year-old algorithms such as Lempel-Ziv. Since most of today’s data is already optimized, the solution must first decompress the files and then apply application and file specific compression techniques in conjunction with dedupe.

Standard block level dedupe approaches will not work well. The solution must identify duplicates at the appropriate boundaries. Dedupe and compression have competing goals in a way. Dedupe likes small chunk sizes–the smaller the chunk, the more likely you are to find a duplicate chunk. However, small chunks are very compression-unfriendly. Compression likes large chunks where it can obtain a good amount of context. It’s better to compress 32K worth of data compared to 8 separate 4K chunks.  So the question is, what is a good block size? This is where “Object boundary recognition” comes into play. An online dedupe solution will find the best possible object boundaries such that each object is large enough to be properly optimized, but yet no smaller chunk of that object may appear as a duplicate of any other file.

Finally, an online dedupe solution must be aware of online storage workflows, which include random-read, modify, update and delete operations. Backup dedupe solutions only have to deal with streaming writes and streaming reads.  In online storage, you have IO access patterns that involve random read/writes, backward reads, overwrites, truncates, locking, concurrent access and so on.

A related topic is reducing the penalty of optimization. Online storage has much different performance metrics compared to backup solutions. In a backup optimization product, the focus is pretty much on how much sustained throughput of ingest can the backup VTL device handle? The measurements are in terms of “MBPS.” The metrics are, how many MBPS can a single stream upload handle?

But when it comes to online storage, the focus is not on how fast can you optimize data, but rather how fast can you rehydrate the data? It is about low latency access to any random part of a compressed file. If you compress a file from beginning to end, and you get a random access request to the middle of the file, you have to rehydrate that file from the beginning in order to service that random IO request.

This will make the latency too large for practice. These things will prevent a dedupe solution from being adopted in online storage. So a good online dedupe solution will optimize data such that random read/write patterns suffer very low latency. It will also format and optimize the data in such a way that rehydration of entire files utilizes as much CPU power as available on the rehydration platform as well as perform asymmetrically (take more time for optimization but much less time for rehydration).

Dedupe/Optimization for Data Movement

The whole goal behind online-dedupe is to represent parts of various files as a singularity. It brings in relationships between files that did not exist before. This optimality is fine while data resides on that storage endpoint.  But what if one of those files needs to be moved or replicated to another storage endpoint?  Must it move in its rehydrated (unoptimized) form?

When data moves between storage endpoints and tiers, the files may not move in the way they were optimized or along with exactly those files they were optimized with. For instance, if files A, B and C were optimized and deduped with respect to each other, but files A, B, E and F need to move to another endpoint, does this mean that these files need to be rehydrated? What if the target endpoint already has some chunks of data from files A, B, E and F due to some prior unrelated operation?

A good end-to-end optimization solution will recognize data movement operations such as migration and tiering and create an optimized package for data movement such that the package is self-redundant and also does not contain information that the target already knows about. For example, consider the use case where an enterprise wishes to backup file servers daily from various branch offices to a central location. This may involve a multiple of endpoint storage servers communicating to a single file server located at a data center. The dedupe solution must not only optimize at the endpoint locations but also optimize the daily backup workflows to the central office. The dedupe solution must be globally aware of duplicates that the other endpoints may have already communicated to the central data center endpoint.

Dedupe/Optimization for Backup and Data Protection

Lastly, the online dedupe solution must be aware of the backup workflows. Deduped data needs to be backed up in an optimized form. Rehydrating data just so it can be backed up is counterproductive. It must submit the data to the backup target in such a way that single file or selective file (Direct Access Restore) may be performed at any arbitrary location.  Today’s solutions solve this by rehydrating all the optimized data. As data moves from one stage to another, such as on a disk backup target to tape, the data is rehydrated and unoptimized before movement to the backup target.

Even if the IT organization uses traditional VTL workflows with media backup servers in their backup practices, the backup file dumps must be optimized file dumps and not rehydrated file dumps. Such file dumps must be locally optimized in such a way that direct access restore (selective file and directory restores) can still be performed without requiring access to any other older backup dump.

A part of optimizing backup workflows is actually to move away from VTL workflows in the first place. A good dedupe/optimization solution for backup will allow for end user direct file restores. This will allow for administrators to not have to deal with restoring files or selective files from what could potentially involve petabytes worth of backup data. Backup is the final resting place for files. The workflow should allow for versions of files to enter the backup target and for end users to directly restore any file they want without IT involvement.

Dedupe for Primary–Everyone’s Talkin’

Posted by Carter George On October - 7 - 2009

It’s very interesting to write for a blog that is focused on a specific topic–in our case, dedupe for primary–and then suddenly see the whole world wake up to the reality of it all at once. There has been quite the pile-on in the storage blogosphere of late.

So, what has been said so far? First, we had Chuck Hollis on his blog talking about primary dedupe and data I/O density. He makes some great points, but he is seeing the problem in a certain way–in essence, he’s thinking of data reduction may impact performance of primary storage. However, in some cases, dedupe can improve performance, where it allows much higher cache hit rates on highly used shared data blocks (virtual machines are the perfect example) and another fact is that a lot of storage on expensive primary tiers today does not need to be there. It started there, but it’s grown cold.  If you don’t want to create another tier and move files, dedupe gives you a way to create a cheaper logical tier on the storage you already have.

In that case, some trade-off in performance is perfectly acceptable. Ocarina’s solution for deduping primary storage  gives you the choice of deduping in-place (creating a logical tier 2) or doing dedupe-and-migrate as a single atomic operation, shrinking colder data and moving it off of Tier 1 storage in one step. In fact we just announced that Ocarina is now part of the EMC Velocity Technology and ISV Program, giving EMC’s Celerra a major edge over NetApp for both in-place dedupe on Celerra for primary, and for dedupe-and-tier.

A string of comments on Chuck’s blog included some heated exchanges between Chuck and arch rival NetApp’s bloggers Vaughn Stewart and Kostadis Roussos.

In response to the post, Hu Yoshida at HDS put in his view, which is that he essentially agrees with Chuck on this question. His main point is that dedupe for primary isn’t a panacea. True enough, but as Hu himself has noted in an earlier post, there’s a great advantage to integrating it when you’re already taking advantage of these other tiering, storage virtualization, and provisioning options.

Then finally, EMC Avamar’s Steve Kenniston covered a great deal of ground , and in fact ended up highlighting two key points that are complimentary parts of Ocarina’s strategy. First, we want to get as many deeply-embedded design wins with NAS and file system vendors as possible - meaning that a common “language of dedupe” would be spoken across multiple vendors. Second, we’re developing an end-to-end dedupe strategy, where a file that is deduped early in its lifecycle can be kept in its most compressed form as it moves throughout storage workflows.

Once deduped, data should never have to be rehydrated unless it is being accessed by users and applications. For all the rest of the classic storage workflows - backup, replication, data distribution, archive - there’s no reason for data to have to be rehydrated as it moves across tiers, platforms, and vendors.

Examples would be supporting replication of optimized (deduped and compressed) volumes, allowing deduped volumes to be backed up without rehydration (regardless of what the backup target is), and seamless integration with NDMP so that NDMP backups and restores can work work transparently with deduped files, without even knowing that they are deduped. The first wave of dedupe products were not only vendor-specific (NetApp Dedupe) but also tier specific (dedupe for backup, dedupe for primary, etc). While there are cases where a customer’s need for data reduction is urgent enough to deploy those point solutions, the real win is when dedupe is common and compelling across vendors and tiers.

Now, some people might say, “I already bought a dedupe appliance for my backup target, do I really need dedupe anywhere else?” But the fact is, if you dedupe upstream from your backup appliance, you not only save money on primary storage,  you still get benefit from your backup appliance. Backups are repetitive - you back up a volume every day, either full or incremental. So even if you have already deduped your primary volume, by backing it up every day, you are creating more duplicates in the backup target. The dedupe appliance will find those and take them out. If the primary volume has already been deduped, though, your backup data set will be smaller, and the work that the backup appliance has to do will be faster. The benefits are cumulative - if you get 5:1 on your primary data, and then back that up every day for a month, you may end up with 100:1 savings in your backups instead of today’s 20:1.

Interestingly, EMC has all the pieces here. And actually we can show how this works in an HDS environment just as well, which we may do in a later post. If you run Ocarina to dedupe your primary file store on Celerra, Ocarina can do the following:

* Optimize some primary files, identified by file type, right where they are on the fastest primary tier.   This may allow those files to see better cache hit rates.
* Optimize other files and move them to another volume on the same Celerra or another Celerra, perhaps a volume with SATA instead of Fibre drives.  Because Ocarina uses EMC FileMover stubs, this means that we can create a much larger global namespace on Celerra than a simple Celerra volume would support.
* Optimize all files in policy and post them to EMC Celerra for archive in an optimized form (deduped and compressed).
* Optimize whole volumes, and then back up those volumes to EMC Data Domain, where additional dedupe will take place as backup after backup creates more dupes in the backup target.

All of this can be done - on EMC, HDS, and other vendors - as true “end to end” dedupe, where data only gets rehydrated where its needed for a live application or user I/O request.

The Dedupe Race

Posted by Sunshine On August - 14 - 2009

marathon

The storage press has sniffed out a good story recently. Today, Beth Pariseau has a piece up on her Storage Soup blog that hones in on the drama surrounding the technology du jour–deduplication.

The post, “HP to EMC/Data Domain: Bring it On” has a headline that’s reminiscent of the sort of fighting words we heard from our former president.

Pariseau writes: “Admittedly late to the data deduplication game, Hewlett-Packard Co. is brewing new dedupe offerings to compete with the market’s new 800-pound gorilla — EMC/Data Domain. … HP partners with Sepaton for high-end VTLs and Ocarina for primary storage data reduction, but also develops deduplication software for its entry-level disk backup devices.”

Earlier this week, Chris Mellor at The Register covered the HP-Ocarina partnership news, also talking about it in terms of the rising competition for a complete dedupe solution. His article “HP Makes Ocarina Music” has a subhead that speaks volumes: “Ocarina close to clean sweep of file vendors.”

Mellor writes: “Ocarina has similar partnerships with BlueArc, EMC and Isilon. It looks almost inevitable that every other filer supplier must be looking at the Ocarina product and thinking a reseller deal might be a good idea. Otherwise, it could lose sales to the competition when a lot of image-type data is being stored.”

It will be interesting to see how this story unfolds. We vendor bloggers are already chattering about the recent partnership announcement, such as this post on the HP Storageworks “Around The Storage Block” blog. The post, by Pete Brey, WW Extreme Storage Business Development Manager, homes in on two recent HP announcements. First, its recent acquisition of IBRIX, and second its partnership with Ocarina.

Brey writes: “Now multi-petabyte systems are great when you have zillions of files that need to be stored but so is a multi-petabyte system that is optimized so that in the same space tens of zillions can be contained. This is where Ocarina’s ECOsystem software adds its value to our NAS products. The ECOsystem software transforms your storage with its content-aware storage optimization that compresses data up to 10:1 with added features such as deduplication, ECOsnap snapshots, and its own global name space capability. The unique thing about our reseller partnership is that HP can run the ECOsystem software right on our NAS nodes, further optimizing your infrastructure. Now there aren’t too many storage vendors out there who can talk about that now, are there?”

Bragging rights, indeed.

Clearly, it’s too soon to say exactly how each player in this space will benefit and/or lose out. As Mellor’s piece obliquely refers to, this isn’t about Ocarina setting itself in opposition to any vendors–in fact, it has a partnership with EMC. Rather, it shows how each provides a piece of the puzzle. In the big picture, there needs to be a shift in thinking towards something more along the lines of end-to-end dedupe–something that our lead blogger Carter George talked about at length in his popular post, “The Dedupe (R)evolution.” But in the short-run it’s certainly good to see how each vendor is distinguishing itself, and working hard to provide the most efficient, cost-effective storage options to its customers.

Dedupe Grows Up

Posted by Sunshine On July - 29 - 2009

George Crump has a piece in Byte and Switch today that poses an important question: “Can we get to a single point of deduplication?” This is a question that we have taken up in one form or another in some of our recent posts, such as this one and this one.

In the article, Crump asks the question in another way: “… can you have all your data tiers; primary, archive and backup deduplicated by a single engine?”

In light of the recent focus on deduplication, this in my view is a question that really does need to be raised. For how long will the industry to silo out these different tiers for its deduplication solutions? And how much sense does it make to rehydrate data every time you move it, in order to once again deduplicate it? Not a lot.

Crump writes: “The current deduplication vendors could work on building out their solutions to either scale up into primary storage performance (see Data Domain’s DD880) or they could move their existing data duplication technology into other markets; see the increased speed of Ocarina Networks and Permabit as well as their move into cloud storage.”

At the same time, as we’ve pointed out here, online storage is quite a bit different than backups and so far at least, none of the successful backup dedupe vendors - Data Domain, Diligent, Quantum, etc. have been able to break into it. Rather, it is NetApp and Ocarina who have been the trailblazers.

Crump makes another key point:

“NetApp and Ocarina could continue to enhance and improve the re-hydration speed of their technologies to make read performance a non-issue, making primary storage a viable platform. Ocarina can already maintain the deduplicated format as they move through tiers, so landing on backup or archive disk would simply be another move for them.”

This is an interesting observation, and one that is often missed in reporting on both of these solutions. We look forward to seeing more debate and discussion on this issue, which was well kicked off with this piece.

EMC Dedupe - Beyond Data Domain

Posted by Mike Davis On July - 27 - 2009

With all the talk about the Data Domain acquisition, there less attention paid to EMC’s native de-dupe features in Celerra, not to mention its other related partnerships, such as with Ocarina for optimization of vertical applications. Last week I had the privilege of attending a webinar, “Surviving the Data Explosion through Data Reduction” with John Hayden, CTO of NAS Engineering at EMC, where I got a fuller picture of Celerra’s latest optimization features.

John provided us with insights on how the new Celerra NAS product integrates data optimization. And while he never mentioned Data Domain directly, an astute observer could see how well EMC is integrating prior acquisitions into its architecture, and draw conclusions from that.

First, he provided us with a couple of interesting factoids from the Digital Universe research  EMC sponsored for IDC:

  • In 2009 there is positive growth in digital content, but IT spending for servers & storage are down 6%
  • Over the next 4 years, data will grow 5x, but IT budgets will only grow 1.2x
  • The administrative and overhead cost of storage is 4-7x the CapEx

This was all a prelude to John discussing the new data optimization features for their Celerra NAS product. It’s great to see the NAS vendors recognizing the value of data optimization as a central part of the NAS stack. Drilling a little deeper, EMC basically pulled together file-level deduplication (single instance storage or SIS) from the Avamar acquisition, and LZ77 data-generic compression from their Recoverpoint acquisition. SIS + LZ77 are a good price-performance combination for generic office files and text docs, but they don’t make much of a dent where we see the real capacity and scalability challenges; vertical applications such as life sciences, oil & gas, and media. In fact, the use of generic compression is becoming impotent against the latest MS Office docs that use ZIP as a container. If you change a single text character in an office doc, the entire file changes.

So there’s a reason that Ocarina has a solid partnership with EMC, with an optimization solution that’s complementary to Celerra’s. When it comes to customers with serious capacity issues and data growth - we’re talking about gene sequencing, post-houses, and so on and so forth - there is little to gain from deduplication, and little to gain from generic compression. Not only does the optimization solution need to more intelligently unwind and understand the file structure, but it needs to make better decisions about what algorithms get applied to specific file sub-objects. The is where Ocarina comes in. Like the native Celerra de-dupe solution, the Ocarina ECOsystem integrates with the FileMover API for a tightly knit, policy-based optimization solution that works even on media and ZIP files that are already compressed.

We look forward to our collaborations with EMC, and will be very interested to watch how they continue to integrate dedupe and compression across their offerings.