At Ocarina, we’re having a great deal of success these days with partnerships, and the buzz around this is being seen in the storage press and beyond. What has happened in part is that now that NetApp has made dedupe table stakes, we are the dance partner that many vendors are turning to, as we stand out as the ones who have the best data reduction technology for online data.
Before we go on, we should acknowledge a recent FUD-spreading post by NetApp’s Dr. Dedupe in which he tries to question whether Ocarina is even dedupe technology. Let’s just clarify, Ocarina offers a solution that includes both content-aware compression and a next-generation form of dedupe called object dedupe. What makes him think that Ocarina is just “resizing photos” is frankly a little beyond us, but in any case, that’s not in any way, shape or form what Ocarina does. However, we are pleased to see that NetApp believes the Ocarina technology deserves attention and is pointing their binoculars at us!
W. Curtis Preston makes a good suggestion in the comments field of Dr. Dedupe’s post. The best way to handle this is obviously to run some tests to see which solution offers better results. As it happens, we have already commissioned just such a study, by George Crump at Storage Switzerland. His results will be published soon, but I can reveal that our results are excellent compared to block dedupe in the filer.
Fundamentally, the value proposition of dedupe technology is that it increases storage capacity, resulting in lower CapEx and OpEx. The difference between the more standard, block-level dedupe such as NetApp’s and our technology is that Ocarina can intelligently extract and analyze the natural semantic objects inside virtually any file.
For example, in a PowerPoint file, a slide is a natural object and so are graphics that appear on a slide. Rather than hashing 4k file system blocks, we look for natural objects like the slide or the graphic, and we hash and dedupe those. This is one reason that we are able to achieve such startling results. For more information on this, please see my earlier post in response to another NetApp blogger, Alex McDonald, who also seemed caught up in semantics around the difference between dedupe and compression. We are planning to release updated white papers in May that reflect all of our latest capabilities. But the bottom line always comes down to, how much data can you reduce?
In general, if we are applied to a data set where our content-aware algorithms recognize most of the file types and objects, we’ll get better results than any other approach. Where the data set is something we do not have specific algorithms for, we’ll treat each file as an opaque object, and our results will trend down to about the same as you’d expect from block dedupe. So worst case, we’re the same, best case we’re as much as 50 times better.
A big advantage of a content-aware approach is that you can set policies that define what gets optimized and when. Block dedupe typically processes all of a volume or none. Since block dedupe has no awareness of content, it has no way to decide whether a given block should be deduped, compressed, or left alone. In a content-aware solution, you can say dedupe files like this, compress files like that, dedupe and compress files older than “x,” and leave these other kinds of files alone, because they are very performance sensitive. We see that kind of file and object level policy control being essential to broad adoption.
This is one key reason many of the largest storage vendors are partnering with Ocarina and including its data reduction technology their overall offerings. We have recently announced partnerships with EMC, HDS, HP, BlueArc, and Isilon, as well as cloud storage provider Nirvanix. This, of course, is good news for customers and the industry in general.
Our success will depend on how well we seamlessly integrate and support their storage hardware. So far, this is going well, and in fact some of the partnerships we’ve already announced are looking as if they may escalate to deeper integrations and levels of partnership.
If we execute well on those, then the data reduction for file storage category may eventually become “NetApp dedupe” versus “everyone-else-with-Ocarina.” That is our goal.
Data reduction for online storage is a hot topic for a couple of reasons. It’s been validated in other parts of the data center. We all know dedupe has become the norm for backups, with disk-based backup targets with dedupe built in rapidly replacing tape. Compression and simple dedupe (called dictionary compression) have also been widely adopted (as WAFS solutions) in the network. The next natural frontier is online file data. Just like Data Domain focused their technology on backups and Riverbed did the same for WAFS, we have optimized our data reduction technology for online data. Each of these use cases has a different design optimum, and if you started out building a solution for online file data, you’d make different design decisions than you would for backups and WAN optimization. In our case, that meant building a whole new kind of dedupe to be able to get the best results on the kinds of files that are driving storage growth.
In my view, we’re really just in the early stages of seeing data reduction make its way to online storage. Over the course of the next year or two, we expect to see dedupe for online data become as widely understood and deployed as dedupe for backups. During that time, we’ll see lots of debate over different approaches, lots of education about how things work, and a market that will gravitate towards the solutions that both deliver the best results, and which deliver the best performance for end-users and applications.

This looks like an interesting approach. Do you have any demos or videos readily available online, such as the deduplication explanations that NetApp posts (i.e. http://www.youtube.com/watch?v=uqNxgqasV3o )?
I look forward to seeing the results of your performance comparison to NetApp deduplication. I hope that you plan to go beyond simply showing that your savings are higher for a given dataset. The interesting talking points will be the numbers around your deduplication/compression throughput, the number of virtual servers and virtual desktops you can run simultaneously, and of course the performance benefits in a disaster recovery scenario.
It is always very interesting to see how the new, smaller companies attack the same problems. Your ability to execute fast, and change directions with the market demand provides continuous pressure on the industry to develop the best technology possible.
Regards,
JB
Thanks for the comment, Jacob. We do have some whiteboard videos, and will be doing some updated versions in the near future. Here are a few I’d suggest:
http://www.youtube.com/watch?v=X11fvNOuUOQ
http://www.youtube.com/watch?v=J2fbX2jpKgc
Also, please visit http://www.ocarinanetworks.com and click on “Resources” to find white papers, webcasts, and podcasts.