Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

The best is yet to come

Posted by Sunshine On July - 20 - 2010

For almost exactly a year, I had the privilege of being a blogger for this site, Online Storage Optimization. It was one of the most fulfilling collaborations of my professional life.

When I heard the news yesterday that Dell would be acquiring Ocarina Networks, my first reaction was pride. I couldn’t help but give myself an inner “high five” for having recognized what a great company Ocarina is and would become. My second was to congratulate Dell on their smart choice. I was lucky enough to have worked with the people at Ocarina, and now a whole new group of folks will have this opportunity.

The people at Ocarina Networks are not run-of-the-mill. They are not even above average. They are extraordinary. Anyone who wants to understand what makes a technology company excel would do well to study them. As I wrote in my final, farewell post, Murli Thirumale, CEO of Ocarina is about as far from the image of the typical start-up CEO as you can get. He is soft spoken, thoughtful, and a good listener.

He built his company around a well thought out business philosophy. He first identifies a problem or need, and then designs a product in response to it. In this case, the problem he identified was the proliferation of data worldwide. Rather than go for the obvious, he brought together a team to research the problem and solve it in a new way. This approach is rare. Most of the time, you have start-ups that try to launch based on a technology they themselves developed–in other words, their own pet project. It takes discipline to go about things the way that Murli does.

Murli also showed himself to be a leader in the true sense. Rather than feed his own ego, he chose to surround himself with giants of the storage industry, and he allowed them to get the job done the way they saw fit. Carter George, VP Products, is one of those people. I am particularly indebted to Carter. He was truly a mentor to me. Despite the intense demands of his role at Ocarina, he always had time to answer my questions. He never treated me as anyone other than an equal. I also got to know Goutham Rao, the visionary genius who is the company’s CTO. In addition, I met Dave Withers, who is the man behind the company’s multiple partnerships with top storage companies. For the last few months of my tenure, I worked closely to ramp up the social media program with Mike Davis, director of marketing. Dell is extremely fortunate. They are acquiring technology, but to me the real gift is that they get to work with these remarkable people.

As the headline says, this to me seems like the beginning for Ocarina. It is a bold plan to take the vision of end-to-end dedupe and make it real. This is the next step for the storage industry–and a crucial one if storage costs are to be kept in line, and infrastructure is to keep up with the demands of the real world. We are living in a time when data growth is spiraling upward at rates that no one could’ve imagined even a decade ago. This is the time for a cohesive, meaningful response to this reality. I couldn’t imagine a better outcome for Ocarina, Dell, and the industry as a whole. I am honored to have been a small part of it.

Passing along what we learned from OEMs

Posted by Mike Davis On July - 18 - 2010

It’s an exciting time at Ocarina, because we’re right in the middle of a wave of OEM efforts to bring data-reduction to market as a standard feature across a wide variety of storage implementations. ECOsystem for OEMs is an Ocarina offering of software libraries and APIs that allows storage OEMs and ISVs to embed data-reduction into their products. At Ocarina, we firmly believe that within a couple years, not only will we see dedupe as a new “standard feature” in all major block and file storage products, but it will be increasingly found in host applications as well.

Here are a series of requirements that we’ve consistently heard from OEMs in our discussions, and here’s how Ocarina addresses those in the ECOsystem for OEMs.

1) The solution shrinks data well.
Ok, so this is obvious. Dedupe is supposed to shrink data, and that improves system utilization and reduces costs. Doing this well is about algorithms, and Ocarina has been on top in the algorithm game since we started shipping our primary storage optimization products almost 2 years ago. We’ve shown in some deployments that we can deliver 80% savings when traditional block dedupe delivers no more than 30%. We believe in fact that we’re the only storage technology vendor to actually employ algorithm PhD’s whose sole job is to invent better algorithms…which is something that has proven really useful in specialized markets where a few unique content types dominate the terabytes.

We are the only provider today that deploys dedupe and compression concurrently. Our dedupe algorithm is a content-aware, variable-block, sliding-window approach. ‘Content-aware’ is an overused term for sure, but here it reflects for example that we’ll recognize monolithic data structures in a stream like a JPEG blob, and we know that slicing that JPEG into 8KB chunks for dedupe delivers absolutely no benefit, and thus is a complete waste of time, CPU, and memory. We’ll treat that JPEG (and other data types like it) as a contiguous chunk, which makes our dedupe namespace extremely fast and efficient.

2) The solution minimizes time to market.
Most OEM vendors are under considerable time pressure to bring dedupe to market, either as a competitive response, or because their customers are demanding it. But the OEM’s dev and test engineering resources are always limited. With that in mind we made a point of developing a full featured library. That means not only do we slice the data and do hash lookups (as with the Permabit product), but we also do the dedupe, compression, on-disk data management, metrics and reporting, throttling mechanisms, optimized data movement, and more. By delivering a full-featured suite of capabilities, the OEM can rapidly bring embedded data reduction to market without for example having to redesign their file system or block map systems to complete the dedupe workflow.

The other attribute that accelerates time to market is simplicity. Despite being full featured, the ECOsystem for OEMs is accesses via an lightweight object-like API that OEM developers have told us is extremely simple to work with.

3) The solution has the flexibility to support specific use cases.
The requirements and functional expectations for implementing data reduction differ from device to device, and any embedded solution needs to have the adaptability to serve different applications. For example content-aware algorithms may not be a meaningful solution in a block array where data structures are completely opaque. Or CPU and memory constraints on a given device may require the use of a lighter weight dedupe workflow, and that shouldn’t force major architectural rework of the solution.

The ECOsystem for OEMs has been designed to support 6 embedded use cases: Servers, block arrays, NAS, object stores, cloud storage, and backup targets. Some of the differences between these tiers manifest themselves as implementation best practices, but there are also clear functional decision-points that allow an OEM to implement the right solution for the job. Importantly though, all of these tiers are compatible in key respects, allowing cross-platform manageability, and giving rise to end-to-end features such as optimized data movement.

4) The solution has high performance, while working within resource constraints.
Performance overhead is an often discussed problem associated with dedupe solutions. We’ve learned to solve these problems through 2 years of empirical experience, and believe we have the fastest, lowest-overhead dedupe solution. Moreover, ECOsystem for OEMs obeys hard constraints of the host platform in terms of CPU and memory usage.

There are a couple main points where dedupe can impose performance penalties:
A) During write, the chunking and lookup process takes time. Like other solutions, Ocarina does this in memory to reduce that penalty. You do have to be careful to understand for a given chunk size how much unique data that 1GB of dictionary can address, and given a constrained memory, how far can the solution go before forcing on-disk (=slowww) lookups. ECOsystem for OEMs utilizes the industry’s most memory efficient lookup design to make the most of existing resources.
B) During read, the reads from disk for any de-duplicated volume are more random than typical disk IO. Ocarina is also able to mitigate this impact through content-aware read-ahead caching that anticipates the next chunks that will be read from disk.

5) The solution supports next-generation features that customers want and competitors don’t have.
Ocarina has spent a lot of time in the market talking to end-customers about what they want in a dedupe solution. In addition to things like “shrink well”, a couple things keep coming out. One is to keep data in shrunken form as it moves around in workflow operations such as replication, backup, tiering, etc. We call this end-to-end optimization (or E2EO) and it expands the value of dedupe by improving backup processes, reducing LAN bandwidth, and reducing the CPU overhead that occurs when data is repeatedly rehydrated and re-deduped for no reason. For those OEMs who carry a broad catalog of server, storage, and backup solutions, there are huge benefits in being able to deliver this end-to-end value to customers who have adopted that OEM across their entire IT infrastructure.

The other key feature is the ability to support structured (eg database) and semi-structured (eg VM) applications. The files in these applications are almost always >95% static and <5% active. But because they are active, traditional dedupe solutions can create a tremendous IO overhead as dedupe operations battle against a steady stream of changes. Instead of getting in the way, Ocarina has invented a way to dedupe the inactive portions of these files, while imposing no performance overhead on active IO. Like our dedupe chunking process, this is one of several content-aware features that Ocarina delivers in the ECOsystem for OEMs.

These advanced dedupe features — which bring data reduction benefits to new applications and workflows — allow storage OEMs and ISVs to deliver more value to customers, to do it faster, and differentiate their product over competitive offerings that have first-generation dedupe capabilities.

Dedupe Or Compression? Both! Optimization or Performance? Both!

Posted by Carter George On June - 8 - 2010

Interest in and discussion of data deduplication and primary storage compression seems to be at an all-time high right now. In the last few weeks, we have seen new entrants into the market, including Permabit, and a broad overview from Wikibon, focused on storage optimization. At Ocarina, we see industry discussions such as these as proof positive our business is on the right track. As the market innovators and pioneers in this space, we believe in end to end storage optimization, aimed to enable customers to best use their existing equipment and protect their core data.

Some of the discussion seen in articles around the Web (including Storwize) has focused on the speed of compression and deduplication – with concerns around how this impacts primary storage performance. Completely valid issues, as one should not have to substitute performance in exchange for features. From our company launch, we have focused on achieving the best of both worlds – with primary data storage optimization, and high performance, both with our own dedicated devices and those from our OEM partners.

For customers who want choice, Ocarina has you covered. Ocarina has both fast in-band deduplication and advanced compression options. You can either run fast in-band deduplication, with sub-millisecond latency, or you can choose deep content-aware compression, which takes longer, of course, but also gets results that simple deduplication can’t hit. Or… you can do both! Stop the presses!

Ocarina’s deduplication is fast – you can get deduplication results immediately on every file that passes through our systems, and then come back to do a post-processing run that gets advanced results later, at an off-peak time, should you choose. Also, the post-processing engine is driven by policies which you set, letting you compress only files that meet criteria you choose – for example, size, age or type.

Thanks to Ocarina winning a number of high profile deals at large customers where deduplication alone is not enough, Ocarina has become associated with heavy compression – but we do dedupe as well, and we’re quite good at it! If you look at results we recently got on a corporate data set, we were able to shrink that data set by 92% overall, with 60% coming from deduplication and 32% more coming from files that were compressed in the post-process.

At Ocarina, we believe deduplication will soon become an embedded system feature, and a commodity. It is possible to do in-band deduplication, with very little latency, and minimal CPU resource demands. Dedupe will become a storage fundamental, and the pricepoint for customers to gain dedupe will trend towards zero (where NetApp is today with their A-SIS offering).

Companies like Permabit will have to win quite a few OEM deals at these kinds of end user prices, minus an OEM discount, to be profitable – but advanced features, including dedupe-aware data movement and advanced compression, will be value-add features customers will pay more for. Unlike dedupe, those features won’t provide big value for every customer, but they will apply as important benefits providing value to a significant percentage, and unlike dedupe, advanced content-aware compression is not going to be a commodity given away for free in every system.

Ocarina is in a good position with our technology, and both customers and OEMs should evaluate not just the technology of a dedupe provider, but their ability to financially survive as well. Once a company has reduced your data, and locked it away in their format, the last thing you want is for that company to go out of business, and compress your chances of getting it back. A company that only has dedupe on the table is going to be priced out of the market by 2012, even if it is successful today selling dedupe.

When something sells to end users for free, you can’t make up the profit margin by selling in volume. In Ocarina’s case, we can be extremely aggressive on price for dedupe, because we bring more to the table and have something else to sell, and every OEM deal we win creates a platform ready for future upgrades. As Wikibon wrote yesterday, “Ocarina provides the highest levels of compression by using the optimum compression techniques.” We are the best in the world at this stuff and we will continue to stay ahead of the competition. We also have a well-rounded business that won’t see us get deduped. So if you are looking for an advanced solution that does much more than dedupe, Ocarina is the answer.

Protecting Compressed Data and Reducing Costs

Posted by Carter George On May - 25 - 2010

As @storagebod (Martin Glassborow) noted today in a blog post, the issue of protecting deduplicated data is an important one for any business.

Deduplication and other data reduction technologies offer the opportunity for increased data protection at a reduced cost.

When you hit a threshold of 50% or greater overall data reduction, it lowers the cost of performing a a full mirror of data. For example, if you have 100 terabytes of data, mirroring today would require 200 terabytes of disk space. But if the data were reduced by 75%, you could store the same data in 25 terabytes, and full mirroring would require on 50 terabytes of disk - a significant savings. In other words, it is possible to fully protect your storage, including full mirroring, with less disk than it takes to store the unprotected data today.

While this is true with any amount of space savings, when that level of savings goes beyond 50% (or a dedupe ratio of 2:1 or better), the storage cost of full mirroring is zero.

There are a number of options that your deduplication provider should be pursuing to compliment your data protection strategy. For example, a vendor should be able to allow you do two key things with your dedupe configuration:

  1. Allow a minimum level of duplicate blocks to accumulate prior to starting deduplication. For example, you could allow data to be written twice, and then deduplicate all subsequent occurances. So long as the dedupe solution is aware of both original occurances, you have a form of mirroring without needing to do full mirroring at the disk level. Call this duplicate mirroring.
  2. Set a threshold for a maximum number of duplicates. To set a maximum level of exposure on the loss of physical disk, you should be able to ask that once you have found ‘n’ instances of the block, to start over. You could set that number at 8, 32, 128 or whatever frequency makes business sense to you, and therefore, the potential loss of the sector on which the duplicate is stored would only affect a certain number of files.

These two examples are not necessarily a complete answer in and of themselves, but they do provide guidance for tools you can work with as part of an overall data protection strategy. As deduplication becomes a storage fundamental, and is in place across multiple tiers of storage and on multiple products in your data center, understanding the impact of dedupe on your data protection strategy will be key and your dedupe vendor should be providing you the right tools to manage that.

One More Time: In-Band Versus Post-Process

Posted by Carter George On April - 23 - 2010

My Mom always used to tell me, “You have to be able to distinguish between Need and Want”.
I need a car.  I  want an Aston Martin.

In Steve Kenniston’s post Storage’s 2010 Hottest Technology he says that customers who “require real-time random access compression….in front of their active solution” need something very fast.
Well, yes, they do.


But having a requirement to go fast has nothing to do with in-band versus post-process. And saying that customers who require something in front of their active solution need an in-band solution is like saying that customers who require an in-band solution require an in-band solution.


Who requires something “in front of their active storage”?


Some customers may want that, some might prefer it, but who needs it? In many cases, you could also use a solution that is next to your active solution, or inside it.   As long as it is fast enough and does a good job of data reduction, I’ll let you decide which one is a car and which one is an Aston Martin.

Let’s just run though some of the comparison points between in-band and post-process  methods for data reduction and see what the trade-offs are.

As for why we’re talking through this old saw again, Storewize is an in-band appliance that sits on the wire in front of your active NAS storage. Ocarina sells an optimization appliance that does post-process compression and dedupe, but we also sell software-only solutions (mostly embedded inside storage vendors’ products) that work in-band inside the active storage system.

What Exactly Is Happening In-Band or Post-Process? The only thing this discussion is about is when data gets shrunk.  Every solution does real-time random access for users in-band.    When users or applications do I/O to their already-compressed data, it is always handled in-band, real-time, and transparent.  I think that’s true of every solution mentioned in the CORE table.    When people talk about in-band versus post-process, they are talking only about when the data gets compressed, not about when it gets decompressed.

Speed.     In-band versus post-process has absolutely nothing to do with the speed of a solution.    Time-to-compress can be just as fast post-process as it is in-band.    To demonstrate this, Ocarina turned off 111 of its 112 compressors, and also turned off our two-stage dedupe engine, and ran just our fastest compressor on a set of data.    We got 3508MB/sec throughput.     I don’t think anyone would buy Ocarina and then use only our simplest fastest compressor, but if they did, they’d find it goes as fast as any in-band solution on the market.      There is an implication regarding speed and in-band, though, and that is that in-band solutions can only run data reduction algorithms that go fast, because they are sitting there in the path of every I/O.  If the in-band solution is slow, then all your I/O is slow.   So yes they are fast, but this also limits the range of what they can do.  If the type of data you have requires a slower compressor to get good results – photos and video would be good examples – then you just can’t use an in-band solution.

How Much Disk I/O? In-Band solutions have a significant advantage in the area of disk I/O….or at least they do when they are getting good compression.     Because in-band solutions compress data before it gets written to disk, if the data has been shrunk by 50%, then there is 50% less data to actually write out to disk.  If you are I/O bound, and if a compressor is getting good results on your data stream, then this can help performance a lot.  You are simply doing a lot less I/O.  When dedupe vendors say that they ingest 1TB/hour, they don’t usually mean that they actually write 1TB of data per hour out to physical disk.  What they mean is, if you assume 20:1 dedupe, to store 1TB of logical data requires 50 Gigabytes of data to be written to disk.  The other 950GB are thrown away, because they were duplicates of data already stored on disk.   You have to be careful with these dedupe throughput claims – they assume you are getting the expected dedupe results.

If you sent 1 Terabyte of totally new and unique data, with no duplicates to data seen before, and the dedupe appliance actually had to write out 1 TB of physical data to disk, most of them would be quite a bit slower than their claimed performance. But for both dedupe and compression, if you get good data reduction, then you can improve I/O performance by being in-band.   Just as is the case with dedupe, though, if you are not getting a good compression result, then you won’t see this performance improvement.

All or Nothing. In-band solutions compress everything that goes by.   They have to.  Because of the speed constraints, they don’t typically have time to analyze data, figure out what it is, who it belongs to, or whether it falls in policy for some action or another.     Post-process solutions let the administrator decide which data to shrink, when to shrink it, how to shrink it, and even where to put the data.  This is probably the single biggest advantage of the post-process approach.    You may decide to dedupe and compress all files that have not been accessed for a month.   You might decide to dedupe all files that are virtual machine files, such as VMDK’s.  You may decide to apply fast compression only to files that are between 2 days old and a month old.  You may decide not to compress hot data, any file that has been read or modified in the last 24 hours, at all.

You may decide to run compression and dedupe jobs only at off-peak hours. You may decide to have the solution read a file from a Fibre Channel tier of drives, dedupe and compress the data, and write it out to a less expensive SATA tier.  This fine grain control, which allows active data management as part of the data reduction solution, makes post-process a clear winner for most unstructured file data.    For transactional databases, since the data is always hot, in-band may be the better approach.

What About Existing Data? Let’s say you have an EMC Celerra filer with 100TB of file data on it.  You buy a fancy new in-band compression solution.    New data that is being written gets compressed as it comes in. But what about that 100TB of data that’s already there?    Well, in-band solutions may give you a tool to read that data, compress it, and put it back….making it a post-process solution!   The fact is, most customers already have a lot of data.  Usually, customers who are out looking for a data reduction solution have lots and lots of data.    If you have a Petabyte of data sitting on the floor, and you can dedupe and compress that data down to 100 Terabytes, you just created 900 Terabytes of free disk.  That’s tremendous savings, and you can only get it by post-processing the data that is already there.

Data Integrity. This is sort of the skeleton in the closet for in-band solutions.     All data reduction vendors go to great lengths to ensure data integrity.  We all do checksums, or even bit-by-bit comparisons.   But when you do in-band compression, that means that you have never had a full original copy of your data on disk.     If there’s a bug in a compression algorithm (and let me be clear – I’m not accusing any vendor of any such thing) there is no original uncompressed data to go back to.

In a post-process solution, data gets written out to disk in its full original form. You decide when to compress it.  The conservative shop will say, compress everything after it is at least 24 hours old.  Why?  Because then you will have at least one backup of the original data taken by your backup solution.   Now you can go ahead and shrink the data, saving disk space on your primary storage.   But if there ever were a data corruption bug – and really, as a group, all of the vendors listed in the CORE score box go to great lengths to guarantee there isn’t, but if – then you’d have a full original copy to go back to.   The CIO of a major German bank once told me that for this reason, the bank would never implement an in-band solution.  He said, logically, I understand the protections that are in place and that they should be sufficient, but psychologically and from a risk perspective, I can not bet the data of the bank and its customers on those assurances.

Dedupe is Always a Post-Process.  No, Really! All dedupe, even solutions that say they are in-band, is a form of post-process.     One of the things that strikes me about the CORE article and the discussion that follows is that the line of reasoning is, “dedupe is the most important thing to happen in storage since forever, so you should go out and buy compression”.   Well, dedupe is a form of compression, and it’s one that compares new in-coming data to data that has already been stored.  To do this, you hold in-coming data somewhere while you compute a hash, look up the hash in an index, and find out whether the data is a duplicate of something you’ve already stored.    Solutions, like Data Domain, that claim to be in-band, hold that in-coming data in a very small holding tank (an NVRAM card) and process it quickly.  The solutions that are called post-process write the data out to disk and then figure out whether it’s a duplicate or not.  If it is, they get rid of it.   In both cases the data is being stored somewhere until the dedupe process runs – it’s just a question of where and for how long.    And dedupe is really important – for online data as much as for backup.

So Which One is Better? Neither.  There is no one right answer.    Different use-cases call for different solutions.  If you have an OLTP database running on a NAS share, in-band compression is probably the best bet.  If you have lots of unstructured file data, I’d say post-process (including dedupe) is going to be better most of the time.   But there are other use-cases, and there are exceptions even to those generalizations.  What I will say is this.   Most data, such as the IDC data on storage growth, shows Tier 1 Primary and database data as having very flat growth year-to-year, while the same data shows unstructured file data – Office docs, email, Sharepoint, photos, video, internet “stuff” - as growing at almost out-of-control rates.


Both in-band compression and full compression and dedupe solutions have their place, but the most important thing for data reduction is to attack the place where data growth is causing the most pain. For the most part, you’ll find that that growth is in unstructured file data, that it is not in the most performance-sensitive and finely-tuned Primary tier, and that if you got 75% savings on just the files in your data center that have not been modified for at least three months, you’d be flabbergasted at the savings.

Next:     Do you have to expand data to back it up?

Words of Wisdom: Let’s get Organized

Posted by Carter George On April - 22 - 2010

Let’s Organize This Discussion A Bit!

I think there are at least three interesting topics you bring up here that are worth exploring further. Please see the recent post by David Vellante on Dedupe Rates Matter…Just Not as Much as You Think

I’ll respond with a separate post on each of the following topics, so that the community can track and participate in the threads that they find interesting and relevant, and ignore the ones they don’t.
The topics are:

1. Is the CORE Formula Flawed?
2. In-Band versus Post-Process: Is it even worth arguing about?
3. Do you have to expanded data to back it up?

Post 1. Is the CORE Formula Flawed?

I’m going to argue here that the CORE formula is flawed, but I want to say right off that the CORE formula has already added value to the community.
It’s a great idea to try to measure the value and effectiveness of different data reduction solutions, and by putting a first shot at it out there, there’s already a vigorous discussion going on here that is going to increase awareness of the issues involved, and will probably lead to a new and improved version of the formula. I think we want to recognize the value that Wikibon has brought to both customers and vendors by starting this whole brouhaha.
That said, yes, the CORE formula is deeply flawed.

Weighting the Factors to Reflect Customer Priorities

From a purely mathematical perspective, the formula is most fundamentally flawed because it builds in a bunch of value judgments as Constants that should really be Variables. The weighting of the values in the different columns may have different levels of importance to different customers, and they should therefore be able to assign that weight themselves, for their environment.
Instead, the weighting is hard-coded. The formula tells the customer that “time to compress” is more important than actual compression results.

That will certainly be true for some customers. It’s not true for others.

What I’d recommend to fix this is simple. Let a customer fill in a value to rate the importance of each column to them. Weight each column from 1 to 5, or 1 to 10. Then multiply the normalized score for each column by the customers assessment of its importance to them. That way, if “time to compress” is the most important thing to a customer, they can rate it 10. Great. If they rate that 5, and rate compression results 9, then the end score will better reflect their needs and priorities.

Include More Things That Matter

Second, the table simply does not represent all of the things that matter. Curtis Preston and others have pointed out that “time to decompress” or response time, might be more important to some customers than time to compress. One measures how fast a customer realizes disk space savings. The other measures the response time users and applications will see when they access compressed data. Those are different things. As the poster from EMC noted, there are different kinds of performance, and different online storage use-cases may put more weight on one versus another. Some applications need streaming write throughput, others need sequential read performance, while others need the ability to seek to the middle of a file and modify a few bytes with low latency.

Admittedly, it would be hard to measure all these things without getting overly complicated, but having at least “time to compress” and “response time” called out separately would be useful to most shops.

There are other less measurable, but potentially important, intangibles. Perhaps you could add a column with a list of features or product characteristics and let the customer assign a value from 1 to 10 for each of those things. If they are not important, fine, give them a 0. But if they are important, that lets the customer express their priorities in the score.

Here are just some things that I think at least some customers would think are worth evaluating as part of a data reduction solution:
• All or Nothing: Does the solution compress everything, or can I choose what to compress based on policy?
• Does the solution do compression?
• Does the solution do dedupe?
• Can I back up data in its most-compressed form, or do I have to expand it to back it up?
• Is the solution considered certified, validated, or supported by my storage vendor?
• Does the solution have an HA or fault tolerance capability?
• Does the solution have the ability to scale up by adding multiple nodes to work on a single volume or namespace?

I’m sure there are others, and maybe the CORE formula, which is supposed to be a measure of Effectiveness rather than overall product merit, would consider these factors out of scope. And that’s fair. But if the CORE metric is to be used as a score for a product, then it should be clearer about the key features and topics that it is not covering, but that customers might want to ask about.

Garbage In, Garbage Out

I think the CORE formula has the right idea with the columns on cost and how well a product can shrink data.However, those columns are meaningless unless the data in them is accurate.   I think that data needs to come from some sort of vendor neutral benchmarking site or analyst.    If you put in numbers from web site claims, customers are going to have high expectations.  The fact is, results are going to vary by what kind of data the customer has.   You can’t tell me that any solution is going to get the exact same shrinking results on, say, a volume full of Vmware VMDK files, a volume full of medical images, a volume full of Microsoft Office files, and a volume holding an Oracle OLTP database.


Cost is even trickier. The cost of dedupe for ZFS and NetApp Dedupe is free – they are a feature of the file system, and you don’t have to pay extra to get them.     Of course, you have to be using NetApp storage to get NetApp dedupe, and that comes with a certain cost premium, but how do compare that with solutions like StoreWize and Ocarina that are separate standalone solutions with a specific price tag?

The first problem can be fixed, if the community gets together and hosts a 3, 5, or 10 different public data sets somewhere. Each vendor can download the data, run their wares, and report back the results.   There’s no protecting against people who lie in this case, but vendors that do that will soon be caught out by customers.    Then a customer could go to that neutral site, pick which sample data set best reflects their own data (OLTP database, medical images, home shares, consolidated virtual machines, whatever) and put that value in the CORE formula column for results.     This is a case where the formula is not flawed – the formula is fine, but the data has to be valid for the data a customer has.  If we work together, we can help put good data in.

The second problem vexes me. I don’t really know how to correctly address the cost problem. My suggestion? Take it out.    If the CORE metric gives you a good value for the Effectiveness of different solutions, then customers can assess cost in ways that make sense to them – including vendor discounts, what storage they have, and so forth.    No customer is going to ignore cost when making a purchase decision – so I think we can count on buyers to figure out how they want to factor costs in to their decision-making process.

Summary

So there you have it.
I think the CORE formula is a great first pass attempt at doing something valuable to the community. It is deeply flawed – and I’m not the only one who thinks so.  But it can be improved, and I would love to see that happen.     These are my ideas, and I see others have contributed good insights as well.     I’m keen to see where this goes from here.   My next post will address the tired old topic of in-band versus post-process.

Deduplication: From Point Solution to Data Center Strategy

Posted by Carter George On April - 19 - 2010

Deduplication has been a hot topic in storage for several years now. Most of the focus has been on dedupe appliances sold in the backup market, by companies like Data Domain and Diligent (now IBM ProtecTIER).There have been dozens of articles written explaining the basic concepts, and comparing the implementations by various vendors.

Dedupe is also becoming more prominent in primary storage, with NAS market leader NetApp including a basic dedupe feature on every node.  EMC followed with whole file dedupe on its Celerra family of products.  Other vendors are now introducing dedupe as a feature on block storage arrays, in the cloud, and on nearline and archival storage.

In effect, what we’re seeing is the emergence and transformation of deduplication (and some other related data reduction techniques) as a new storage fundamental.     Rather than being a standalone solution that customers pay a premium for, dedupe is becoming something that will be a standard feature on most mid-market and enterprise storage products by 2012.

This blossoming of dedupe is happening faster than with other value-add storage features that have followed similar paths. Data reduction in general, and dedupe in particular, represents a technology whose time seems to have come. There are two drivers. One is the exponential storage growth  putting strain on both capital expenditures in IT and outpacing the ability of cheap disk drives to keep up.    The other is that economic pressure and tightened budgets have made the traditionally conservative enterprise storage buyer willing to look at new technology that can store data more efficiently and at less cost.  It’s also a technology that works because the data driving storage growth is unstructured data, data created by people.  Database business applications are not driving storage growth – people are.   It’s office documents, email, photos, and videos that are driving the information revolution, and it is in the nature of humans to create lots of copies, variants, and versions of information – in other words, humans (as opposed to database applications) are a lot more likely to create a bunch of duplicate inefficiently-stored data.    Dedupe goes and finds all that, and takes the unnecessary copies out of the picture, and on the actual storage, all those duplicate copies point to one shared place.   The benefits are compelling – it’s not just saving disk space, but ultimately it’s also saving power, cooling, rack space, and all the other things you didn’t have to buy when you avoided buying another rack of disks.

That said, not all dedupe is created equal. It comes in multiple flavors – fixed block or variable, block aligned or sliding window, in-band and post-process and so forth.   There are those who argue fervently that one approach is the only true way to dedupe, as though it were some sort of medieval religious dispute. What matters is that the dedupe method chosen matches the use case for which it is being used.

Dedupe and Compression:  Friends or Foes?

Dedupe is not always the best approach to data reduction either. Some data sets – like collections of virtual machines and repetitive backups of the same volumes – lend themselves very well to deduplication.  Other data sets, such as corporate file shares and primary storage, respond better to compression.  The goal of the IT user should be to store and move data in the most efficient way.   Lost in the hype around dedupe is the fact that quite often compression – which is seeing a renaissance in research after years of relative inactivity – is better at shrinking data than dedupe is.     The two technologies are not mutually exclusive.    It is possible to apply both dedupe and compression to the same data, but finding the optimal balance is easier said than done.

For dedupe to work well, data is chunked up in to small blocks, which are then compared to see if any are the same.  Duplicate blocks can be discarded, saving space. The smaller the block used, the more likely it is to find dupes.   Where dedupe gets the best data reduction is with smaller chunks, compression gets better results with larger chunks.   Compression works by looking at patterns and then making predictions.  If you can predict the next thing, you can compress it. To best find patterns and predict data better, compression likes to have more context, and that means that bigger chunks work better.   For any given data set, there’s an optimal balance, but there is no one right answer that works best for every data set.

The Dedupe Transformation

At the Gartner Data Center Conference held in December 2009, the audiences of three different sessions were polled. In the Gartner report, “Data Deduplication will be even bigger in 2010*,” following the event, analyst Dave Russell commented on the poll results. “If the 56% of those with some plans for deduplication in 2010 are combined with the 14% that are using deduplication for only a portion of their backups, and if, as in years past, 2% to 4% of those with no current plans to deploy the technology do implement it, then it’s conceivable that 72% to 74% of the audience will adopt deduplication by year-end 2010.”

A similar poll by SearchStorage of storage buyers found very similar trends. Almost everyone either plans to deploy dedupe for backup or is evaluating it.  Likewise, while only 17% have deployed it for primary storage, a staggering 60% are planning to either buy or evaluate dedupe for primary in the next year.  This means that CIO’s and IT Directors are expecting dedupe to have the kind of impact on IT operations that virtual machines have had over recent years.

All the major players in storage will need to decide on a dedupe strategy, bringing out standalone products in dedupe-centric markets like backup, and adding dedupe as a feature set to existing primary and nearline products in the other storage tiers. Some of the large vendors may look to startups to acquire the technology they need, either through OEM deals or acquisitions, while others will develop in house.  In either case, within two years at most dedupe will have become a storage fundamental, and the landscape of vendor offerings will have changed.  Today, the playing field consists of niche offerings by the big vendors and a set of innovative startups looking to breakout.  Within two years, some of those startups will have gotten design wins with major vendors, some will follow in Data Domain’s footsteps and get acquired, and some will be left holding the short end of the stick.

Point Solution or Coherent Strategy?

Finally, there is one key mystery left in the dedupe marketplace, which is headed towards a situation where every storage product will have deduplication built in. At current course and speed, all of those dedupe implementations will be inconsistent and incompatible with one another, even inside the product line of a single vendor.    Will that continue to be the case as dedupe becomes a standard feature?    If you look at today’s two market leaders in dedupe – NetApp for primary and Data Domain for backup – you’ll see a painful scenario that we expect to see played out over and over again over the next few years.     Take a volume on NetApp filled with 16 Terabytes of data.    NetApp dedupe might shrink that data to 8TB, a great space savings.   But when it comes time to back that data up, the NetApp rehydates (expands) the 8TB’s back to the full original 16TB’s to send it to the backup target.  Let’s say the backup target is a Data Domain.      Now the network needs to carry that whole 16TB, and the NetApp storage controller had to use a lot of CPU to put the data all back together, possibly slowing down other applications trying to use the NetApp.     You have to buy a Data Domain model big enough to handle that 16TB ingestion in your available backup window.   When the 16TB of data gets to the Data Domain, it will be deduped again, using different algorithms, getting back down to 8TB or less.    In this process, a huge amount of CPU, network bandwidth, and time have been consumed to expand data and then shrink it again.   For a market focused on getting to better storage efficiency, this is blatantly wasteful.   This is the case today with almost any dedupe-for-primary solution backing up to a dedupe-for-backup target.   What would be more useful to customers would be a consistent and compatible dedupe, allowing data that has already been deduped to be moved in its compressed format to other storage products that support a compatible implementation.

A good deal of the I/O workload in a given shop is driven by a handful of common storage management workflows – backup, replication, migration, and tiering – and all of those workflows would be more efficient if they could be done using data that had been compressed and deduplicated. To truly deliver on the promise of storage efficiency, we’ll look to see some vendors deliver consistent and compatible dedupe and compression that works across products, supporting dedupe-aware versions of those key workflows transparently.

Our Thoughts on Performance and Recent Dedupe/Compression Comments

Posted by Carter George On April - 16 - 2010

In the recent InfoStor article on primary storage optimization http://www.infostor.com/index/articles/display/2460996926/articles/infostor/storage-management/2010/april-2010/consider-compression.html , Ocarina was mentioned along with some other vendors who have offerings that provide either dedupe or compression for primary storage. Ocarina is characterized as being a post-process solution.    This is a theme that we’ve seen in several product review pieces, and it’s worth clarifying.   Ocarina does offer post-process optimization, but the product can also be configured to do inband optimization. The common wisdom seems to be that post-process gets the best data reduction, but is slower than inband and therefore can only be used for cold data.   Ocarina is happy to be recognized for having the best data reduction, but we’re not post-process only, nor are we willing to concede that we are for cold data only.

User access to optimized data is always inband and real time. This whole inband versus post-process discussion only applies to the question of when you shrink the data.

Ocarina’s ECOsystem is a configurable multi-stage data reduction pipeline and it can be set up to run post-process, inband, or both. Ocarina’s family of storage optimization appliances come pre-configured to do post-process dedupe and compression, but in the cases where Ocarina’s ECOsystem has been embedded as software inside our storage partners’ products, we have been configured inband in some cases.

The ECOsystem pipeline has four elements, all optional:    object dedupe, block dedupe, regular compression, and content-aware compression. To run inband, an element has to be fast enough to keep up with the storage system’s I/O speed. Just as you want compression to be data invisible (it’s invisible when a user gets their file back bit-for-bit the way it was originally without ever knowing it was compressed), you want it to be performance invisible too. Adding dedupe and compression should not affect the perceived performance of a storage system. Some of Ocarina’s data reduction elements are fast enough to run inband, and can be configured that way. Some elements, especially advanced content-aware compressors, are slow enough that in most cases you would want to run them as background post-processes. With post-process, you also have the option of using policies to decide which data to compress, and when. With most inband solutions, you have to compress everything, all the time.

In the latest issue of Storage Magazine, Curtis Preston, editor in TechTarget’s Storage Media Group and an independent backup expert, in covering data reduction vendors said, “Ocarina takes a very different approach to data reduction than many other vendors. Where most vendors apply compression and deduplication without any knowledge of the data, Ocarina has hundreds of different compression and deduplication algorithms that it uses depending on the specific type of data.”


Because we have such a rich toolbox, we can do different things with it.  If you used Ocarina to build a backup solution, you might use our block dedupe and fast regular compression only. If you used us for a deep archive, you might use object dedupe and advanced content-aware compression as post-process only. For primary NAS, you might do fast regular compression inband, and dedupe as a post process, and so forth.


To make this point a bit clearer, we’ll publish some performance results in the next week or two showing how we attack different data sets inband, post-process, and with both. We’ll make the data sets public, so if other vendors want to try their wares, they can download the data and see how they do, apples to apples To directly address the issue of whether Ocarina can be fast enough to be used for “true” primary storage, if you run our fast regular compressor only, inband, we have been benchmarked over 3,000 MB/sec. Most of our customers will elect to do more than just simple regular compression, so that’s not a number we’d claim for the real world – in the real world, people want to get better data reduction than you get with just regular compression.


At the end of the day, going fast is important, but only if you actually do something useful. If a solution goes really fast, but can’t actually shrink your data, then we don’t think that’s very interesting.    The right solution will get the best possible data reduction while still meeting performance requirements. Different use cases have different performance requirements, and what you want is something that can be configured to hit the sweet spot for performance while still getting smokin’ dedupe and compression results.

Ocarina Adds Video Optimization to Its List of 900 File Types

Posted by Matthew Harvey On April - 7 - 2010

Here is our official announcement on our ability to reduce more than 900 file types

Ocarina Adds Video Optimization to Its List of 900 File Types Supported by Content-Aware Storage Solution

Company’s Native File Optimization delivers best possible file compression while preserving video quality, retaining native file format

SAN JOSE, Calif., Apr 07, 2010 — Ocarina Networks, a leading provider of content-aware storage optimization solutions, today announced that it has expanded the number of file types it can reduce to more than 900, to include support for popular Flash video for Internet distribution, and MPEG2 for broadcast workflows. Michael Davis, senior director of marketing at Ocarina, will be speaking about the company’s video capabilities at the 2010 NAB show in partners’ BlueArc and Isilon booths in the Las Vegas Convention Center April 12-15.

Ocarina has developed the industry’s most-advanced platform for data optimization, providing reliability, scalability, and a complete array of advanced data-reduction algorithms in a fully integrated solution. Ocarina’s ECOsystem(R) includes over 100 algorithms that have proven effective for more than 900 file types, including ones that had been considered previously uncompressible. Ocarina’s fully lossless optimization technology has already gained widespread acceptance as an online archival solution in film production studios including Starz, Rainmaker Entertainment, and ZOIC. Now Ocarina has enhanced its Native Format Optimization (NFO) workflow to support Internet video workflows. The NFO workflow introduces non-visible compression to image and video files, allowing customers to capture data-reduction benefits in storage, bandwidth, and web-site responsiveness. Ocarina’s addition of Adobe Flash video formats to the NFO workflow ensures the best possible video compression of various file types (FLV, SWF, F4V) and data formats (Spark, VP6, h.264) while preserving original image quality. Ocarina has also added lossless compression support for MPEG2 video in broadcast video archives.

Ocarina’s NFO workflow delivers the best possible video compression while preserving image quality as well as the native encoding format. By reducing file size across large video repositories, media companies are able to keep files online longer, reduce storage capital and operating costs, reduce the cost of distribution including CDN and ISP bandwidth fees, and reduce the page-load times for video-intensive web sites, and improve audience penetration into marginal broadband markets.

“Ocarina NFO is the only enterprise-class data reduction product that is effective on video content. This is content that doesn’t present redundancies for a dedupe algorithm, and generic compression algorithms such as LZW really add no benefit,” said Davis. “What’s different about our NFO compression is that we’re really getting into the video encoding to align it with what the human eye can perceive. Our post processing algorithms will seek out opportunities for spatial optimization, inter-frame optimization, better motion compensation, improved bit-rate control, quality normalization and hot-spot detection. Our video-aware optimization allows us to deliver up to 40% or more savings where traditional de-duplication technologies deliver no benefit.”

Ocarina’s new video NFO workflow appeals to media companies concerned with bandwidth costs including social networks, user-generated content sites, video advertisers, news outlets, and other ad-supported video sites.

Ocarina’s ECOsystem is deployed as an appliance co-processor that works with existing storage systems, and uses a policy-based interface to align data-reduction with application workflows. ECOsystem enhances tiered-storage architectures by multiplying available storage in secondary storage tiers.

With more than 1,500 exhibiting companies and 800,000 square feet of exhibit space, The NAB Show is the world’s largest electronic media show covering filmed entertainment and the development, management and delivery of content across all mediums. More than 85,000 audio, video and film content professionals are expected to attend the 500 conference and training sessions available. Additional details about the NAB Show are available at www.nabshow.com.

About Ocarina

Ocarina is a leader in online storage optimization solutions. Organizations of all sizes use Ocarina’s content-aware optimization technology to reduce their storage footprint and achieve a ten-fold capacity increase on their current storage systems. Based in San Jose, Calif., Ocarina is privately-held and financed by leading investors JAFCO Ventures, Kleiner Perkins Caufield & Byers and Highland Capital Partners. For more information, visit www.ocarinanetworks.com

Ocarina Weighs in- Dedupe Ratios Do Matter

Posted by Carter George On April - 6 - 2010

Every dedupe ratio can be converted to a percentage of data reduction and vice versa.    The dedupe ratio measures against what’s left after you’ve deduped. The percentage measures against the size of the data before you dedupe.   Both are valid measures.      It’s also true that some solutions do a better job shrinking your data than others.   Dedupe solutions that do a better job should be ranked higher when you are comparing solutions.     That said, comparing the claims made on vendor websites is not a very good way to find out who can actually shrink your data better.
Why?  Vendors lie.

Lies, damned lies, and statistics” is a phrase describing the persuasive power of numbers <http://en.wikipedia.org/wiki/Number> , particularly the use of statistics <http://en.wikipedia.org/wiki/Statistics>  to bolster weak arguments <http://en.wikipedia.org/wiki/Argument> , and the tendency of people to disparage statistics that do not support their positions.

The term was popularized in the United States <http://en.wikipedia.org/wiki/United_States>  by Mark Twain <http://en.wikipedia.org/wiki/Mark_Twain>  (among others), who attributed it to the 19th Century British <http://en.wikipedia.org/wiki/Great_Britain>  Prime Minister Benjamin Disraeli <http://en.wikipedia.org/wiki/Benjamin_Disraeli>  (1804-1881): “There are three kinds of lies: lies, damned lies, and statistics.”

I tend to agree with Dipesh’s original point that dedupe ratios are a little more misleading than percentages in measuring the effectiveness of data reduction, but either way what matters is “which vendor can really shrink my data better?”.   I agree with Howard and Curtis who point out that it does matter when one product can do a better job of actual data reduction.      And the more data you have, and just as importantly, the more you want to keep, the more it matters.

Lost somewhat in this discussion is whether the dedupe ratios claimed by vendors are actually true. The claims are not actually lies, they’re worse – they are statistics based on a set of assumptions favorable to each vendor.  There’s no industry standard for reporting these numbers, nor is there a standard public data set that everyone can run their dedupe and compression engines against to get even-steven numbers.     These ratio claims, whether they are 10:1 or 20:1 or whatever,  are based on a set of assumptions that you have to go trolling in the fine print to find.    For example, on the first full backup you run through almost any dedupe engine, you are not going to see anything remotely like 10:1 or 20:1.  You’ll be lucky if you get 2:1 – most vendors will deliver less than that.   If you are talking about dedupe for primary storage (NAS, a block storage array, a nearline archive) most vendors will never achieve 10:1, let alone higher ratios.    Before we argue about whether 20:1 really is twice as good as 10:1, maybe we should figure out if any vendor can even get you 5:1 in the real world.

The high ratios you see in all these claims are based on the assumption that you are backing up the same data over and over. A vendor makes an assumption about a change rate in the data (eg, 10% of the data gets changed each day), and they assume you backup every day.   They’ll be able to make a higher claim if they assume that you do a full backup every day, because then there will be more dupes to throw away.      These claims are not based on how much they can shrink your first backup.  They are based on how much they’ll have saved after you have done many backups.  Look in the fine print to see how many backups you have to run to get the claimed results, and see whether those were are fulls or a mix of fulls and incrementals.   If you do a full backup every day for 100 days, most dedupe solutions will get a really good result. However, if your situation is not just like the assumptions they made to get to their marketing claim, your results are going to be different than that ratio on their website.     The results you actually get when you buy the product are what’s important, aren’t they?   And those real results may have very very little to do with the claimed results in a marketing brochure.

Your mileage will vary.    Not “may vary”.  Will vary.   Some data sets dedupe well (virtual machine images, databases, repetitive full backups).  Some do not (Office documents, volumes full of rich media like photos, video, and graphics, Zip files, etc).      Full backups dedupe better than incrementals.   If you have encrypted your data, then dedupe won’t get much result at all.

Compression may also play a role.  Often, the data that does not dedupe well can be compressed well. Almost every dedupe vendor also does some kind of compression.  To get the best data reduction result (that’s what we’re talking about here, right?) usually requires some combination of dedupe and compression.     Some dedupe solutions just have generic compressors, like LZ or zip, while others have more sophisticated compressors that can deal with complex data types.

Does that matter?  Only if you have a lot of the kind of data where compression is key to getting good results. If you have 1,000 identical VMDK clones, then dedupe will get smokin’ results without any compression.  If you have 1,000 Terabytes of unique photos and and compressed Microsoft Office docs, dedupe will get next to 0% data reduction.   For most customers, the answer is going to be somewhere in the middle.
Right now, I don’t see any shortcut to picking the most likely 2 to 4 vendors and having them actually run their wares against a good test data set that truly represents the kind of data you have. Take one full backup, or one typical volume’s worth of your own data, have them sign an NDA, and have them run it through their product and see how well they can shrink it.     Rank the vendors ability to shrink your data based on the results they get on your own real data, not on the inflated claim numbers you see on the front page of their website.

Carter K. George – VP, Products