Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Passing along what we learned from OEMs

Posted by Mike Davis On July - 18 - 2010

It’s an exciting time at Ocarina, because we’re right in the middle of a wave of OEM efforts to bring data-reduction to market as a standard feature across a wide variety of storage implementations. ECOsystem for OEMs is an Ocarina offering of software libraries and APIs that allows storage OEMs and ISVs to embed data-reduction into their products. At Ocarina, we firmly believe that within a couple years, not only will we see dedupe as a new “standard feature” in all major block and file storage products, but it will be increasingly found in host applications as well.

Here are a series of requirements that we’ve consistently heard from OEMs in our discussions, and here’s how Ocarina addresses those in the ECOsystem for OEMs.

1) The solution shrinks data well.
Ok, so this is obvious. Dedupe is supposed to shrink data, and that improves system utilization and reduces costs. Doing this well is about algorithms, and Ocarina has been on top in the algorithm game since we started shipping our primary storage optimization products almost 2 years ago. We’ve shown in some deployments that we can deliver 80% savings when traditional block dedupe delivers no more than 30%. We believe in fact that we’re the only storage technology vendor to actually employ algorithm PhD’s whose sole job is to invent better algorithms…which is something that has proven really useful in specialized markets where a few unique content types dominate the terabytes.

We are the only provider today that deploys dedupe and compression concurrently. Our dedupe algorithm is a content-aware, variable-block, sliding-window approach. ‘Content-aware’ is an overused term for sure, but here it reflects for example that we’ll recognize monolithic data structures in a stream like a JPEG blob, and we know that slicing that JPEG into 8KB chunks for dedupe delivers absolutely no benefit, and thus is a complete waste of time, CPU, and memory. We’ll treat that JPEG (and other data types like it) as a contiguous chunk, which makes our dedupe namespace extremely fast and efficient.

2) The solution minimizes time to market.
Most OEM vendors are under considerable time pressure to bring dedupe to market, either as a competitive response, or because their customers are demanding it. But the OEM’s dev and test engineering resources are always limited. With that in mind we made a point of developing a full featured library. That means not only do we slice the data and do hash lookups (as with the Permabit product), but we also do the dedupe, compression, on-disk data management, metrics and reporting, throttling mechanisms, optimized data movement, and more. By delivering a full-featured suite of capabilities, the OEM can rapidly bring embedded data reduction to market without for example having to redesign their file system or block map systems to complete the dedupe workflow.

The other attribute that accelerates time to market is simplicity. Despite being full featured, the ECOsystem for OEMs is accesses via an lightweight object-like API that OEM developers have told us is extremely simple to work with.

3) The solution has the flexibility to support specific use cases.
The requirements and functional expectations for implementing data reduction differ from device to device, and any embedded solution needs to have the adaptability to serve different applications. For example content-aware algorithms may not be a meaningful solution in a block array where data structures are completely opaque. Or CPU and memory constraints on a given device may require the use of a lighter weight dedupe workflow, and that shouldn’t force major architectural rework of the solution.

The ECOsystem for OEMs has been designed to support 6 embedded use cases: Servers, block arrays, NAS, object stores, cloud storage, and backup targets. Some of the differences between these tiers manifest themselves as implementation best practices, but there are also clear functional decision-points that allow an OEM to implement the right solution for the job. Importantly though, all of these tiers are compatible in key respects, allowing cross-platform manageability, and giving rise to end-to-end features such as optimized data movement.

4) The solution has high performance, while working within resource constraints.
Performance overhead is an often discussed problem associated with dedupe solutions. We’ve learned to solve these problems through 2 years of empirical experience, and believe we have the fastest, lowest-overhead dedupe solution. Moreover, ECOsystem for OEMs obeys hard constraints of the host platform in terms of CPU and memory usage.

There are a couple main points where dedupe can impose performance penalties:
A) During write, the chunking and lookup process takes time. Like other solutions, Ocarina does this in memory to reduce that penalty. You do have to be careful to understand for a given chunk size how much unique data that 1GB of dictionary can address, and given a constrained memory, how far can the solution go before forcing on-disk (=slowww) lookups. ECOsystem for OEMs utilizes the industry’s most memory efficient lookup design to make the most of existing resources.
B) During read, the reads from disk for any de-duplicated volume are more random than typical disk IO. Ocarina is also able to mitigate this impact through content-aware read-ahead caching that anticipates the next chunks that will be read from disk.

5) The solution supports next-generation features that customers want and competitors don’t have.
Ocarina has spent a lot of time in the market talking to end-customers about what they want in a dedupe solution. In addition to things like “shrink well”, a couple things keep coming out. One is to keep data in shrunken form as it moves around in workflow operations such as replication, backup, tiering, etc. We call this end-to-end optimization (or E2EO) and it expands the value of dedupe by improving backup processes, reducing LAN bandwidth, and reducing the CPU overhead that occurs when data is repeatedly rehydrated and re-deduped for no reason. For those OEMs who carry a broad catalog of server, storage, and backup solutions, there are huge benefits in being able to deliver this end-to-end value to customers who have adopted that OEM across their entire IT infrastructure.

The other key feature is the ability to support structured (eg database) and semi-structured (eg VM) applications. The files in these applications are almost always >95% static and <5% active. But because they are active, traditional dedupe solutions can create a tremendous IO overhead as dedupe operations battle against a steady stream of changes. Instead of getting in the way, Ocarina has invented a way to dedupe the inactive portions of these files, while imposing no performance overhead on active IO. Like our dedupe chunking process, this is one of several content-aware features that Ocarina delivers in the ECOsystem for OEMs.

These advanced dedupe features — which bring data reduction benefits to new applications and workflows — allow storage OEMs and ISVs to deliver more value to customers, to do it faster, and differentiate their product over competitive offerings that have first-generation dedupe capabilities.

Dedupe Or Compression? Both! Optimization or Performance? Both!

Posted by Carter George On June - 8 - 2010

Interest in and discussion of data deduplication and primary storage compression seems to be at an all-time high right now. In the last few weeks, we have seen new entrants into the market, including Permabit, and a broad overview from Wikibon, focused on storage optimization. At Ocarina, we see industry discussions such as these as proof positive our business is on the right track. As the market innovators and pioneers in this space, we believe in end to end storage optimization, aimed to enable customers to best use their existing equipment and protect their core data.

Some of the discussion seen in articles around the Web (including Storwize) has focused on the speed of compression and deduplication – with concerns around how this impacts primary storage performance. Completely valid issues, as one should not have to substitute performance in exchange for features. From our company launch, we have focused on achieving the best of both worlds – with primary data storage optimization, and high performance, both with our own dedicated devices and those from our OEM partners.

For customers who want choice, Ocarina has you covered. Ocarina has both fast in-band deduplication and advanced compression options. You can either run fast in-band deduplication, with sub-millisecond latency, or you can choose deep content-aware compression, which takes longer, of course, but also gets results that simple deduplication can’t hit. Or… you can do both! Stop the presses!

Ocarina’s deduplication is fast – you can get deduplication results immediately on every file that passes through our systems, and then come back to do a post-processing run that gets advanced results later, at an off-peak time, should you choose. Also, the post-processing engine is driven by policies which you set, letting you compress only files that meet criteria you choose – for example, size, age or type.

Thanks to Ocarina winning a number of high profile deals at large customers where deduplication alone is not enough, Ocarina has become associated with heavy compression – but we do dedupe as well, and we’re quite good at it! If you look at results we recently got on a corporate data set, we were able to shrink that data set by 92% overall, with 60% coming from deduplication and 32% more coming from files that were compressed in the post-process.

At Ocarina, we believe deduplication will soon become an embedded system feature, and a commodity. It is possible to do in-band deduplication, with very little latency, and minimal CPU resource demands. Dedupe will become a storage fundamental, and the pricepoint for customers to gain dedupe will trend towards zero (where NetApp is today with their A-SIS offering).

Companies like Permabit will have to win quite a few OEM deals at these kinds of end user prices, minus an OEM discount, to be profitable – but advanced features, including dedupe-aware data movement and advanced compression, will be value-add features customers will pay more for. Unlike dedupe, those features won’t provide big value for every customer, but they will apply as important benefits providing value to a significant percentage, and unlike dedupe, advanced content-aware compression is not going to be a commodity given away for free in every system.

Ocarina is in a good position with our technology, and both customers and OEMs should evaluate not just the technology of a dedupe provider, but their ability to financially survive as well. Once a company has reduced your data, and locked it away in their format, the last thing you want is for that company to go out of business, and compress your chances of getting it back. A company that only has dedupe on the table is going to be priced out of the market by 2012, even if it is successful today selling dedupe.

When something sells to end users for free, you can’t make up the profit margin by selling in volume. In Ocarina’s case, we can be extremely aggressive on price for dedupe, because we bring more to the table and have something else to sell, and every OEM deal we win creates a platform ready for future upgrades. As Wikibon wrote yesterday, “Ocarina provides the highest levels of compression by using the optimum compression techniques.” We are the best in the world at this stuff and we will continue to stay ahead of the competition. We also have a well-rounded business that won’t see us get deduped. So if you are looking for an advanced solution that does much more than dedupe, Ocarina is the answer.

Protecting Compressed Data and Reducing Costs

Posted by Carter George On May - 25 - 2010

As @storagebod (Martin Glassborow) noted today in a blog post, the issue of protecting deduplicated data is an important one for any business.

Deduplication and other data reduction technologies offer the opportunity for increased data protection at a reduced cost.

When you hit a threshold of 50% or greater overall data reduction, it lowers the cost of performing a a full mirror of data. For example, if you have 100 terabytes of data, mirroring today would require 200 terabytes of disk space. But if the data were reduced by 75%, you could store the same data in 25 terabytes, and full mirroring would require on 50 terabytes of disk - a significant savings. In other words, it is possible to fully protect your storage, including full mirroring, with less disk than it takes to store the unprotected data today.

While this is true with any amount of space savings, when that level of savings goes beyond 50% (or a dedupe ratio of 2:1 or better), the storage cost of full mirroring is zero.

There are a number of options that your deduplication provider should be pursuing to compliment your data protection strategy. For example, a vendor should be able to allow you do two key things with your dedupe configuration:

  1. Allow a minimum level of duplicate blocks to accumulate prior to starting deduplication. For example, you could allow data to be written twice, and then deduplicate all subsequent occurances. So long as the dedupe solution is aware of both original occurances, you have a form of mirroring without needing to do full mirroring at the disk level. Call this duplicate mirroring.
  2. Set a threshold for a maximum number of duplicates. To set a maximum level of exposure on the loss of physical disk, you should be able to ask that once you have found ‘n’ instances of the block, to start over. You could set that number at 8, 32, 128 or whatever frequency makes business sense to you, and therefore, the potential loss of the sector on which the duplicate is stored would only affect a certain number of files.

These two examples are not necessarily a complete answer in and of themselves, but they do provide guidance for tools you can work with as part of an overall data protection strategy. As deduplication becomes a storage fundamental, and is in place across multiple tiers of storage and on multiple products in your data center, understanding the impact of dedupe on your data protection strategy will be key and your dedupe vendor should be providing you the right tools to manage that.

In his comment in the on-going discussion of the CORE formula ( Dedupe Rates Matter…Just Not as Much as You Think )      Steve Kenniston had the following to say about the relationship between shrinking data on primary storage and backups:

Example, if I use Ocarina deduplication, but have already purchased Data Domain, don’t I need to re-hydrate the Ocarina deduplicated, primary storage data before I use Data Domain?  They say you do.  That means I don’t really save on my primary storage if I need the space to re-hydrate before I back it up and that also means processing time on the array.  Storwize, with random access compression doesn’t require decompression.


There are several interesting issues brought up here. They only relate to the CORE formula in that the formula does not account for them.

Today, if you have the most common dedupe for primary storage (NetApp dedupe) and the most common dedupe for backup (EMC’s Data Domain product), it works like this.


You start with a volume of, say, 16TB. NetApp Dedupe will shrink that to maybe 8TB.     Then you go to backup.  The backup server sends an NDMP request to the NetApp asking for a data stream to be sent to the backup target, the Data Domain.  NetApp then rehydrates (expands) the 8TB back to 16TB and sends that stream to the Data Domain.  The Data Domain will then dedupe that data back down to probably 4TB (since Data Domain dedupe is more sophisticated).

This is wasteful and has several negative consequences. It uses a bunch of CPU and I/O on the NetApp to rehydrate the data, which means performance might be slower for users while this is going on.   You have to use the full network bandwidth to move the whole 16TB to the Data Domain.  And you have to buy a Data Domain model big enough to handle 16TB of backup data instead of 8.


One thing that does not happen, which Steve seems to think is the case, is that you need 16TB of disk space on the NetApp to put all that expanded data in before it goes off to the backup.

The situation is ugly, but it is not that ugly.

Now, how would this take place if you used Ocarina dedupe and compression instead of NetApp. Let’s look at the following examples, and since NetApp has their own dedupe, let’s use a NAS filer that has nice integration with Ocarina, the BlueArc:

Scenario 1:   Compression-only, BlueArc to Data Domain
Scenario 2:   Dedupe and Compression, BlueArc to Data Domain, Full Backup
Scenario 3:  Dedpue and Compression, BlueArc to Data Domain, Incrementals

To continue in the vein of the NetApp example, let’s say you started with 16TB. Ocarina will have shrunk that to maybe 4TB in the compression-only case, and maybe 2TB in the compression and dedupe case – because we shrink better than anyone.  In the first case, Ocarina replaces each file with a compressed version of the file in the same volume.  When you go to back up, you backup through a mount point that exports the volume without going through the decompression layer.

So this works just like it would with Storewize – the compressed versions of files go to the Data Domain.   When you back up day after day, you will create duplicates, and the Data Domain will find and eliminate those.

It should be noted, though, that Data Domain results will be slightly worse with either Storewize or Ocarina compression – because compressing data makes it harder to find duplicates.

In the second scenario, we have not only compressed the data, but deduped it too. This makes backups more complicated, because the pieces and parts of a file may be spread around in many other files.  Doing a full volume backup is pretty straightforward though.  You first take a snapshot of the volume, and then back it up.  It is important to take a snapshot, because you need to make sure all the pieces of files are consistent at a point in time.   You back up the 2TB to the Data Domain.   Now, does this mean you don’t need a Data Domain, because dedupe was already done at the source?

No, not at all. The first time you backup a volume, Data Domain won’t shrink it any, if at all, because there shouldn’t be any duplicates in the data set.  Ocarina dedupe is at least as good as Data Domain’s.   However, backup is not something you do once.  It’s something you do every day.  So when you back up the next day, and the day after that, and so forth, the Data Domain will find plenty of duplicates, as you backup the same files over and over.  Over the course of a month or so, the Data Domain will be getting its 20:1 dedupe ratio even though the source volume was perfectly deduped already!

The third scenario is the most complicated. Doing incremental backups from a NAS usually means backup software and NDMP.    The backup server, called a DMA, figures out which files have changed since the last backup, and then sends a request to the NDMP Data Server.      Normally, NDMP is a service provided by the NAS head, but when you have used Ocarina to dedupe a volume, Ocarina will provide the NDMP Data Server.  The NDMP data request will come to the Ocarina NDMP service and ask for the 1,237 files that have changed since yesterday.       Now, you could just rehydrate those files and send them to the backup target.  However, Ocarina has a dedupe-aware NDMP.  This will figure out which chunks are needed to rehydrate the 1,237 files requested by the backup DMA, and will create – on the fly, using no disk space – a self-contained NDMP data stream that is deduped within itself.

You might see some partial rehydration, because the backup stream needs to have every block in it necessary to recover the files being backed up. But there will be no duplicate blocks in the data stream that goes to the backup target.     What’s more, all those blocks or chunks will remain compressed.  So what shows up at the Data Domain is a file-level incremental backup that is both deduped and compressed.  This allows file level restores by the backup software DMA (like NetBackup or Commvault).

OK, now let’s take one more example, because this is the way you’d do data reduction as an enterprise strategy, rather than as a point solution for one filer. We call it end-to-end dedupe.

Scenario 4:   Dedupe and Compression, BlueArc to BlueArc

In this case, we’re going to backup from one Ocarina-enabled BlueArc to another. The backup software will see the second BlueArc as an NFS backup target, just as it does a Data Domain.     In either the full or the incremental case, the NDMP service will call the Ocarina NDMP Data Server on the Primary BlueArc.  But when the backup starts, Ocarina will query the target on a known port to ask if it is Ocarina-aware.  If the answer is yes, then instead of sending the data, Ocarina’s dedupe-aware NDMP will send just the hashes.

The target-side BlueArc (acting in place of the Data Domain) will examine those hashes and determine if it has any of that data already. It uses a negative acknowledgement protocol to tell the source BlueArc which chunks it needs to execute the backup.  On the first backup, this will be all of the chunks or objects.  But on subsequent backups, the dedupe in both places will be synched up, and only net new data is moved.

Now, in this case, you have true end-to-end dedupe. You dedupe and compression the primary storage.  When it comes time to backup, you do not need any disk to store rehydrated data.    Rather, you engage in an intelligent conversation with the backup target and move only the blocks, chunks, or objects needed to complete the backup, keeping every chunk in its compressed form.     You can do this from any Ocarina-enabled source to any Ocarina-enabled target.
So it applies to more than just  backups. This kind of optimized end-to-end solution applies to replication, tiering (primary to nearline, for example), migration, archiving (primary to object store), and backup.

Some significant percentage of the I/O’s in a data center are done not for user I/O or application I/O, but are done in support of storage management workflows. Ocarina can be deployed as an enterprise storage optimization solution – not a point solution for one NAS filer, not even a solution for one data center.  Ocarina can be deployed across multiple tiers of storage – NAS, block, DAS, archive, object, cloud and backup – and then all storage management workflows will operate using dedupe-aware compressed data.   The benefits here are not just saved disk space (and power and cooling and so forth), but network bandwidth, time, backup window reduction, and more.

If you are a storage vendor reading this, you should know that Ocarina now has a complete storage integration SDK.  This is a set of API’s and documentation for integrating Ocarina – in-band or post-process – inside different types of storage. There is a framework for file system-based products, a framework for block products (DAS and storage arrays) and a framework for cloud and object stores (get/post/put model).    If you want consistent, compatible, and integrated data reduction across you whole storage product line, give us a jingle.

One More Time: In-Band Versus Post-Process

Posted by Carter George On April - 23 - 2010

My Mom always used to tell me, “You have to be able to distinguish between Need and Want”.
I need a car.  I  want an Aston Martin.

In Steve Kenniston’s post Storage’s 2010 Hottest Technology he says that customers who “require real-time random access compression….in front of their active solution” need something very fast.
Well, yes, they do.


But having a requirement to go fast has nothing to do with in-band versus post-process. And saying that customers who require something in front of their active solution need an in-band solution is like saying that customers who require an in-band solution require an in-band solution.


Who requires something “in front of their active storage”?


Some customers may want that, some might prefer it, but who needs it? In many cases, you could also use a solution that is next to your active solution, or inside it.   As long as it is fast enough and does a good job of data reduction, I’ll let you decide which one is a car and which one is an Aston Martin.

Let’s just run though some of the comparison points between in-band and post-process  methods for data reduction and see what the trade-offs are.

As for why we’re talking through this old saw again, Storewize is an in-band appliance that sits on the wire in front of your active NAS storage. Ocarina sells an optimization appliance that does post-process compression and dedupe, but we also sell software-only solutions (mostly embedded inside storage vendors’ products) that work in-band inside the active storage system.

What Exactly Is Happening In-Band or Post-Process? The only thing this discussion is about is when data gets shrunk.  Every solution does real-time random access for users in-band.    When users or applications do I/O to their already-compressed data, it is always handled in-band, real-time, and transparent.  I think that’s true of every solution mentioned in the CORE table.    When people talk about in-band versus post-process, they are talking only about when the data gets compressed, not about when it gets decompressed.

Speed.     In-band versus post-process has absolutely nothing to do with the speed of a solution.    Time-to-compress can be just as fast post-process as it is in-band.    To demonstrate this, Ocarina turned off 111 of its 112 compressors, and also turned off our two-stage dedupe engine, and ran just our fastest compressor on a set of data.    We got 3508MB/sec throughput.     I don’t think anyone would buy Ocarina and then use only our simplest fastest compressor, but if they did, they’d find it goes as fast as any in-band solution on the market.      There is an implication regarding speed and in-band, though, and that is that in-band solutions can only run data reduction algorithms that go fast, because they are sitting there in the path of every I/O.  If the in-band solution is slow, then all your I/O is slow.   So yes they are fast, but this also limits the range of what they can do.  If the type of data you have requires a slower compressor to get good results – photos and video would be good examples – then you just can’t use an in-band solution.

How Much Disk I/O? In-Band solutions have a significant advantage in the area of disk I/O….or at least they do when they are getting good compression.     Because in-band solutions compress data before it gets written to disk, if the data has been shrunk by 50%, then there is 50% less data to actually write out to disk.  If you are I/O bound, and if a compressor is getting good results on your data stream, then this can help performance a lot.  You are simply doing a lot less I/O.  When dedupe vendors say that they ingest 1TB/hour, they don’t usually mean that they actually write 1TB of data per hour out to physical disk.  What they mean is, if you assume 20:1 dedupe, to store 1TB of logical data requires 50 Gigabytes of data to be written to disk.  The other 950GB are thrown away, because they were duplicates of data already stored on disk.   You have to be careful with these dedupe throughput claims – they assume you are getting the expected dedupe results.

If you sent 1 Terabyte of totally new and unique data, with no duplicates to data seen before, and the dedupe appliance actually had to write out 1 TB of physical data to disk, most of them would be quite a bit slower than their claimed performance. But for both dedupe and compression, if you get good data reduction, then you can improve I/O performance by being in-band.   Just as is the case with dedupe, though, if you are not getting a good compression result, then you won’t see this performance improvement.

All or Nothing. In-band solutions compress everything that goes by.   They have to.  Because of the speed constraints, they don’t typically have time to analyze data, figure out what it is, who it belongs to, or whether it falls in policy for some action or another.     Post-process solutions let the administrator decide which data to shrink, when to shrink it, how to shrink it, and even where to put the data.  This is probably the single biggest advantage of the post-process approach.    You may decide to dedupe and compress all files that have not been accessed for a month.   You might decide to dedupe all files that are virtual machine files, such as VMDK’s.  You may decide to apply fast compression only to files that are between 2 days old and a month old.  You may decide not to compress hot data, any file that has been read or modified in the last 24 hours, at all.

You may decide to run compression and dedupe jobs only at off-peak hours. You may decide to have the solution read a file from a Fibre Channel tier of drives, dedupe and compress the data, and write it out to a less expensive SATA tier.  This fine grain control, which allows active data management as part of the data reduction solution, makes post-process a clear winner for most unstructured file data.    For transactional databases, since the data is always hot, in-band may be the better approach.

What About Existing Data? Let’s say you have an EMC Celerra filer with 100TB of file data on it.  You buy a fancy new in-band compression solution.    New data that is being written gets compressed as it comes in. But what about that 100TB of data that’s already there?    Well, in-band solutions may give you a tool to read that data, compress it, and put it back….making it a post-process solution!   The fact is, most customers already have a lot of data.  Usually, customers who are out looking for a data reduction solution have lots and lots of data.    If you have a Petabyte of data sitting on the floor, and you can dedupe and compress that data down to 100 Terabytes, you just created 900 Terabytes of free disk.  That’s tremendous savings, and you can only get it by post-processing the data that is already there.

Data Integrity. This is sort of the skeleton in the closet for in-band solutions.     All data reduction vendors go to great lengths to ensure data integrity.  We all do checksums, or even bit-by-bit comparisons.   But when you do in-band compression, that means that you have never had a full original copy of your data on disk.     If there’s a bug in a compression algorithm (and let me be clear – I’m not accusing any vendor of any such thing) there is no original uncompressed data to go back to.

In a post-process solution, data gets written out to disk in its full original form. You decide when to compress it.  The conservative shop will say, compress everything after it is at least 24 hours old.  Why?  Because then you will have at least one backup of the original data taken by your backup solution.   Now you can go ahead and shrink the data, saving disk space on your primary storage.   But if there ever were a data corruption bug – and really, as a group, all of the vendors listed in the CORE score box go to great lengths to guarantee there isn’t, but if – then you’d have a full original copy to go back to.   The CIO of a major German bank once told me that for this reason, the bank would never implement an in-band solution.  He said, logically, I understand the protections that are in place and that they should be sufficient, but psychologically and from a risk perspective, I can not bet the data of the bank and its customers on those assurances.

Dedupe is Always a Post-Process.  No, Really! All dedupe, even solutions that say they are in-band, is a form of post-process.     One of the things that strikes me about the CORE article and the discussion that follows is that the line of reasoning is, “dedupe is the most important thing to happen in storage since forever, so you should go out and buy compression”.   Well, dedupe is a form of compression, and it’s one that compares new in-coming data to data that has already been stored.  To do this, you hold in-coming data somewhere while you compute a hash, look up the hash in an index, and find out whether the data is a duplicate of something you’ve already stored.    Solutions, like Data Domain, that claim to be in-band, hold that in-coming data in a very small holding tank (an NVRAM card) and process it quickly.  The solutions that are called post-process write the data out to disk and then figure out whether it’s a duplicate or not.  If it is, they get rid of it.   In both cases the data is being stored somewhere until the dedupe process runs – it’s just a question of where and for how long.    And dedupe is really important – for online data as much as for backup.

So Which One is Better? Neither.  There is no one right answer.    Different use-cases call for different solutions.  If you have an OLTP database running on a NAS share, in-band compression is probably the best bet.  If you have lots of unstructured file data, I’d say post-process (including dedupe) is going to be better most of the time.   But there are other use-cases, and there are exceptions even to those generalizations.  What I will say is this.   Most data, such as the IDC data on storage growth, shows Tier 1 Primary and database data as having very flat growth year-to-year, while the same data shows unstructured file data – Office docs, email, Sharepoint, photos, video, internet “stuff” - as growing at almost out-of-control rates.


Both in-band compression and full compression and dedupe solutions have their place, but the most important thing for data reduction is to attack the place where data growth is causing the most pain. For the most part, you’ll find that that growth is in unstructured file data, that it is not in the most performance-sensitive and finely-tuned Primary tier, and that if you got 75% savings on just the files in your data center that have not been modified for at least three months, you’d be flabbergasted at the savings.

Next:     Do you have to expand data to back it up?

Deduplication: From Point Solution to Data Center Strategy

Posted by Carter George On April - 19 - 2010

Deduplication has been a hot topic in storage for several years now. Most of the focus has been on dedupe appliances sold in the backup market, by companies like Data Domain and Diligent (now IBM ProtecTIER).There have been dozens of articles written explaining the basic concepts, and comparing the implementations by various vendors.

Dedupe is also becoming more prominent in primary storage, with NAS market leader NetApp including a basic dedupe feature on every node.  EMC followed with whole file dedupe on its Celerra family of products.  Other vendors are now introducing dedupe as a feature on block storage arrays, in the cloud, and on nearline and archival storage.

In effect, what we’re seeing is the emergence and transformation of deduplication (and some other related data reduction techniques) as a new storage fundamental.     Rather than being a standalone solution that customers pay a premium for, dedupe is becoming something that will be a standard feature on most mid-market and enterprise storage products by 2012.

This blossoming of dedupe is happening faster than with other value-add storage features that have followed similar paths. Data reduction in general, and dedupe in particular, represents a technology whose time seems to have come. There are two drivers. One is the exponential storage growth  putting strain on both capital expenditures in IT and outpacing the ability of cheap disk drives to keep up.    The other is that economic pressure and tightened budgets have made the traditionally conservative enterprise storage buyer willing to look at new technology that can store data more efficiently and at less cost.  It’s also a technology that works because the data driving storage growth is unstructured data, data created by people.  Database business applications are not driving storage growth – people are.   It’s office documents, email, photos, and videos that are driving the information revolution, and it is in the nature of humans to create lots of copies, variants, and versions of information – in other words, humans (as opposed to database applications) are a lot more likely to create a bunch of duplicate inefficiently-stored data.    Dedupe goes and finds all that, and takes the unnecessary copies out of the picture, and on the actual storage, all those duplicate copies point to one shared place.   The benefits are compelling – it’s not just saving disk space, but ultimately it’s also saving power, cooling, rack space, and all the other things you didn’t have to buy when you avoided buying another rack of disks.

That said, not all dedupe is created equal. It comes in multiple flavors – fixed block or variable, block aligned or sliding window, in-band and post-process and so forth.   There are those who argue fervently that one approach is the only true way to dedupe, as though it were some sort of medieval religious dispute. What matters is that the dedupe method chosen matches the use case for which it is being used.

Dedupe and Compression:  Friends or Foes?

Dedupe is not always the best approach to data reduction either. Some data sets – like collections of virtual machines and repetitive backups of the same volumes – lend themselves very well to deduplication.  Other data sets, such as corporate file shares and primary storage, respond better to compression.  The goal of the IT user should be to store and move data in the most efficient way.   Lost in the hype around dedupe is the fact that quite often compression – which is seeing a renaissance in research after years of relative inactivity – is better at shrinking data than dedupe is.     The two technologies are not mutually exclusive.    It is possible to apply both dedupe and compression to the same data, but finding the optimal balance is easier said than done.

For dedupe to work well, data is chunked up in to small blocks, which are then compared to see if any are the same.  Duplicate blocks can be discarded, saving space. The smaller the block used, the more likely it is to find dupes.   Where dedupe gets the best data reduction is with smaller chunks, compression gets better results with larger chunks.   Compression works by looking at patterns and then making predictions.  If you can predict the next thing, you can compress it. To best find patterns and predict data better, compression likes to have more context, and that means that bigger chunks work better.   For any given data set, there’s an optimal balance, but there is no one right answer that works best for every data set.

The Dedupe Transformation

At the Gartner Data Center Conference held in December 2009, the audiences of three different sessions were polled. In the Gartner report, “Data Deduplication will be even bigger in 2010*,” following the event, analyst Dave Russell commented on the poll results. “If the 56% of those with some plans for deduplication in 2010 are combined with the 14% that are using deduplication for only a portion of their backups, and if, as in years past, 2% to 4% of those with no current plans to deploy the technology do implement it, then it’s conceivable that 72% to 74% of the audience will adopt deduplication by year-end 2010.”

A similar poll by SearchStorage of storage buyers found very similar trends. Almost everyone either plans to deploy dedupe for backup or is evaluating it.  Likewise, while only 17% have deployed it for primary storage, a staggering 60% are planning to either buy or evaluate dedupe for primary in the next year.  This means that CIO’s and IT Directors are expecting dedupe to have the kind of impact on IT operations that virtual machines have had over recent years.

All the major players in storage will need to decide on a dedupe strategy, bringing out standalone products in dedupe-centric markets like backup, and adding dedupe as a feature set to existing primary and nearline products in the other storage tiers. Some of the large vendors may look to startups to acquire the technology they need, either through OEM deals or acquisitions, while others will develop in house.  In either case, within two years at most dedupe will have become a storage fundamental, and the landscape of vendor offerings will have changed.  Today, the playing field consists of niche offerings by the big vendors and a set of innovative startups looking to breakout.  Within two years, some of those startups will have gotten design wins with major vendors, some will follow in Data Domain’s footsteps and get acquired, and some will be left holding the short end of the stick.

Point Solution or Coherent Strategy?

Finally, there is one key mystery left in the dedupe marketplace, which is headed towards a situation where every storage product will have deduplication built in. At current course and speed, all of those dedupe implementations will be inconsistent and incompatible with one another, even inside the product line of a single vendor.    Will that continue to be the case as dedupe becomes a standard feature?    If you look at today’s two market leaders in dedupe – NetApp for primary and Data Domain for backup – you’ll see a painful scenario that we expect to see played out over and over again over the next few years.     Take a volume on NetApp filled with 16 Terabytes of data.    NetApp dedupe might shrink that data to 8TB, a great space savings.   But when it comes time to back that data up, the NetApp rehydates (expands) the 8TB’s back to the full original 16TB’s to send it to the backup target.  Let’s say the backup target is a Data Domain.      Now the network needs to carry that whole 16TB, and the NetApp storage controller had to use a lot of CPU to put the data all back together, possibly slowing down other applications trying to use the NetApp.     You have to buy a Data Domain model big enough to handle that 16TB ingestion in your available backup window.   When the 16TB of data gets to the Data Domain, it will be deduped again, using different algorithms, getting back down to 8TB or less.    In this process, a huge amount of CPU, network bandwidth, and time have been consumed to expand data and then shrink it again.   For a market focused on getting to better storage efficiency, this is blatantly wasteful.   This is the case today with almost any dedupe-for-primary solution backing up to a dedupe-for-backup target.   What would be more useful to customers would be a consistent and compatible dedupe, allowing data that has already been deduped to be moved in its compressed format to other storage products that support a compatible implementation.

A good deal of the I/O workload in a given shop is driven by a handful of common storage management workflows – backup, replication, migration, and tiering – and all of those workflows would be more efficient if they could be done using data that had been compressed and deduplicated. To truly deliver on the promise of storage efficiency, we’ll look to see some vendors deliver consistent and compatible dedupe and compression that works across products, supporting dedupe-aware versions of those key workflows transparently.

Our Thoughts on Performance and Recent Dedupe/Compression Comments

Posted by Carter George On April - 16 - 2010

In the recent InfoStor article on primary storage optimization http://www.infostor.com/index/articles/display/2460996926/articles/infostor/storage-management/2010/april-2010/consider-compression.html , Ocarina was mentioned along with some other vendors who have offerings that provide either dedupe or compression for primary storage. Ocarina is characterized as being a post-process solution.    This is a theme that we’ve seen in several product review pieces, and it’s worth clarifying.   Ocarina does offer post-process optimization, but the product can also be configured to do inband optimization. The common wisdom seems to be that post-process gets the best data reduction, but is slower than inband and therefore can only be used for cold data.   Ocarina is happy to be recognized for having the best data reduction, but we’re not post-process only, nor are we willing to concede that we are for cold data only.

User access to optimized data is always inband and real time. This whole inband versus post-process discussion only applies to the question of when you shrink the data.

Ocarina’s ECOsystem is a configurable multi-stage data reduction pipeline and it can be set up to run post-process, inband, or both. Ocarina’s family of storage optimization appliances come pre-configured to do post-process dedupe and compression, but in the cases where Ocarina’s ECOsystem has been embedded as software inside our storage partners’ products, we have been configured inband in some cases.

The ECOsystem pipeline has four elements, all optional:    object dedupe, block dedupe, regular compression, and content-aware compression. To run inband, an element has to be fast enough to keep up with the storage system’s I/O speed. Just as you want compression to be data invisible (it’s invisible when a user gets their file back bit-for-bit the way it was originally without ever knowing it was compressed), you want it to be performance invisible too. Adding dedupe and compression should not affect the perceived performance of a storage system. Some of Ocarina’s data reduction elements are fast enough to run inband, and can be configured that way. Some elements, especially advanced content-aware compressors, are slow enough that in most cases you would want to run them as background post-processes. With post-process, you also have the option of using policies to decide which data to compress, and when. With most inband solutions, you have to compress everything, all the time.

In the latest issue of Storage Magazine, Curtis Preston, editor in TechTarget’s Storage Media Group and an independent backup expert, in covering data reduction vendors said, “Ocarina takes a very different approach to data reduction than many other vendors. Where most vendors apply compression and deduplication without any knowledge of the data, Ocarina has hundreds of different compression and deduplication algorithms that it uses depending on the specific type of data.”


Because we have such a rich toolbox, we can do different things with it.  If you used Ocarina to build a backup solution, you might use our block dedupe and fast regular compression only. If you used us for a deep archive, you might use object dedupe and advanced content-aware compression as post-process only. For primary NAS, you might do fast regular compression inband, and dedupe as a post process, and so forth.


To make this point a bit clearer, we’ll publish some performance results in the next week or two showing how we attack different data sets inband, post-process, and with both. We’ll make the data sets public, so if other vendors want to try their wares, they can download the data and see how they do, apples to apples To directly address the issue of whether Ocarina can be fast enough to be used for “true” primary storage, if you run our fast regular compressor only, inband, we have been benchmarked over 3,000 MB/sec. Most of our customers will elect to do more than just simple regular compression, so that’s not a number we’d claim for the real world – in the real world, people want to get better data reduction than you get with just regular compression.


At the end of the day, going fast is important, but only if you actually do something useful. If a solution goes really fast, but can’t actually shrink your data, then we don’t think that’s very interesting.    The right solution will get the best possible data reduction while still meeting performance requirements. Different use cases have different performance requirements, and what you want is something that can be configured to hit the sweet spot for performance while still getting smokin’ dedupe and compression results.

Ocarina Adds Video Optimization to Its List of 900 File Types

Posted by Matthew Harvey On April - 7 - 2010

Here is our official announcement on our ability to reduce more than 900 file types

Ocarina Adds Video Optimization to Its List of 900 File Types Supported by Content-Aware Storage Solution

Company’s Native File Optimization delivers best possible file compression while preserving video quality, retaining native file format

SAN JOSE, Calif., Apr 07, 2010 — Ocarina Networks, a leading provider of content-aware storage optimization solutions, today announced that it has expanded the number of file types it can reduce to more than 900, to include support for popular Flash video for Internet distribution, and MPEG2 for broadcast workflows. Michael Davis, senior director of marketing at Ocarina, will be speaking about the company’s video capabilities at the 2010 NAB show in partners’ BlueArc and Isilon booths in the Las Vegas Convention Center April 12-15.

Ocarina has developed the industry’s most-advanced platform for data optimization, providing reliability, scalability, and a complete array of advanced data-reduction algorithms in a fully integrated solution. Ocarina’s ECOsystem(R) includes over 100 algorithms that have proven effective for more than 900 file types, including ones that had been considered previously uncompressible. Ocarina’s fully lossless optimization technology has already gained widespread acceptance as an online archival solution in film production studios including Starz, Rainmaker Entertainment, and ZOIC. Now Ocarina has enhanced its Native Format Optimization (NFO) workflow to support Internet video workflows. The NFO workflow introduces non-visible compression to image and video files, allowing customers to capture data-reduction benefits in storage, bandwidth, and web-site responsiveness. Ocarina’s addition of Adobe Flash video formats to the NFO workflow ensures the best possible video compression of various file types (FLV, SWF, F4V) and data formats (Spark, VP6, h.264) while preserving original image quality. Ocarina has also added lossless compression support for MPEG2 video in broadcast video archives.

Ocarina’s NFO workflow delivers the best possible video compression while preserving image quality as well as the native encoding format. By reducing file size across large video repositories, media companies are able to keep files online longer, reduce storage capital and operating costs, reduce the cost of distribution including CDN and ISP bandwidth fees, and reduce the page-load times for video-intensive web sites, and improve audience penetration into marginal broadband markets.

“Ocarina NFO is the only enterprise-class data reduction product that is effective on video content. This is content that doesn’t present redundancies for a dedupe algorithm, and generic compression algorithms such as LZW really add no benefit,” said Davis. “What’s different about our NFO compression is that we’re really getting into the video encoding to align it with what the human eye can perceive. Our post processing algorithms will seek out opportunities for spatial optimization, inter-frame optimization, better motion compensation, improved bit-rate control, quality normalization and hot-spot detection. Our video-aware optimization allows us to deliver up to 40% or more savings where traditional de-duplication technologies deliver no benefit.”

Ocarina’s new video NFO workflow appeals to media companies concerned with bandwidth costs including social networks, user-generated content sites, video advertisers, news outlets, and other ad-supported video sites.

Ocarina’s ECOsystem is deployed as an appliance co-processor that works with existing storage systems, and uses a policy-based interface to align data-reduction with application workflows. ECOsystem enhances tiered-storage architectures by multiplying available storage in secondary storage tiers.

With more than 1,500 exhibiting companies and 800,000 square feet of exhibit space, The NAB Show is the world’s largest electronic media show covering filmed entertainment and the development, management and delivery of content across all mediums. More than 85,000 audio, video and film content professionals are expected to attend the 500 conference and training sessions available. Additional details about the NAB Show are available at www.nabshow.com.

About Ocarina

Ocarina is a leader in online storage optimization solutions. Organizations of all sizes use Ocarina’s content-aware optimization technology to reduce their storage footprint and achieve a ten-fold capacity increase on their current storage systems. Based in San Jose, Calif., Ocarina is privately-held and financed by leading investors JAFCO Ventures, Kleiner Perkins Caufield & Byers and Highland Capital Partners. For more information, visit www.ocarinanetworks.com

Compression and Dedupe like Oil & Water?

Posted by Mike Davis On March - 31 - 2010

Ocarina customer Imagination Technology got some good press the other day in a SearchStorage article reviewing solutions for primary storage optimization. Imagination based Northwest of London provides key pieces of semiconductor IP that are in just about every smart phone out there, and they used Ocarina to double their storage in the same datacenter footprint. [note to Steve Jobs...I see you used them for iPhone graphics, can you also use their Flash acceleration? please?]. The underlying storage there is Network Appliance, and our co-processor is happily reaching in via NFS to shrink all the archived project data for reference and restore.

Like most every other company whose product is digital, access patterns follow a 90/10 rule…There’s a hot-set, and there’s the rest, and that’s the stuff you should target for data reduction. For someone who’s budget constrained or datacenter constrained (and I think that covers about everyone except maybe the NRO), data reduction brings real benefits to online storage, whether that comes from Netapp, Storwize, or Ocarina. Of course if it comes to a shrink-off, we’ll take any challenge from any taker with any dataset!

Dedupe and Compression like oil and water?
Brian raises the contentious issue of whether dedupe is better or compression is better, and whether you should do both. The truth is dedupe’d data is less easy to compress, and compressed data is less easy to dedupe. If you apply a compression-only workflow to a dataset let’s say you get 50%. Now run the same data set through a dedupe-only workflow and you’ll get maybe 20% (remember this is primary storage not backup data). Now take those little chunks and pointers from the dedupe workflow and compress them; you might get an additional 35% for a total of 55%. So compression of deduped data is less effective than on the raw data-set, but the combination (for this example) has eeked out a 5% advantage over the compression-only workflow. Of course it all depends on the data, and there’s a high burden on the software to be really smart about how and when it chooses different algorithms.

It takes some skill and forethought to do both well, and Ocarina’s algorithm selection logic is well tuned enough (thanks partly to the use of a neural network) that the combination of dedupe plus compression will deliver say 80% savings (or 5x effective capacity) when processing a enterprise-profile data set (where there’s some redundancy in the data). For some vertical applications however, the benefit of adding deduplication is so slight, it’ll actually be disabled. A life-sciences dataset for example has relatively little data redundancy.

Steve from Storwise is right that given in many situations compression can do much better than dedupe (for primary storage apps), and that’s certainly true when considering most of the prevailing dedupe algorithms such as SIS (full file), static block (Netapp Asis), or dynamic block. The key difference for Ocarina, and the way to make dedupe pay off in primary storage is as Brian in the article explains that we “pull files apart and deduplicate their constituent elements.” In other words we find reduction where no one else can. Imagine for example how cool it is when ECOsystem delivers dedupe benefits even when files have already been “single-instanced”.

I dream of data reduction

Posted by Sunshine On March - 29 - 2010

jeannie

Data is growing at a dizzying rate. We need only look at our home computers to get a sense of how easy it is to fill our hard drives to overflowing with all manner of flotsam and jetsam. From family photos to LOLcats to videos of our kids, we’re finding it difficult if not impossible to keep down the rising tide of files.

There is a cost to this, as many if not most enterprises are now recognizing. Recently, InfoWorld launched a special section, Data Explosion that guides companies through the myriad problems that arise from having too much data to handle. With headlines like: “The big data addiction,” the new section promises to address the issue with step-by-step guides, white papers, and other instructional pieces.

Infoworld blogger Matt Prigge delves into the topic in a post today, “The high cost of lazy storage.” He says that users need to take responsibility for keeping their data under control. Despite this admonishment, he admits that he himself is an “excellent example of the problem.” He saves all of his email, because he never knows what he might need later. Sound familiar? If someone whose blog is called “Information Overload” can’t get control of his personal data, it’s hard to imagine how anyone else can.

Prigge writes, “The bigger that data gets, the more effort required to put the genie back in the bottle.” He pushes the metaphor even further (and more gruesomely) by suggesting that at some point it’s easier to kill the genie and throw away the bottle. Now, that does strike us here at Online Sto Op as rather extreme. Why not simply put the genie back into that nice, compact bottle where she was living perfectly happily for so many years?

As we all know from 70s TV, those bottles were well-upholstered and downright comfortable living spaces for many a genie. And while it’s true that some genies (or Jeannies) would get so angry they’d stomp their feet when they were magically sent back there, they eventually settled back onto the purple pillows, kicked off their metallic platform heels, dug their toes into the shag carpeting and relaxed. Same goes for data reduction. A combination of approaches seems the most sensible answer. Data needs to be managed. There is something that is known as 100% compression–it’s called “deletion.” But short of that, there are ways to reduce data by as much as 90%. There are solutions for reducing the types of files that are driving the fastest storage growth, such as JPEGs, documents, videos, graphics, and other large files. An intelligent, content aware approaches that includes both deduplication and compression is what this blog’s parent Ocarina provides.