Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Archive for the ‘Uncategorized’ Category

Dell Day 2

Posted by Mike Davis On August - 2 - 2010

It’s a great experience going through a a positive transfer of ownership in a company, whether it’s an IPO or a strategic acquisition by a larger company. From an Engineering perspective, the day-to-day tasks are mostly unchanged, but there’s clearly something different in the air. Maybe it’s the fog hanging in the air, from all the attorney’s forms…sign this and sign that…  Maybe it’s just the anticipation of change, which is almost always good, but there’s no denying that there’s a ton of uncertainty in the air.

Will I have to change how I write code?
Is there going to be a lot of bureaucracy in the our new big company?
Are they going to take away our Monday morning bagel run? Our soda? Mixed nuts???
Will any of our product roadmap be pruned? targeted for acceleration?
Will we be the the hot new division at Dell or just another group of smart-guys added to the team?
Will they still let us play Cricket matches in the lobby?
Will Ocarina’s CEO get to fly on a private jet now? Can we ride along???

We are already deep into discussions of integration plans, but only time and reassuring communications will help settle the uncertainty. Fortunately the “unknown” is balanced by the thrill of developing something of true value; that Dell took us off the table because we are a keystone for their vision of the future datacenter.

As far as rating our acquirer, Dell gets an A+. Everyone is really happy with the way we’ve been treated so far. Dell has been incredibly generous with their executives’ time, spending quality time with the Ocarina team helping us understand everything from culture, to cubicles, to Coke’s, to product tactics. Even Michael — we’re told using the last name is optional — made a point of sending us a personal welcome message, and even showed up in our office. Even though the acquisition is just closed, many of us have started developing personal relationships with our Dell counterparts, and it all looks good from San Jose.

Michael and Murli

Michael and Murli

For now, business goes on, customers will get supported, and dev work on our existing embedded projects continues at breakneck pace. Attorney activity is on the downswing, and now it’s time to get to the rewarding work of integrating the teams, the product strategies, and making sure our customers are happier than ever.

The best is yet to come

Posted by Sunshine On July - 20 - 2010

For almost exactly a year, I had the privilege of being a blogger for this site, Online Storage Optimization. It was one of the most fulfilling collaborations of my professional life.

When I heard the news yesterday that Dell would be acquiring Ocarina Networks, my first reaction was pride. I couldn’t help but give myself an inner “high five” for having recognized what a great company Ocarina is and would become. My second was to congratulate Dell on their smart choice. I was lucky enough to have worked with the people at Ocarina, and now a whole new group of folks will have this opportunity.

The people at Ocarina Networks are not run-of-the-mill. They are not even above average. They are extraordinary. Anyone who wants to understand what makes a technology company excel would do well to study them. As I wrote in my final, farewell post, Murli Thirumale, CEO of Ocarina is about as far from the image of the typical start-up CEO as you can get. He is soft spoken, thoughtful, and a good listener.

He built his company around a well thought out business philosophy. He first identifies a problem or need, and then designs a product in response to it. In this case, the problem he identified was the proliferation of data worldwide. Rather than go for the obvious, he brought together a team to research the problem and solve it in a new way. This approach is rare. Most of the time, you have start-ups that try to launch based on a technology they themselves developed–in other words, their own pet project. It takes discipline to go about things the way that Murli does.

Murli also showed himself to be a leader in the true sense. Rather than feed his own ego, he chose to surround himself with giants of the storage industry, and he allowed them to get the job done the way they saw fit. Carter George, VP Products, is one of those people. I am particularly indebted to Carter. He was truly a mentor to me. Despite the intense demands of his role at Ocarina, he always had time to answer my questions. He never treated me as anyone other than an equal. I also got to know Goutham Rao, the visionary genius who is the company’s CTO. In addition, I met Dave Withers, who is the man behind the company’s multiple partnerships with top storage companies. For the last few months of my tenure, I worked closely to ramp up the social media program with Mike Davis, director of marketing. Dell is extremely fortunate. They are acquiring technology, but to me the real gift is that they get to work with these remarkable people.

As the headline says, this to me seems like the beginning for Ocarina. It is a bold plan to take the vision of end-to-end dedupe and make it real. This is the next step for the storage industry–and a crucial one if storage costs are to be kept in line, and infrastructure is to keep up with the demands of the real world. We are living in a time when data growth is spiraling upward at rates that no one could’ve imagined even a decade ago. This is the time for a cohesive, meaningful response to this reality. I couldn’t imagine a better outcome for Ocarina, Dell, and the industry as a whole. I am honored to have been a small part of it.

Accelerating our Mission

Posted by Murli Thirumale On July - 19 - 2010

By now you may have heard the news that Ocarina will become part of Dell. Many of you who have been following Ocarina over the last few years may wonder why a company with exciting technology, huge customer success and rapidly growing revenues would sell itself at this time.  The answer is simple:  it accelerates our mission and is great for our people.  Our mission is compelling:  We will be the industry leader in primary data reduction by the year 2012.  Dell takes our current trajectory and straps a big rocket engine to it.  Their success with the EqualLogic product, huge customer footprint, leadership in servers (where we can embed our technology) and the fact that they are excited about our OEM path already ensures that we will now be supercharged on our path to our mission.  Dell’s tagline for the Enterprise business is “The Efficient Enterprise”.   We could not have come up with a better line for our dedupe solutions.  Business alignment? Check!

For our people, it is a great outcome.  Good people build great technology.  Great people build products that get used over and over again by a large number of customers.  Getting our world-beating dedupe solutions into the hands of a large number of customers rapidly is what we are all about.

We have been working with Dell for over a year.  The people we have met all the way from engineers up to Michael Dell are intellectually curious, direct, driven to succeed,  talented and fun to work with.  In fact, they are a lot like us at Ocarina! Culture alignment?  You bet baby! Finally, even with its success, Dell’s storage business is relatively new and their willingness to fund it (they bought us didn’t they?), long-term expansion plans and focus on talent acquisition ensures a great opportunity for the talented team we bring across all functions to Dell.

Primary dedupe leadership by the year 2012.  Now coming to you at rocketship speed powered by Ocarina AND Dell.  Watch out storage industry!

Passing along what we learned from OEMs

Posted by Mike Davis On July - 18 - 2010

It’s an exciting time at Ocarina, because we’re right in the middle of a wave of OEM efforts to bring data-reduction to market as a standard feature across a wide variety of storage implementations. ECOsystem for OEMs is an Ocarina offering of software libraries and APIs that allows storage OEMs and ISVs to embed data-reduction into their products. At Ocarina, we firmly believe that within a couple years, not only will we see dedupe as a new “standard feature” in all major block and file storage products, but it will be increasingly found in host applications as well.

Here are a series of requirements that we’ve consistently heard from OEMs in our discussions, and here’s how Ocarina addresses those in the ECOsystem for OEMs.

1) The solution shrinks data well.
Ok, so this is obvious. Dedupe is supposed to shrink data, and that improves system utilization and reduces costs. Doing this well is about algorithms, and Ocarina has been on top in the algorithm game since we started shipping our primary storage optimization products almost 2 years ago. We’ve shown in some deployments that we can deliver 80% savings when traditional block dedupe delivers no more than 30%. We believe in fact that we’re the only storage technology vendor to actually employ algorithm PhD’s whose sole job is to invent better algorithms…which is something that has proven really useful in specialized markets where a few unique content types dominate the terabytes.

We are the only provider today that deploys dedupe and compression concurrently. Our dedupe algorithm is a content-aware, variable-block, sliding-window approach. ‘Content-aware’ is an overused term for sure, but here it reflects for example that we’ll recognize monolithic data structures in a stream like a JPEG blob, and we know that slicing that JPEG into 8KB chunks for dedupe delivers absolutely no benefit, and thus is a complete waste of time, CPU, and memory. We’ll treat that JPEG (and other data types like it) as a contiguous chunk, which makes our dedupe namespace extremely fast and efficient.

2) The solution minimizes time to market.
Most OEM vendors are under considerable time pressure to bring dedupe to market, either as a competitive response, or because their customers are demanding it. But the OEM’s dev and test engineering resources are always limited. With that in mind we made a point of developing a full featured library. That means not only do we slice the data and do hash lookups (as with the Permabit product), but we also do the dedupe, compression, on-disk data management, metrics and reporting, throttling mechanisms, optimized data movement, and more. By delivering a full-featured suite of capabilities, the OEM can rapidly bring embedded data reduction to market without for example having to redesign their file system or block map systems to complete the dedupe workflow.

The other attribute that accelerates time to market is simplicity. Despite being full featured, the ECOsystem for OEMs is accesses via an lightweight object-like API that OEM developers have told us is extremely simple to work with.

3) The solution has the flexibility to support specific use cases.
The requirements and functional expectations for implementing data reduction differ from device to device, and any embedded solution needs to have the adaptability to serve different applications. For example content-aware algorithms may not be a meaningful solution in a block array where data structures are completely opaque. Or CPU and memory constraints on a given device may require the use of a lighter weight dedupe workflow, and that shouldn’t force major architectural rework of the solution.

The ECOsystem for OEMs has been designed to support 6 embedded use cases: Servers, block arrays, NAS, object stores, cloud storage, and backup targets. Some of the differences between these tiers manifest themselves as implementation best practices, but there are also clear functional decision-points that allow an OEM to implement the right solution for the job. Importantly though, all of these tiers are compatible in key respects, allowing cross-platform manageability, and giving rise to end-to-end features such as optimized data movement.

4) The solution has high performance, while working within resource constraints.
Performance overhead is an often discussed problem associated with dedupe solutions. We’ve learned to solve these problems through 2 years of empirical experience, and believe we have the fastest, lowest-overhead dedupe solution. Moreover, ECOsystem for OEMs obeys hard constraints of the host platform in terms of CPU and memory usage.

There are a couple main points where dedupe can impose performance penalties:
A) During write, the chunking and lookup process takes time. Like other solutions, Ocarina does this in memory to reduce that penalty. You do have to be careful to understand for a given chunk size how much unique data that 1GB of dictionary can address, and given a constrained memory, how far can the solution go before forcing on-disk (=slowww) lookups. ECOsystem for OEMs utilizes the industry’s most memory efficient lookup design to make the most of existing resources.
B) During read, the reads from disk for any de-duplicated volume are more random than typical disk IO. Ocarina is also able to mitigate this impact through content-aware read-ahead caching that anticipates the next chunks that will be read from disk.

5) The solution supports next-generation features that customers want and competitors don’t have.
Ocarina has spent a lot of time in the market talking to end-customers about what they want in a dedupe solution. In addition to things like “shrink well”, a couple things keep coming out. One is to keep data in shrunken form as it moves around in workflow operations such as replication, backup, tiering, etc. We call this end-to-end optimization (or E2EO) and it expands the value of dedupe by improving backup processes, reducing LAN bandwidth, and reducing the CPU overhead that occurs when data is repeatedly rehydrated and re-deduped for no reason. For those OEMs who carry a broad catalog of server, storage, and backup solutions, there are huge benefits in being able to deliver this end-to-end value to customers who have adopted that OEM across their entire IT infrastructure.

The other key feature is the ability to support structured (eg database) and semi-structured (eg VM) applications. The files in these applications are almost always >95% static and <5% active. But because they are active, traditional dedupe solutions can create a tremendous IO overhead as dedupe operations battle against a steady stream of changes. Instead of getting in the way, Ocarina has invented a way to dedupe the inactive portions of these files, while imposing no performance overhead on active IO. Like our dedupe chunking process, this is one of several content-aware features that Ocarina delivers in the ECOsystem for OEMs.

These advanced dedupe features — which bring data reduction benefits to new applications and workflows — allow storage OEMs and ISVs to deliver more value to customers, to do it faster, and differentiate their product over competitive offerings that have first-generation dedupe capabilities.

Expanding Role Of Data Deduplication

Posted by Matthew Harvey On May - 17 - 2010

I came across this article on InformationWeek “Expanding Role of Data Deduplication,” and I thought I would share it. InformationWeek created a survey of 437 technology professionals and picked their brains regarding the technology of dedupe and how they might be using it within their infrastructures. The report is available free for a limited time here: http://dedupe.informationweek.com/

This report gives a solid overview of dedupe and how it relates to the overall storage market, its very detailed and I would encourage anyone interested in dedupe to download it and give it a read. Cheers!

To Pluto and Beyond!

Posted by Carter George On May - 7 - 2010

Thanks to EMC’s sponsorship, the effort started by the UC Berkeley School of information in 2003 to quantify the global trends in creation of digital data continues under IDC’s able supervision. The latest report, The Digital Universe Decade,” is hot off the press, and breaks the banks on crazy metaphors for how much data is produced and consumed. They have the usual “moon-and-back” metaphor, but here’s my favorite: 707 trillion copies of the 2000+ page U.S. Healthcare bill stretched from Earth to Pluto and back 16-times. Nothing like a little dig at congress while we’re at it!

The most interesting trends from our perspective were:

1) The growth in unstructured data continues to outpace structured data,

2) Thanks largely to widespread media-enabled cell phone use, 70% of data will be user-generated

3) This one really jumps out: 35% more digital information is created today than the capacity exists to store it. This number will jump to over 60% over the next several years.

So what the IDC numbers mean is that platter density can’t keep pace with the marketAlthough John Toigo (http://www.drunkendata.com/?p=2872) refutes that by arguing platter densities have kept ahead of IDC’s expectations, I expect the truth to be somewhere in the middle. Either way, the ability to find more disk and datacenter to store the deluge is going to present more challenges for Sysadmins, who are already up to their necks in data.

Good news for vendors who supply storage hardware. No surprise it’s EMC who sponsors this study!

Administrative issues aside, it raises a question of IT sustainability. The IT budget in a typical Fortune 1000 enterprise needs to track as a percentage of overall capex and opex. When IT in a typical enterprise must grow as a % of overall expenses, that can’t continue; IT has to consider itself non-scalable. With the scenario described in the report, IT organizations are going to encounter this exact scenario. Their storage budget will grow as a percentage of opex and capex, and that can’t continue indefinitely.

So the market has two choices:

1) Throw away data.

2) Greatly improve utilization, through technologies like thin provisioning and data reduction.

History has already shown that throwing away is not going to happen because the IT staff managing it doesn’t own the data, and they have neither the tools nor the buy-in to begin expiring the data of their users. So primary storage optimization is getting a ton of attention. We anticipate data reduction features rapidly moving from “should have” to a “must have” in storage solutions globally, and enterprises globally are being encouraged to start evaluating reduction technology if they want to keep storage budgets in check. In fact we’re starting to see that sentiment in F1000 polling by TheInfoPro (www.theinfopro.com): Their two latest survey waves are showing Online Data Reduction as ranked 2 of 21 in their Storage Networking Technologies Heat Index.

In a perfect world hardware would be free and wouldn’t generate heat, users would diligently delete files, the IT staff would triple and we’d all win the lottery, which would allow you to safely store the 1.2 Zettabytes of data that’s crossing your LAN in 2010! But the boss says we shouldn’t count on those things, so we’re working hard on end-to-end solutions for a more intelligent way to deal with the data explosion.

trash-can

In his comment in the on-going discussion of the CORE formula ( Dedupe Rates Matter…Just Not as Much as You Think )      Steve Kenniston had the following to say about the relationship between shrinking data on primary storage and backups:

Example, if I use Ocarina deduplication, but have already purchased Data Domain, don’t I need to re-hydrate the Ocarina deduplicated, primary storage data before I use Data Domain?  They say you do.  That means I don’t really save on my primary storage if I need the space to re-hydrate before I back it up and that also means processing time on the array.  Storwize, with random access compression doesn’t require decompression.


There are several interesting issues brought up here. They only relate to the CORE formula in that the formula does not account for them.

Today, if you have the most common dedupe for primary storage (NetApp dedupe) and the most common dedupe for backup (EMC’s Data Domain product), it works like this.


You start with a volume of, say, 16TB. NetApp Dedupe will shrink that to maybe 8TB.     Then you go to backup.  The backup server sends an NDMP request to the NetApp asking for a data stream to be sent to the backup target, the Data Domain.  NetApp then rehydrates (expands) the 8TB back to 16TB and sends that stream to the Data Domain.  The Data Domain will then dedupe that data back down to probably 4TB (since Data Domain dedupe is more sophisticated).

This is wasteful and has several negative consequences. It uses a bunch of CPU and I/O on the NetApp to rehydrate the data, which means performance might be slower for users while this is going on.   You have to use the full network bandwidth to move the whole 16TB to the Data Domain.  And you have to buy a Data Domain model big enough to handle 16TB of backup data instead of 8.


One thing that does not happen, which Steve seems to think is the case, is that you need 16TB of disk space on the NetApp to put all that expanded data in before it goes off to the backup.

The situation is ugly, but it is not that ugly.

Now, how would this take place if you used Ocarina dedupe and compression instead of NetApp. Let’s look at the following examples, and since NetApp has their own dedupe, let’s use a NAS filer that has nice integration with Ocarina, the BlueArc:

Scenario 1:   Compression-only, BlueArc to Data Domain
Scenario 2:   Dedupe and Compression, BlueArc to Data Domain, Full Backup
Scenario 3:  Dedpue and Compression, BlueArc to Data Domain, Incrementals

To continue in the vein of the NetApp example, let’s say you started with 16TB. Ocarina will have shrunk that to maybe 4TB in the compression-only case, and maybe 2TB in the compression and dedupe case – because we shrink better than anyone.  In the first case, Ocarina replaces each file with a compressed version of the file in the same volume.  When you go to back up, you backup through a mount point that exports the volume without going through the decompression layer.

So this works just like it would with Storewize – the compressed versions of files go to the Data Domain.   When you back up day after day, you will create duplicates, and the Data Domain will find and eliminate those.

It should be noted, though, that Data Domain results will be slightly worse with either Storewize or Ocarina compression – because compressing data makes it harder to find duplicates.

In the second scenario, we have not only compressed the data, but deduped it too. This makes backups more complicated, because the pieces and parts of a file may be spread around in many other files.  Doing a full volume backup is pretty straightforward though.  You first take a snapshot of the volume, and then back it up.  It is important to take a snapshot, because you need to make sure all the pieces of files are consistent at a point in time.   You back up the 2TB to the Data Domain.   Now, does this mean you don’t need a Data Domain, because dedupe was already done at the source?

No, not at all. The first time you backup a volume, Data Domain won’t shrink it any, if at all, because there shouldn’t be any duplicates in the data set.  Ocarina dedupe is at least as good as Data Domain’s.   However, backup is not something you do once.  It’s something you do every day.  So when you back up the next day, and the day after that, and so forth, the Data Domain will find plenty of duplicates, as you backup the same files over and over.  Over the course of a month or so, the Data Domain will be getting its 20:1 dedupe ratio even though the source volume was perfectly deduped already!

The third scenario is the most complicated. Doing incremental backups from a NAS usually means backup software and NDMP.    The backup server, called a DMA, figures out which files have changed since the last backup, and then sends a request to the NDMP Data Server.      Normally, NDMP is a service provided by the NAS head, but when you have used Ocarina to dedupe a volume, Ocarina will provide the NDMP Data Server.  The NDMP data request will come to the Ocarina NDMP service and ask for the 1,237 files that have changed since yesterday.       Now, you could just rehydrate those files and send them to the backup target.  However, Ocarina has a dedupe-aware NDMP.  This will figure out which chunks are needed to rehydrate the 1,237 files requested by the backup DMA, and will create – on the fly, using no disk space – a self-contained NDMP data stream that is deduped within itself.

You might see some partial rehydration, because the backup stream needs to have every block in it necessary to recover the files being backed up. But there will be no duplicate blocks in the data stream that goes to the backup target.     What’s more, all those blocks or chunks will remain compressed.  So what shows up at the Data Domain is a file-level incremental backup that is both deduped and compressed.  This allows file level restores by the backup software DMA (like NetBackup or Commvault).

OK, now let’s take one more example, because this is the way you’d do data reduction as an enterprise strategy, rather than as a point solution for one filer. We call it end-to-end dedupe.

Scenario 4:   Dedupe and Compression, BlueArc to BlueArc

In this case, we’re going to backup from one Ocarina-enabled BlueArc to another. The backup software will see the second BlueArc as an NFS backup target, just as it does a Data Domain.     In either the full or the incremental case, the NDMP service will call the Ocarina NDMP Data Server on the Primary BlueArc.  But when the backup starts, Ocarina will query the target on a known port to ask if it is Ocarina-aware.  If the answer is yes, then instead of sending the data, Ocarina’s dedupe-aware NDMP will send just the hashes.

The target-side BlueArc (acting in place of the Data Domain) will examine those hashes and determine if it has any of that data already. It uses a negative acknowledgement protocol to tell the source BlueArc which chunks it needs to execute the backup.  On the first backup, this will be all of the chunks or objects.  But on subsequent backups, the dedupe in both places will be synched up, and only net new data is moved.

Now, in this case, you have true end-to-end dedupe. You dedupe and compression the primary storage.  When it comes time to backup, you do not need any disk to store rehydrated data.    Rather, you engage in an intelligent conversation with the backup target and move only the blocks, chunks, or objects needed to complete the backup, keeping every chunk in its compressed form.     You can do this from any Ocarina-enabled source to any Ocarina-enabled target.
So it applies to more than just  backups. This kind of optimized end-to-end solution applies to replication, tiering (primary to nearline, for example), migration, archiving (primary to object store), and backup.

Some significant percentage of the I/O’s in a data center are done not for user I/O or application I/O, but are done in support of storage management workflows. Ocarina can be deployed as an enterprise storage optimization solution – not a point solution for one NAS filer, not even a solution for one data center.  Ocarina can be deployed across multiple tiers of storage – NAS, block, DAS, archive, object, cloud and backup – and then all storage management workflows will operate using dedupe-aware compressed data.   The benefits here are not just saved disk space (and power and cooling and so forth), but network bandwidth, time, backup window reduction, and more.

If you are a storage vendor reading this, you should know that Ocarina now has a complete storage integration SDK.  This is a set of API’s and documentation for integrating Ocarina – in-band or post-process – inside different types of storage. There is a framework for file system-based products, a framework for block products (DAS and storage arrays) and a framework for cloud and object stores (get/post/put model).    If you want consistent, compatible, and integrated data reduction across you whole storage product line, give us a jingle.

One More Time: In-Band Versus Post-Process

Posted by Carter George On April - 23 - 2010

My Mom always used to tell me, “You have to be able to distinguish between Need and Want”.
I need a car.  I  want an Aston Martin.

In Steve Kenniston’s post Storage’s 2010 Hottest Technology he says that customers who “require real-time random access compression….in front of their active solution” need something very fast.
Well, yes, they do.


But having a requirement to go fast has nothing to do with in-band versus post-process. And saying that customers who require something in front of their active solution need an in-band solution is like saying that customers who require an in-band solution require an in-band solution.


Who requires something “in front of their active storage”?


Some customers may want that, some might prefer it, but who needs it? In many cases, you could also use a solution that is next to your active solution, or inside it.   As long as it is fast enough and does a good job of data reduction, I’ll let you decide which one is a car and which one is an Aston Martin.

Let’s just run though some of the comparison points between in-band and post-process  methods for data reduction and see what the trade-offs are.

As for why we’re talking through this old saw again, Storewize is an in-band appliance that sits on the wire in front of your active NAS storage. Ocarina sells an optimization appliance that does post-process compression and dedupe, but we also sell software-only solutions (mostly embedded inside storage vendors’ products) that work in-band inside the active storage system.

What Exactly Is Happening In-Band or Post-Process? The only thing this discussion is about is when data gets shrunk.  Every solution does real-time random access for users in-band.    When users or applications do I/O to their already-compressed data, it is always handled in-band, real-time, and transparent.  I think that’s true of every solution mentioned in the CORE table.    When people talk about in-band versus post-process, they are talking only about when the data gets compressed, not about when it gets decompressed.

Speed.     In-band versus post-process has absolutely nothing to do with the speed of a solution.    Time-to-compress can be just as fast post-process as it is in-band.    To demonstrate this, Ocarina turned off 111 of its 112 compressors, and also turned off our two-stage dedupe engine, and ran just our fastest compressor on a set of data.    We got 3508MB/sec throughput.     I don’t think anyone would buy Ocarina and then use only our simplest fastest compressor, but if they did, they’d find it goes as fast as any in-band solution on the market.      There is an implication regarding speed and in-band, though, and that is that in-band solutions can only run data reduction algorithms that go fast, because they are sitting there in the path of every I/O.  If the in-band solution is slow, then all your I/O is slow.   So yes they are fast, but this also limits the range of what they can do.  If the type of data you have requires a slower compressor to get good results – photos and video would be good examples – then you just can’t use an in-band solution.

How Much Disk I/O? In-Band solutions have a significant advantage in the area of disk I/O….or at least they do when they are getting good compression.     Because in-band solutions compress data before it gets written to disk, if the data has been shrunk by 50%, then there is 50% less data to actually write out to disk.  If you are I/O bound, and if a compressor is getting good results on your data stream, then this can help performance a lot.  You are simply doing a lot less I/O.  When dedupe vendors say that they ingest 1TB/hour, they don’t usually mean that they actually write 1TB of data per hour out to physical disk.  What they mean is, if you assume 20:1 dedupe, to store 1TB of logical data requires 50 Gigabytes of data to be written to disk.  The other 950GB are thrown away, because they were duplicates of data already stored on disk.   You have to be careful with these dedupe throughput claims – they assume you are getting the expected dedupe results.

If you sent 1 Terabyte of totally new and unique data, with no duplicates to data seen before, and the dedupe appliance actually had to write out 1 TB of physical data to disk, most of them would be quite a bit slower than their claimed performance. But for both dedupe and compression, if you get good data reduction, then you can improve I/O performance by being in-band.   Just as is the case with dedupe, though, if you are not getting a good compression result, then you won’t see this performance improvement.

All or Nothing. In-band solutions compress everything that goes by.   They have to.  Because of the speed constraints, they don’t typically have time to analyze data, figure out what it is, who it belongs to, or whether it falls in policy for some action or another.     Post-process solutions let the administrator decide which data to shrink, when to shrink it, how to shrink it, and even where to put the data.  This is probably the single biggest advantage of the post-process approach.    You may decide to dedupe and compress all files that have not been accessed for a month.   You might decide to dedupe all files that are virtual machine files, such as VMDK’s.  You may decide to apply fast compression only to files that are between 2 days old and a month old.  You may decide not to compress hot data, any file that has been read or modified in the last 24 hours, at all.

You may decide to run compression and dedupe jobs only at off-peak hours. You may decide to have the solution read a file from a Fibre Channel tier of drives, dedupe and compress the data, and write it out to a less expensive SATA tier.  This fine grain control, which allows active data management as part of the data reduction solution, makes post-process a clear winner for most unstructured file data.    For transactional databases, since the data is always hot, in-band may be the better approach.

What About Existing Data? Let’s say you have an EMC Celerra filer with 100TB of file data on it.  You buy a fancy new in-band compression solution.    New data that is being written gets compressed as it comes in. But what about that 100TB of data that’s already there?    Well, in-band solutions may give you a tool to read that data, compress it, and put it back….making it a post-process solution!   The fact is, most customers already have a lot of data.  Usually, customers who are out looking for a data reduction solution have lots and lots of data.    If you have a Petabyte of data sitting on the floor, and you can dedupe and compress that data down to 100 Terabytes, you just created 900 Terabytes of free disk.  That’s tremendous savings, and you can only get it by post-processing the data that is already there.

Data Integrity. This is sort of the skeleton in the closet for in-band solutions.     All data reduction vendors go to great lengths to ensure data integrity.  We all do checksums, or even bit-by-bit comparisons.   But when you do in-band compression, that means that you have never had a full original copy of your data on disk.     If there’s a bug in a compression algorithm (and let me be clear – I’m not accusing any vendor of any such thing) there is no original uncompressed data to go back to.

In a post-process solution, data gets written out to disk in its full original form. You decide when to compress it.  The conservative shop will say, compress everything after it is at least 24 hours old.  Why?  Because then you will have at least one backup of the original data taken by your backup solution.   Now you can go ahead and shrink the data, saving disk space on your primary storage.   But if there ever were a data corruption bug – and really, as a group, all of the vendors listed in the CORE score box go to great lengths to guarantee there isn’t, but if – then you’d have a full original copy to go back to.   The CIO of a major German bank once told me that for this reason, the bank would never implement an in-band solution.  He said, logically, I understand the protections that are in place and that they should be sufficient, but psychologically and from a risk perspective, I can not bet the data of the bank and its customers on those assurances.

Dedupe is Always a Post-Process.  No, Really! All dedupe, even solutions that say they are in-band, is a form of post-process.     One of the things that strikes me about the CORE article and the discussion that follows is that the line of reasoning is, “dedupe is the most important thing to happen in storage since forever, so you should go out and buy compression”.   Well, dedupe is a form of compression, and it’s one that compares new in-coming data to data that has already been stored.  To do this, you hold in-coming data somewhere while you compute a hash, look up the hash in an index, and find out whether the data is a duplicate of something you’ve already stored.    Solutions, like Data Domain, that claim to be in-band, hold that in-coming data in a very small holding tank (an NVRAM card) and process it quickly.  The solutions that are called post-process write the data out to disk and then figure out whether it’s a duplicate or not.  If it is, they get rid of it.   In both cases the data is being stored somewhere until the dedupe process runs – it’s just a question of where and for how long.    And dedupe is really important – for online data as much as for backup.

So Which One is Better? Neither.  There is no one right answer.    Different use-cases call for different solutions.  If you have an OLTP database running on a NAS share, in-band compression is probably the best bet.  If you have lots of unstructured file data, I’d say post-process (including dedupe) is going to be better most of the time.   But there are other use-cases, and there are exceptions even to those generalizations.  What I will say is this.   Most data, such as the IDC data on storage growth, shows Tier 1 Primary and database data as having very flat growth year-to-year, while the same data shows unstructured file data – Office docs, email, Sharepoint, photos, video, internet “stuff” - as growing at almost out-of-control rates.


Both in-band compression and full compression and dedupe solutions have their place, but the most important thing for data reduction is to attack the place where data growth is causing the most pain. For the most part, you’ll find that that growth is in unstructured file data, that it is not in the most performance-sensitive and finely-tuned Primary tier, and that if you got 75% savings on just the files in your data center that have not been modified for at least three months, you’d be flabbergasted at the savings.

Next:     Do you have to expand data to back it up?

Words of Wisdom: Let’s get Organized

Posted by Carter George On April - 22 - 2010

Let’s Organize This Discussion A Bit!

I think there are at least three interesting topics you bring up here that are worth exploring further. Please see the recent post by David Vellante on Dedupe Rates Matter…Just Not as Much as You Think

I’ll respond with a separate post on each of the following topics, so that the community can track and participate in the threads that they find interesting and relevant, and ignore the ones they don’t.
The topics are:

1. Is the CORE Formula Flawed?
2. In-Band versus Post-Process: Is it even worth arguing about?
3. Do you have to expanded data to back it up?

Post 1. Is the CORE Formula Flawed?

I’m going to argue here that the CORE formula is flawed, but I want to say right off that the CORE formula has already added value to the community.
It’s a great idea to try to measure the value and effectiveness of different data reduction solutions, and by putting a first shot at it out there, there’s already a vigorous discussion going on here that is going to increase awareness of the issues involved, and will probably lead to a new and improved version of the formula. I think we want to recognize the value that Wikibon has brought to both customers and vendors by starting this whole brouhaha.
That said, yes, the CORE formula is deeply flawed.

Weighting the Factors to Reflect Customer Priorities

From a purely mathematical perspective, the formula is most fundamentally flawed because it builds in a bunch of value judgments as Constants that should really be Variables. The weighting of the values in the different columns may have different levels of importance to different customers, and they should therefore be able to assign that weight themselves, for their environment.
Instead, the weighting is hard-coded. The formula tells the customer that “time to compress” is more important than actual compression results.

That will certainly be true for some customers. It’s not true for others.

What I’d recommend to fix this is simple. Let a customer fill in a value to rate the importance of each column to them. Weight each column from 1 to 5, or 1 to 10. Then multiply the normalized score for each column by the customers assessment of its importance to them. That way, if “time to compress” is the most important thing to a customer, they can rate it 10. Great. If they rate that 5, and rate compression results 9, then the end score will better reflect their needs and priorities.

Include More Things That Matter

Second, the table simply does not represent all of the things that matter. Curtis Preston and others have pointed out that “time to decompress” or response time, might be more important to some customers than time to compress. One measures how fast a customer realizes disk space savings. The other measures the response time users and applications will see when they access compressed data. Those are different things. As the poster from EMC noted, there are different kinds of performance, and different online storage use-cases may put more weight on one versus another. Some applications need streaming write throughput, others need sequential read performance, while others need the ability to seek to the middle of a file and modify a few bytes with low latency.

Admittedly, it would be hard to measure all these things without getting overly complicated, but having at least “time to compress” and “response time” called out separately would be useful to most shops.

There are other less measurable, but potentially important, intangibles. Perhaps you could add a column with a list of features or product characteristics and let the customer assign a value from 1 to 10 for each of those things. If they are not important, fine, give them a 0. But if they are important, that lets the customer express their priorities in the score.

Here are just some things that I think at least some customers would think are worth evaluating as part of a data reduction solution:
• All or Nothing: Does the solution compress everything, or can I choose what to compress based on policy?
• Does the solution do compression?
• Does the solution do dedupe?
• Can I back up data in its most-compressed form, or do I have to expand it to back it up?
• Is the solution considered certified, validated, or supported by my storage vendor?
• Does the solution have an HA or fault tolerance capability?
• Does the solution have the ability to scale up by adding multiple nodes to work on a single volume or namespace?

I’m sure there are others, and maybe the CORE formula, which is supposed to be a measure of Effectiveness rather than overall product merit, would consider these factors out of scope. And that’s fair. But if the CORE metric is to be used as a score for a product, then it should be clearer about the key features and topics that it is not covering, but that customers might want to ask about.

Garbage In, Garbage Out

I think the CORE formula has the right idea with the columns on cost and how well a product can shrink data.However, those columns are meaningless unless the data in them is accurate.   I think that data needs to come from some sort of vendor neutral benchmarking site or analyst.    If you put in numbers from web site claims, customers are going to have high expectations.  The fact is, results are going to vary by what kind of data the customer has.   You can’t tell me that any solution is going to get the exact same shrinking results on, say, a volume full of Vmware VMDK files, a volume full of medical images, a volume full of Microsoft Office files, and a volume holding an Oracle OLTP database.


Cost is even trickier. The cost of dedupe for ZFS and NetApp Dedupe is free – they are a feature of the file system, and you don’t have to pay extra to get them.     Of course, you have to be using NetApp storage to get NetApp dedupe, and that comes with a certain cost premium, but how do compare that with solutions like StoreWize and Ocarina that are separate standalone solutions with a specific price tag?

The first problem can be fixed, if the community gets together and hosts a 3, 5, or 10 different public data sets somewhere. Each vendor can download the data, run their wares, and report back the results.   There’s no protecting against people who lie in this case, but vendors that do that will soon be caught out by customers.    Then a customer could go to that neutral site, pick which sample data set best reflects their own data (OLTP database, medical images, home shares, consolidated virtual machines, whatever) and put that value in the CORE formula column for results.     This is a case where the formula is not flawed – the formula is fine, but the data has to be valid for the data a customer has.  If we work together, we can help put good data in.

The second problem vexes me. I don’t really know how to correctly address the cost problem. My suggestion? Take it out.    If the CORE metric gives you a good value for the Effectiveness of different solutions, then customers can assess cost in ways that make sense to them – including vendor discounts, what storage they have, and so forth.    No customer is going to ignore cost when making a purchase decision – so I think we can count on buyers to figure out how they want to factor costs in to their decision-making process.

Summary

So there you have it.
I think the CORE formula is a great first pass attempt at doing something valuable to the community. It is deeply flawed – and I’m not the only one who thinks so.  But it can be improved, and I would love to see that happen.     These are my ideas, and I see others have contributed good insights as well.     I’m keen to see where this goes from here.   My next post will address the tired old topic of in-band versus post-process.

Deduplication: From Point Solution to Data Center Strategy

Posted by Carter George On April - 19 - 2010

Deduplication has been a hot topic in storage for several years now. Most of the focus has been on dedupe appliances sold in the backup market, by companies like Data Domain and Diligent (now IBM ProtecTIER).There have been dozens of articles written explaining the basic concepts, and comparing the implementations by various vendors.

Dedupe is also becoming more prominent in primary storage, with NAS market leader NetApp including a basic dedupe feature on every node.  EMC followed with whole file dedupe on its Celerra family of products.  Other vendors are now introducing dedupe as a feature on block storage arrays, in the cloud, and on nearline and archival storage.

In effect, what we’re seeing is the emergence and transformation of deduplication (and some other related data reduction techniques) as a new storage fundamental.     Rather than being a standalone solution that customers pay a premium for, dedupe is becoming something that will be a standard feature on most mid-market and enterprise storage products by 2012.

This blossoming of dedupe is happening faster than with other value-add storage features that have followed similar paths. Data reduction in general, and dedupe in particular, represents a technology whose time seems to have come. There are two drivers. One is the exponential storage growth  putting strain on both capital expenditures in IT and outpacing the ability of cheap disk drives to keep up.    The other is that economic pressure and tightened budgets have made the traditionally conservative enterprise storage buyer willing to look at new technology that can store data more efficiently and at less cost.  It’s also a technology that works because the data driving storage growth is unstructured data, data created by people.  Database business applications are not driving storage growth – people are.   It’s office documents, email, photos, and videos that are driving the information revolution, and it is in the nature of humans to create lots of copies, variants, and versions of information – in other words, humans (as opposed to database applications) are a lot more likely to create a bunch of duplicate inefficiently-stored data.    Dedupe goes and finds all that, and takes the unnecessary copies out of the picture, and on the actual storage, all those duplicate copies point to one shared place.   The benefits are compelling – it’s not just saving disk space, but ultimately it’s also saving power, cooling, rack space, and all the other things you didn’t have to buy when you avoided buying another rack of disks.

That said, not all dedupe is created equal. It comes in multiple flavors – fixed block or variable, block aligned or sliding window, in-band and post-process and so forth.   There are those who argue fervently that one approach is the only true way to dedupe, as though it were some sort of medieval religious dispute. What matters is that the dedupe method chosen matches the use case for which it is being used.

Dedupe and Compression:  Friends or Foes?

Dedupe is not always the best approach to data reduction either. Some data sets – like collections of virtual machines and repetitive backups of the same volumes – lend themselves very well to deduplication.  Other data sets, such as corporate file shares and primary storage, respond better to compression.  The goal of the IT user should be to store and move data in the most efficient way.   Lost in the hype around dedupe is the fact that quite often compression – which is seeing a renaissance in research after years of relative inactivity – is better at shrinking data than dedupe is.     The two technologies are not mutually exclusive.    It is possible to apply both dedupe and compression to the same data, but finding the optimal balance is easier said than done.

For dedupe to work well, data is chunked up in to small blocks, which are then compared to see if any are the same.  Duplicate blocks can be discarded, saving space. The smaller the block used, the more likely it is to find dupes.   Where dedupe gets the best data reduction is with smaller chunks, compression gets better results with larger chunks.   Compression works by looking at patterns and then making predictions.  If you can predict the next thing, you can compress it. To best find patterns and predict data better, compression likes to have more context, and that means that bigger chunks work better.   For any given data set, there’s an optimal balance, but there is no one right answer that works best for every data set.

The Dedupe Transformation

At the Gartner Data Center Conference held in December 2009, the audiences of three different sessions were polled. In the Gartner report, “Data Deduplication will be even bigger in 2010*,” following the event, analyst Dave Russell commented on the poll results. “If the 56% of those with some plans for deduplication in 2010 are combined with the 14% that are using deduplication for only a portion of their backups, and if, as in years past, 2% to 4% of those with no current plans to deploy the technology do implement it, then it’s conceivable that 72% to 74% of the audience will adopt deduplication by year-end 2010.”

A similar poll by SearchStorage of storage buyers found very similar trends. Almost everyone either plans to deploy dedupe for backup or is evaluating it.  Likewise, while only 17% have deployed it for primary storage, a staggering 60% are planning to either buy or evaluate dedupe for primary in the next year.  This means that CIO’s and IT Directors are expecting dedupe to have the kind of impact on IT operations that virtual machines have had over recent years.

All the major players in storage will need to decide on a dedupe strategy, bringing out standalone products in dedupe-centric markets like backup, and adding dedupe as a feature set to existing primary and nearline products in the other storage tiers. Some of the large vendors may look to startups to acquire the technology they need, either through OEM deals or acquisitions, while others will develop in house.  In either case, within two years at most dedupe will have become a storage fundamental, and the landscape of vendor offerings will have changed.  Today, the playing field consists of niche offerings by the big vendors and a set of innovative startups looking to breakout.  Within two years, some of those startups will have gotten design wins with major vendors, some will follow in Data Domain’s footsteps and get acquired, and some will be left holding the short end of the stick.

Point Solution or Coherent Strategy?

Finally, there is one key mystery left in the dedupe marketplace, which is headed towards a situation where every storage product will have deduplication built in. At current course and speed, all of those dedupe implementations will be inconsistent and incompatible with one another, even inside the product line of a single vendor.    Will that continue to be the case as dedupe becomes a standard feature?    If you look at today’s two market leaders in dedupe – NetApp for primary and Data Domain for backup – you’ll see a painful scenario that we expect to see played out over and over again over the next few years.     Take a volume on NetApp filled with 16 Terabytes of data.    NetApp dedupe might shrink that data to 8TB, a great space savings.   But when it comes time to back that data up, the NetApp rehydates (expands) the 8TB’s back to the full original 16TB’s to send it to the backup target.  Let’s say the backup target is a Data Domain.      Now the network needs to carry that whole 16TB, and the NetApp storage controller had to use a lot of CPU to put the data all back together, possibly slowing down other applications trying to use the NetApp.     You have to buy a Data Domain model big enough to handle that 16TB ingestion in your available backup window.   When the 16TB of data gets to the Data Domain, it will be deduped again, using different algorithms, getting back down to 8TB or less.    In this process, a huge amount of CPU, network bandwidth, and time have been consumed to expand data and then shrink it again.   For a market focused on getting to better storage efficiency, this is blatantly wasteful.   This is the case today with almost any dedupe-for-primary solution backing up to a dedupe-for-backup target.   What would be more useful to customers would be a consistent and compatible dedupe, allowing data that has already been deduped to be moved in its compressed format to other storage products that support a compatible implementation.

A good deal of the I/O workload in a given shop is driven by a handful of common storage management workflows – backup, replication, migration, and tiering – and all of those workflows would be more efficient if they could be done using data that had been compressed and deduplicated. To truly deliver on the promise of storage efficiency, we’ll look to see some vendors deliver consistent and compatible dedupe and compression that works across products, supporting dedupe-aware versions of those key workflows transparently.