One More Time: In-Band Versus Post-Process


My Mom always used to tell me, “You have to be able to distinguish between Need and Want”.
I need a car.  I  want an Aston Martin.

In Steve Kenniston’s post Storage’s 2010 Hottest Technology he says that customers who “require real-time random access compression….in front of their active solution” need something very fast.
Well, yes, they do.


But having a requirement to go fast has nothing to do with in-band versus post-process. And saying that customers who require something in front of their active solution need an in-band solution is like saying that customers who require an in-band solution require an in-band solution.


Who requires something “in front of their active storage”?


Some customers may want that, some might prefer it, but who needs it? In many cases, you could also use a solution that is next to your active solution, or inside it.   As long as it is fast enough and does a good job of data reduction, I’ll let you decide which one is a car and which one is an Aston Martin.

Let’s just run though some of the comparison points between in-band and post-process  methods for data reduction and see what the trade-offs are.

As for why we’re talking through this old saw again, Storewize is an in-band appliance that sits on the wire in front of your active NAS storage. Ocarina sells an optimization appliance that does post-process compression and dedupe, but we also sell software-only solutions (mostly embedded inside storage vendors’ products) that work in-band inside the active storage system.

What Exactly Is Happening In-Band or Post-Process? The only thing this discussion is about is when data gets shrunk.  Every solution does real-time random access for users in-band.    When users or applications do I/O to their already-compressed data, it is always handled in-band, real-time, and transparent.  I think that’s true of every solution mentioned in the CORE table.    When people talk about in-band versus post-process, they are talking only about when the data gets compressed, not about when it gets decompressed.

Speed.     In-band versus post-process has absolutely nothing to do with the speed of a solution.    Time-to-compress can be just as fast post-process as it is in-band.    To demonstrate this, Ocarina turned off 111 of its 112 compressors, and also turned off our two-stage dedupe engine, and ran just our fastest compressor on a set of data.    We got 3508MB/sec throughput.     I don’t think anyone would buy Ocarina and then use only our simplest fastest compressor, but if they did, they’d find it goes as fast as any in-band solution on the market.      There is an implication regarding speed and in-band, though, and that is that in-band solutions can only run data reduction algorithms that go fast, because they are sitting there in the path of every I/O.  If the in-band solution is slow, then all your I/O is slow.   So yes they are fast, but this also limits the range of what they can do.  If the type of data you have requires a slower compressor to get good results – photos and video would be good examples – then you just can’t use an in-band solution.

How Much Disk I/O? In-Band solutions have a significant advantage in the area of disk I/O….or at least they do when they are getting good compression.     Because in-band solutions compress data before it gets written to disk, if the data has been shrunk by 50%, then there is 50% less data to actually write out to disk.  If you are I/O bound, and if a compressor is getting good results on your data stream, then this can help performance a lot.  You are simply doing a lot less I/O.  When dedupe vendors say that they ingest 1TB/hour, they don’t usually mean that they actually write 1TB of data per hour out to physical disk.  What they mean is, if you assume 20:1 dedupe, to store 1TB of logical data requires 50 Gigabytes of data to be written to disk.  The other 950GB are thrown away, because they were duplicates of data already stored on disk.   You have to be careful with these dedupe throughput claims – they assume you are getting the expected dedupe results.

If you sent 1 Terabyte of totally new and unique data, with no duplicates to data seen before, and the dedupe appliance actually had to write out 1 TB of physical data to disk, most of them would be quite a bit slower than their claimed performance. But for both dedupe and compression, if you get good data reduction, then you can improve I/O performance by being in-band.   Just as is the case with dedupe, though, if you are not getting a good compression result, then you won’t see this performance improvement.

All or Nothing. In-band solutions compress everything that goes by.   They have to.  Because of the speed constraints, they don’t typically have time to analyze data, figure out what it is, who it belongs to, or whether it falls in policy for some action or another.     Post-process solutions let the administrator decide which data to shrink, when to shrink it, how to shrink it, and even where to put the data.  This is probably the single biggest advantage of the post-process approach.    You may decide to dedupe and compress all files that have not been accessed for a month.   You might decide to dedupe all files that are virtual machine files, such as VMDK’s.  You may decide to apply fast compression only to files that are between 2 days old and a month old.  You may decide not to compress hot data, any file that has been read or modified in the last 24 hours, at all.

You may decide to run compression and dedupe jobs only at off-peak hours. You may decide to have the solution read a file from a Fibre Channel tier of drives, dedupe and compress the data, and write it out to a less expensive SATA tier.  This fine grain control, which allows active data management as part of the data reduction solution, makes post-process a clear winner for most unstructured file data.    For transactional databases, since the data is always hot, in-band may be the better approach.

What About Existing Data? Let’s say you have an EMC Celerra filer with 100TB of file data on it.  You buy a fancy new in-band compression solution.    New data that is being written gets compressed as it comes in. But what about that 100TB of data that’s already there?    Well, in-band solutions may give you a tool to read that data, compress it, and put it back….making it a post-process solution!   The fact is, most customers already have a lot of data.  Usually, customers who are out looking for a data reduction solution have lots and lots of data.    If you have a Petabyte of data sitting on the floor, and you can dedupe and compress that data down to 100 Terabytes, you just created 900 Terabytes of free disk.  That’s tremendous savings, and you can only get it by post-processing the data that is already there.

Data Integrity. This is sort of the skeleton in the closet for in-band solutions.     All data reduction vendors go to great lengths to ensure data integrity.  We all do checksums, or even bit-by-bit comparisons.   But when you do in-band compression, that means that you have never had a full original copy of your data on disk.     If there’s a bug in a compression algorithm (and let me be clear – I’m not accusing any vendor of any such thing) there is no original uncompressed data to go back to.

In a post-process solution, data gets written out to disk in its full original form. You decide when to compress it.  The conservative shop will say, compress everything after it is at least 24 hours old.  Why?  Because then you will have at least one backup of the original data taken by your backup solution.   Now you can go ahead and shrink the data, saving disk space on your primary storage.   But if there ever were a data corruption bug – and really, as a group, all of the vendors listed in the CORE score box go to great lengths to guarantee there isn’t, but if – then you’d have a full original copy to go back to.   The CIO of a major German bank once told me that for this reason, the bank would never implement an in-band solution.  He said, logically, I understand the protections that are in place and that they should be sufficient, but psychologically and from a risk perspective, I can not bet the data of the bank and its customers on those assurances.

Dedupe is Always a Post-Process.  No, Really! All dedupe, even solutions that say they are in-band, is a form of post-process.     One of the things that strikes me about the CORE article and the discussion that follows is that the line of reasoning is, “dedupe is the most important thing to happen in storage since forever, so you should go out and buy compression”.   Well, dedupe is a form of compression, and it’s one that compares new in-coming data to data that has already been stored.  To do this, you hold in-coming data somewhere while you compute a hash, look up the hash in an index, and find out whether the data is a duplicate of something you’ve already stored.    Solutions, like Data Domain, that claim to be in-band, hold that in-coming data in a very small holding tank (an NVRAM card) and process it quickly.  The solutions that are called post-process write the data out to disk and then figure out whether it’s a duplicate or not.  If it is, they get rid of it.   In both cases the data is being stored somewhere until the dedupe process runs – it’s just a question of where and for how long.    And dedupe is really important – for online data as much as for backup.

So Which One is Better? Neither.  There is no one right answer.    Different use-cases call for different solutions.  If you have an OLTP database running on a NAS share, in-band compression is probably the best bet.  If you have lots of unstructured file data, I’d say post-process (including dedupe) is going to be better most of the time.   But there are other use-cases, and there are exceptions even to those generalizations.  What I will say is this.   Most data, such as the IDC data on storage growth, shows Tier 1 Primary and database data as having very flat growth year-to-year, while the same data shows unstructured file data – Office docs, email, Sharepoint, photos, video, internet “stuff” – as growing at almost out-of-control rates.


Both in-band compression and full compression and dedupe solutions have their place, but the most important thing for data reduction is to attack the place where data growth is causing the most pain. For the most part, you’ll find that that growth is in unstructured file data, that it is not in the most performance-sensitive and finely-tuned Primary tier, and that if you got 75% savings on just the files in your data center that have not been modified for at least three months, you’d be flabbergasted at the savings.

Next:     Do you have to expand data to back it up?

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , ,

About Carter George

Carter runs storage strategy for Dell

No comments yet.

Leave a Reply