Storage Industry Lags Behind Advances in Compression


There’s a lot of talk about compression these days, but how much do we know about it? Well, for one thing, compression as a research area for mathematics has evolved much faster than most people realize. The thing is, most compressors used in computer products, including dedupe appliances, use generic algorithms rather than making use of these advances.

Most storage products use Lempel-Ziv (LZ) or derivatives, and try to use that single compressor to compress everything. These algorithms have been around forever, and for the most part, have not evolved much in the last ten years other than in the area of performance. This is too bad, because compression has advanced in exciting ways. LZ and its cousins work well on the kinds of data that were around 10 or 20 years ago – plain text, plain numbers, or combinations of those things. They do not work so well on a lot of modern data – images, video, Office documents, PDF’s, already-compressed files like Zip, encrypted data, etc. What’s important to understand is that all the most notable advances in compression that apply to storage have taken place not in generic compression algorithms, but in file type specific ones. File type specific compressors can, in fact, deal with all those modern data types.

Compression is all about pattern recognition and prediction. You look for patterns in a file and if you can find those patterns you try to predict their occurrence. If you can predict a pattern, you can compress it. So understanding the kinds of patterns that might show up in a file – video, a Zip file, music, and a PowerPoint are all very different – is the key to building a compressor for that file type.

What’s especially relevant is that the most important thing in compression of data today is recompression. Almost all of the file formats that are driving data growth, and taking up the most space on backups, are already compressed. Think of a file type that’s eating up space, and it’s likely to an already-compressed format: JPEG, video, Office, PDF, mp3, medical images … all compressed already.

A generic compressor won’t get any results at all on an already-compressed file. That’s because the first compression obscures the patterns that a compressor would look for. That’s why if you try to compress, say, a Zip file, if anything you’re likely to make it bigger. Recompression means first decompressing the file and then recompressing it with a better compressor. To do that, you have to recognize what kind of file it is, what kind of compression has been applied, and how to decompress it. By first decompressing it, you are able to see and process the patterns that make better prediction and compression possible.

Almost every market has a set of well-defined file types that make up the bulk of its unstructured data. In medical imaging, it’s Dicom (which in turns contains JPEG 2000, JPEG LS, and TIFF). In seismic, it’s seg-y. In satellite imaging, it’s NTF, MrSID, GeoTIFF and a few others. In the average business, it’s Office, PDF, photos and video.

In specific industries, you see very advanced compression implemented in the application layer, not in storage. Video is a great example – the whole concept of the video codec is all about compression. Whole companies exists specifically to do better video compression (On2 is a good example), but this compression is done primarily for transmission, and implemented as part of the video application workflow, not as a storage technology.

In a world that had all plain ASCII text data, generic compressors would be great. But that’s not the world we live in. For compression to have any meaningful impact on today’s data sets, you have to have file type aware recompression.

It’s a shame that most storage products today have not implemented the most exciting advances in modern compression mathematics. My company Ocarina is quite frankly one of the few exceptions. The compressors found in tape drives or in dedupe appliances represent the best of the evolution of the generic compressor. The thing to look for going forward is the emergence in storage products of the next generation set of file type aware compressors, which is where all the action has been over the last ten years.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , , , , , , , , , , , ,

About Carter George

Carter runs storage strategy for Dell

3 Responses to “Storage Industry Lags Behind Advances in Compression”

  1. David Vellante January 13, 2010 at 2:10 pm #

    Great post Carter. It IS a shame that most storage products haven’t implemented some type of compression. While nothing comes for free this innovation is long overdue. Storage salespeople are afraid of compression because they think they’ll sell less storage.

    I believe they’re wrong because storage is an elastic market. Drop the price and you’ll sell more.

    Not sure why you make this statement.

    “For compression to have any meaningful impact on today’s data sets, you have to have file type aware recompression.”

    Seems to me there are other use cases where compression could have a meaningful impact.

  2. Carter George January 13, 2010 at 3:24 pm #

    Good point, and thanks Dave for the clarification there. I didn’t mean to say that generic compression has no benefit for customers today, as long as they have a lot of text files for example. I did mean though that generic compression has little effect on the modern pre-compressed data types (e.g. Office 2007 docs) that own an increasing share of our customers’ storage today.

    The other data set where generic compression is valid is on most databases, which really still are essentially sparse alphanumeric data for the most part. The thing is, databases are also the least likely place for customers to implement compression, because this is the one area where they are least likely to trade off any performance for space savings. Companies like Storewize have very fast generic compression, and that would be a good fit for databases, but it’s not clear to me that customers are embracing compression for that, the most conservative area in the data center.

  3. David Vellante January 14, 2010 at 10:15 am #

    Another question and a comment Carter if I may…

    I agree with your comments on a production database but what % of an organization’s database storage would you consider the ‘family jewels’ vs. copies of the database for things like decision support/data warehousing, snapshots, and other copies/clones for recovery purposes?

    If I can compress those supporting copies down 50-80%…why not?

Leave a Reply