Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Databases - Compression Targets?

Posted by Carter George On January - 16 - 2010

The headline of this post poses a question that was raised in a recent comments discussion between Dave Vellante of Wikibon and myself on this blog. Dave wanted to know if there are use cases in which generic compression might still be useful. As I wrote in my post, most of the storage industry still relies on generic, or LZ compression. This is a shame, because it’s severely limited compared to possibilities inherent in more advanced, file type specific compression algorithms such as we at Ocarina use. My main point was that the more advanced, file type specific compression algorithms can be applied to the bulk of the files one finds in the modern data center–MS Office, Zip, PDF, video, images, and so on.

However, Dave was interested in hearing whether there are use cases in which generic compression could be commercially viable. My response was that data sets that are made of entirely of text files, and databases are the two examples in which it really doesn’t matter what type of compression you use–the generic type will work fine because essentially all you have to do is reduce text and/or alphanumeric data. But, I added, databases aren’t likely to be a compression target because there is too much of a performance trade-off. Also, this is unlikely to be a good commercial target as databases are the most conservative part of the data center. Dave pressed his case. He wanted to know if perhaps there are times when compressing a database would make sense.

He wrote: “I agree with your comments on a production database but what % of an organization’s database storage would you consider the ‘family jewels’ vs. copies of the database for things like decision support/data warehousing, snapshots, and other copies/clones for recovery purposes? If I can compress those supporting copies down 50-80%…why not?”

My answer: it varies by organization, but sometimes a large percentage of database data is in star schema data warehouses. Those databases, unlike the transactional databases, tend to support frequent whole table scans. That is, instead of fast small writes (transactions) in to the middle of a table, they see very large reads of everything in a table. Databases tend to be very compressible, and if you can compress them and still support the I/O rates you need for performance, by all means do so!
Transactional database performance tends to be measured in TPS (transactions per second) and TPS in turn is largely bounded by the speed at which the database can do direct I/O writes of transaction logs to stable store. Putting compression or dedupe in that path is risky. I’m not saying it can’t be done, but people will want to be quite sure it doesn’t mess up years of performance tuning. With data warehouses, you may have hundreds of Terabytes of data in simple so-called star schema databases, and the kinds of queries run against these databases tend to go through and read every row in every table.

Consequently, performance is bound by the ability of disk systems to sustain sequential reads of very large data sets. In this case, as long as decompression can happen at the rate of physical disk reads, then I see no reason not to compress or dedupe those databases. As I mentioned earlier, data in databases is largely alphanumeric. That means that both compression and decompression on that kind of data can be very fast - it lends itself to coprocessors like HiFN, for example. If your architecture provides a place to insert something like that, or if you have CPU cycles free enough on your database servers, I think data warehouses can be good candidates for both compression and dedupe.

With all that said, the future of compression is in reducing unstructured data. Why? Because this is where the greatest data growth is occurring. In order to address this problem, we’ll have to start looking at far more advanced algorithms than those that did the trick in the past.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

2 Responses to “Databases - Compression Targets?”

  1. Hi Carter…your blogs got me thinking about data warehouse use cases and I reached out to some people at Oracle to see what they could add. They turned me on to some interesting examples of columnar compression (Exadata). Here’s a decent overview:
    http://www.oracle.com/technology/products/bi/db/exadata/pdf/ehcc_twp.pdf

    I’ve also looked into Vertica which looks to be a nice solution and I’m sure there are others.

    Thanks for the ideas.

  2. Kirk Bradley says:

    Take a look at storwize. They can and do compress Oracle databases with NO application or storage system changes (subject to the condition that the database be NAS accessible.) And they do it with very good performance. As pointed out above what “good” needs to be is shop and DB/application dependent but still…

Leave a Reply