Great Minds Think Together


This week, the New York Times Bits blog reported that a winner has been announced in the $1 million Netflix challenge. The prize went to the team that could come up with a better “predictor” of customer behavior than the company’s current in-house software, Cinematch.

It’s more than a gimmick. Having a reliable and effective tool that will tell whether a person is likely to prefer “Sleepless in Seattle” over “Napoleon Dynamite” on a given Friday night represents the holy grail for the mail-order movie rental company, which must continually come up with ways to satisfy its customers’ need for entertainment.

Many have been anxiously awaiting the results. What’s most notable about this prize is that it has turned out to be a great example of the power of teamwork in innovation–to a level far beyond what had been previously imagined.

In fact, as one team approached the finish line, it became a race to aggregate other teams in an attempt to develop the winning algorithm.

The Times quotes Chris Volinsky, a scientist at AT&T Research and a leader of the winning “Bellkor” team, saying that the blending of different statistical and machine-learning techniques “only works well if you combine models that approach the problem differently … That’s why collaboration has been so effective, because different people approach problems differently.”

Reading this, I couldn’t help but notice the parallels between this and what we are accomplishing at Ocarina. In fact, the intensely collaborative approach is what makes Ocarina such an exciting place to work. We bring together great minds in order to yield algorithms for data compression. In addition, compression is as thorny a problem to solve as consumer behavior, and is similar in ways that might surprise you.

You can work on having one good generic compressor – and there are many good free generic compressors out there – but the best results will be achieved by having a set of more specific algorithms and an infrastructure for tying them together to get the best results. What’s more, compression, like the Netflix challenge, is all about prediction. If you can predict the next value in a data set, you can compress it. So the approach that produced the best results for predicting consumer rental behavior are a good signpost for how to go about getting the best prediction on how to shrink a data set. This doesn’t mean the same algorithms, but the same concept of having multiple algorithms, a content-mixing and adaptation approach, and tying things together to get a whole that is greater than the sum of the parts.

Let’s take a look at a very well-known and high-quality generic compressor, LZ (and its many variants, which include: LZO, libz, glib, etc). The term “LZ” comes from two pioneers in compression algorithms for computers, Abraham Lempel and Jacob Ziv. LZ is based on their work, and has evolved quite a lot over the past 30 years since they were first developed. This family of compressors will try its best on whatever data pattern you throw at it. If the data pattern is uncompressed, LZ variants will do quite well. Plain text, numeric data, and other application data that has open patterns will be well-compressed by LZ.

As powerful and useful as it is, LZ has its limitations. Notably, it is stymied by file types that have already been compressed by the application (or by some other tool), as well as files that are compound documents with multiple different data types inside, and media files. Alas, more and more of today’s files are either office documents (MS Office, PDF) or media files (photos, video, special effects) that are compressed in some simple way by the application that created them.

Compression, as I said, is all about prediction. A compressor builds a dictionary or context, looks for patterns, and if the pattern repeats or can be predicted, then it can be replaced with a shorter symbol.  The longer the pattern you can find, the better then win for replacing it with a symbol. The more patterns you find, the better results you get.

The trouble is that when one compressor has gone through a file, it has taken out at least the low-hanging fruit. If it’s a good compressor, it may have done a very good job of finding patterns and replacing them with efficient codes. Video – which has been an area of intense compression research for years – is a good example of this. An h.264-encoded video file is going to have had the easily compressible stuff already well taken out. Photos, graphics, animation, and other media-related files are similar – applications from PhotoShop to Maya (and inferno, Flint, and Flame), and most sound file types (WAV, mp3, FLAC) will already have been compressed. This often will defeat even a good generic compressor like LZ. The patterns it looks for will have been obscured by the compressor that processed the data first.

You can see this easily by running LZ against a directory with a wide variety of file types.  Put some text, some spreadshseet, some photos, and some other media files in a directory and see the percentage compression you get. There will typically be a wide range. This does not mean LZ is not a good compressor; what it means is that it seeing a mix of files it can compress and ones it can’t.

So how do you get better results? Not by building a better generic compressor, that’s for sure! Rather, by getting specific. This means building a family of compressors, each of which is aware of specific file types and their data patterns. This is the approach we have taken, and the results are so remarkable that our customers — whether in the Hollywood post-production industry, life sciences, Internet media, or some other industry–  are often stunned by the difference these algorithms make when it comes to reducing their data.

All of which is to say that the Netflix prize researchers’ discovery is much the same as ours: the winning strategy is not going to be one better compressor, but a cooperating framework of compressors that work together. Teamwork, not just among people but among interlocking technologies, is the difference between good and great — between generic and extraordinary.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , , , , , , , , , , , , ,

About Carter George

Carter runs storage strategy for Dell

No comments yet.

Leave a Reply