Reprise–Compressing Already Compressed Files


Time to fire up the wayback machine and take a look at blog posts gone by. This one, written by Carter George last May, addresses the question of whether it’s possible to further compress already compressed files. With unstructured data loads skyrocketing, this question is as relevant now–if not more–than it was a year ago.

Here is what the original post, “Can You Compress An Already Compressed File? Parts I and II” had to say about this still very timely topic:

We can all recognize the amount of data we generate. And just like we keep telling ourselves we’ll clean out the garage “one of these days” most of us rarely bother to clean out our email or photo sharing accounts.

As a result, enterprise and internet data centers have to buy hundreds of thousands of petabytes of disk every year to handle all the data in those files. It all has to be stored somewhere.

One way to reduce the amount of storage growth is to compress files. Compression techniques have been around forever, and are built in to many operating systems (like Windows) and storage platforms (such as file servers).

Here’s the problem: most modern file formats, the formats driving all this storage growth, are already compressed.
· The most common format for photos is JPEG – that’s a compressed image format.
· The most common format for most documents at work is Microsoft Office, and in Office 2007, all Office documents are compressed as they are saved.
· Music (mp3) and video (MPEG-2 and MPEG-4) are highly compressed.

The mathematics of compression are that once you compress a file, and reduce its size, you can’t expect to be able to compress it again and get even more size reduction. The way compression works is that it looks for patterns in the data, and if it finds patterns it replaces them with more efficient codes. So if you’ve compressed something once, the compressed file shouldn’t have any patterns in it.

Of course, some compression algorithms are better than others, and you might see some small benefits by trying to compress something that has already been compressed with a lesser tool, but for the most part, you’re not going to see a big win by doing that. In fact, in a lot of cases, trying to compress an already compressed file will make it bigger!

Conventional wisdom dictates that once files are compressed via commonly used technologies, the ability to further limit their size and consumption of expensive resources is nearly impossible. So, what can be done about this?

End of part I

….

Part II

On the cutting edge, there are some new innovations in file-aware optimization that allow companies to reduce their storage footprint and get more from the storage they already have. The key to this is understanding specific file types, their formats, and how the applications that created those files use and save data. Most existing compression tools are generic. To get better results than you can get with a generic compressor, you need to go to file-type-aware compressors.

There’s another problem. Let’s say you just created a way better tool for compressing photographs than JPEG. That doesn’t mean your tool can compress already-compressed JPEGs, it means that if you were given the same original photo in the first place, you could do a better job. So the first step in moving towards compressing already-compressed files is what we call Extraction – you have to extract the original full information from the file. In most cases, that’s going to involve de-compressing the file first, getting back to the uncompressed original, and then applying your better tools.

Extraction may seem simple enough – just reverse whatever was done to a file in the first place. But it’s not always quite that easy. Many files are compound documents, with multiple sections or objects of different data types. A PowerPoint presentation, for example, may have text sections, graphics sections, some photos pasted in, etc. The same is true for PDFs, email folders with attachments, and a lot of the other file types that are driving storage growth. So to really extract all the original information from these files, you may need to not only be able to decompress files, but to look inside them, understand how they are structured, break them apart in to their separate pieces, and then do different things to each different piece.

The two things to take away from this discussion are: 1) you won’t get much benefit from applying generic compression to already-compressed file types, which are the file types that are driving most of your storage growth and 2) it is possible to compress already-compressed files, but to do so, you have to first extract all the original information from them, which may involve decoding and unraveling complex compound documents and then decompressing all the different parts. Once you’ve gotten to that point, you’re just at the starting point for where online data reduction can really get started for today’s file types.

  • Twitter
  • Facebook
  • LinkedIn
  • del.icio.us
  • Digg
  • StumbleUpon

Tags: , , ,

About Carter George

Carter runs storage strategy for Dell

2 Responses to “Reprise–Compressing Already Compressed Files”

  1. Anil kumar gupta April 14, 2009 at 9:07 am #

    Hi,
    I have been thinking of compression techniques past 5 years which is not inline with any of the existing algorithms. I have complete logic on how to compress a compressed file to any number of times, but to a certain block size, meaning even if you give 1GB it can be reduced to 8 kb and 10kb will also be reduced to 8kb only. The challenge here is the decompression. Though this idea is still in planning stage , i would require lot of time to write in a program. I require inputs whether there are any attempts made by anybody where such technique exists might be they failed to acheive it completly. Your inputs are much appreciated.

    Regards,

Trackbacks/Pingbacks

  1. Doing More With Less - Enterprise Storage Strategies - April 23, 2009

    [...] our systems, deleting unnecessary extra copies, temporary files, or inappropriate content. Advanced compression and deduplication technologies can even free up space without deleting anything! The result of [...]

Leave a Reply