Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Archive for the ‘File Systems’ Category

A Q&A with Michael Callahan, HP

Posted by Sunshine On May - 22 - 2009

mjc-photo-cropped1

We’ve been hearing about high capacity storage systems, such as HP’s Extreme Data Storage 9100 System (ExDS9100) for the past year or so. There’s clearly huge potential for using these types of systems to manage many terabytes of data. We decided to sit down with Michael Callahan, chief technologist for network-attached storage in the HP StorageWorks Unified Storage Division, and get his views on the trends towards larger capacity storage, deduplication and other responses to the rising tide of data. Below is our interview.

Sunshine: What kinds of trends are you seeing in storage, from where you stand?

Michael Callahan: We feel like we’re very well aligned with two of the big trends that are underway. The first trend is that storage systems are consolidating. Many enterprises are realizing they’re creating lots of individual silos, and data is spreading across many filers. While it’s natural for this to happen, it’s very inefficient. You wake up one morning and realize you’re dealing with immense and costly complexity, as well as poor utilization. That’s why there’s been a huge amount of interest in consolidation as an IT project, really across all industries. It’s just now that people are getting focused on consolidation as regards storage in some of these environments.

The second trend, of course, is around data reduction and other similar efficiencies.

Sunshine: Why is dedupe such a hot topic these days, in your view?

Michael Callahan: Well, obviously it has to do with economics. Much of the last decade has been focused on cost control — being able to do more with less. The more geeky answer is that on the technology side, there is the fact is that machines are far faster than they were even five years ago. Over the years, the compute power in environments has risen significantly as compared to storage speeds. So it becomes much more appealing to spend some CPU cycles to reduce data before putting it into storage.

Sunshine: Wow. I’ve never heard it put in quite this way.

Michael Callahan: The way I look at is that ten years ago, someone might’ve proposed something along the lines of Ocarina, but it would’ve been harder to justify. Today in systems like our ExDS9100 — the Extreme Data Storage System and the space where we partner with Ocarina — we use industry standard components. That is, HP blade technology. So we’re able to leverage an engine that is building incredibly powerful, frugal blade systems with a lot of compute power, right in the storage system. Customers should be able to deploy that compute power into the storage tier very effectively in order to make it do interesting things such as Ocarina.  And because the ExDS9100 is based on Linux, we can run the Ocarina software right within the storage system itself.

Sunshine: What do you see as the advantage of your storage system?

Michael Callahan: We feel we have huge advantages in being able to build systems that leverage the fact that HP has the most successful, widely-used industry standard blade infrastructure.

People are asking, what can we do to be efficient in the way we spend our dollars for storage, and utilize our data center space? Actually, this goes beyond cost. It’s not just the money, but also the space for storage. There is literally no place to put the storage even if you can pay for it. So in light of that, there’s this incredible push for some set of tools that will allow customers to optimize their use of storage.

One thing that we at HP do is to build a system in the ExDS that is in itself very dense, power efficient, and simple to manage. That’s a good starting point. But then it’s very compelling to be able to go beyond that and say furthermore, we’re able to do some very advanced things with Ocarina around compressing the types of data that tend to show up on these huge data sets.

We think we have a better integration with Ocarina than the other systems out there. For the customers who will be using Ocarina with our solution, there’s no box that says Ocarina. The same blade that’s in the storage system would have Ocarina running as software within it.

In our design, the expectation is that you’ll have lots of disk storage, and then you’ll want some number of blades. The architecture allows the choice about how much storage and how many blades to be made completely independently of each other, and to be revised without having to do any complicated repartitioning or migration of data.

Sunshine: That’s the PolyServe aspect of the architecture?

Michael Callahan: Yes, and the relevance to Ocarina is that Ocarina is a capability that consumes CPU cycles to process your data, and you might well need to have some flexibility in the amount of CPU power in a system. Suppose you have a system where you’re going to load up some huge data set–100s of terabytes of data–but then access it at a relatively low rate.

In our ExDS system, you can put 16 servers into one rack unit-each of those blades has 2 CPUs, each has 4 cores. You can actually have as many as 128 cores in one compact box. If you’re using Ocarina, you might choose to have some relatively large number of blades, because you’re compressing data and there’s significant computation involved in that. So during the ingest process, you can configure the system in such a way to optimize computation. But then once it’s all filled up, there’s no more need for lots of CPU to support the (Ocarina) Optimizer. So, at that point, you can take those blades out of the system and just run the Ocarina Reader on many fewer blades, proportional to the rate it’s being accessed.

In our approach there’s no partitioning of the data–every blade can access every part of that data set completely equally. You’re not required to go through some horrific rebalancing process to accommodate the number of blades. So it’s really natural fit.

Michael Callahan is Chief Technologist for network-attached storage in the HP StorageWorks Unified Storage Division. Previously, he was Chief Technology Officer at PolyServe, a software company that delivered scalable, highly-available shared data clustering solutions, from its founding until it was acquired by HP in April 2007. Before that he led advanced development at Ask Jeeves and did mathematics research at the Mathematical Sciences Research Institute in Berkeley. He has a BA from Harvard University and was a Rhodes Scholar and Junior Research Fellow in Mathematics at Oxford University.

When You Only Have a Hammer…

Posted by Carter George On May - 7 - 2009

Nice to see Dave Simpson from InfoStor getting interested in the subject of dedupe for primary. Today, he had a new post on the topic, along with a plug for a webcast he’s doing on the topic with Noemi Greyzdorf of IDC. In the post, Dave makes the point that people are starting to muddy the waters when it comes to terms such as “dedupe for primary”–something that NetApp and others are popularizing at the moment. He notes that quite often, the term covers a lot more than just dedupe, and can include compression, single instancing, and other capacity optimization methods.

Dave makes some great points. For example, there are a lot of possible trade-offs you can make between performance and space savings when you are looking at data reduction technologies. My company Ocarina gives you the choice of superfast lightweight compression, subfile dedupe (fast), object dedupe (medium fast, but better results) and content-aware compression (slow, but great results).

You can turn any of these things on or off, and you can pick the optimizations you want by policy not only by volume, but right down to the individual file or file type.

For example, you could say, “I want wire speed lightweight compression only for my MS Word docs, but use object dedupe and content-aware compression on any PDFs you find in this homeshare volume; don’t do anything to this database volume, and do everything you know how to do to shrink this archive volume.”

There’s no one right answer. The hotter your data, the more likely it is to be true primary data with lots of users reading it and writing it in real time. That means you have to be that much more careful about what data reduction algorithms you apply to it. The colder the data, the more aggressive you can be. Some data reduction vendors can only deploy in one tier of storage because they only have one kind of tool. If all you have is wire speed compression, you can do hot database data, but you won’t be able to shrink most kinds of data at all. If you have a heavyweight dedupe and compression solution, you may be stuck in archive or backup, because you’re not fast enough for primary. If all you have is a hammer, then all the world looks like a nail.

At Ocarina, we want to give you the toolbox, and the ability through policies to intelligently decide which tools to use for each file, volume or data set. This means you can match your data reduction strategy to a file’s performance requirements and get the best fit for each part of your storage - primary, nearline, archive, etc. Multiple tools, multiple options, and a far more customized solution. At the end of the day, having a really really great hammer is only marginally useful if what you need to do is cut something in half … like your storage budget.

Dedupe for Primary Hot Topic at SNW

Posted by Sunshine On April - 6 - 2009

SNW is kicking off with a bang today, and primary storage optimization is the topic du jour. Ocarina Networks, the leader in content aware compression and dedupe for online storage is partnering with cloud storage provider Nirvanix. The combination means cost savings and improved throughput to customers looking to leverage content-aware compression and object deduplication as part of an overall cloud storage solution. Beth Pariseau reported the news on SearchStorage this morning.

Meanwhile, another dedupe for primary player, Storewize, is launching a new series of appliances that it says will be a 35% improvement over its previous products, reports Pariseau. Storewize is different from Ocarina in that, while a primary storage optimization appliance, it is not content-aware and does not compress already compressed files. Rather it is an inline device that provides generic compression only.

On a related note, Hifn, which is soon to be merged with Exar Corp., will also be releasing an inline storage optimization product.

For all the latest news and SNW gossip, feel free to follow #SNW on Twitter. We’ll be on the ground with regular updates, as will many other bloggers in the storage space.

Reprise–Compressing Already Compressed Files

Posted by Carter George On March - 5 - 2009

Time to fire up the wayback machine and take a look at blog posts gone by. This one, written by Carter George last May, addresses the question of whether it’s possible to further compress already compressed files. With unstructured data loads skyrocketing, this question is as relevant now–if not more–than it was a year ago.

Here is what the original post, “Can You Compress An Already Compressed File? Parts I and II” had to say about this still very timely topic:

We can all recognize the amount of data we generate. And just like we keep telling ourselves we’ll clean out the garage “one of these days” most of us rarely bother to clean out our email or photo sharing accounts.

As a result, enterprise and internet data centers have to buy hundreds of thousands of petabytes of disk every year to handle all the data in those files. It all has to be stored somewhere.

One way to reduce the amount of storage growth is to compress files. Compression techniques have been around forever, and are built in to many operating systems (like Windows) and storage platforms (such as file servers).

Here’s the problem: most modern file formats, the formats driving all this storage growth, are already compressed.
· The most common format for photos is JPEG – that’s a compressed image format.
· The most common format for most documents at work is Microsoft Office, and in Office 2007, all Office documents are compressed as they are saved.
· Music (mp3) and video (MPEG-2 and MPEG-4) are highly compressed.

The mathematics of compression are that once you compress a file, and reduce its size, you can’t expect to be able to compress it again and get even more size reduction. The way compression works is that it looks for patterns in the data, and if it finds patterns it replaces them with more efficient codes. So if you’ve compressed something once, the compressed file shouldn’t have any patterns in it.

Of course, some compression algorithms are better than others, and you might see some small benefits by trying to compress something that has already been compressed with a lesser tool, but for the most part, you’re not going to see a big win by doing that. In fact, in a lot of cases, trying to compress an already compressed file will make it bigger!

Conventional wisdom dictates that once files are compressed via commonly used technologies, the ability to further limit their size and consumption of expensive resources is nearly impossible. So, what can be done about this?

End of part I

….

Part II

On the cutting edge, there are some new innovations in file-aware optimization that allow companies to reduce their storage footprint and get more from the storage they already have. The key to this is understanding specific file types, their formats, and how the applications that created those files use and save data. Most existing compression tools are generic. To get better results than you can get with a generic compressor, you need to go to file-type-aware compressors.

There’s another problem. Let’s say you just created a way better tool for compressing photographs than JPEG. That doesn’t mean your tool can compress already-compressed JPEGs, it means that if you were given the same original photo in the first place, you could do a better job. So the first step in moving towards compressing already-compressed files is what we call Extraction – you have to extract the original full information from the file. In most cases, that’s going to involve de-compressing the file first, getting back to the uncompressed original, and then applying your better tools.

Extraction may seem simple enough – just reverse whatever was done to a file in the first place. But it’s not always quite that easy. Many files are compound documents, with multiple sections or objects of different data types. A PowerPoint presentation, for example, may have text sections, graphics sections, some photos pasted in, etc. The same is true for PDFs, email folders with attachments, and a lot of the other file types that are driving storage growth. So to really extract all the original information from these files, you may need to not only be able to decompress files, but to look inside them, understand how they are structured, break them apart in to their separate pieces, and then do different things to each different piece.

The two things to take away from this discussion are: 1) you won’t get much benefit from applying generic compression to already-compressed file types, which are the file types that are driving most of your storage growth and 2) it is possible to compress already-compressed files, but to do so, you have to first extract all the original information from them, which may involve decoding and unraveling complex compound documents and then decompressing all the different parts. Once you’ve gotten to that point, you’re just at the starting point for where online data reduction can really get started for today’s file types.

Ocarina Raises $20 Million

Posted by Sunshine On February - 25 - 2009

Today, Ocarina Networks announced that it closed a $20 million Series B funding round. This is obviously great news for the company, as well as a strong validation of what the Ocarina set out to accomplish. The funding round was led by Jafco Ventures, with significant participation from Series A investors Kleiner Perkins Caufield & Byers and Highland Capital Partners.

The Mercury News did a very nice piece on the funding today. As reporter Scott Harris put it:

“The incredible shrinking economy may not be creating much in the way of jobs, profits or consumer confidence, but it’s doing a bang-up job of producing data. The Information Age is nothing if not a volcanic profusion of digitized documents, photographs and video — not to mention the data emerging from the genomics industry and other deeply scientific pursuits.”

Ocarina has a solution for these rising storage demands. As VentureBeat wrote this morning, 90 percent compression is an attractive proposition these days, not only to manage costs, but also, as the article states, “for companies concerned with greening their business models, cutting down on storage requirements can help shrink energy footprints.”

This new round of venture funding, raised at a time when the economy is in dire straits, demonstrates the immense need that Ocarina fulfills in the marketplace.

Can You Compress Already Compressed Files? Part II

Posted by Carter George On May - 6 - 2008

In my last post I discussed the fact that most files that are used are already compressed. And up to now, there were no algorithms to further compress them. Yet, it’s obvious that there needs to be a new solution.

On the cutting edge, there are some new innovations in file-aware optimization that allow companies to reduce their storage footprint and get more from the storage they already have. The key to this is understanding specific file types, their formats, and how the applications that created those files use and save data. Most existing compression tools are generic. To get better results than you can get with a generic compressor, you need to go to file-type-aware compressors.

There’s another problem. Let’s say you just created a way better tool for compressing photographs than JPEG. That doesn’t mean your tool can compress already-compressed JPEGs, it means that if you were given the same original photo in the first place, you could do a better job. So the first step in moving towards compressing already-compressed files is what we call Extraction – you have to extract the original full information from the file. In most cases, that’s going to involve de-compressing the file first, getting back to the uncompressed original, and then applying your better tools.

Extraction may seem simple enough – just reverse whatever was done to a file in the first place. But it’s not always quite that easy. Many files are compound documents, with multiple sections or objects of different data types. A PowerPoint presentation, for example, may have text sections, graphics sections, some photos pasted in, etc. The same is true for PDFs, email folders with attachments, and a lot of the other file types that are driving storage growth. So to really extract all the original information from these files, you may need to not only be able to decompress files, but to look inside them, understand how they are structured, break them apart in to their separate pieces, and then do different things to each different piece.

The two things to take away from this discussion are: 1) you won’t get much benefit from applying generic compression to already-compressed file types, which are the file types that are driving most of your storage growth and 2) it is possible to compress already-compressed files, but to do so, you have to first extract all the original information from them, which may involve decoding and unraveling complex compound documents and then decompressing all the different parts. Once you’ve gotten to that point, you’re just at the starting point for where online data reduction can really get started for today’s file types.

Can you compress an already compressed file? Part I

Posted by Carter George On May - 1 - 2008

We can all recognize the amount of data we generate. And just like we keep telling ourselves we’ll clean out the garage “one of these days” most of us rarely bother to clean out our email or photo sharing accounts.

As a result, enterprise and internet data centers have to buy hundreds of thousands of petabytes of disk every year to handle all the data in those files. It all has to be stored somewhere.

One way to reduce the amount of storage growth is to compress files. Compression techniques have been around forever, and are built in to many operating systems (like Windows) and storage platforms (such as file servers).

Here’s the problem: most modern file formats, the formats driving all this storage growth, are already compressed.
· The most common format for photos is JPEG – that’s a compressed image format.
· The most common format for most documents at work is Microsoft Office, and in Office 2007, all Office documents are compressed as they are saved.
· Music (mp3) and video (MPEG-2 and MPEG-4) are highly compressed.

The mathematics of compression are that once you compress a file, and reduce its size, you can’t expect to be able to compress it again and get even more size reduction. The way compression works is that it looks for patterns in the data, and if it finds patterns it replaces them with more efficient codes. So if you’ve compressed something once, the compressed file shouldn’t have any patterns in it.

Of course, some compression algorithms are better than others, and you might see some small benefits by trying to compress something that has already been compressed with a lesser tool, but for the most part, you’re not going to see a big win by doing that. In fact, in a lot of cases, trying to compress an already compressed file will make it bigger!
Conventional wisdom dictates that once files are compressed via commonly used technologies, the ability to further limit their size and consumption of expensive resources is nearly impossible. So, what can be done about this?

Greening storage

Posted by Carter George On May - 1 - 2008

The New York Times Bits blog has a post on the need to green Internet and other data centers, “Data Centers are Becoming Big Polluters.” Citing a study by McKinsey & Company, Bits’ Steve Lohr states that data centers are “projected to surpass the airline industry as a greenhouse gas polluter by 2020.”

He goes on to sum up the report, which “also lists 10 ‘game-changing improvements’ intended to double data center efficiency, ranging from using virtualization software to integrated control of cooling units.”

Many of us are aware that server virtualization is the path to increasing server utilization. But servers are only half of the data center picture. The other half is storage. The solution for that? Storage optimization.

Just as server virtualization lets you turn 10 physical servers in to 10 virtual servers and then consolidate them on to one physical machine, storage optimization lets you store 10 times more files on a given disk than you can today. The heat, cooling, rackspace, and power benefits are obvious.

Update: Ben Worthen at the Wall Street Journal is also discussing this on the Business Technology Blog. His post, “Can the Tech Guy Afford to Care about Pollution?” also talks about how the problem will only get worse in the future. Worthen’s take: “Given that most of the tech departments we talk to are looking to cut costs, they’re not likely to invest in new technology that will cut emissions, unless it cuts short-term costs at the same time.”

Less is More–Part 2

Posted by Carter George On April - 25 - 2008

As we all know, the internet is where there is huge storage growth, multi-petabyte scale, and a need to stay very close to the commodity price point on storage costs. There are two common threads across all of the “less is more” file systems that have been popping up to handle all this growth. 

First, they are all designed in a way that you can build very scalable, very large pools of storage using generic white box servers stuffed with cheap disks. Second, they mostly support only the most primitive operations — create a new file, read that file, delete a file. While I’m generalizing, and this is not exactly true for all of these new file systems, many just skip things that are considered standard in traditional file systems: locking, Posix semantics, authentication, ACLs, concurrency control, metadata or the ability to list and search for files.

The overhead of all those traditional file system operations is too much for massive internet-scale operations where the primary purpose of a file system is for a user to upload something, for millions of people to look at it over and over, and maybe someday, sometime, someone will delete something.

These file systems are in contrast to advanced file system developments from places like NetApp’s latest OnTap and WAFL releases, HP’s PolyServe cluster file system, or the transaction-enabled NTFS from Microsoft that you can find in Server 2008. 

The line in the sand is, there are file systems that are designed to be used by people, and file systems that are designed to be used by specific applications only.    

The commercial file systems grew up serving the needs of business users and business applications. They are designed to host a wide variety of applications, including production databases, to let users peruse and manage their files, and to let storage administrators keep up with both growth, availability, and corporate compliance requirements.     

As a consequence, more and more value-add features are being put in to the file system to support these use-cases. The “less is more” crowd, on the other hand, wants a very cost-effective but massively scalable pool of storage to make available to their web applications.  A global namespace (so it looks like one giant pool of storage), and low, low cost per terabyte are the drivers of these file systems.

Users don’t list their directories in these file systems. In fact, users never see these file systems. Users see web applications, and the web applications use databases to keep track of what files are where in the massive storage pool, and who is allowed to see them. In that sense, in the “less is more” file system world, a lot of the value-add and management functionality of the file system is moving up in to the application layer, especially in the largest content-rich web sites.   

>From my point of view, the feature-rich commercial file systems will continue to evolve to meet the needs of corporate customers, including scaling to meet their growth needs. The “less is more” file systems will continue to push out traditional file systems in the highest growth web properties and other customers whose data growth is at that many-petabyte scale. Finally, the two things are not entirely incompatible – most of the new web tier file systems actually have a bunch of single node file systems buried in them on each storage node somewhere at the bottom building block level of their architecture.

But it’s time that these two file system approaches evolve and develop some kind of relationship–because for now, neither is perfectly suited the problem at hand. There’s no reason why those building blocks couldn’t have richer functionality, such as transparent clustering and failover, that comes from commercial file systems, and still give you the massive scale and cheap $/petabyte of a global namespace and commodity building blocks.

The internet has often been the cauldron in which new technologies are forged that then eventually move in to the corporate data center. We saw this in the server world, where low cost Linux servers displaced Sun and other Unix systems early on, and eventually that movement to cheaper, standard servers pushed Big Unix out of the corporate data center too.   

The cost differences between a corporation’s EMC DMX storage array and a storage pool of white boxes with disk is even greater than the cost difference between Unix machines and standard Linux boxes. People are more hesitant to change storage platforms than server platforms (for good reason), but that huge cost difference and the rate at which storage is growing is going to cause the shift to happen sooner or later.

My prediction (and hope) is that someone will figure out a way to marry the “less is more” simple file system layers with richer underlying commercial file systems. This is what’s needed.

Less is More

Posted by Carter George On April - 24 - 2008

Less is more … or is it? Part One

I recently returned from Storage Networking World in Orlando. As everyone knows, the conference is mainly a place for storage vendors to meet each other, tout their wares, and nose around in their competitors’ booths pretending to be potential customers. There are some good sessions, however, and one of the best was IDC analyst Noemi Greyzdorf’s presentation on the future of file systems.

Her smart and interesting talk was on the evolution of clustered, distributed, and grid file systems. As I listened, it occurred to me that I’m seeing a big split in the file system world, especially at the high end, where really large amounts of data are stored.

One of Noemi’s key points is that more and more functionality is being packed into file systems. As she puts it, file systems are the natural place for value-add knowledge about storage to be kept. That’s certainly true, and there are a number of advanced file systems that are becoming richer and richer in terms of integrated features.

At the same time, there is definitely a “less is more” crowd emerging, where many of the most basic features of file systems are being left out in some of the newest large-scale file systems around. This group includes file systems like GoogleFS, Hadoop, Mogile, Amazon’s S3 simple storage service, and the in-house developments at a couple of other very large online web 2.0 shops.

Are these two trends in file systems headed on a collision course? I don’t think so. But what I do see is that neither of these solutions is nailing the growing problem posed by the exploding amount of internet data that needs to be managed and stored. In other words, there are issues with both of these approaches. In my next entry, I will discuss what that is, and how we might solve it.