Content feed Comments Feed

Online Storage Optimization

Exploring Next Generation Storage Solutions

Passing along what we learned from OEMs

Posted by Mike Davis On July - 18 - 2010

It’s an exciting time at Ocarina, because we’re right in the middle of a wave of OEM efforts to bring data-reduction to market as a standard feature across a wide variety of storage implementations. ECOsystem for OEMs is an Ocarina offering of software libraries and APIs that allows storage OEMs and ISVs to embed data-reduction into their products. At Ocarina, we firmly believe that within a couple years, not only will we see dedupe as a new “standard feature” in all major block and file storage products, but it will be increasingly found in host applications as well.

Here are a series of requirements that we’ve consistently heard from OEMs in our discussions, and here’s how Ocarina addresses those in the ECOsystem for OEMs.

1) The solution shrinks data well.
Ok, so this is obvious. Dedupe is supposed to shrink data, and that improves system utilization and reduces costs. Doing this well is about algorithms, and Ocarina has been on top in the algorithm game since we started shipping our primary storage optimization products almost 2 years ago. We’ve shown in some deployments that we can deliver 80% savings when traditional block dedupe delivers no more than 30%. We believe in fact that we’re the only storage technology vendor to actually employ algorithm PhD’s whose sole job is to invent better algorithms…which is something that has proven really useful in specialized markets where a few unique content types dominate the terabytes.

We are the only provider today that deploys dedupe and compression concurrently. Our dedupe algorithm is a content-aware, variable-block, sliding-window approach. ‘Content-aware’ is an overused term for sure, but here it reflects for example that we’ll recognize monolithic data structures in a stream like a JPEG blob, and we know that slicing that JPEG into 8KB chunks for dedupe delivers absolutely no benefit, and thus is a complete waste of time, CPU, and memory. We’ll treat that JPEG (and other data types like it) as a contiguous chunk, which makes our dedupe namespace extremely fast and efficient.

2) The solution minimizes time to market.
Most OEM vendors are under considerable time pressure to bring dedupe to market, either as a competitive response, or because their customers are demanding it. But the OEM’s dev and test engineering resources are always limited. With that in mind we made a point of developing a full featured library. That means not only do we slice the data and do hash lookups (as with the Permabit product), but we also do the dedupe, compression, on-disk data management, metrics and reporting, throttling mechanisms, optimized data movement, and more. By delivering a full-featured suite of capabilities, the OEM can rapidly bring embedded data reduction to market without for example having to redesign their file system or block map systems to complete the dedupe workflow.

The other attribute that accelerates time to market is simplicity. Despite being full featured, the ECOsystem for OEMs is accesses via an lightweight object-like API that OEM developers have told us is extremely simple to work with.

3) The solution has the flexibility to support specific use cases.
The requirements and functional expectations for implementing data reduction differ from device to device, and any embedded solution needs to have the adaptability to serve different applications. For example content-aware algorithms may not be a meaningful solution in a block array where data structures are completely opaque. Or CPU and memory constraints on a given device may require the use of a lighter weight dedupe workflow, and that shouldn’t force major architectural rework of the solution.

The ECOsystem for OEMs has been designed to support 6 embedded use cases: Servers, block arrays, NAS, object stores, cloud storage, and backup targets. Some of the differences between these tiers manifest themselves as implementation best practices, but there are also clear functional decision-points that allow an OEM to implement the right solution for the job. Importantly though, all of these tiers are compatible in key respects, allowing cross-platform manageability, and giving rise to end-to-end features such as optimized data movement.

4) The solution has high performance, while working within resource constraints.
Performance overhead is an often discussed problem associated with dedupe solutions. We’ve learned to solve these problems through 2 years of empirical experience, and believe we have the fastest, lowest-overhead dedupe solution. Moreover, ECOsystem for OEMs obeys hard constraints of the host platform in terms of CPU and memory usage.

There are a couple main points where dedupe can impose performance penalties:
A) During write, the chunking and lookup process takes time. Like other solutions, Ocarina does this in memory to reduce that penalty. You do have to be careful to understand for a given chunk size how much unique data that 1GB of dictionary can address, and given a constrained memory, how far can the solution go before forcing on-disk (=slowww) lookups. ECOsystem for OEMs utilizes the industry’s most memory efficient lookup design to make the most of existing resources.
B) During read, the reads from disk for any de-duplicated volume are more random than typical disk IO. Ocarina is also able to mitigate this impact through content-aware read-ahead caching that anticipates the next chunks that will be read from disk.

5) The solution supports next-generation features that customers want and competitors don’t have.
Ocarina has spent a lot of time in the market talking to end-customers about what they want in a dedupe solution. In addition to things like “shrink well”, a couple things keep coming out. One is to keep data in shrunken form as it moves around in workflow operations such as replication, backup, tiering, etc. We call this end-to-end optimization (or E2EO) and it expands the value of dedupe by improving backup processes, reducing LAN bandwidth, and reducing the CPU overhead that occurs when data is repeatedly rehydrated and re-deduped for no reason. For those OEMs who carry a broad catalog of server, storage, and backup solutions, there are huge benefits in being able to deliver this end-to-end value to customers who have adopted that OEM across their entire IT infrastructure.

The other key feature is the ability to support structured (eg database) and semi-structured (eg VM) applications. The files in these applications are almost always >95% static and <5% active. But because they are active, traditional dedupe solutions can create a tremendous IO overhead as dedupe operations battle against a steady stream of changes. Instead of getting in the way, Ocarina has invented a way to dedupe the inactive portions of these files, while imposing no performance overhead on active IO. Like our dedupe chunking process, this is one of several content-aware features that Ocarina delivers in the ECOsystem for OEMs.

These advanced dedupe features — which bring data reduction benefits to new applications and workflows — allow storage OEMs and ISVs to deliver more value to customers, to do it faster, and differentiate their product over competitive offerings that have first-generation dedupe capabilities.

I dream of data reduction

Posted by Sunshine On March - 29 - 2010

jeannie

Data is growing at a dizzying rate. We need only look at our home computers to get a sense of how easy it is to fill our hard drives to overflowing with all manner of flotsam and jetsam. From family photos to LOLcats to videos of our kids, we’re finding it difficult if not impossible to keep down the rising tide of files.

There is a cost to this, as many if not most enterprises are now recognizing. Recently, InfoWorld launched a special section, Data Explosion that guides companies through the myriad problems that arise from having too much data to handle. With headlines like: “The big data addiction,” the new section promises to address the issue with step-by-step guides, white papers, and other instructional pieces.

Infoworld blogger Matt Prigge delves into the topic in a post today, “The high cost of lazy storage.” He says that users need to take responsibility for keeping their data under control. Despite this admonishment, he admits that he himself is an “excellent example of the problem.” He saves all of his email, because he never knows what he might need later. Sound familiar? If someone whose blog is called “Information Overload” can’t get control of his personal data, it’s hard to imagine how anyone else can.

Prigge writes, “The bigger that data gets, the more effort required to put the genie back in the bottle.” He pushes the metaphor even further (and more gruesomely) by suggesting that at some point it’s easier to kill the genie and throw away the bottle. Now, that does strike us here at Online Sto Op as rather extreme. Why not simply put the genie back into that nice, compact bottle where she was living perfectly happily for so many years?

As we all know from 70s TV, those bottles were well-upholstered and downright comfortable living spaces for many a genie. And while it’s true that some genies (or Jeannies) would get so angry they’d stomp their feet when they were magically sent back there, they eventually settled back onto the purple pillows, kicked off their metallic platform heels, dug their toes into the shag carpeting and relaxed. Same goes for data reduction. A combination of approaches seems the most sensible answer. Data needs to be managed. There is something that is known as 100% compression–it’s called “deletion.” But short of that, there are ways to reduce data by as much as 90%. There are solutions for reducing the types of files that are driving the fastest storage growth, such as JPEGs, documents, videos, graphics, and other large files. An intelligent, content aware approaches that includes both deduplication and compression is what this blog’s parent Ocarina provides.

Storage News and Views - March 17

Posted by Sunshine On March - 17 - 2010

saint_patricks_day_cheer-tAcross the storage blog-o-tweet-osphere today folks are donning green scarves, putting four leaf clovers in their lapels, and generally proclaiming the luck of the Irish. Yes, it’s a good life in storageland. And there’s plenty of news to amuse and bemuse.

EMC made a big splash this week with a presentation to analysts by President and COO Pat Gelsinger that outlined a new vision for virtual storage. You can listen to the whole thing here. Chris Mellor at the Register called the plan a sign that EMC has “lost their marbles,” but others think it represents the future of storage.

Here is some of the commentary from both within and outside EMC:

EMC:

Chuck’s Blog - This changes everything

Blog Stu - Virtual Storage, not just another V-word

Commentary:

**New addition thanks to @sfoskett** Burton Group: EMC’s Global Storage Vision

Gregs’ StorageIO blog - Virtual Storage and Social Media: What did EMC not Announce?

Chris Mellor, The Register - Gelsinger stuns analysts and colleagues with storage pool plan

Stacey Higginbotham , GigaOM (yes, GigaOm! Welcome to sto-land, Stacey) - EMC’s Crazy Plan to Create a Worldwide Data Cloud

In other news… there’s a really sweet video on the Hitachi Data Systems site that talks about its partnership with this blog’s parent Ocarina Networks, and how this will benefit customers, reducing their data at rest by 10:1. Ocarina CEO Murli Thirumale makes a pixelated, jazz music backed appearance.

Here’s the video in its entirety, or go to the HDS blog site and watch a higher quality version:

Meanwhile, it’s not all sword crossing in the land of the storers.

As we already know, our kind can rally for a good cause. This past week, arch rivals NetApp and EMC raised money for kids with cancer by shaving their locks for St. Baldrick’s. NetApp led the charge, and EMC responded in kind.

Virtual Geek Chad Sakac sums it up here: A little EMC/NetApp Fun - to help cure cancer…

A heartwarming effort.

That’s all for now. Remember, it’s not what you store, it’s how you store it.

Make the right call

Posted by Sunshine On March - 10 - 2010

Four out of five college students agree, this is not the way to deal with data growth. How about this instead?

stuffed-phonebooth


Fast and Effective Dedupe

Posted by Carter George On March - 3 - 2010

I’ve noticed a few blog posts recently about speed of deduplication in the modern data center. I agree that speed is an important factor, but keep in mind that not all dedupe is created equal. That is to say, fast is good, but only if you are also effective. One of the tricky things has been that the easiest data to compress is also usually the most carefully performance tuned. A great example of this is a database. This is because databases are comprised of simple alphanumeric fields and sparse tables. All of that is easy to reduce in size.

However, a company’s core transactional database is the most conservative asset in the data center. Introducing compression would save space, for sure, but you could only use very fast, simple compressors there. At the same time, customers will be hesitant to deploy a new layer of processing in their most sensitive application.

So, where is most data growth? In fact, it’s being driven by unstructured data – Office documents, rich media, email with attachments, PDFs, Flash videos, and so forth. This complex data does not lend itself to fast simple compressors. But perhaps we should back up for a moment and think about how customers have been behaving all along.

Throughout the history of storage, there have always been tradeoffs available between fast expensive storage, and slower but cheaper alternatives. This is not a bad thing. It gives users alternatives based on their priorities and budgets. Back in the old mainframe days, these choices were between very expensive mainframe memory and “offline” storage like drums, cards, and tapes. Today the technology is all much bigger, faster, cheaper and sexier. But really, the tradeoffs are the same.

Data reduction technology adds another layer of choice above and beyond the traditional hardware choices. Now in addition to choosing whether you want fast, expensive solid state disk (SSD) or slower but very cost-effective SATA, you can also choose whether you want to compress and/or deduplicate the data that is stored on those disks.

Just like physical disks, compression and dedupe come in a range of speeds and capabilities.
There are simple and very fast compressors that are essentially invisible in terms of their impact on storage performance. There are more complex compressors that get better results, but which may take longer, either to compress or to decompress the data. Deduplication, done well, should always be pretty fast, and streaming dedupe rates of well of 300MB/sec are now available from many vendors (including Data Domain and Ocarina).

The emergence of tools to automatically tier data to its appropriate place help make the use of all of these technologies more feasible. That applies as much to solid state disks as it does to dedupe and compression. When data tiering can be made invisible to end-users and applications, then implementing multiple physical and logical tiers of storage becomes practical.  Good examples would include EMC’s new FAST tools, Compellent’s “Fluid Data Storage”, and HDS’s Data Migrator. When users or administrators have to move data by hand to get it to a compressed tier or a solid state disk, then the operational costs offset the capital savings.

You might want to be wary when someone’s biggest claim to fame is fast dedupe. Just as the old mainframe admin had to decide whether something was important enough to live in RAM, or could be stored on cheaper tapes instead, today’s IT shops have to decide where it is most important to try to get data reduction, and what tool will get the most bang for the buck for that kind of data. You need the whole story, and then you can decide based on your own priorities.

Dedupealooza

Posted by Sunshine On February - 19 - 2010

So much talk about dedupe these days it’s hard to keep up. The industry is waking up to the reality that dedupe is one of the best ways to reduce data, thus saving on power, cooling, space and other crippling storage costs.

Some of the more thought provoking posts of late:

DCIG - How SSDs can be leveraged to Deliver Inline Deduplication for Primary Storage
Jerome Wendt responds to a comment from someone about Hifn’s Bitwackr inline dedupe. I don’t necessarily agree with Jerome’s take on this. In general, inline solutions are extremely limited, as the original commenter pointed out. But the post provides interesting food for thought.

Storagebod - Where is OnTap 8 with a bit of a rant!
Martin Glassborow isn’t talking specifically about NetApp dedupe here, but the delay on shipping OnTap8 is of interest to anyone who is concerned about data reduction products. As he puts it, the elephant in the room is that A-SIS dedupe as it now stands has limited scalability.

Recovery Monkey - More FUD busting: Deduplication - is variable-block better than fixed-block, and should you care?
This post, by Dmitris Krekoukias, argues that major distinction some vendors make about variable and fixed block deduplication is a way of distracting customers from the real issues. The post served to defend NetApp against its detractors and competitors who say fixed block dedupe is limiting. The comments field is in some ways the most interesting part, with EMC heavy Chuck Hollis raising questions about his connections with NetApp. Also, our own Mike Davis weighed in, and the numbers he cited were so notable that further commenters questioned how this could be lossless compression. At this point, we’re used to it–the industry at large has become accustomed to less than spectacular results. More on all of this in a later post.

And here’s another interesting trend. The word “dedupe” is starting to creep into the lingo in a more general way. Among storage tweeps there is a greater tendency to throw “dedupe” into their conversations about everything from their record collections to what they eat. It reminds me a little bit of the “hepcat” slang I used to hear when hanging around jazz musicians. If something was ordinary, they’d call it “B Flat,” since that’s the most common key in jazz. For example, “Oh, I just had a B Flat lunch today of a burger and fries.”

The often Twit-witty Greg Schulz recently tweeted: “ I can have dvr record on disk NBC tape delay (thats probably on disk) then dedupe da commercials.” Good plan, Greg.

This post by Steve Gillmor at TechCrunch also uses the term–in a way that I’ve never heard anyway. In this case, he’s referring to the fact that there is duplication of content across what are now becoming overlapping social networks–FriendFeed, Twitter, and the new Google Buzz.

OK that’s all for now. Keep on deduping friends!

News from the Holodeck

Posted by Sunshine On February - 16 - 2010

what_happens_in_the_holodeckAs regular readers of this blog know, we’re obsessed with out there tech. Anything that smacks of Star Trekkian futurism gets our blood pumping. This week, Deep Storage’s Howard Marks reports on something we’ve been watching for some time: holographic storage.

The news is sad. The company that was developing it, InPhase, is out of business. Their web site is still up, but according to the article, the company, a Bell Labs spin-off, was shuttered in early February and the Colorado Dept. of Revenue is now seizing its assets. As he points out, for now, technologies like deduplication make it hard to justify spending $10K on holographic drive.

Despite this terrible setback I for one don’t want to believe this idea will die out entirely. It promises a new generation in storage at a time when data growth is spiraling out of control, threatening to overtake data centers worldwide. And who says we can’t add compression and deduplication on top of that? Howard and I both predict that sooner or later someone else will follow the holographic storage clarion call. As he so succinctly put it: “It’s just so cool.”

Image from: Geek Stuff

Storage Industry Lags Behind Advances in Compression

Posted by Carter George On January - 13 - 2010

There’s a lot of talk about compression these days, but how much do we know about it? Well, for one thing, compression as a research area for mathematics has evolved much faster than most people realize. The thing is, most compressors used in computer products, including dedupe appliances, use generic algorithms rather than making use of these advances.

Most storage products use Lempel-Ziv (LZ) or derivatives, and try to use that single compressor to compress everything. These algorithms have been around forever, and for the most part, have not evolved much in the last ten years other than in the area of performance. This is too bad, because compression has advanced in exciting ways. LZ and its cousins work well on the kinds of data that were around 10 or 20 years ago - plain text, plain numbers, or combinations of those things. They do not work so well on a lot of modern data - images, video, Office documents, PDF’s, already-compressed files like Zip, encrypted data, etc. What’s important to understand is that all the most notable advances in compression that apply to storage have taken place not in generic compression algorithms, but in file type specific ones. File type specific compressors can, in fact, deal with all those modern data types.

Compression is all about pattern recognition and prediction. You look for patterns in a file and if you can find those patterns you try to predict their occurrence. If you can predict a pattern, you can compress it. So understanding the kinds of patterns that might show up in a file - video, a Zip file, music, and a PowerPoint are all very different - is the key to building a compressor for that file type.

What’s especially relevant is that the most important thing in compression of data today is recompression. Almost all of the file formats that are driving data growth, and taking up the most space on backups, are already compressed. Think of a file type that’s eating up space, and it’s likely to an already-compressed format: JPEG, video, Office, PDF, mp3, medical images … all compressed already.

A generic compressor won’t get any results at all on an already-compressed file. That’s because the first compression obscures the patterns that a compressor would look for. That’s why if you try to compress, say, a Zip file, if anything you’re likely to make it bigger. Recompression means first decompressing the file and then recompressing it with a better compressor. To do that, you have to recognize what kind of file it is, what kind of compression has been applied, and how to decompress it. By first decompressing it, you are able to see and process the patterns that make better prediction and compression possible.

Almost every market has a set of well-defined file types that make up the bulk of its unstructured data. In medical imaging, it’s Dicom (which in turns contains JPEG 2000, JPEG LS, and TIFF). In seismic, it’s seg-y. In satellite imaging, it’s NTF, MrSID, GeoTIFF and a few others. In the average business, it’s Office, PDF, photos and video.

In specific industries, you see very advanced compression implemented in the application layer, not in storage. Video is a great example - the whole concept of the video codec is all about compression. Whole companies exists specifically to do better video compression (On2 is a good example), but this compression is done primarily for transmission, and implemented as part of the video application workflow, not as a storage technology.

In a world that had all plain ASCII text data, generic compressors would be great. But that’s not the world we live in. For compression to have any meaningful impact on today’s data sets, you have to have file type aware recompression.

It’s a shame that most storage products today have not implemented the most exciting advances in modern compression mathematics. My company Ocarina is quite frankly one of the few exceptions. The compressors found in tape drives or in dedupe appliances represent the best of the evolution of the generic compressor. The thing to look for going forward is the emergence in storage products of the next generation set of file type aware compressors, which is where all the action has been over the last ten years.

Happy New Year

Posted by Sunshine On December - 29 - 2009

Tis the week for the “out of office” email messages. But the storage blogo-tweet-osphere waits for no man. Here are a few posts that caught my eye this week.

Bas Raayman sees CPU power hitting the wall: The RAM per CPU wall

Rick Vanover says 2010 could be the year for 10GigE - Will 2010 see 10 Gigabit Ethernet go mainstream?

It being the end of a year–and a decade–predictions abounded. We’re pleased to note that when it came to summarizing the top storage stories of 2009, deduplication for primary storage, the specialty of this blog’s parent Ocarina, made the big lists:

Infostor: The top 5 storage technologies of 2009 (and 2010?)

“Storage optimization (or data reduction) technologies such as data deduplication and compression can significantly reduce capacity requirements and costs … Consider data reduction for primary storage.”

SearchStorage - Beth Pariseau: Top 10 enterprise data storage news stories of 2009

“10. Data deduplication branches out. As deduplication settled into a comfortable role in backup, data-reduction technology started working its way into other parts of the data storage infrastructure, including primary as well as nearline and archived data … Ocarina and Isilon Clustered NAS help visual effects studio archive images, cut costs.”

For sheer inventiveness, blogger Stephen Foskett wins the prize with his 2009 predictions post, in which he turns the clock back and takes advantage of 20-20 hindsight: My 2009 IT Industry Predictions.

Meanwhile, social media and tech watcher Louis Gray takes himself to task and looks at all of his 2009 predictions to see how well he fared: My 2009 Tech Predictions: Mixed, But Nailed Real-Time.

OK that’s all for now. Here’s wishing all of you a happy, healthy, green and techy new decade.

Dedupe - The Big News in 2009

Posted by Sunshine On December - 7 - 2009

niketigerswoosh

It’s been a tough year — a worldwide recession, a sluggish housing market, rising unemployment … and on top of all that, the tarnished image of one of sports’ most squeaky clean players. Well, actually, there have been some bright spots. As DCIG blogger and storage analyst Jerome Wendt notes while looking back at the past year, “Deduplication is the Big Success Story of 2009.”

Wendt writes: “Deduplication is arguably one of the most notable trends of 2009 as it has been widely adopted by users after bursting onto the scene just a few years ago and has grown to be included in both software and hardware products.”

Wendt focuses on dedupe for backups, where there has been much publicized activity over the past year. The big storage story of 2009 was of course the battle between storage titans EMC and NetApp over backup dedupe specialist Data Domain. He cites an industry survey from SearchDataBackup that indicates that 41% of enterprises either are or are seriously considering dedupe to control data growth and costs. He also notes that the despite the predicted demise of Quantum, that dedupe company remains strong.

Dedupe for backups is one part of the cost reduction puzzle. Another part is to reduce data at the source, in primary storage. This is of course the specialty of this blog’s parent Ocarina, which implements a unique combination of content-aware dedupe and compression to achieve startling results. It focuses on the very types of unstructured data that are driving storage growth today–emails, images, documents, and so on. The company has been partnering with almost every leading storage provider, including HP, EMC, HDS, BlueArc, and Isilon. Another  leader in this space is NetApp, which has a strong dedupe for primary offering that has also garnered a great deal of attention.

Here’s the thing, the economy might be slowing down, but data growth continues apace. This is one reason that the storage industry has been thriving this year. But rather than standing still, what is spells is a concerted effort to keep that data under control. As Wendt notes, another of the year’s big trends is cloud storage, which offers companies more flexibility for storing some percentage of their data. I would also add that virtualization has taken a huge leap forward, not only in terms of the technology itself, but also in terms of adoption over the past year. Yet another way to attack the problem.

So if 2009 was all about dedupe for backups, I’m going to guess that 2010 will be very much about data reduction at all points on the data life cycle. What do you predict?

Image: Gizmodo