Four out of five college students agree, this is not the way to deal with data growth. How about this instead?

Four out of five college students agree, this is not the way to deal with data growth. How about this instead?

I’ve noticed a few blog posts recently about speed of deduplication in the modern data center. I agree that speed is an important factor, but keep in mind that not all dedupe is created equal. That is to say, fast is good, but only if you are also effective. One of the tricky things has been that the easiest data to compress is also usually the most carefully performance tuned. A great example of this is a database. This is because databases are comprised of simple alphanumeric fields and sparse tables. All of that is easy to reduce in size.
However, a company’s core transactional database is the most conservative asset in the data center. Introducing compression would save space, for sure, but you could only use very fast, simple compressors there. At the same time, customers will be hesitant to deploy a new layer of processing in their most sensitive application.
So, where is most data growth? In fact, it’s being driven by unstructured data – Office documents, rich media, email with attachments, PDFs, Flash videos, and so forth. This complex data does not lend itself to fast simple compressors. But perhaps we should back up for a moment and think about how customers have been behaving all along.
Throughout the history of storage, there have always been tradeoffs available between fast expensive storage, and slower but cheaper alternatives. This is not a bad thing. It gives users alternatives based on their priorities and budgets. Back in the old mainframe days, these choices were between very expensive mainframe memory and “offline” storage like drums, cards, and tapes. Today the technology is all much bigger, faster, cheaper and sexier. But really, the tradeoffs are the same.
Data reduction technology adds another layer of choice above and beyond the traditional hardware choices. Now in addition to choosing whether you want fast, expensive solid state disk (SSD) or slower but very cost-effective SATA, you can also choose whether you want to compress and/or deduplicate the data that is stored on those disks.
Just like physical disks, compression and dedupe come in a range of speeds and capabilities. There are simple and very fast compressors that are essentially invisible in terms of their impact on storage performance. There are more complex compressors that get better results, but which may take longer, either to compress or to decompress the data. Deduplication, done well, should always be pretty fast, and streaming dedupe rates of well of 300MB/sec are now available from many vendors (including Data Domain and Ocarina).
The emergence of tools to automatically tier data to its appropriate place help make the use of all of these technologies more feasible. That applies as much to solid state disks as it does to dedupe and compression. When data tiering can be made invisible to end-users and applications, then implementing multiple physical and logical tiers of storage becomes practical. Good examples would include EMC’s new FAST tools, Compellent’s “Fluid Data Storage”, and HDS’s Data Migrator. When users or administrators have to move data by hand to get it to a compressed tier or a solid state disk, then the operational costs offset the capital savings.
You might want to be wary when someone’s biggest claim to fame is fast dedupe. Just as the old mainframe admin had to decide whether something was important enough to live in RAM, or could be stored on cheaper tapes instead, today’s IT shops have to decide where it is most important to try to get data reduction, and what tool will get the most bang for the buck for that kind of data. You need the whole story, and then you can decide based on your own priorities.
This Saturday I’m participating in an event that aims to bridge the gender gap in computer science and engineering. It’s the first annual Dare2BDigital, a conference for young women in the 7th-10th grades that exposes them to the new and exciting career options that now exist in computer science and engineering.
Why such a young group? Studies suggest this is the time when we begin the decision-making process about our career path. These young women are beginning to make pictures in their minds about how they’ll be spending their days when they enter the workforce. They might well be gifted in math or logic. But computer science still suffers from an image problem. Most people–girls in particular–see it as the realm of geeky guys who make endless Star Trek references, drink too much soda and have questionable grooming habits.
What many don’t know is how far this field has come in the last decade. If you’re creatively inclined, now is one of the best times to enter the vast computing field and start poking around for an interest area. An example, one of the first workshops at Dare2BDigital to fill up was one taught by Pixar technical directors on “Computers, Art, and Animation — How opposing specialties come together to create feature films.” What a a treat for a middle- or high school-aged girl to be able to dip her toes into the exciting field of computer animation. Other popular choices were programming with Python, making a Facebook game (with folks from Zynga), my workshop on being a tech reporter, and others. For the full list of workshops to share with your daughter, go to the workshops page.
The event is sponsored by SAP, along with many other top names in technology, including, HP, Microsoft, Cisco, IBM, Symantec, and others. What do you think? Is this the way to bring more women into the fold? What else can be done to open up the world of computing to more potentially qualified and creative people?
Full disclosure: I personally am receiving a small stipend from the event presenters for my consulting work on this conference. This blog’s parent Ocarina Networks is in no way involved, other than to be supportive of the concept.
With all the talk about the data inconsistencies around climate change theory, one issue that I’d hate to see lost in the shuffle is the actual environment. That is, while I personally have been skeptical for some time about the alarmist tone many scientists took regarding global warming, it would be a shame if there was such a backlash that people forget about the much more crucial, larger issue at stake. That is, we need to look at all the ways –on macro- and micro-scales–that we can reduce the overall pollution we generate through our daily habits.
One of the persistent myths about the Internet is that it is clean and green. We overestimate the value of going “paperless” while lowballing the effect on the environment of data centers. One need only look at an online pub like Data Center Knowledge to see that one of the most talked about issues in data centers today is how to reduce rack space, cooling and other energy costs associated with storage. (Another great resource is Greg Schulz’s StorageI/O blog.) This is particularly true of the data being generated through our new Web 2.0 sharing habits. Jon Toigo can laugh about the exploding digital universe all he likes, but it’s still the case that data growth is going like gangbusters in this socially networked era. Recession or no recession, there is a growing demand for ways to make storage more efficient.
Large players in this space are all too aware of the environmental and financial costs of such rapid data growth. Every time you share a photo or video, you’re contributing to it. And who among us doesn’t do this nowadays? In response. companies are experimenting with all kinds of techniques, including new building designs making use of outside air, reducing overall rack space usage with data reduction such as is offered by this blog’s parent Ocarina, cloud adoption, and so on and so forth. Companies like Google, Yahoo and Facebook are also creating next generation storage architectures that are more efficient for handling the realities of today’s internet. In short, let’s be sure, as we discuss the fallout from the latest global warming debate that we don’t start acting too lax about the effect of our actions on the planet.
This blog often talks about data deduplication, but today, some news about a different type of dedupe came across our desk. Consulting giant Gartner has acquired Burton Group for $56 million. This is just one example of the larger consolidation trend that’s been taking over the consulting industry. In fact, this is the second such acquision by Gartner–a month ago it picked up AMR for a similar price tag.
For many, it is unsettling to see these acquisitions by the consulting industry’s 800-lb. gorilla. As F5’s Don MacVittie so succinctly put it on Twitter, “The Incredible Shrinking Analyst Market.”
But this acquisition has its positives, says James Governor’s Monkchips blog. His boutique analyst firm, Redmonk might actually benefit, he says. Why? Because companies will need a source for a “second or third opinion” once they’ve gotten the Gartner view.
Says James: “Enterprises use Gartner for a reason – for the experience and knowledge of its people – and with Burton comes heavy collaboration, directory, networking and SOA experience.”
Clearly, Gartner is waking up to the growth–and increased complexity–of the modern data center. It is by necessity responding to the needs of IT decision-makers within organizations who must make calls about everything from virtualization to cloud computing to security all along the line. As this post on the Burton Group blog explains:
“Gartner focuses on the executive level - the strategic question of “what to do.” Burton Group focuses on the IT leaders and implementers - the question of “how to do.” So both products compliment each other and focus on different audiences within the IT organization.”
What’s notable is the extent to which specialization is becoming the name of the analyst game. What do you, the IT-decision-maker need to know? Gartner’s bet is essentially this: if it delves deeper into the weeds–offering more than just business strategy and its (increasingly devalued) stamp of approval–it is more likely to retain its customers over the long haul. Do you agree? If not, why not?
Bas Raayman, an SAP consultant has a post on StorageMonkeys that questions the emphasis on de-dupe and thin provisioning, when evidence is mounting that storage at many enterprises is largely underutilized. This got me thinking.
Bas is asking a good question: Why bother shrinking your data when you’re only at 50% capacity? Now the premise is debatable, as certain verticals such as life sciences, media and entertainment, social networks, etc. are as prolific as ever and driving a lot of the 2009 revenue of our NAS partners Isilon, HDS, BlueArc, HP, and EMC.
However, for customers where utilization is low, it’s legitimate to ask the question “why mess with de-dupe?” The answer lies in understanding the resources saved beyond just storage capacity.
Here are some obvious opportunities:
1. Higher performance data migration, backup and replication (thus improving reliability and reducing bandwidth costs);
2. Consolidation into fewer filers (thus reducing management cost and OpEx);
3. For Internet businesses, reduce distribution and CDN costs by delivering smaller files.
Of course end-to-end optimization is a pipe-dream for most vendors where de-dupe is a low-level embedded block-based solution. These point solutions lead to the problem of “de-dupe headache” (yes, you heard it here first!) where islands of disparate de-dupe solutions work in isolation, and any data-movement requires re-hydration and re-shrinking into the next solution. This is something that Curtis Preston takes up in a recent article in SearchStorage.
Here’s where I think an out-of-band content-aware tunable optimization solution like the Ocarina ECOSystem has bridged the gap. It delivers end-to-end fully optimized workflows--not just storage, but bandwidth, power, and reliability. This is really the end-game and where we’ve set our sights.
One example–Ocarina has been shipping the capability to optimize files while leaving them in their native format. In other words, no reconstitution mechanism is necessary. This is particularly useful for photo sites where JPG images are optimized using our visually lossless algorithms, but still retained as JPG files. So the photo site can distribute shrunken images without quality impact. We may not deliver the 8x reduction of a bunch of office docs, but 30-50% optimization delivers massive benefits in bandwidth cost in addition to storage savings.
So even for storage consumers with low utilization, it’s becoming more clear that end-to-end data optimization will lead to improved costs across the entire workflow and infrastructure. In short, as the headline says, it’s not just about storage anymore.
Data deduplication has become a very hot topic these days, especially in light of EMC’s recent and very high profile acquisition of Data Domain. This week, analyst George Crump of Storage Switzerland made some predictions as to where this technology is heading. His post, The Foundation of DeDupe’s Next Era, asserts that it will require many different approaches–likely from a number of vendors–in order to best reduce the multiple types of data found in primary storage. I agree with much of what he says, but here are some further thoughts on the topic.
First, a general observation. In every new major market, there is always an early winner, and then that early winner is typically leap-frogged by a 2.0 approach that solves the problems of the first wave. There are a number of examples of this. Browsers, for starters. Netscape made the market, only to be wiped out by Internet Explorer. In the file serving market, Auspex created the market, but NetApp blew them away. The list goes on.
With that in mind, there are four elements that I believe will define the winning architecture in Dedupe 2.0:
1. Global dedupe: Deduplication will find duplicates across multiple nodes and multiple storage pools. No matter where a data stream comes in to the solution, if it has a dupe, it will be found.
2. Post-Process: The second wave of dedupe will be a post-process architecture. Data Domain tells us as much when they focus so much of their marketing on their latest product (the 800 series) on why in-band is the right answer. They’re the market leader, they have a smoking fast new product – why are they so worried about post-processing that they make it the focus of their release messaging? Who are they worried about? Not the vendors they’ve already beaten. No, they’re worried because they know the 2.0 generation will be done this way. They are already positioning now for the new competitors they know they’ll see in the future; they’re being defensive, because they understand their own limitations better than anyone else.
There are several reasons dedupe will move to a post-process architecture, but the main one is better results in data reduction. Dedupe 2.0 won’t be just dedupe – it will be dedupe plus content-aware compression. This means two- and three-dimensional compressors need to see the context of data, not just the small window of data passing through memory in an in-band appliance. Done right, there’s no reason why post-processing can’t be just as fast as in-band, and data reduction will be dramatically better.
3. Scale-out Processing: In Dedupe 2.0 you will be able to scale out throughput by adding more nodes to your dedupe cluster to process in-coming streams. The Dedupe 2.0 cluster will look like one single target to backup (or other) sources. It will have a load-balanced global namespace, but behind that you could have one cheap server or 32 big fast ones. You’ll be able to start small and grow big, without changing anything on the backup software or writer side. Data streams can get load-balanced to any node, and because of global dedupe, any node can dedupe in real time with data coming to any other node. Instead of having to pick which model has the right throughput for you, start with one node, and if you grow from needing half a Terabyte an hour to 5 Terabytes an hour throughput, you add a few more nodes.
4. Scale-out Capacity: As the between backups (with short retention windows) and archives (potentially long retention periods) continue to blur, the dedupe 2.0 store wants to scale out to massive amounts of storage. That should be independent of processing capacity. For example, the shop that does not backup that much every day should not have to buy some top of the line model just so that they can get enough storage to keep their backups online for 7 years.
Just like processing and throughput capability, capacity should scale independently. You also should be able to add as much storage as you want – inside a dedupe 2.0 cluster node, on a SAN, or network-attached – independently of whether you bought the small cheap dedupe node or the big fast one or a cluster of many of them.
Some vendor will deliver a dedupe 2.0 cluster solution that meets these four must-have requirements. Who knows? That might be Data Domain, the winner of the first wave. But it might be someone else, too.
The question of what to do with already-deduped input streams is a separate but interesting topic. For the most part, customers voted with their wallets against doing source dedupe for backups. After all, EMC bought Data Domain even though it already had source-based dedupe technology Avamar.
More and more, file servers and even database servers are going to be doing dedupe of the primary and nearline file systems, not for backup, but for storage efficiency in primary storage. That means that data streams going to the backup solution with dedupe are going to be already deduped in some way.
All of which raises even more questions–which will have to wait for a later post. What’s the right way to deal with that? Is the answer something that needs to be done on the source side or the backup side? Meanwhile, I invite your comments.
There’s been a lot of discussion about tiered storage lately. Most notably, Stephen Foskett has written a series of posts on the topic on his Nirvanix blog, Enterprise Storage Strategies. In his latest post, he essentially argues that tiered storage hasn’t turned out to be cost effective and that cloud storage could be the best option for the lower tier.
We certainly agree with him that unstructured data has become unmanageable due to the proliferation of rich media and other large files. We also agree that tiered storage hasn’t lived up to its promise to a large extent. However, let’s not be too quick to throw out the baby with the bathwater. As Hu Yoshida has discussed in a recent post, tiering has come a long way in light of new technologies, particularly virtualization. In our view, by combining virtual tiering at the block level (as described in Hu’s post) with virtual tiering at the file level you can get the best of both worlds.
Tiered storage used to be about moving data from one physical storage place to another. The premise there was that some storage was fast and expensive, and other storage was slower but cheaper, and that you could save a lot of money by moving data to the appropriate place.
This was a good idea in theory, but as it turned out there were a number of unforeseen problems. First, the tools for moving files were themselves sometimes expensive. There goes your cost savings. On top of that, they were sometimes good at moving the files but not at getting them back. And further, in situations where the fast tier and the cheap tier were not from the same vendor, it often proved difficult to make finding files that had been moved transparent to users and applications. As you can guess, these types of problems often made the whole thing more trouble than it was worth.
The fact remains, though, that most files are stored on storage that has more performance, and costs more, than is necessary for that file. Most storage admins know that 80% of their files could be stored on a cheaper tier, if it wasn’t a hassle or too expensive to do so.
One solution with immense potential is to have virtual tiers within a single filer or namespace. Virtual tiers are levels of dedupe and compression applied to a file, making it cheaper to store because it’s taking up less space. In a virtual tier, the file does not have to move anywhere – it can stay right where it is, but you reduce the cost of storing it by shrinking it. With dedupe and compression, there are lots of choices for trading off performance versus space savings.
Sun’s file system ZFS allows this, and cloud storage like Nirvanix can do this too — having the advantage of using the latest technology, and that the technology behind the cloud interface is invisible to the user of the cloud. Either way, let’s look at how you can implement virtual tiers while keeping files in the same place that they were created in.
Let’s say Tier 1 is for your fast hot files - they live on your Tier 1 filer, uncompressed. In that case, you might have a Virtual Tier 2 be all the files that have not been modified in 7 days, and Tier 2 would be that same filer, same volume, but with a policy that those files that meet the Tier 2 definition are deduped. No compression, just dedupe. In that case, read back times will be quite fast. Maybe not exactly as fast as reading the original un-deduped file, but almost.
A Virtual Tier 3 might be “files that have not been modified in 30 days” and the tier might be defined as dedupe plus light compression. Read back will be a tad slower, but space savings greater than dedupe alone. Finally, you might have a virtual Tier 4 – dedupe and maximum compression. This might fire more complex compressors that take longer to compress (and decompress) a file, but will get excellent space savings. Read back performance for tier 4 might be quite a bit slower, but the space savings might be 90% or more reduction in the file sizes in that tier.
Here’s the kicker: All of this can be done without moving a file off the filer it started on. Users and applications can still find the file right where it always was. If they access the file, the optimization solution will transparently “rehydrate” the file.
There are different solutions that can do some or all of these things today. NetApp’s dedupe can only dedupe all of the files in a volume or none, so it can’t be used today to create logical or virtual tiers within a volume. But other solutions, like the Ocarina ECOsystem, are policy-based and can be used to create multiple different logical (or virtual) tiers within a single filer or volume, with multiple dedupe settings (including Ocarina’s patented Object Dedupe) and multiple levels of compression, with choices of over 100 compressors for different file types.
Ocarina has been tightly integrated with certain types of storage – including cloud solutions like Nirvanix – and the most transparent virtual tiers would be with the combination of Ocarina and one the filer choices that have tight integration with Ocarina: BlueArc, EMC, HDS HNAS, HP, Isilon and Nirvanix (in alphabetical order – no vendor prefences implied!).
Of course, virtual tiers can be combined with real physical tiers, so that you can combine the level of storage optimization (dedupe, compression) with storage of different physical characteristics (expensive filers, cheap filers, cloud storage) to provide an environment that is not just a simple two-tiered model but a policy-driven environment of possibly a dozen or more logical tiers, with files being tiered-in-place or migrated-and-optimized automatically based on policy with little or no storage admin involvement.
As you can see, there is vast potential in this new approach to tiering. Even better, it can be achieved in such a way that storage admins’ jobs become easier, rather than harder. Like a lot of things, storage tiering has always been a good idea, but sometimes the technology has to catch up with the idea for implementation to become a good idea. Given the growth of storage, and the improvements in physical and virtual tiering, I think doing a better job of tiering must rank close to top of the list for many customers.
A few of the Ocarina crew recently returned from Siggraph2009, the 36th International Conference and Exhibition on Computer Graphics and Interactive Techniques. Held in New Orleans the week of August 3-7, the event drew participants from around the world. We were newbies to the event, and so decided get some perspective from an industry veteran, who has been attending Siggraph every year for the past decade and a half.
Michael Zachary Huber is an animator and educator who has worked with the top studios, including director James Cameron, Digital Domain, and Electronic Arts (EA). He’s an assistant professor at Cogswell Polytechnical College an animation and engineering school in Sunnyvale. Over the years he’s witnessed peaks and valleys when it comes to Hollywood’s love affair with visual effects.
“In the 1990s they were a novelty,” he said. “It was similar to the Internet, which came into its own later in that same decade. People were just on fire!”
The early 1990s were truly the heyday of Siggraph, he recalled. It wasn’t unusual to see stars like Danny DeVito and Arnold Schwarzenegger in attendance, and there was tremendous “buzz” in Hollywood. Yet some studios were burned by spending millions of dollars for lavish effects for movies that flopped.
He compares the new animation-driven effects such as CGI to the craze around 3-D movies in the 1950s. That type of special effect didn’t make movies better, just more novel.
“In some ways visual effects are the same thing. They are a tool, not an end unto themselves. The movie still has to be good for audiences to respond,” said Huber.
Nowadays directors and studios are getting smarter about where visual effects need to be used and where they don’t, he said.
Siggraph itself has been something of a barometer as to how the animation and effects side of the industry is faring. And this year, the attendance was far lower than in recent years, perhaps by as much as 25-50% in his estimation.
Huber’s interpretation–it’s not that the recession has meant that the industry is in real trouble, simply that this is a year in which studios are more cautious, but are still very much investing in the coming year. Said Huber, “I firmly believe there’s going to be a nice rebound.”
An exciting and fun part of the conference he said, is the Computer Animation Festival, where participants from around the world screen their latest work. Check out this extremely cool vid showing some of the work:
Huber himself has a short animation film, a co-production with Cogswell that he plans to show at next year’s event. Called “The Offering,” it’s a story that draws from Hindu legend yet includes elements from everything from Marvel Comics to Bollywood. It was made at the school, with students playing a large role in its creation. (The poster for the movie is shown at the top of this page.)
So, what about the geeky side of Siggraph?
“The conference definitely gets a steady stream of techno fans–people who are interested in the technology and want to come for the white papers,” he reported, adding that the need for efficient storage is something that many animation houses are recognizing.
“Looking at the first Transformers movie, which was made four or five years ago, Industrial Light and Magic (ILM), which did most of the effects, used about 30 TB of space for all of the files they needed during the production process. Compare that with the latest Transformers movie, which took up closer to 150-250 TB of space,” said Huber.
Even his own short film took up three terabytes of space to make, he said. So even small institutions that have limited budgets could benefit from some kind of compression or optimization technology.
Indeed, he said, the computer animation industry is perhaps the art form most closely tied to technology. This, he said, is another reason that this year’s Siggraph conference was less well attended–the tech industry is smarting from the effects of the recession.
Yet it’s also the pace of technological innovation that drives visual effects/animation studios to continually improve what they’re able to achieve. Effects such as 3-D animation are getting more complex, he said, especially now that studios have moved from 8-bit to 16- or 32-bit technology. One exciting new innovation is OpenEXR, a high dynamic-range (HDR) image file format developed by ILM. This allows studios to ingrain details in visual effects like never before, according to Huber.
“You’re not going to get the true richness unless you get the file formats like that,” he said.
However, he added that such new file formats are demanding and require a lot of space. This is no doubt why so many post-production and animation studios are looking for ways to optimize files in order to save disk space. As it happens, Ocarina stands alone in that it has algorithms that are designed specifically for over 100 file types–OpenEXR included. As many studios are already discovering, Ocarina is their ticket to space savings of as much as 80%.
Overall, Huber sees a bright future for his industry.
“We see films today that seem amazingly more complex and rich. In ten years, all of that will be topped by what is coming.”
Well, I for one can’t wait for those coming attractions.
George Crump has a piece in Byte and Switch today that poses an important question: “Can we get to a single point of deduplication?” This is a question that we have taken up in one form or another in some of our recent posts, such as this one and this one.
In the article, Crump asks the question in another way: “… can you have all your data tiers; primary, archive and backup deduplicated by a single engine?”
In light of the recent focus on deduplication, this in my view is a question that really does need to be raised. For how long will the industry to silo out these different tiers for its deduplication solutions? And how much sense does it make to rehydrate data every time you move it, in order to once again deduplicate it? Not a lot.
Crump writes: “The current deduplication vendors could work on building out their solutions to either scale up into primary storage performance (see Data Domain’s DD880) or they could move their existing data duplication technology into other markets; see the increased speed of Ocarina Networks and Permabit as well as their move into cloud storage.”
At the same time, as we’ve pointed out here, online storage is quite a bit different than backups and so far at least, none of the successful backup dedupe vendors - Data Domain, Diligent, Quantum, etc. have been able to break into it. Rather, it is NetApp and Ocarina who have been the trailblazers.
Crump makes another key point:
“NetApp and Ocarina could continue to enhance and improve the re-hydration speed of their technologies to make read performance a non-issue, making primary storage a viable platform. Ocarina can already maintain the deduplicated format as they move through tiers, so landing on backup or archive disk would simply be another move for them.”
This is an interesting observation, and one that is often missed in reporting on both of these solutions. We look forward to seeing more debate and discussion on this issue, which was well kicked off with this piece.