Let’s Organize This Discussion A Bit!
I think there are at least three interesting topics you bring up here that are worth exploring further. Please see the recent post by David Vellante on Dedupe Rates Matter…Just Not as Much as You Think
I’ll respond with a separate post on each of the following topics, so that the community can track and participate in the threads that they find interesting and relevant, and ignore the ones they don’t.
The topics are:
1. Is the CORE Formula Flawed?
2. In-Band versus Post-Process: Is it even worth arguing about?
3. Do you have to expanded data to back it up?
Post 1. Is the CORE Formula Flawed?
I’m going to argue here that the CORE formula is flawed, but I want to say right off that the CORE formula has already added value to the community.
It’s a great idea to try to measure the value and effectiveness of different data reduction solutions, and by putting a first shot at it out there, there’s already a vigorous discussion going on here that is going to increase awareness of the issues involved, and will probably lead to a new and improved version of the formula. I think we want to recognize the value that Wikibon has brought to both customers and vendors by starting this whole brouhaha.
That said, yes, the CORE formula is deeply flawed.
Weighting the Factors to Reflect Customer Priorities
From a purely mathematical perspective, the formula is most fundamentally flawed because it builds in a bunch of value judgments as Constants that should really be Variables. The weighting of the values in the different columns may have different levels of importance to different customers, and they should therefore be able to assign that weight themselves, for their environment.
Instead, the weighting is hard-coded. The formula tells the customer that “time to compress” is more important than actual compression results.
That will certainly be true for some customers. It’s not true for others.
What I’d recommend to fix this is simple. Let a customer fill in a value to rate the importance of each column to them. Weight each column from 1 to 5, or 1 to 10. Then multiply the normalized score for each column by the customers assessment of its importance to them. That way, if “time to compress” is the most important thing to a customer, they can rate it 10. Great. If they rate that 5, and rate compression results 9, then the end score will better reflect their needs and priorities.
Include More Things That Matter
Second, the table simply does not represent all of the things that matter. Curtis Preston and others have pointed out that “time to decompress” or response time, might be more important to some customers than time to compress. One measures how fast a customer realizes disk space savings. The other measures the response time users and applications will see when they access compressed data. Those are different things. As the poster from EMC noted, there are different kinds of performance, and different online storage use-cases may put more weight on one versus another. Some applications need streaming write throughput, others need sequential read performance, while others need the ability to seek to the middle of a file and modify a few bytes with low latency.
Admittedly, it would be hard to measure all these things without getting overly complicated, but having at least “time to compress” and “response time” called out separately would be useful to most shops.
There are other less measurable, but potentially important, intangibles. Perhaps you could add a column with a list of features or product characteristics and let the customer assign a value from 1 to 10 for each of those things. If they are not important, fine, give them a 0. But if they are important, that lets the customer express their priorities in the score.
Here are just some things that I think at least some customers would think are worth evaluating as part of a data reduction solution:
• All or Nothing: Does the solution compress everything, or can I choose what to compress based on policy?
• Does the solution do compression?
• Does the solution do dedupe?
• Can I back up data in its most-compressed form, or do I have to expand it to back it up?
• Is the solution considered certified, validated, or supported by my storage vendor?
• Does the solution have an HA or fault tolerance capability?
• Does the solution have the ability to scale up by adding multiple nodes to work on a single volume or namespace?
I’m sure there are others, and maybe the CORE formula, which is supposed to be a measure of Effectiveness rather than overall product merit, would consider these factors out of scope. And that’s fair. But if the CORE metric is to be used as a score for a product, then it should be clearer about the key features and topics that it is not covering, but that customers might want to ask about.
Garbage In, Garbage Out
I think the CORE formula has the right idea with the columns on cost and how well a product can shrink data.However, those columns are meaningless unless the data in them is accurate. I think that data needs to come from some sort of vendor neutral benchmarking site or analyst. If you put in numbers from web site claims, customers are going to have high expectations. The fact is, results are going to vary by what kind of data the customer has. You can’t tell me that any solution is going to get the exact same shrinking results on, say, a volume full of Vmware VMDK files, a volume full of medical images, a volume full of Microsoft Office files, and a volume holding an Oracle OLTP database.
Cost is even trickier. The cost of dedupe for ZFS and NetApp Dedupe is free – they are a feature of the file system, and you don’t have to pay extra to get them. Of course, you have to be using NetApp storage to get NetApp dedupe, and that comes with a certain cost premium, but how do compare that with solutions like StoreWize and Ocarina that are separate standalone solutions with a specific price tag?
The first problem can be fixed, if the community gets together and hosts a 3, 5, or 10 different public data sets somewhere. Each vendor can download the data, run their wares, and report back the results. There’s no protecting against people who lie in this case, but vendors that do that will soon be caught out by customers. Then a customer could go to that neutral site, pick which sample data set best reflects their own data (OLTP database, medical images, home shares, consolidated virtual machines, whatever) and put that value in the CORE formula column for results. This is a case where the formula is not flawed – the formula is fine, but the data has to be valid for the data a customer has. If we work together, we can help put good data in.
The second problem vexes me. I don’t really know how to correctly address the cost problem. My suggestion? Take it out. If the CORE metric gives you a good value for the Effectiveness of different solutions, then customers can assess cost in ways that make sense to them – including vendor discounts, what storage they have, and so forth. No customer is going to ignore cost when making a purchase decision – so I think we can count on buyers to figure out how they want to factor costs in to their decision-making process.
Summary
So there you have it. I think the CORE formula is a great first pass attempt at doing something valuable to the community. It is deeply flawed – and I’m not the only one who thinks so. But it can be improved, and I would love to see that happen. These are my ideas, and I see others have contributed good insights as well. I’m keen to see where this goes from here. My next post will address the tired old topic of in-band versus post-process.
