There is so much buzz in the market these days about data deduplication. If you are like me, you receive numerous emails every day that contain a reference to data dedupe. These references typically provide you with a short description about the value of data dedupe and how the specific vendor has been able to tame the explosion of data. After reading the short description you can download a white paper, register for a webinar, watch a video-cast, listen to a podcast, or get something that promises to enlighten you about the promised benefits of data deduplication. Now you may think I am being harsh with these words implying empty promises from a technology that is so innovative, elegant, and complex all at the same time. So you rightly can ask me what is your answer to your exploding data issue? I am grateful that you have asked and I will provide you with an answer. One you may not expect from me.
Why is data deduplication receiving so much press these days? This innovative technology provides you with the promise of reducing the amount of data you store on disk by identifying your duplicate data and ultimately storing only one copy. Further, this innovative technology has invented elaborate processes so that the one saved copy can be located by all the applications that had previously saved the data. Additionally, some data dedupe processes examine files – some of which could be compliance type files – to find repetitive bit patterns to further improve the dedupe ratio. A very very slick and elegant technology and one that has tremendous potential when leveraged within the whole of storage tiers and the storage hierarchy.
Now if your organization is typical you are likely seeing your data growing around a 45% compounded annualized growth rate (CAGR). That is enough to send fear into the heart of most storage administrators because they are asking the question, “How will I be able to store all this data on primary disk particularly in today’s economic environment when I have little or no budget for new hardware acquisitions and every request undergoes intense scrutiny?” Let’s consider for a moment business as usual and that there are no data deduplication solutions available. To solve this data explosion it will take money to buy the additional storage, likely more people to manage the additional storage, space in the data center to install the additional storage, and electricity to power and cool all the additional storage. All of components have a large price tag. If you look at the latest trends in the industry, you notice that the price tag of the storage management and operations people can be up to 3x the hardware price, the electrical costs are rapidly approaching and in some places exceeding the hardware price, and many organizations are running out of floor space – another limited resource. So when you add it all up you may have to take into account a minimum of 5x the storage hardware price tag to manage, maintain, upgrade, power, cool, and install all of this additional storage.
A great example of most of these costs, excluding personnel costs, is shown in the Clipper Notes paper, referenced in my 2009 May 9 blog, where the cost of a quarterly backup disk solution over 5 years for data growing at 50% CAGR starting at 50 TB and growing to 253 TB is estimated to be approximately $14.7 million US dollars. This also, takes into account the typical useful life of disk of three years on disk so the new expected technology is included in the analysis.
That is before data dedupe comes into the picture. So is that the secret sauce? Assume – and this can be a job limiting assumption – that you are able to get a 20:1 data dedupe benefit. According to this Clipper Notes paper, they made that assumption and estimated that the impact of a data dedupe environment would reduce the overall quarterly backup disk solution to a cost of approximately $3 million US dollars. That is a eye popping reduction in price so why am I so suspicious about the value of data dedupe?
To answer that, you need to understand your data well. For example if you have an application that backs up your data incrementally “forever”, the likely dedupe ratio will be very small. It will not approach a 20:1 ratio and it is likely that data dedupe ratio will be less than the data compression ratio. Now what if you do full data backups every day? First of all, I would ask why are you doing full backups every day? But if that were the case, your data dedupe ratio would likely be very large. For example if you change only 1% of your data daily and you are adding 0.1% new data daily (for the 45% CAGR), you would see in the range of almost 100:1 dedupe ratio. Fantastic. But again I would really question why you would do a full backup every day. Let’s compromise and use the assumptions from the Clipper Notes paper: Full backup weekly, incrementals daily where 5% of your data changes. So under the assumptions approximately 92% of your data does not change during the week. That would provide you with about a 12:1 to 13:1 dedupe ratio. Smaller than the Clipper Notes paper assumption of 20:1 but still respectable. Full speed ahead with data dedupe. Right? Well let’s be careful. Your mileage will vary based on your specific data environment. Know your data and understand what data is likely a good candidate for data dedupe because your backups may be essentially incremental forever which may provide you with a dedupe ratio less than the typical data compression ratio.
Also, you will want to verify the validity of the assumptions as cautioned in the Clipper Notes paper compared to your environment. For example the cost of electricity, the floor space costs, be sure to include the cost of additional personnel to manage the data, adjust for vendor discounts on the hardware, use utilization levels for disk and tape media representative of your environment, and leverage the cost advantages of newer hardware in the later years of the 5 year analysis. Further, based on the publication “Panorama Storage” by Fred Moore, President of Horison Information Strategies, “By 2010, it is expected that a tool rich non-mainframe storage administrator should be able to effectively manage approximately 28 terabytes of storage….” So these costs should definitely be taken into account. One last point, the baseline amount of data that is kept long term is a very critical assumption. If your environment is larger than the 50 TB environment used as the initial baseline for the Clipper Notes paper, the costs of the disk environment will increase almost linearly whereas this is not the situation for tape environments. All of these factors can have a major impact on the final results.
OK so I have highlighted an example where the cost of a 5 year quarterly backup pure disk solution can have radically lower costs, ,maybe as much as 5x lower, when incorporating data dedupe within your environment. I have also warned you about making the underlying assumptions realistic to your environment. So you may now ask me why am I so less than 100% enthusiast about data dedupe?
Based on my background it is critical that you understand the benefits of each storage tier within the storage hierarchy. The storage hierarchy includes tape. So, based on this, if costs are critical to you these days due to your stagnant or shrinking budgets, than consider an alternative to the quarterly backup example. Tape! Referencing the Clipper Notes paper again, the costs of the 5 year quarterly backup scenario on tape is estimated to be less than $650,000 US dollars. Approximately 23x less that the disk environment without data dedupe and almost 5x less than the data dedupe estimate. So leveraging your tape automation environment in this example can make you a prudent fiscal conservative within your company during these lean times.
Going back to data dedupe for a moment. There is definitely a place for this technology but go into it with your eyes open. For example, when I asked about data deduplication among my peers, I received responses that I must share with you to give you some additional insight when evaluating data deduplication solutions that are in the market:
-
Understand the problem that you need to solve. From a simple TCO perspective, remember that disk environments including a data dedupe environment will likely need a ‘tech refresh’ every 3 years, tape drives are good for about 5 years, and tape automation for 7 – 10 years. Some companies have leveraged tape drives and libraries for much longer and the media life is up to 30 years.
-
How easy will it be to migrate deduped data from one system to another when the disk needs to be refreshed? Will the data need to be reconstructed into its native format than deduped again?
-
For bulk data backup and restore, tape is faster than disk, including data dedupe environments.
-
What happens if there is a logical corruption in the dedupe repository? Keep a copy on tape.
-
Does the deduped file conform to requirements for legislation? That is, in a court of law will you be able to prove the data has not changed? Keep a verifiable copy on tape.
-
As your data dedupe environment grows beyond a single dedupe system for a specific type of data (i.e, your email system, your data base system) will the additional system have a global view of all the data being dedupe? If not, you may end up having to manage independent islands of dedupe systems for load balancing or to keep your dedupe ratio high. So understand the difference between a data dedupe system that has a global vs. a local view of the data.
-
Does the data dedupe system allow you to easily accommodate data replication and sending a copy to tape?
-
Does the data dedupe vendor have maintenance and service personnel that are quickly accessible and knowledgeable of the solution when an issue arises?
-
How quickly can the data that has been deduped be restored? If it is a few files you likely will not be concerned about the time to reconstitute the data. But what if the entire disk farm needs to be restored? Will you be able to restore it fast enough to recovery your business? It takes time to dedupe your data during your backup and time to reconstruct your data for recovery. Be sure to understand the time it takes in your environment compared to your native backup procedures. You do not want any questions or surprises about being able to backup your data within your backup window and being able to recover your data to meet your service levels.
What does all of this mean? Data dedupe is a huge step forward to help reduce the amount of data stored on disk. It is a part of the entire storage hierarchy. It is an innovative and elegant solution but it does not approach the cost and value of tape. As a world renowned backup expert, W. Curtis Preston, Executive Editor at TechTarget and independent backup expert stated in his “Let’s Talk About Deduplication” Video-cast, “I am not talking about deduping everything to one copy and then leaving it there. That would be, you know, stupid, alright, although I have had people do that. … So your choices are to replicate it, to copy to tape, or replicate it and then copy to tape, or if you like, copy to tape and replicate it, right, just depends on what you want to do.”
Leverage the various parts of the storage hierarchy that will serve you best. Each part has it specific purpose that has served the industry well for at least the past 45 years and even with the new additions into the hierarchy see them as additions, not replacements.