Data Capacity Optimization: Considering the Business Value

Lately, there has been a great deal of emphasis in the industry press describing data capacity optimization, also known as data efficiency. While many simply discuss the data capacity technology, as an IT solution provider, we look at the offering from a business perspective. Doing so, will help determine that you are getting what you need from the technology or if you are better off focusing on another area of your architecture.

Data Capacity Optimization: Questions to Consider

When asked what you gain from data efficiency technology, the easy response is disk space. While that sounds great, it is important to also consider the cost. For example, is the cost in the form of additional overhead or potential deployment delay? To determine this, we must first understand the fundamentals of data capacity optimization technology.

Storage Deduplication

From a storage perspective, deduplication (a.k.a. dedupe) is offered earlier in the IT lifecycle versus compression. This is because most vendors can address this technology while performing R&D on the more difficult task of storage block compression.

Storage dedupe can be implemented in a couple of ways:

  • Target Dedupe – backups are pushed over the network to an appliance for dedupe to be performed.
  • Source Dedupe – typically involves the backup software and its components.

The main storage array implementation of dedupe is done on the storage array on a volume-by-volume basis. This makes sense as not all data is created equally and you would not want to waste precious controller resources. Also, consider what the process or procedure looks like to turn off dedupe on a volume. For example, think about:

  • How long will it take to expand all of your data?
  • What is the performance impact on your array?
  • Can you complete it during off hours?

Before Deciding on Deduplication

Deduplication can be performed on flash and spinning disk. And, dedupe can be performed inline (as data is ingested) or as a post process, after the data has been successfully written to disk – typically done after business hours. There are several factors to consider when determining dedupe success. First, you must understand your enterprise data flows and types. Here is a short, but important, list to consider before making a deduplication decision:

  1. What is your data change rate?
  2. What is your data retention period?
  3. Fixed block or variable block? Keep in mind the average block size is 8k to 32k
  4. What is the effort to calculate the fingerprint (FPT)?
  5. Hardware accelerators (such as FGPA chips) and specialized chipsets to alleviate main CPU burdens and produce faster results (since it is a specialized offering)

Do not overlook the design factors either, including:

  • Where to position, in your architecture, the primary data or secondary data (i.e. where does it not make sense?)
  • How is replication handled? For example, does the data you are replicating require rehydration?
  • Which applications are deduplication friendly and which are not? Why try to gain something if it doesn’t fit the basic pattern matching which can be exploited by deduplication?
  • Don’t encrypt your data and expect to get better than 1:1 ratio as the process is to deduplicate and then encrypt the data.

Dedupe Ratio and Data Capacity Optimization

1:1 0%
2:1 50%
3:1 67%
4:1 75%
5:1 80%
6:1 83%
7:1 86%
8:1 87%
9:1 89%
10:1 90%
50:1 98%
100:1 99.0%
500:1 99.8%

Average dedupe results are found to be between 3-12%. To further refine your results, run a vendor specific tool against a data set sample.

Some Data Capacity Optimization Dedupe Gotchas:

  • CPU and Memory intensive (even more so than compression)
  • Video Files – will receive little to no value from running this operation
  • Exchange or Lotus Notes are better for Single Instance Storage (SIS)
  • Encryption
  • Not built for mainframe data

Lastly, databases are not good candidates for this offering due to being an enterprise system of truths! If you are going to replicate data, be aware the data might have to be rehydrated prior to doing so and then deduped again.

Dedupe is widespread in corporate enterprises due to disk space savings, but a critical design eye must be plied to ensure it is worth the effort of deployment. We would have you consider if it would it be less costly to just deploy your storage array without this technology enabled. Also, you should take into consideration all the design and testing time frames versus enough initial raw capacity. If you are considering data capacity optimization, or if you have questions, contact us online, or at 888-861-8884.