Plan carefully & reap substantial storage efficiencies
Data deduplication, the process of locating
and deleting duplicate files or blocks of data (often to reduce storage
needs), has been available for a number of years. However, only recently has
it started to achieve widespread adoption, largely due to the spiraling amount
of data users create and replicate each year (1.8 zettabytes-1.8 trillion
gigabytes in 2011, per IDC (www.idc.com), and doubling every two years).
"The storage market has been growing
in the double or triple digits," says Robert Amatruda, research director,
data protection and recovery, IDC. "The primary reason is the inclusion
of data deduplication or other optimization features as part of the storage
package."
If you are not yet using data deduplication,
we expect you will eventually. "This isn't a feature that separates the
men from the boys, but it is something that everyone is going to use eventually,"
says David Hill, principal at Mesabi Group (www.mesabigroup.com), an analyst
firm focusing on storage management.
How it works
Data deduplication is a data compression
technique that add-on software or a storage device's built-in file system
uses. (We'll refer to both as "the system.") However, a variety of
methods and derivations affect performance and space allocations and give
businesses some hard decisions to make.
For instance, deduplication can be file or
block based. The file based method (also called single instance storage)
identifies, duplicates, and deletes discrete data files, replacing them with a
reference to the single remaining original file match. When or if someone
restores any number of files from a single file to an entire backup the system
sees the tag and replaces it with a copy of the single archived original.
For example, assume a company's email
server contains 30 instances of the same 5MB image because the firm sent the
image to numerous employees. Although most of the recipients might need local
access to the file, the company doesn't need 30 instances in its email backup.
During data analysis, the system would identify the first instance of that
file, but each time it found a duplicate of the 5MB file, it would create a
reference pointing to the original rather than backing up that instance,
thereby reducing the total amount of disk space required for the backup. The
system would maintain that information and if someone ever needed to restore
that file if he deleted the locally stored original, for example, the system
would see the tag, locate the original it referenced, make a copy, and provide
it as the restored file.
Block-based deduplication works in a
similar fashion but with blocks of data identified as large, repeating patterns
of bytes rather than individual files. The system scans the data pool and segments
it, assigning a unique tag to each block. When the system encounters a block
matching one already tagged, it excludes the match from the backup and replaces
it with a reference that points to the single match.
Both file and block deduplication can take
place on the fly (in-process) prior to the creation of a backup file or image
on a storage drive. Alternately, a system can dedupe data after creating the
backup file (post process). The process can also take place on the source
drive (deduped before being backed up) or on the target (backup) drive.
Why you’ll want it
For years, two specific backup methods
(incremental and differential, which do not replicate unchanged data) have
helped keep backup file sizes and times to a minimum. However, while these
methods may try to avoid data redundancy, they don't eliminate it. In other
words, if an identical file is stored 30 times and no instance has changed,
most backup solutions won't store those 30 files a new, but they won't delete
the extras, either. That's where data deduplication comes in.
With data deduplication in the
mix-especially when combined with data compression and other storage reduction
techniques duplicates are eliminated no matter where they occur. The percentage
of savings is difficult to project due to the variances in how firms store
data, says Valdis Filks, research director, storage technologies and
strategies, for IT research firm Gartner (www.gartner.com).
"In a worst-case scenario where there
[are very few] common data or files, but where the same data is stored over
again, users can get more than 50% reduction in storage requirements." In
instances such as our example of a file stored 30 times, the reduction can be
far greater. Data deduplication also reduces the size of network transmissions
and is valuable in virtual environments because IT departments can store a
single instance of each system file on a common storage device.