Data Deduplication (Part 1)

12/17/2012 2:56:04 PM

Plan carefully & reap substantial storage efficiencies

Data deduplication, the process of locating and deleting duplicate files or blocks of data (often to re­duce storage needs), has been avail­able for a number of years. However, only recently has it started to achieve widespread adoption, largely due to the spiraling amount of data users create and replicate each year (1.8 zettabytes-1.8 trillion gigabytes in 2011, per IDC (www.idc.com), and dou­bling every two years).

"The storage market has been growing in the double or triple digits," says Robert Amatruda, research di­rector, data protection and recovery, IDC. "The primary reason is the in­clusion of data deduplication or other optimization features as part of the storage package."

If you are not yet using data dedupli­cation, we expect you will eventually. "This isn't a feature that separates the men from the boys, but it is something that everyone is going to use eventu­ally," says David Hill, principal at Mesabi Group (www.mesabigroup.com), an analyst firm focusing on storage management.

Data Deduplication

How it works

Data deduplication is a data com­pression technique that add-on soft­ware or a storage device's built-in file system uses. (We'll refer to both as "the system.") However, a variety of methods and derivations affect per­formance and space allocations and give businesses some hard decisions to make.

For instance, deduplication can be file or block based. The file based method (also called single instance storage) identifies, duplicates, and deletes discrete data files, replacing them with a reference to the single remaining original file match. When or if someone restores any number of files from a single file to an entire backup the system sees the tag and replaces it with a copy of the single archived original.

For example, assume a company's email server contains 30 instances of the same 5MB image because the firm sent the image to numerous employees. Although most of the recipients might need local access to the file, the com­pany doesn't need 30 instances in its email backup. During data analysis, the system would identify the first instance of that file, but each time it found a du­plicate of the 5MB file, it would create a reference pointing to the original rather than backing up that instance, thereby reducing the total amount of disk space required for the backup. The system would maintain that information and if someone ever needed to restore that file if he deleted the locally stored original, for example, the system would see the tag, locate the original it refer­enced, make a copy, and provide it as the restored file.

Block-based deduplication works in a similar fashion but with blocks of data identified as large, repeating patterns of bytes rather than individual files. The system scans the data pool and seg­ments it, assigning a unique tag to each block. When the system encounters a block matching one already tagged, it excludes the match from the backup and replaces it with a reference that points to the single match.

Both file and block deduplication can take place on the fly (in-process) prior to the creation of a backup file or image on a storage drive. Alternately, a system can dedupe data after creating the backup file (post process). The pro­cess can also take place on the source drive (deduped before being backed up) or on the target (backup) drive.

Why you’ll want it

For years, two specific backup methods (incremental and differential, which do not replicate unchanged data) have helped keep backup file sizes and times to a minimum. However, while these methods may try to avoid data redundancy, they don't eliminate it. In other words, if an identical file is stored 30 times and no instance has changed, most backup solutions won't store those 30 files a new, but they won't de­lete the extras, either. That's where data deduplication comes in.

Data Deduplication

With data deduplication in the mix-especially when combined with data compression and other storage reduction techniques duplicates are eliminated no matter where they occur. The percentage of savings is difficult to project due to the variances in how firms store data, says Valdis Filks, re­search director, storage technologies and strategies, for IT research firm Gartner (www.gartner.com).

"In a worst-case scenario where there [are very few] common data or files, but where the same data is stored over again, users can get more than 50% re­duction in storage requirements." In instances such as our example of a file stored 30 times, the reduction can be far greater. Data deduplication also re­duces the size of network transmissions and is valuable in virtual environments because IT departments can store a single instance of each system file on a common storage device.

Top 10
Review : Sigma 24mm f/1.4 DG HSM Art
Review : Canon EF11-24mm f/4L USM
Review : Creative Sound Blaster Roar 2
Review : Philips Fidelio M2L
Review : Alienware 17 - Dell's Alienware laptops
Review Smartwatch : Wellograph
Review : Xiaomi Redmi 2
Extending LINQ to Objects : Writing a Single Element Operator (part 2) - Building the RandomElement Operator
Extending LINQ to Objects : Writing a Single Element Operator (part 1) - Building Our Own Last Operator
3 Tips for Maintaining Your Cell Phone Battery (part 2) - Discharge Smart, Use Smart
- First look: Apple Watch

- 3 Tips for Maintaining Your Cell Phone Battery (part 1)

- 3 Tips for Maintaining Your Cell Phone Battery (part 2)
- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 1)

- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 2)

- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 3)
Popular Tags
Microsoft Access Microsoft Excel Microsoft OneNote Microsoft PowerPoint Microsoft Project Microsoft Visio Microsoft Word Active Directory Biztalk Exchange Server Microsoft LynC Server Microsoft Dynamic Sharepoint Sql Server Windows Server 2008 Windows Server 2012 Windows 7 Windows 8 Adobe Indesign Adobe Flash Professional Dreamweaver Adobe Illustrator Adobe After Effects Adobe Photoshop Adobe Fireworks Adobe Flash Catalyst Corel Painter X CorelDRAW X5 CorelDraw 10 QuarkXPress 8 windows Phone 7 windows Phone 8