What is Deduplication?

This is some text inside of a div block.

What is Deduplication?

In data storage, deduplication is a technique that eliminates redundant data. It identifies the redundant data and then either replaces the data with a pointer to the original block of data or removes it entirely (depending on the type of deduplication being done).

The advantage of data deduplication is that it reduces storage needs. Because only a single copy of each unique piece of data is actually stored, overall capacity requirements are greatly reduced. For example, if you have three computers that each has a 100 GB backup file, you can use deduplication to reduce those backups to 100 GB in total by deduplicating against the other two sets of data.

Data deduplication is a process that identifies and removes duplicate copies of repeating data segments to free up storage capacity and reduce costs. Deduplication reduces storage requirements by as much as 95 percent, making it one of the most effective technologies for reducing storage capacity needs.

Benefits of Deduplication

In addition to reducing storage costs, eliminating duplicate data can improve system performance by increasing available space on primary storage devices. This is because some systems are configured to migrate less frequently accessed files from primary storage to secondary storage when primary storage reaches certain levels of usage. If there is less overall data stored in these systems, there will be more free space for actively used files, which may reduce the need for migration.

Many companies are using deduplication technology in backup software, storage systems and array appliances to reduce the amount of storage consumed by backup and archiving data. Deduplication is gaining recognition for its ability to help companies reduce the amount of storage space needed for backups, archives and replicas.

How does Data Deduplication work?

Deduplication technology works by eliminating redundant data. In a backup environment, this means removing duplicate copies of data chunks, or segments of files. The deduplication process finds the unique data blocks in the backup stream, stores them and then replaces the duplicates with pointers to the original blocks.

Identifying the "fingerprint" of data is the first step of data duplication. This is the key distinguisher of a data entity and has unique features that no other data entity will have. Once this fingerprint is identified and stored, a lookup can occur within the warehouse or table (depending on what you're working on), and all data entities are compared to the finger print. If there is any entity that matches the fingerprint, it can be removed or at least identified for further inspection. If no other data entity is identified with the same fingerprint, that piece of data will be labeled as the "original", therefore establishing it as non-duplicated.

Inline vs. Post-processing Deduplication

There are two types of data deduplication: inline and post-processing. Inline deduplication is performed while data is being written to disk, which provides the highest performance but requires more processing power than a post-processing method. Post-processing deduplication occurs after the data has been written to disk and incurs less overhead at the expense of slower performance.

Data deduplication is available on many types of storage products, including dedicated appliances and software that can be installed on commodity hardware.

From the blog

See all