Deduplication best practices

Deduplication is a complex process that depends on many factors.

The most important factors that influence deduplication speed are:

The speed of access to the deduplication database
The RAM capacity of the storage node
The number of deduplicating locations created on the storage node.

To increase deduplication performance, follow the recommendations below.

Place the deduplication database and deduplicating location on separate physical devices

The deduplication database stores the hash values of all items stored in the location—except for those that cannot be deduplicated, such as encrypted files.

To increase the speed of access to a deduplication database, the database and the location must be placed on separate physical devices.

It is best to allocate dedicated devices for the location and the database. If this is not possible, at least do not place a location or database on the same disk with the operating system. The reason is that the operating system performs a large number of hard disk read/write operations, which significantly slows down the deduplication.

Selecting a disk for a deduplication database

The database must reside on a fixed drive. Please do not try to place the deduplication database on external detachable drives.
To minimize access time to the database, store it on a directly attached drive rather than on a mounted network volume. The network latency may significantly reduce deduplication performance.
The disk space required for a deduplication database can be estimated by using the following formula:

S = U * 90 / 65536 + 10

Here,

S is disk size, in GB

U is the planned amount of unique data in the deduplication data store, in GB

For example, if the planned amount of unique data in the deduplication data store is U=5 TB, the deduplication database will require a minimum of free space, as shown below:

S = 5000 * 90 / 65536 +10 = 17 GB

Selecting a disk for a deduplicating location

For the purpose of data loss prevention, we recommend using RAID 10, 5, or 6. RAID 0 is not recommended since it not fault tolerant. RAID 1 is not recommended because of relatively low speed. There is no preference to local disks or SAN, both are good.

40 to 160 MB of RAM per 1 TB of unique data

When the limit is reached, deduplication will stop but backup and recovery will continue to work. If you add more RAM to the storage node, after the next backup, the deduplication will resume. In general, the more RAM you have, the larger volumes of unique data you can store.

Only one deduplicating location on each storage node

It is highly recommended that you create only one deduplicating location on a storage node. Otherwise, the whole available RAM volume may be distributed in proportion to the number of the locations.

Absence of applications competing for resources

The machine with the storage node should not run applications that require much system resources; for example, Database Management Systems (DBMS) or Enterprise Resource Planning (ERP) systems.

Multi-core processor with at least 2.5 GHz clock rate

We recommend that you use a processor with the number of cores not less than four and the clock rate not less than 2.5 GHz.

Sufficient free space in the location

Deduplication at target requires as much free space as the backed-up data occupies immediately after saving it to the location. Without a compression or deduplication at source, this value is equal to the size of the original data backed up during the given backup operation.

High-speed LAN

1-Gbit LAN is recommended. It will allow the software to perform 5-6 backups with deduplication in parallel, and the speed will not reduce considerably.

Back up a typical machine before backing up several machines with similar contents

When backing up several machines with similar contents, it is recommended that you back up one machine first and wait until the end of the backed-up data indexing. After that, the other machines will be backed up faster owing to the efficient deduplication. Because the first machine's backup has been indexed, most of the data is already in the deduplication data store.

Back up different machines at different times

If you back up a large number of machines, spread out the backup operations over time. To do this, create several protection plans with various schedules.