remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

PBS Chunking: 100GB VM in 25,000 Pieces

You back up a 100GB VM tonight. Tomorrow you back it up again and the snapshot finishes in 90 seconds, transferring around 800MB to the server. That is not magic, and it is not a clever incremental flag inside the guest. It is chunking, specifically content-defined chunking with a rolling hash, and once you see how it works the rest of Proxmox Backup Server starts to make sense.

Key Takeaways
  • PBS uses content-defined chunking (CDC), not fixed-block chunking.
  • Split points are decided by a Buzhash rolling hash over a sliding window, not by byte offset.
  • Target chunk size is ~4MB on average, with min and max bounds to handle outliers.
  • Each chunk is identified by the SHA-256 of its contents, so identical bytes always produce the same chunk.
  • Inserting a byte at the start of a file shifts boundaries only locally, not for the whole disk.
  • Encrypting before chunking destroys dedup. PBS does the opposite, which is why dedup survives encryption.

The Backup Chunking Problem

A running VM has a 100GB virtual disk. Last night's backup uploaded the full image. Tonight, somewhere between 50MB and 5GB of bytes changed, depending on workload. You need to figure out which pieces actually moved and ship only those.

The naive answer is "diff the files." That works on text. It does not work on a block device with about 100 billion potential byte positions and no useful diff semantics.

The real answer needs three things. First, a way to break the disk into pieces. Second, a way to recognise pieces that already exist on the server so you skip them. Third, the recognition step has to survive small edits without invalidating the entire disk.

That third requirement is the hard one. The first part of this series covered how PBS talks to the client over HTTP/2; chunking is the next layer up, and it is where the dedup story begins.

Approach 1: Fixed-Block Chunking

The obvious approach is to slice the disk into equal-sized blocks. Pick a size, say 4MB. Block 0 is bytes 0 through 4MB. Block 1 is bytes 4MB through 8MB. Hash each block. Compare hashes to last night. Upload the ones that changed.

This works when writes are aligned to the block size. Databases that overwrite 4KB pages in place produce predictable change sets. ZFS scrubs do not move blocks around. For these workloads, fixed-block chunking is fine.

It falls apart the moment something inserts or removes bytes inside a file. Imagine a 1GB log file. Someone prepends a 17-byte header. Every subsequent byte in the file has shifted forward by 17. Block 0 is now slightly different. Block 1 is completely different from yesterday's block 1 because its contents are mostly yesterday's block 0. Block 2 is yesterday's block 1, shifted. And so on for every block in the file.

Your dedup ratio just collapsed to zero, even though 99.99% of the data is identical.

This is why some S3 backup tools have poor dedup ratios

A lot of VM-backup-to-object-storage products still use fixed-block chunking under the hood. That choice is fine for snapshot-style consistent block writes, but it explains why their reported dedup ratios are often a fraction of what Proxmox Backup Server achieves on the same workload.

Approach 2: Content-Defined Chunking

Content-defined chunking (CDC) flips the question. Instead of asking "where does block N start," it asks "given the bytes I am looking at right now, should I cut a chunk boundary here?"

The mechanism looks like this. A small window slides through the data one byte at a time. At each position, a rolling hash is computed over the window's current contents. When the hash hits a chosen marker pattern, that position becomes a chunk boundary. Repeat until the file ends.

The marker is something cheap to check, usually "the low N bits of the hash are all zero." Picking N controls the average distance between cuts, which controls the average chunk size. The hash is content-driven, so cuts happen at content-driven positions.

Now reconsider the 17-byte prepend. The window slides into yesterday's content after only a few KB of new bytes. As soon as the window contains the same bytes it contained yesterday at some position, the hash produces the same value as yesterday, and the cut decisions match. The boundaries resync within a small region around the edit. Everything after that point produces identical chunks.

That is the property that makes incremental backups of a 100GB disk feasible.

The Buzhash Rolling Hash

Proxmox Backup Server uses Buzhash for the rolling hash step. There are alternatives (Rabin fingerprinting, FastCDC, Gear hashing, others), and each has trade-offs. Buzhash is fast to compute, easy to make truly rolling (constant-time add-byte and remove-byte operations), and has predictable enough distribution to land on the target chunk size.

The mechanics are simple enough to fit in a paragraph. Each possible byte value gets assigned a random 32-bit constant ahead of time, stored in a lookup table. The current hash is the XOR of rotated copies of the constants for each byte in the window. To slide forward one position, you XOR in the new byte's constant and XOR out the oldest byte's constant after rotating to account for its age. No expensive modular arithmetic, no big-integer math, no cryptographic operations.

python
WINDOW_SIZE = 64
TARGET_MASK = (1 << 22) - 1   # ~4 MiB average chunk size
MIN_CHUNK = 512 * 1024        # 512 KiB
MAX_CHUNK = 16 * 1024 * 1024  # 16 MiB

def chunk_stream(data, table):
    h = 0
    window = bytearray(WINDOW_SIZE)
    chunk_start = 0

    for i, byte in enumerate(data):
        old = window[i % WINDOW_SIZE]
        window[i % WINDOW_SIZE] = byte

        # Roll the hash: add new byte, remove old byte.
        # Illustrative only; real Buzhash uses the rotation amount
        # accumulated over W iterations.
        h = rotate_left(h, 1) ^ table[byte] ^ rotate_left(table[old], WINDOW_SIZE % 32)

        size = i - chunk_start
        if size < MIN_CHUNK:
            continue
        if size >= MAX_CHUNK or (h & TARGET_MASK) == 0:
            yield data[chunk_start:i + 1]
            chunk_start = i + 1
Simplified content-defined chunking with Buzhash (pseudo-code)

That snippet is not the real PBS implementation. It is the idea, compressed to fit on a screen. The production version is written in Rust, handles edge cases, and is heavily optimised for cache-friendly access patterns. But the loop body is conceptually the same.

The rolling hash does not need to be cryptographically strong

Buzhash collisions are fine here. The rolling hash only decides where to cut. Once a chunk has been cut, Proxmox Backup Server hashes it again with SHA-256 to produce the chunk's actual identity. That second hash is the one that needs to be collision-resistant, and SHA-256 is.

Why ~4MB Average?

The chunk size target is a tunable parameter in any CDC system, and the choice has real consequences. PBS lands on roughly 4MB average for good reasons.

Chunk size trade-offs
Property
256 KB
4 MB (PBS default)
16 MB
Dedup quality on small edits
Excellent
Very good
Mediocre
Metadata overhead
Heavy
Moderate
Light
Network round-trips per backup
Many
Manageable
Few
Restore granularity
Fine-grained
Good
Coarse
Best fit for
Mostly-text repos, source trees
VMs, containers, mixed workloads
Cold archives, large media files

Smaller chunks deduplicate better in theory because a small edit invalidates a smaller region. They also produce far more metadata, multiply chunk-existence checks on the wire, and turn the chunk store into a directory tree with millions of tiny files. Larger chunks waste bandwidth when edits land in the middle of a 16MB region and force the whole thing to be re-uploaded.

Around 4MB is where the curve flattens for typical VM workloads. The min and max bounds (512 KiB and 16 MiB) handle the pathological cases: a hash that never hits the cut condition would otherwise produce a single giant chunk; one that hits constantly would produce thousands of tiny ones.

SHA-256 as Chunk Identity

After a chunk is cut, the bytes go through SHA-256. The resulting 32-byte hash is the chunk's identity, full stop. On disk in the chunk store, the chunk lives at a path derived from its hash, sharded by the first byte or two to keep directory sizes sane.

Two chunks with identical contents always produce identical hashes. That is the property that makes cross-VM dedup work. If your web cluster runs ten VMs from the same Debian template, the chunks that contain /usr/bin/python3 are physically stored exactly once. The next post in this series goes into how the deduplication and chunk store layer handles reference counting and physical storage on disk.

When client-side encryption is enabled, the chunk identity is derived from an HMAC-SHA-256 keyed by the per-datastore encryption key over the plaintext instead of a plain SHA-256, so dedup still works across snapshots that share a keyring. The detail belongs in the encryption deep dive, not here. For now, the takeaway is that the chunk ID is always content-derived.

A Real Example: 100GB Linux VM Across Two Backups

Concrete numbers make this less abstract. Picture a 100GB Linux VM running a standard application stack.

text
Day 1 (first backup):
  Disk size:                   100 GB
  Average chunk size:          ~4 MB
  Approx. chunks produced:     25,000
  Chunks already on server:    0 (new datastore)
  Chunks uploaded:             25,000
  Bytes on wire:               ~100 GB (compressed slightly)

Day 2 (incremental, after 5 GB of writes):
  Disk size:                   100 GB (unchanged)
  Writes since Day 1:          5 GB scattered
                               (logs, apt updates, app working set)
  Chunks fully invalidated:    ~1,250
  Boundary-shifted chunks:     ~10-30 (edges of changed regions)
  Chunks uploaded:             ~1,275
  Chunks reused server-side:   ~23,725
  Bytes on wire:               ~5-6 GB
Chunk math for a typical VM workload

Day 1 is the boring one. Everything is new, everything gets uploaded, the wire moves close to the raw disk size minus whatever compression saves on top.

Day 2 is where CDC earns its keep. The 5GB of changes are scattered across log files, package updates, and application working data. Each of those regions invalidates the chunks that overlap it. The rolling hash resyncs within a few KB after each edit, so only a handful of boundary-adjacent chunks need to be rehashed and reuploaded. Each contiguous edit region produces one or two boundary-shifted chunks at its edges, so a handful of edit regions adds up to roughly a dozen extra chunks across the whole disk. The rest of the disk is identical to yesterday and the server already has those chunks. The client never sends them.

The wire ends up moving roughly the volume of changed data plus a small overhead for the chunk-existence checks. That is the 800MB-on-an-otherwise-quiet-day pattern, scaled up to a workload with real activity.

Random-write databases break the pattern

A 50GB PostgreSQL instance under heavy OLTP load can rewrite blocks anywhere on the device between snapshots. Even though the total write volume might be 2GB, those writes touch many chunk boundaries, so the chunk count uploaded is closer to the change footprint than to the change size. Dedup still helps across snapshots of the same database; it just helps less than it does on an OS disk.

When Chunking Wins, When It Does Not

Be honest about the wins and losses. Content-defined chunking is fantastic when neighbouring writes preserve neighbouring bytes. It struggles when neighbouring writes produce uncorrelated bytes.

The wins:

  • OS images, application binaries, source trees. Most files change rarely; when they do change, the changes are localised. Dedup ratios of 20:1 or higher are common across snapshots and across VMs that share a base image.
  • Log files and append-mostly data. New writes go to the end; old data does not move. The first few hundred chunks of a log file are identical day after day.
  • Document trees, media libraries, software repositories. Files are added, occasionally edited, rarely shuffled. Cross-snapshot dedup is excellent.

The losses:

  • Pre-encrypted volumes. A LUKS-encrypted disk looks like noise. Two snapshots of the same encrypted disk taken seconds apart have nothing in common at the chunk level if the encryption layer rotates anything. Do not use guest-level full-disk encryption like LUKS if you want PBS dedup to work; let PBS handle encryption at the chunk layer instead, where dedup is preserved.
  • Pre-compressed archives. A 10GB tarball that gets one file added gets a new compression dictionary. Every byte after the change is different. Dedup ratios drop to near zero.
  • Random-write databases. Already covered above. The dedup story is "less bad than fixed-block chunking," not "great."

The rule of thumb: if you can predict that yesterday's bytes will appear in tomorrow's snapshot at roughly similar positions, CDC will find the overlap. If yesterday and tomorrow have nothing visually in common at the byte level, chunking can only do so much.

The Encryption Ordering Question

This belongs in part 4 of the series, but it is worth flagging here because it explains a choice that confuses people coming from other backup products.

If you encrypt the disk before chunking, every snapshot looks like uniform random noise. The Buzhash rolling hash will still produce chunks, but no two snapshots will share any. Dedup ratio: 1.0. That is what happens with naive server-side encryption schemes.

Proxmox Backup Server encrypts at the chunk level, after the content has been chunked, using an HMAC over the plaintext to derive a deterministic chunk ID. The server stores ciphertext and never sees the plaintext, but identical plaintext from the same keyring still produces identical chunk IDs. Dedup survives. Privacy survives. That ordering decision is one of the more important design choices in the protocol, and part 4 of this series digs into how encryption preserves dedup.

Wrapping Up

Chunks are the atom of Proxmox Backup Server. Dedup, encryption, sync replication, pruning and garbage collection, restore, verification: everything operates on top of the single decision to cut chunk boundaries based on content rather than position. Once you internalise that one mechanism, the rest of the system stops looking like magic and starts looking like a sensible pipeline of small steps. The next post picks up at the chunk store itself and the index files that turn a pile of chunks back into a coherent VM disk.

Not directly. The chunk size target is a property of the on-wire protocol and the chunk store layout. Changing it would break dedup across existing snapshots. The bounds (roughly 1MB min, 16MB max) exist to handle outliers, not as a knob to tune.

Yes. The proxmox-backup-client tool can back up filesystem trees, archives, and other byte streams, all through the same CDC pipeline. File-level backups deduplicate against each other and against VM disk backups when the underlying byte content matches.

Buzhash is simple, fast on modern CPUs, and its distribution properties are well understood for the target chunk size range. FastCDC and Gear hashing have some advantages in specific benchmarks, but the gain in practice is small compared to the total cost of SHA-256 and disk I/O. Changing the rolling hash now would also break dedup with every existing chunk store.

The proxmox-backup-client reads sparse regions as zero bytes. Long runs of zeros produce highly deduplicable chunks (often a single chunk repeated many times), so sparse VM disks back up far faster than their nominal size suggests.

A SHA-256 collision has not been demonstrated and is computationally infeasible with current hardware. If one ever occurred, PBS would treat the two chunks as identical and only store one. Collisions are negligibly likely in practice, so PBS treats SHA-256 as a content identifier and does not engineer for collision handling.
Want the dedup math working in your favour?

remote-backups.com runs managed Proxmox Backup Server targets in EU datacenters, with content-defined chunking, client-side encryption, and per-tenant isolation by default. You get the protocol benefits without operating the server.

View Plans
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.