remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

Inside the PBS Chunk Store: How Dedup Works

Proxmox Backup Server does not keep a database of backups. It keeps a heap of content-addressed chunks and a stack of small index files pointing into them. Once you see the on-disk layout this way, deduplication, pruning, and garbage collection stop being mysterious, and the dedup ratios near 10x that are achievable on shared infrastructure start to make sense.

Key Takeaways
  • The chunk store is a content-addressed heap; every chunk filename is the SHA-256 of its contents.
  • An index file (.fidx or .didx) is an ordered list of chunk hashes plus offsets. Nothing more.
  • A backup snapshot is a tiny manifest pointing at index files. The bulk data lives in the shared .chunks pile.
  • Deduplication is server-side, but the client asks first and skips uploading chunks the server already has.
  • Pruning a snapshot deletes index files. It does not free disk space. Only garbage collection does.
  • Typical dedup ratios: 1.2-2x for a single VM, 3-5x for a small homelab, 8-12x for an MSP with similar OS stacks.

This is post 3 in the How Proxmox Backup Server Works Under the Hood series. Part 2 explained how PBS chunks the input stream; this one covers what happens to those chunks after the client decides to upload them.

The Chunk Store on Disk

Every datastore has a .chunks directory at its root. Inside it sit exactly 65,536 subdirectories, named 0000 through ffff. Each holds actual chunk files, and every chunk file is named after the full SHA-256 hash of its contents. The first four hex characters of the hash decide which subdirectory the chunk lives in.

Sharding exists for one reason: filesystem performance. A serious datastore holds tens of millions of chunks. With 65,536 buckets, even a 100 million chunk store keeps each bucket under a few thousand entries. Here is what a real listing looks like:

bash
$ tree -L 2 /mnt/datastore/main/.chunks | head -20
/mnt/datastore/main/.chunks
├── 0000
│   ├── 0000a1b2c3d4e5f6...
│   └── 0000ff11ee22dd33...
├── 0001
│   ├── 00012a4b6c8de0f1...
│   └── 0001abcd1234ef56...
├── 0002
│   └── 0002777888999aaa...
...
└── ffff
    └── ffff0a1b2c3d4e5f...

$ ls /mnt/datastore/main/.chunks/0000 | wc -l
1847
Datastore on-disk layout

Each chunk file is independent. No central index, no locking, no transaction log. The filename is the integrity check: if the contents do not hash back to the name, the chunk is corrupt. That property is what makes the chunk store crash-resilient and easy to verify out of band.

Index Files: .fidx and .didx

Chunks alone are not useful. You need a recipe that lists them in the right order to reconstruct a disk image or an archive. That recipe is the index file. PBS uses two index formats:

  • .fidx (fixed index). Used for block devices and VM disk images. Every chunk is the same size, so the index is an array of chunk hashes.
  • .didx (dynamic index). Used for file-level archives where chunks vary in size (the content-defined chunker decides the boundaries). The index stores (offset, chunk_hash) pairs.

Both formats have a small header followed by the array. They are compact: a multi-terabyte VM disk produces an index file in the low megabytes. The index is the recipe, the chunks are the ingredients. The recipe is cheap to copy and inspect. The ingredients are heavy and stay in the shared pile.

What a Snapshot Actually Is

A snapshot directory on disk looks like this:

bash
$ ls -lh /mnt/datastore/main/vm/100/2026-05-19T02:00:00Z/
-rw-r--r-- 1 backup backup  812 May 19 02:00 index.json.blob
-rw-r--r-- 1 backup backup  48K May 19 02:01 drive-scsi0.img.fidx
-rw-r--r-- 1 backup backup  12K May 19 02:01 drive-scsi1.img.fidx
-rw-r--r-- 1 backup backup  2.1K May 19 02:01 qemu-server.conf.blob
-rw-r--r-- 1 backup backup  640 May 19 02:01 client.log.blob
A real snapshot directory

That is the entire snapshot. A handful of small files. The index.json.blob is the manifest: it lists the index files in this snapshot, their sizes, their hashes, and any encryption metadata. The .fidx files are the per-disk recipes. The .blob files are small inline payloads like the VM config and the client log. The actual data, every byte of every chunk referenced by those .fidx files, lives in the shared .chunks heap. That is why deleting a snapshot is cheap and moving one between datastores is not.

Snapshot mobility is a chunk copy

Because all snapshots share the .chunks pile, moving a snapshot to a different datastore means copying every chunk it references into the destination's .chunks directory. It is never a fast metadata move. Sync jobs, restore-to-different-datastore, and cross-host workflows all pay that cost.

How Deduplication Actually Happens

Here is what happens when a client runs a backup against a PBS datastore:

  1. The client reads the input stream (a block device, a tar archive, whatever) and chunks it.
  2. For each chunk, the client computes its SHA-256 hash.
  3. The client opens an authenticated HTTP/2 session and asks the server, per chunk: "do you already have a chunk with this hash?"
  4. The server checks the chunk store. If the chunk file exists, it answers yes and the client skips the upload entirely.
  5. If the chunk is missing, the client uploads it. The server writes to a temp file inside .chunks/<prefix>/ and atomically renames to the final hash name.
  6. Once all chunks for a disk are accounted for, the client registers the .fidx or .didx file with the full ordered list of hashes.

Bytes on the wire are only the chunks the server did not already have, across every client sharing the datastore. Two VMs on the same Debian release share almost all of their base-image chunks. Thirty MSP clients running similar workloads share the kernel, libc, systemd binaries, and most package metadata. Every chunk gets uploaded once; every subsequent client with the same chunk pays nothing.

Why Server-Side and Not Client-Side?

Why does the client ask per chunk instead of keeping a local cache of "what the server has"?

Client-side dedup means every client needs a synchronized view of the server's chunk inventory before uploading. That cache has to be kept fresh across many hosts, handle concurrent uploaders, invalidate correctly when garbage collection runs, and must not lie. Stale-cache bugs in client-side dedup systems are notorious. They cause silent data loss when a client thinks the server has a chunk it does not.

PBS picked the simpler model: the server is the source of truth, the client asks per chunk, and the protocol makes that asking cheap. HTTP/2 multiplexes thousands of small chunk-existence requests over one TCP connection, so the per-chunk round trip becomes a few microseconds in practice. The protocol internals post covers the H2 layer in more detail.

Dedup Ratios in the Wild

Real-world dedup ratios depend heavily on what is in the datastore. Here is the rough shape of what we see:

Expected Dedup Ratios by Deployment
Deployment
Single VM, one OS
Small homelab (5-10 VMs)
MSP, 30 clients, similar stacks
Multi-tenant, mixed OSes
Pre-encrypted data inside the VM
Typical Ratio
1.2x to 2x
3x to 5x
8x to 12x
4x to 8x
~1x
Main Driver
Internal redundancy only
Shared base images
Shared OS and runtime chunks
Partial OS overlap
Entropy defeats dedup

Shared base content drives the numbers. Same Debian, same nginx, same Java runtimes, same node_modules. Once any client uploads those chunks, every other client gets them at zero cost. If a tenant encrypts data inside the guest before PBS sees it, the chunks look random and dedup collapses to roughly 1x.

Garbage Collection: The Mark and Sweep

This is the part operators get wrong. Pruning a snapshot does not free disk space.

Pruning removes index files, nothing more. A chunk the pruned snapshot referenced might still be referenced by ten other snapshots, so the chunk has to stay. Finding chunks that nothing references anymore is the job of garbage collection, a separate two-phase process:

  • Mark phase. Walk every index file in the datastore. For every referenced chunk, touch the chunk file on disk. PBS uses the filesystem's atime as the mark bit. PBS forces an atime update during mark, so relatime is fine; noatime will break GC.
  • Sweep phase. Walk the entire .chunks directory. Delete any chunk whose atime is older than the GC grace period. The default grace is 24 hours plus a 5-minute safety margin to cover clock skew and in-flight uploads whose index has not yet been written.
GC must run on a schedule

PBS does not enable garbage collection by default on new datastores. The operator must configure a schedule. If you prune aggressively but never set a GC schedule, your apparent free space will not move and you will think dedup is broken. Check the datastore "Garbage Collection" tab in the UI, or schedule proxmox-backup-manager garbage-collection start <store> via cron or a systemd timer.

bash
# Run GC manually
proxmox-backup-manager garbage-collection start main

# Show GC status and last run
proxmox-backup-manager garbage-collection status main

# Configure GC schedule in the datastore config
proxmox-backup-manager datastore update main \
    --gc-schedule "daily"
Garbage collection from the CLI

Why Deletion Is Cheap But Reclamation Is Slow

The operational consequence is the one operators have to internalise:

  • Deleting a snapshot is instant. A few unlink calls on index files.
  • Reclaiming disk space is not. GC walks every index in the datastore, then every chunk file on disk. On a multi-terabyte chunk store, that takes hours.
  • Capacity planning has to assume GC runs on its schedule, not on demand. If retention says "keep 30 days" and GC runs weekly, peak disk usage reflects somewhere between 30 and 37 days of data.

The PBS pruning and garbage collection post covers the scheduling tradeoffs, and the capacity planning guide for multi-client setups walks through sizing the buffer properly.

What the Filesystem Sees

The chunk store is a specific workload from the filesystem's perspective:

  • Many small-to-medium files, typically a few hundred KB to a few MB on disk after Zstandard compression of a 4 MB logical chunk.
  • High inode count. Millions of files is normal on a serious datastore.
  • Sequential writes during backup (chunks land in chunker order).
  • Random reads during restore (chunks are read in index order, not write order).

That mix favors filesystems that handle small-to-medium files efficiently and tune well for high inode density. ZFS with recordsize=1M is a common choice because it matches chunk size, keeps fragmentation low, and gives you scrub and snapshot tooling on top. The storage backend comparison post covers the tradeoffs against ext4 and LVM-thin.

Wrapping Up

The chunk store is simple to describe and powerful in aggregate. A flat directory of SHA-256-named files, sharded for filesystem health. Thin index files that point into the pile. A tiny manifest per snapshot. Garbage collection as a separate background job because cheap deletion plus delayed reclamation is the only way the math works when many snapshots share chunks.

The next post in the series covers encryption: how PBS keeps deduplication working even when the server cannot see the plaintext. That question trips up most people coming from traditional backup products, and the answer is more elegant than you might expect.

Want this dedup math working for you?

remote-backups.com runs Proxmox Backup Server datastores with shared-tenant dedup pricing. You pay for the bytes the chunk store actually stores, not the bytes your VMs nominally hold.

See Pricing

Weekly is the default and works for most deployments. Daily GC adds disk I/O overhead without much benefit unless you prune aggressively. Run it manually after a large prune to reclaim space sooner. Never schedule GC during backup or sync windows; the grace period exists for safety, but heavy GC I/O can slow active jobs.

Not safely. The chunk store relies on consistent atime semantics for garbage collection, and copying tens of millions of files between filesystems takes hours during which new backups would land in only one of the two locations. Stop the datastore, rsync or zfs send the chunk store and index files, then point PBS at the new mount point.

The filename is the SHA-256 of the contents, so corruption is detectable. PBS verify jobs read each chunk and check the hash. A corrupt chunk fails verification, and the affected snapshots are flagged. The chunk itself stays on disk until you remove the bad snapshots and let GC clean it up, because removing a referenced chunk would corrupt other snapshots.

A database would be a single point of failure and a synchronization bottleneck. The filesystem-as-index approach means every chunk is independent, writes are atomic via rename, and crash recovery is automatic. The tradeoff is that filesystem performance matters; that is why the chunk store is sharded into 65,536 directories.

Yes. Namespaces are a logical separation for access control and organization. The chunk store is shared across all namespaces in a datastore, so identical chunks from different namespaces are stored once. If you need strict storage isolation between tenants, use separate datastores, not separate namespaces.
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.