remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

PBS Troubleshooting: Fix Common Issues

PBS doesn't fail often. When it does, you need to diagnose and fix it fast, before your next backup window opens or your client notices. This guide covers the most common Proxmox Backup Server failures in order of frequency: connection problems, job failures, performance degradation, garbage collection confusion, and sync issues.

Key Takeaways
  • Port 8007 is the PBS HTTPS interface — check firewall, service status, and fingerprint before anything else
  • Username realm matters: user@pbs for PBS users, user@pam for Linux system users
  • Pruning removes snapshot manifests but does not free disk space — only garbage collection does that
  • GC duration scales with chunk count, not raw storage size; large datastores taking hours is normal
  • Verify failures don't mean a backup is gone — they flag specific corrupted chunks; re-run the affected backup
  • For MSPs: read the PBS task log directly, not just the PVE job summary — the detail is in PBS

Connection Issues

Connection failures account for a large portion of PBS support tickets. Most come down to four things: firewall, service state, addressing, or TLS fingerprint.

Connection Refused or Timeout

PBS serves its web interface and API on port 8007/TCP over HTTPS. If connections time out or get refused, work through this checklist:

Firewall: Check that port 8007 is open on the PBS host. On Debian-based systems:

bash
# Check current iptables rules
iptables -L INPUT -n --line-numbers | grep 8007

# Or with ufw
ufw status | grep 8007

# Open the port if missing
ufw allow 8007/tcp
Check and open port 8007

Service state: Two systemd units handle PBS connections. Both must be running:

bash
systemctl status proxmox-backup.service
systemctl status proxmox-backup-proxy.service
Check PBS service status

The proxmox-backup-proxy service handles HTTPS on port 8007. If it's stopped, all remote connections fail regardless of firewall state.

Addressing: Verify the IP or hostname is correct on the client side. DNS resolution failures present identically to refused connections from the client's perspective.

TLS Fingerprint Errors

Proxmox Backup Server uses a self-signed certificate by default. PVE records the certificate's SHA-256 fingerprint when you add PBS as a storage backend. Reinstalling PBS or renewing the certificate generates a new fingerprint, and PVE then refuses the connection with a fingerprint mismatch error.

Find the current fingerprint:

bash
proxmox-backup-manager cert info | grep Fingerprint
Get current PBS TLS fingerprint

Alternatively, open the PBS web UI and navigate to Dashboard, then Certificate Information. Copy the fingerprint from there.

Update the stored fingerprint in PVE: go to Datacenter > Storage, edit your PBS storage entry, and paste the new fingerprint into the Fingerprint field.

Fingerprint changes break all connected PVE nodes

If multiple PVE nodes use the same PBS instance, you must update the fingerprint on each node separately. One missed node causes intermittent failures that look unrelated to the cert change.

Authentication Failures

PBS uses realm-qualified usernames. The realm suffix is not optional:

Format
user@pbs
user@pam
user@realm!tokenname
Realm
PBS internal users
Linux PAM
API tokens
Use Case
Accounts created in the PBS user database
System users authenticated by the OS
Token-based auth for automation

Using admin instead of admin@pam or admin@pbs produces a 401 error. The token format for API tokens is user@realm!tokenname, not user@realm:tokenname.

Beyond usernames, check ACL permissions. A user needs at least DatastoreReader on the target datastore for read operations and DatastoreBackup to write backups. Missing ACLs produce permission denied errors, not auth failures, but the two are easy to confuse.

Backup Job Failures

Datastore Full

The most common backup failure. PBS rejects new backups when a datastore hits capacity. Before expanding storage, verify that maintenance has been running:

bash
# Check current datastore usage
proxmox-backup-manager datastore list

# List prune jobs and their last run time
proxmox-backup-manager prune-job list

# List GC jobs
proxmox-backup-manager gc-job list
Check datastore usage and maintenance jobs

If pruning has been running but the datastore is still full, garbage collection may not have run since the last prune. Pruning removes snapshot manifests. It does not free disk space. Only garbage collection reclaims space by deleting orphaned chunks. See our pruning and garbage collection guide for the full explanation of how these two operations relate.

Run GC manually to reclaim space immediately:

bash
proxmox-backup-manager gc run <datastore-name>
Run GC on a datastore manually

Snapshot Failures

QEMU guest agent not responding: If qemu-guest-agent is enabled in the VM config but not installed or running inside the guest, the backup may stall waiting for a quiesced snapshot. Fix: either install and start the agent inside the VM, or disable the agent option in the VM's hardware configuration if you don't need quiescing.

Storage backend errors: If the VM's underlying storage has I/O errors, backup reads fail. Check dmesg and storage logs on the PVE node to confirm the disk, ZFS pool, or NFS mount is healthy before blaming PBS.

Timeout During Backup

Large changed blocks over a slow or unstable network produce timeouts. The PBS client retries internally, but a sustained degraded connection can outlast the retry window.

Diagnose the network path first:

bash
# On PBS server (or any host near it): start iperf3 server
iperf3 -s

# On PVE node: test throughput
iperf3 -c <pbs-host-ip> -t 30
Test throughput between PVE node and PBS

If throughput is well below your expected baseline, the problem is the network path, not PBS configuration. For WAN-connected offsite targets, schedule backups during off-peak hours and consider seeding large initial datasets locally. Our initial seed loading guide covers transfer strategies for large first-time syncs.

Backup Job Failures
Error / Symptom
datastore full
Backup stalls, then fails
Job fails after partial upload
permission denied on backup write
no space left on device
Likely Cause
GC hasn't run since last prune
Guest agent timeout
Network timeout or instability
ACL missing DatastoreBackup
Datastore at 100% capacity
Fix
Run proxmox-backup-manager gc run <store>
Install qemu-guest-agent or disable agent in VM config
Check iperf3, schedule during off-peak, reduce concurrent jobs
Add ACL via PBS UI or proxmox-backup-manager acl update
Run GC, then tighten retention policy if still full

Performance Issues

Slow Backups

Narrow down to network, disk, or CPU before changing any configuration.

Network: Run iperf3 between the PVE node and PBS host as shown above. Compare against your expected baseline.

Disk I/O: Run iostat -x 2 on both the PVE node and PBS host during an active backup. Watch %util for the storage devices involved. A device near 100% utilization is saturated.

CPU: Run htop on the PBS host during a backup. High CPU from proxmox-backup-proxy points to compression overhead. PBS uses zstd compression by default. For large datasets that are already compressed (VM images full of compressed files, database backups using their own compression), disabling PBS-level compression can cut CPU use significantly:

bash
proxmox-backup-manager datastore update <store-name> --chunk-order none
Disable compression on a datastore
Check compression ratio before disabling

PBS shows deduplication and compression statistics per datastore. If your compression ratio is below 1.05x (less than 5% savings), compression is burning CPU for negligible benefit. If it's above 1.4x, keep it enabled.

Too Many Concurrent Jobs

PBS does not hard-limit concurrent backup jobs. If 30 VMs all back up simultaneously, PBS serves all of them at once, and CPU and disk I/O suffer on both ends. Stagger your PVE backup job schedules by 5 to 10 minutes. For MSPs managing multiple client environments from a shared PBS target, this is one of the most impactful configuration changes you can make.

We cover multi-client scheduling in depth in our PBS backup scheduling guide.

Performance Diagnostics
Issue
Slow backup throughput
Disk saturation
CPU bottleneck
Too many concurrent jobs
GC overlap with backups
Diagnostic Command
iperf3 -c <pbs-host>
iostat -x 2 during backup
htop on PBS host
PBS web UI: Active Tasks
proxmox-backup-manager gc-job list
What to Look For
Compare against line-rate baseline
%util above 90% on storage device
High proxmox-backup-proxy CPU
More than 4-6 simultaneous backup jobs
GC schedule overlaps backup window

Garbage Collection Problems

Space Not Freed After Pruning

This is the most common source of confusion in Proxmox Backup Server. The workflow is:

  1. Prune removes snapshot manifests based on your retention policy
  2. GC identifies chunks not referenced by any remaining snapshot and deletes them

If you prune 50 old snapshots but don't run GC, the disk usage does not change. Chunks from those snapshots may be shared with other snapshots through deduplication. GC is the only operation that actually frees disk space, and it only frees chunks that are no longer referenced by any snapshot.

Run GC after every significant pruning operation and check the output:

bash
# Run GC (replace <store> with your datastore name)
proxmox-backup-manager gc run <store>

# Check GC task log for bytes freed
proxmox-backup-manager task list --limit 10 | grep gc
Run GC and check freed space

GC Takes Too Long

GC duration scales with the number of unique chunks in the datastore, not with total storage size. A 10TB datastore with high deduplication runs faster than a 2TB datastore with many unique chunks. On HDD-backed datastores, the mark phase involves random reads across the entire chunk store, which is slow by nature.

Running GC weekly instead of daily keeps each run smaller. If GC regularly takes more than 4-6 hours, check available memory on the PBS host. The mark phase loads chunk manifests into memory. Tight memory causes swap usage, which extends GC significantly.

Never run GC while backup or sync jobs are active

PBS includes a 24-hour safety window that protects newly written chunks, but the safest approach is to schedule GC in a dedicated maintenance window with no concurrent backup or sync activity. An overlapping GC sweep can delete chunks that an in-progress backup hasn't finished indexing yet.

Verification Failures

PBS verify jobs read each snapshot's chunks and validate their checksums. A failed verify means one or more chunks have mismatched checksums, which indicates data corruption on disk.

A verify failure does not mean the entire backup is unrecoverable. PBS identifies which snapshots are affected. Re-run the backup for those VMs to create new, clean snapshots. Then prune the corrupted snapshots.

Prevention is straightforward: run verify jobs on a regular schedule and alert on failures. Catching corruption early, before you need the backup, is the entire point. Our PBS verify jobs guide covers scheduling strategies and interpreting verify output.

Alert on every verify failure

A silent verify failure on your offsite datastore that you discover at restore time is a worst-case scenario. Configure email or webhook notifications for failed verify tasks. It takes 10 minutes to set up and saves hours of recovery work.

Sync Job Issues

Sync job failures generally fall into three categories: remote connection problems, namespace configuration, and encryption mismatch.

Remote connection failures follow the same pattern as PVE-to-PBS connection issues: wrong hostname, port 8007 not reachable, fingerprint mismatch, or authentication failure. Work through the connection checklist from the first section.

Namespace mismatch: If you use PBS namespaces for multi-tenant isolation, sync jobs must specify matching source and target namespaces. A sync job targeting the root namespace won't pull data from sub-namespaces unless configured with --recursive. Check the --ns parameters on both source and target when a sync job succeeds but transfers no data.

Encrypted sync: Sync jobs transfer raw chunks without decrypting them. If the source datastore contains a mix of encrypted and unencrypted backups, use --encrypted-only to ensure only encrypted data reaches the remote target. Omitting this flag on a mixed datastore silently transfers unencrypted chunks to your offsite server.

bash
proxmox-backup-manager sync-job create offsite-sync \
  --store local-datastore \
  --remote offsite-pbs \
  --remote-store remote-datastore \
  --encrypted-only true \
  --schedule "daily 02:00"
Create sync job with encrypted-only flag

MSP Quick Checklist

When a client reports a backup problem, work through these steps before touching any configuration:

  1. Is PBS running? systemctl status proxmox-backup.service proxmox-backup-proxy.service
  2. Is the datastore full? proxmox-backup-manager datastore list
  3. Did the job fail or just warn? Check the PBS task log directly, not just the PVE job summary. PVE sometimes shows "OK with warnings" for jobs that actually wrote corrupted data.
  4. When did GC last run? proxmox-backup-manager gc-job list
  5. What does the error message actually say? PBS error messages are specific. Read the full task log.

For MSPs managing multiple PBS environments, catching problems before clients notice them requires centralized monitoring. Prometheus metrics from each PBS instance let you alert on "datastore approaching full" before it causes job failures. Our PBS monitoring guide covers the full Prometheus and Grafana setup for multi-environment visibility.

Quick Reference

Quick Reference Commands
Issue
Service status
Datastore usage
TLS fingerprint
Active tasks
Prune jobs
GC jobs
Run GC now
ACL permissions
Command
systemctl status proxmox-backup-proxy.service
proxmox-backup-manager datastore list
proxmox-backup-manager cert info
proxmox-backup-manager task list --limit 20
proxmox-backup-manager prune-job list
proxmox-backup-manager gc-job list
proxmox-backup-manager gc run <store>
proxmox-backup-manager acl list
What It Shows
Whether the HTTPS proxy is running
Used/total space per datastore
Current certificate fingerprint
Recent job history with status
Configured prune schedules
GC schedule and last run
Trigger GC immediately
Current permission assignments

Wrapping Up

Most PBS problems have a short list of root causes: network or firewall blocking port 8007, fingerprint mismatch after a cert change, the pruning/GC confusion where space doesn't free up as expected, or too many concurrent jobs saturating CPU and disk. Work through the connection stack first, then look at job logs for specific error messages. PBS error output is specific enough that the fix is usually obvious once you're reading the right log.

For MSPs, the key is detecting issues before they cause job failures. Centralized monitoring with alerts on datastore capacity, failed verify jobs, and failed sync jobs covers the scenarios that matter most at scale.

Need the offsite PBS leg handled for you?

remote-backups.com provides encrypted PBS targets in EU datacenters. Managed GC scheduling, monitored sync jobs, and isolated credentials included.

View Plans

Pruning removes snapshot manifests but does not free disk space. Disk space is reclaimed only by garbage collection, which deletes chunks that are no longer referenced by any snapshot. Run 'proxmox-backup-manager gc run <datastore-name>' after pruning and check the output for bytes freed.

PBS uses a self-signed certificate by default. When you add PBS as a storage backend in PVE, PVE records the certificate's SHA-256 fingerprint. If PBS is reinstalled or the certificate is renewed, the fingerprint changes and PVE refuses to connect. Find the current fingerprint with 'proxmox-backup-manager cert info' and update it in each PVE node's storage configuration.

PBS uses realm-qualified usernames. 'user@pbs' refers to accounts created directly in the PBS internal user database. 'user@pam' refers to Linux system users authenticated via PAM. Using a username without a realm suffix will always fail. API tokens use the format 'user@realm!tokenname'.

GC duration scales with the number of unique chunks in the datastore, not total storage size. A datastore with low deduplication (many unique chunks) takes longer than a highly deduplicated one of the same size. HDD-backed datastores are slower than SSDs for GC due to the random-read pattern of the mark phase. Running GC weekly instead of daily keeps each run smaller.

A verify failure means one or more chunks in a snapshot have mismatched checksums, indicating data corruption on disk. It does not mean the entire backup is unrecoverable. PBS identifies the affected snapshots. Re-run the backup for those VMs to create fresh, clean snapshots, then prune the corrupted ones. Configure alerts for verify failures so you catch them before restore time.

The most likely cause is a namespace mismatch. If the source datastore uses sub-namespaces and your sync job targets the root namespace without the --recursive flag, it finds no data to sync. Check the --ns parameter on both source and target in the sync job configuration.
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.