remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

PBS High Availability: Dual-Node Sync & Failover

PBS protects your VMs. A single Proxmox Backup Server, though, is a single point of failure for your entire backup tier. This post walks through a resilient two-node design with primary plus secondary, sync jobs keeping them in step, and a defined failover path for when the primary goes down. No clustering required.

Key Takeaways
  • PBS does not support active-active clustering; HA here means sync plus a documented manual failover path
  • Pull-mode sync from secondary to primary keeps the primary's inbound attack surface unchanged
  • Failover is a storage.cfg change on each PVE host, not a cluster operation; new backups land on the secondary within minutes
  • Namespaces let you mirror selectively, so a small secondary can hold only what matters most
  • Failing back is the dangerous step: reverse sync first, then switch clients, never the other way around
  • remote-backups.com works as a ready-made secondary node with no hardware to procure or rack

Why PBS HA Matters (and What "HA" Means Here)

Proxmox Backup Server has no Corosync ring, no DRBD volume, no shared-storage active-active mode. One PBS daemon owns its datastore exclusively. If you come from Veeam or a storage product with native replication, this is the first thing to internalise: HA on PBS is built from sync jobs and operational runbooks, not from a cluster manager.

What that buys you in practice:

  • Primary takes all backups during normal operation
  • Secondary receives synced copies on a schedule
  • Failover is manual or scripted, measured in minutes
  • Failback is deliberate, with a reverse sync step before clients switch back

That is enough to cover the failure scenarios that actually happen: a dead PSU, a corrupted ZFS pool, an OS upgrade gone sideways, a site-wide power event. It does not give you sub-second transparent failover, and it does not pretend to.

How PBS HA compares to other approaches
Capability
PBS Dual-Node
Veeam Scale-Out Repo
Ceph-backed storage
Active-active
Sync mechanism
Pull/push sync jobs
Native replication
Ceph replication
Failover model
Manual or scripted
Automated
Transparent
Setup complexity
Low
High
Very high
License cost
€0
Per-socket licensing
€0 (hardware-heavy)

The PBS model trades automation for simplicity. Two boxes, sync jobs, a documented failover. There is nothing to debug at 3am except the network and the daemon you already know.

Architecture Design

Two patterns cover most deployments. Pick based on what you are protecting against.

Pattern A: Local Primary + Offsite Secondary

Primary lives at the main site. Secondary lives in a colo, a VPS, or a managed PBS endpoint at remote-backups.com. The sync runs over the internet, encrypted by PBS client-side encryption and ideally tunneled through WireGuard or a site-to-site VPN.

This is the 3-2-1 strategy made operational. The offsite copy survives fire, flood, theft, and ransomware that walks across your LAN.

Pattern B: Two On-Premises Nodes

Primary in the server room, secondary in a different rack or building. Sync runs over the LAN, so it is fast and cheap. This protects against single-server hardware failure but does nothing for site-level events.

Pattern A vs Pattern B
Criteria
Local + Offsite
Two On-Prem Nodes
Protects against hardware failure
Protects against site failure
Network bandwidth need
Moderate (dedup helps a lot)
Low (LAN sync)
Failover complexity
Reconnect clients to new endpoint
Update PBS hostname only
Ongoing cost
Storage subscription or colo fee
Second server, power, rack space

The honest answer for most operators: do both. Pattern B for fast restores from local hardware failure, Pattern A for everything else. PBS supports chained sync, so the secondary can itself be a sync source for a third tier.

Configuring PBS Sync Jobs (Pull Mode)

Pull mode means the secondary reaches into the primary and copies chunks. The primary needs no inbound connection to the secondary, no awareness it exists, and no API access pointing outward.

This matters for two reasons. First, the primary's network attack surface stays unchanged. Second, if the primary is ever compromised, the attacker cannot push poisoned chunks to the secondary because the primary holds no credentials for it.

Do not use an admin token for sync

The sync user needs read access on a single datastore, nothing more. A scoped DatastoreReader token cannot delete, prune, or modify anything on the primary. If the secondary is ever compromised, the blast radius stops at "attacker can read snapshots they could already pull."

Step by step.

1. Create the sync user and a scoped token on the primary.

bash
# Create the sync user
proxmox-backup-manager user create sync-user@pbs --password '<long-random>'

# Create an API token under that user
proxmox-backup-manager user generate-token sync-user@pbs sync-token

# Scope read-only ACL to the source datastore
proxmox-backup-manager acl update /datastore/main \
    --auth-id 'sync-user@pbs!sync-token' \
    --role DatastoreReader
On the primary PBS

The generate-token command prints the token secret once. Copy it immediately; PBS does not store it in retrievable form.

2. On the secondary, add the primary as a Remote.

In the web UI: Configuration → Remotes → Add. Fill in:

  • ID: a friendly name, for example primary-hub
  • Host: the primary's hostname or IP, reachable from the secondary
  • Auth ID: sync-user@pbs!sync-token
  • Password: the token secret from step 1
  • Fingerprint: the SHA-256 of the primary's TLS cert (the UI will offer to fetch it)

Or, by CLI:

bash
proxmox-backup-manager remote create primary-hub \
    --host primary-pbs.example.com \
    --auth-id 'sync-user@pbs!sync-token' \
    --password '<token-secret>' \
    --fingerprint '<sha256-fingerprint>'
On the secondary PBS

3. Create the sync job.

bash
proxmox-backup-manager sync-job create primary-mirror \
    --remote primary-hub \
    --remote-store main \
    --store mirror \
    --schedule 'hourly' \
    --remove-vanished true
Sync job: pull from primary's main store into secondary's mirror store

--remove-vanished true keeps the secondary's content aligned with the primary, including pruning. If you want the secondary to retain snapshots the primary has already deleted (an extra retention tier), set this to false and run a separate prune policy on the secondary.

4. Run the sync once manually before trusting the schedule.

bash
proxmox-backup-manager sync-job run primary-mirror
proxmox-backup-manager task list --type sync
Trigger and watch

The first run is the slow one. Every chunk transfers. After that, PBS deduplication means only new and changed chunks move, and incremental syncs typically finish in single-digit minutes for healthy environments.

Sync Scheduling and Bandwidth

Daily sync is enough for most environments. Critical workloads (PCI estates, healthcare, finance) usually want a 4-hour or hourly schedule. The cost of a more aggressive schedule is mostly in network IO; the storage cost on the secondary is identical because dedup is content-addressed.

If the link between primary and secondary is shared with production traffic, rate-limit the sync job. From the UI: Sync Jobs → Edit → Rate Limit (MB/s). From the CLI:

bash
proxmox-backup-manager sync-job update primary-mirror \
    --rate-in 100
Cap a sync at 100 MB/s

Watch the sync logs the same way you watch backup logs. From the UI: Datastore → {name} → Sync Jobs → Show Log. From the CLI: proxmox-backup-manager task list --type sync. If a sync starts failing silently, the secondary drifts, and you find out during failover, which is exactly the wrong moment. Hook these tasks into the same monitoring you use for backup jobs.

Verify after sync, not just transfer

A sync job copies chunks. It does not validate them on the secondary unless you run verification there. Schedule a weekly verify job on the secondary datastore so failover-day surprises stay theoretical.

Failover Procedure

When the primary goes down, the goal is to keep new backups landing somewhere safe while the primary recovers. Restores from the secondary's existing snapshots already work, and that is the larger of the two concerns.

Step 1. Confirm the secondary is healthy and recently synced.

bash
# Last sync run
proxmox-backup-manager task list --type sync --limit 5

# Datastore health
proxmox-backup-manager datastore show mirror
On the secondary

Step 2. Update PBS storage config on each PVE host.

The PBS storage entry in /etc/pve/storage.cfg points at the primary's address. Change it to point at the secondary.

conf
pbs: backup-primary
    datastore main
    server primary-pbs.example.com
    username sync-user@pbs
    fingerprint aa:bb:cc:...
/etc/pve/storage.cfg before
conf
pbs: backup-primary
    datastore mirror
    server secondary-pbs.example.com
    username sync-user@pbs
    fingerprint dd:ee:ff:...
/etc/pve/storage.cfg after

The fingerprint and datastore name change too. Or do it in the UI: Datacenter → Storage → Edit → Server.

Step 3. Trigger a test backup.

Pick a small VM or container. Run a one-shot backup. Confirm it lands on the secondary and verifies. This is the moment you find out whether your firewall, DNS, or VPN config remembered to allow the secondary's address.

Failover checklist
Step
1. Verify secondary
2. Update storage.cfg
3. Test backup
4. Notify clients/team
5. Open primary recovery ticket
Action
Check sync job status and datastore health
Change server address on every PVE host
One-shot backup of a small VM
Email or Slack: failover complete, primary in recovery
Track the work; failback depends on it
Owner
On-call engineer
On-call engineer
On-call engineer
Comms or on-call
Ops lead
Time estimate
2 min
5 min for ≤10 hosts
5 min
5 min
Async
Using a managed secondary

If your secondary is a managed PBS endpoint at remote-backups.com, the server address and fingerprint are in your dashboard under Connection Details. Failover is the same procedure, you just do not also have to keep the secondary's hardware alive.

Failing Back

This is the step that causes the most data loss in the wild, and it is almost always avoidable.

The instinct after the primary recovers is to switch clients back immediately. Don't. While the primary was down, every backup landed on the secondary. Those snapshots do not exist on the primary. If you point clients back at the primary now, you have an asymmetric dataset and no clear story for which copy is authoritative.

The correct path:

  1. Once the primary is back online, configure a reverse sync job on the primary that pulls from the secondary into the primary's datastore. Same pull-mode pattern, just inverted: now the primary holds a token scoped to the secondary's mirror store.
  2. Let the reverse sync catch up. Watch task logs until the primary holds every snapshot the secondary has.
  3. Verify the primary datastore. A proxmox-backup-manager verify-job run will confirm the chunks are intact after the bulk transfer.
  4. Switch PVE hosts back to the primary in storage.cfg.
  5. Disable the reverse sync job. Re-enable the original primary-to-secondary sync.

The reverse sync is throwaway. It exists for the recovery window and gets retired the moment failback completes. Document the steps in your runbook and rehearse them once a year, the same way you rehearse restore drills and other operational procedures.

Never delete primary data before failback completes

If the primary recovered with its old data intact, do not wipe and rebuild it before reverse sync runs. Even if the data looks stale, it might contain snapshots the secondary missed (a sync job that failed an hour before the outage, for example). Treat the primary's old data as authoritative until the secondary has been compared against it.

Wrapping Up

PBS HA is manual by design, and that is the feature, not a limitation. A scoped sync job and a runbooked failover procedure cover the failure modes that actually occur: dead hardware, corrupted pools, bad upgrades, site-level events. The cost of operating it is low because there is no cluster state to debug, only two daemons and a network in between.

If you are looking for an offsite secondary without buying a second server, remote-backups.com ships exactly this: a managed Proxmox Backup Server endpoint, isolated namespaces per client, and the connection details ready to drop into your sync job today.

Need a ready-made secondary PBS node?

remote-backups.com gives you an offsite Proxmox Backup Server endpoint in EU datacenters. Drop the connection details into a sync job and you have a working dual-node setup in under an hour.

View Plans

No. The PBS daemon owns its datastore exclusively. There is no shared-storage active-active mode and no Corosync integration. HA in PBS is built from sync jobs and a documented manual failover.

For a typical environment with under 10 PVE hosts, the storage.cfg change plus a verification backup completes in 10 to 15 minutes. The constraint is operator response time, not PBS itself.

Pull mode in nearly every case. The primary's inbound surface stays unchanged, and a compromised primary cannot push poisoned chunks to the secondary because it has no credentials for it.

VPN protects bytes in flight. Client-side encryption protects bytes at rest on every PBS node. Use both. The encryption key only needs to live on PVE clients, not on either PBS server.

The secondary mirrors the primary's prune policy. If you want the secondary to retain snapshots the primary has aged out, set --remove-vanished false and run a separate, longer-retention prune schedule on the secondary.

Yes. Sync jobs are namespace-aware. Configure the sync to pull only from a specific namespace on the source, so the secondary holds only what matters and stays cheaper to operate.
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.