PBS High Availability: Dual-Node Sync & Failover

April 21, 2026
11 min read

PBS protects your VMs. A single Proxmox Backup Server, though, is a single point of failure for your entire backup tier. This post walks through a resilient two-node design with primary plus secondary, sync jobs keeping them in step, and a defined failover path for when the primary goes down. No clustering required.

Key Takeaways

PBS does not support active-active clustering; HA here means sync plus a documented manual failover path
Pull-mode sync from secondary to primary keeps the primary's inbound attack surface unchanged
Failover is a storage.cfg change on each PVE host, not a cluster operation; new backups land on the secondary within minutes
Namespaces let you mirror selectively, so a small secondary can hold only what matters most
Failing back is the dangerous step: reverse sync first, then switch clients, never the other way around
remote-backups.com works as a ready-made secondary node with no hardware to procure or rack

Why PBS HA Matters (and What "HA" Means Here)

Proxmox Backup Server has no Corosync ring, no DRBD volume, no shared-storage active-active mode. One PBS daemon owns its datastore exclusively. If you come from Veeam or a storage product with native replication, this is the first thing to internalise: HA on PBS is built from sync jobs and operational runbooks, not from a cluster manager.

What that buys you in practice:

Primary takes all backups during normal operation
Secondary receives synced copies on a schedule
Failover is manual or scripted, measured in minutes
Failback is deliberate, with a reverse sync step before clients switch back

That is enough to cover the failure scenarios that actually happen: a dead PSU, a corrupted ZFS pool, an OS upgrade gone sideways, a site-wide power event. It does not give you sub-second transparent failover, and it does not pretend to.

How PBS HA compares to other approaches

Capability	PBS Dual-Node	Veeam Scale-Out Repo	Ceph-backed storage
Active-active
Sync mechanism	Pull/push sync jobs	Native replication	Ceph replication
Failover model	Manual or scripted	Automated	Transparent
Setup complexity	Low	High	Very high
License cost	€0	Per-socket licensing	€0 (hardware-heavy)

The PBS model trades automation for simplicity. Two boxes, sync jobs, a documented failover. There is nothing to debug at 3am except the network and the daemon you already know.

Architecture Design

Two patterns cover most deployments. Pick based on what you are protecting against.

Pattern A: Local Primary + Offsite Secondary

Primary lives at the main site. Secondary lives in a colo, a VPS, or a managed PBS endpoint at remote-backups.com. The sync runs over the internet, encrypted by PBS client-side encryption and ideally tunneled through WireGuard or a site-to-site VPN.

This is the 3-2-1 strategy made operational. The offsite copy survives fire, flood, theft, and ransomware that walks across your LAN.

Pattern B: Two On-Premises Nodes

Primary in the server room, secondary in a different rack or building. Sync runs over the LAN, so it is fast and cheap. This protects against single-server hardware failure but does nothing for site-level events.

Pattern A vs Pattern B

Criteria	Local + Offsite	Two On-Prem Nodes
Protects against hardware failure
Protects against site failure
Network bandwidth need	Moderate (dedup helps a lot)	Low (LAN sync)
Failover complexity	Reconnect clients to new endpoint	Update PBS hostname only
Ongoing cost	Storage subscription or colo fee	Second server, power, rack space

The honest answer for most operators: do both. Pattern B for fast restores from local hardware failure, Pattern A for everything else. PBS supports chained sync, so the secondary can itself be a sync source for a third tier.

Configuring PBS Sync Jobs (Pull Mode)

Pull mode means the secondary reaches into the primary and copies chunks. The primary needs no inbound connection to the secondary, no awareness it exists, and no API access pointing outward.

This matters for two reasons. First, the primary's network attack surface stays unchanged. Second, if the primary is ever compromised, the attacker cannot push poisoned chunks to the secondary because the primary holds no credentials for it.

Do not use an admin token for sync

The sync user needs read access on a single datastore, nothing more. A scoped DatastoreReader token cannot delete, prune, or modify anything on the primary. If the secondary is ever compromised, the blast radius stops at "attacker can read snapshots they could already pull."

Step by step.

1. Create the sync user and a scoped token on the primary.

bash

# Create the sync user
proxmox-backup-manager user create sync-user@pbs --password '<long-random>'

# Create an API token under that user
proxmox-backup-manager user generate-token sync-user@pbs sync-token

# Scope read-only ACL to the source datastore
proxmox-backup-manager acl update /datastore/main \
    --auth-id 'sync-user@pbs!sync-token' \
    --role DatastoreReader

On the primary PBS

The generate-token command prints the token secret once. Copy it immediately; PBS does not store it in retrievable form.

2. On the secondary, add the primary as a Remote.

In the web UI: Configuration → Remotes → Add. Fill in:

ID: a friendly name, for example primary-hub
Host: the primary's hostname or IP, reachable from the secondary
Auth ID: sync-user@pbs!sync-token
Password: the token secret from step 1
Fingerprint: the SHA-256 of the primary's TLS cert (the UI will offer to fetch it)

Or, by CLI:

bash

proxmox-backup-manager remote create primary-hub \
    --host primary-pbs.example.com \
    --auth-id 'sync-user@pbs!sync-token' \
    --password '<token-secret>' \
    --fingerprint '<sha256-fingerprint>'

On the secondary PBS

3. Create the sync job.

bash

proxmox-backup-manager sync-job create primary-mirror \
    --remote primary-hub \
    --remote-store main \
    --store mirror \
    --schedule 'hourly' \
    --remove-vanished true

Sync job: pull from primary's main store into secondary's mirror store

--remove-vanished true keeps the secondary's content aligned with the primary, including pruning. If you want the secondary to retain snapshots the primary has already deleted (an extra retention tier), set this to false and run a separate prune policy on the secondary.

4. Run the sync once manually before trusting the schedule.

bash

proxmox-backup-manager sync-job run primary-mirror
proxmox-backup-manager task list --type sync

Trigger and watch

The first run is the slow one. Every chunk transfers. After that, PBS deduplication means only new and changed chunks move, and incremental syncs typically finish in single-digit minutes for healthy environments.

Sync Scheduling and Bandwidth

Daily sync is enough for most environments. Critical workloads (PCI estates, healthcare, finance) usually want a 4-hour or hourly schedule. The cost of a more aggressive schedule is mostly in network IO; the storage cost on the secondary is identical because dedup is content-addressed.

If the link between primary and secondary is shared with production traffic, rate-limit the sync job. From the UI: Sync Jobs → Edit → Rate Limit (MB/s). From the CLI:

bash

proxmox-backup-manager sync-job update primary-mirror \
    --rate-in 100

Cap a sync at 100 MB/s

Watch the sync logs the same way you watch backup logs. From the UI: Datastore → {name} → Sync Jobs → Show Log. From the CLI: proxmox-backup-manager task list --type sync. If a sync starts failing silently, the secondary drifts, and you find out during failover, which is exactly the wrong moment. Hook these tasks into the same monitoring you use for backup jobs.

Verify after sync, not just transfer

A sync job copies chunks. It does not validate them on the secondary unless you run verification there. Schedule a weekly verify job on the secondary datastore so failover-day surprises stay theoretical.

Failover Procedure

When the primary goes down, the goal is to keep new backups landing somewhere safe while the primary recovers. Restores from the secondary's existing snapshots already work, and that is the larger of the two concerns.

Step 1. Confirm the secondary is healthy and recently synced.

bash

# Last sync run
proxmox-backup-manager task list --type sync --limit 5

# Datastore health
proxmox-backup-manager datastore show mirror

On the secondary

Step 2. Update PBS storage config on each PVE host.

The PBS storage entry in /etc/pve/storage.cfg points at the primary's address. Change it to point at the secondary.

conf

pbs: backup-primary
    datastore main
    server primary-pbs.example.com
    username sync-user@pbs
    fingerprint aa:bb:cc:...

/etc/pve/storage.cfg before

conf

pbs: backup-primary
    datastore mirror
    server secondary-pbs.example.com
    username sync-user@pbs
    fingerprint dd:ee:ff:...

/etc/pve/storage.cfg after

The fingerprint and datastore name change too. Or do it in the UI: Datacenter → Storage → Edit → Server.

Step 3. Trigger a test backup.

Pick a small VM or container. Run a one-shot backup. Confirm it lands on the secondary and verifies. This is the moment you find out whether your firewall, DNS, or VPN config remembered to allow the secondary's address.

Failover checklist

Step	1. Verify secondary	2. Update storage.cfg	3. Test backup	4. Notify clients/team	5. Open primary recovery ticket
Action	Check sync job status and datastore health	Change server address on every PVE host	One-shot backup of a small VM	Email or Slack: failover complete, primary in recovery	Track the work; failback depends on it
Owner	On-call engineer	On-call engineer	On-call engineer	Comms or on-call	Ops lead
Time estimate	2 min	5 min for ≤10 hosts	5 min	5 min	Async

Using a managed secondary

If your secondary is a managed PBS endpoint at remote-backups.com, the server address and fingerprint are in your dashboard under Connection Details. Failover is the same procedure, you just do not also have to keep the secondary's hardware alive.

Failing Back

This is the step that causes the most data loss in the wild, and it is almost always avoidable.

The instinct after the primary recovers is to switch clients back immediately. Don't. While the primary was down, every backup landed on the secondary. Those snapshots do not exist on the primary. If you point clients back at the primary now, you have an asymmetric dataset and no clear story for which copy is authoritative.

The correct path:

Once the primary is back online, configure a reverse sync job on the primary that pulls from the secondary into the primary's datastore. Same pull-mode pattern, just inverted: now the primary holds a token scoped to the secondary's mirror store.
Let the reverse sync catch up. Watch task logs until the primary holds every snapshot the secondary has.
Verify the primary datastore. A proxmox-backup-manager verify-job run will confirm the chunks are intact after the bulk transfer.
Switch PVE hosts back to the primary in storage.cfg.
Disable the reverse sync job. Re-enable the original primary-to-secondary sync.

The reverse sync is throwaway. It exists for the recovery window and gets retired the moment failback completes. Document the steps in your runbook and rehearse them once a year, the same way you rehearse restore drills and other operational procedures.

Never delete primary data before failback completes

If the primary recovered with its old data intact, do not wipe and rebuild it before reverse sync runs. Even if the data looks stale, it might contain snapshots the secondary missed (a sync job that failed an hour before the outage, for example). Treat the primary's old data as authoritative until the secondary has been compared against it.

Wrapping Up

PBS HA is manual by design, and that is the feature, not a limitation. A scoped sync job and a runbooked failover procedure cover the failure modes that actually occur: dead hardware, corrupted pools, bad upgrades, site-level events. The cost of operating it is low because there is no cluster state to debug, only two daemons and a network in between.

If you are looking for an offsite secondary without buying a second server, remote-backups.com ships exactly this: a managed Proxmox Backup Server endpoint, isolated namespaces per client, and the connection details ready to drop into your sync job today.

Need a ready-made secondary PBS node?

remote-backups.com gives you an offsite Proxmox Backup Server endpoint in EU datacenters. Drop the connection details into a sync job and you have a working dual-node setup in under an hour.

View Plans

No. The PBS daemon owns its datastore exclusively. There is no shared-storage active-active mode and no Corosync integration. HA in PBS is built from sync jobs and a documented manual failover.

For a typical environment with under 10 PVE hosts, the storage.cfg change plus a verification backup completes in 10 to 15 minutes. The constraint is operator response time, not PBS itself.

Pull mode in nearly every case. The primary's inbound surface stays unchanged, and a compromised primary cannot push poisoned chunks to the secondary because it has no credentials for it.

VPN protects bytes in flight. Client-side encryption protects bytes at rest on every PBS node. Use both. The encryption key only needs to live on PVE clients, not on either PBS server.

The secondary mirrors the primary's prune policy. If you want the secondary to retain snapshots the primary has aged out, set --remove-vanished false and run a separate, longer-retention prune schedule on the secondary.

Yes. Sync jobs are namespace-aware. Configure the sync to pull only from a specific namespace on the source, so the secondary holds only what matters and stays cheaper to operate.

Sign In

PBS High Availability: Dual-Node Sync & Failover

Key Takeaways

Why PBS HA Matters (and What "HA" Means Here)

How PBS HA compares to other approaches

Architecture Design

Pattern A: Local Primary + Offsite Secondary

Pattern B: Two On-Premises Nodes

Pattern A vs Pattern B

Configuring PBS Sync Jobs (Pull Mode)

Do not use an admin token for sync

Sync Scheduling and Bandwidth

Verify after sync, not just transfer

Failover Procedure

Failover checklist

Using a managed secondary

Failing Back

Never delete primary data before failback completes

Wrapping Up

Need a ready-made secondary PBS node?

Tags

Bennet Gallein

Backup Solutions

Resources

Useful Links

Tools

Newsletter

For Who

Comparisons

Our Network

PBS High Availability: Dual-Node Sync & Failover

Key Takeaways

How PBS HA compares to other approaches

Pattern A vs Pattern B

Do not use an admin token for sync

Verify after sync, not just transfer

Failover checklist

Using a managed secondary

Never delete primary data before failback completes

Need a ready-made secondary PBS node?

Can PBS run active-active across two nodes?

How long does failover actually take?

Should sync run in pull mode or push mode?

Do I need PBS client-side encryption if the sync runs over a VPN?

What happens to retention if the secondary stays in --remove-vanished mode?

Can I use namespaces to mirror only critical workloads?

Related Articles

Tags

Share this article

Bennet Gallein

You might also like

PBS Disaster Recovery: Full Cluster Restore

Initial Seed Loading for Remote PBS

Proxmox Backup Client for Windows: PBS Backup Guide