PBS protects your VMs. A single Proxmox Backup Server, though, is a single point of failure for your entire backup tier. This post walks through a resilient two-node design with primary plus secondary, sync jobs keeping them in step, and a defined failover path for when the primary goes down. No clustering required.
Key Takeaways
- PBS does not support active-active clustering; HA here means sync plus a documented manual failover path
- Pull-mode sync from secondary to primary keeps the primary's inbound attack surface unchanged
- Failover is a storage.cfg change on each PVE host, not a cluster operation; new backups land on the secondary within minutes
- Namespaces let you mirror selectively, so a small secondary can hold only what matters most
- Failing back is the dangerous step: reverse sync first, then switch clients, never the other way around
- remote-backups.com works as a ready-made secondary node with no hardware to procure or rack
Why PBS HA Matters (and What "HA" Means Here)
Proxmox Backup Server has no Corosync ring, no DRBD volume, no shared-storage active-active mode. One PBS daemon owns its datastore exclusively. If you come from Veeam or a storage product with native replication, this is the first thing to internalise: HA on PBS is built from sync jobs and operational runbooks, not from a cluster manager.
What that buys you in practice:
- Primary takes all backups during normal operation
- Secondary receives synced copies on a schedule
- Failover is manual or scripted, measured in minutes
- Failback is deliberate, with a reverse sync step before clients switch back
That is enough to cover the failure scenarios that actually happen: a dead PSU, a corrupted ZFS pool, an OS upgrade gone sideways, a site-wide power event. It does not give you sub-second transparent failover, and it does not pretend to.
How PBS HA compares to other approaches
| Capability | PBS Dual-Node | Veeam Scale-Out Repo | Ceph-backed storage |
|---|---|---|---|
Active-active | |||
Sync mechanism | Pull/push sync jobs | Native replication | Ceph replication |
Failover model | Manual or scripted | Automated | Transparent |
Setup complexity | Low | High | Very high |
License cost | €0 | Per-socket licensing | €0 (hardware-heavy) |
The PBS model trades automation for simplicity. Two boxes, sync jobs, a documented failover. There is nothing to debug at 3am except the network and the daemon you already know.
Architecture Design
Two patterns cover most deployments. Pick based on what you are protecting against.
Pattern A: Local Primary + Offsite Secondary
Primary lives at the main site. Secondary lives in a colo, a VPS, or a managed PBS endpoint at remote-backups.com. The sync runs over the internet, encrypted by PBS client-side encryption and ideally tunneled through WireGuard or a site-to-site VPN.
This is the 3-2-1 strategy made operational. The offsite copy survives fire, flood, theft, and ransomware that walks across your LAN.
Pattern B: Two On-Premises Nodes
Primary in the server room, secondary in a different rack or building. Sync runs over the LAN, so it is fast and cheap. This protects against single-server hardware failure but does nothing for site-level events.
Pattern A vs Pattern B
| Criteria | Local + Offsite | Two On-Prem Nodes |
|---|---|---|
Protects against hardware failure | ||
Protects against site failure | ||
Network bandwidth need | Moderate (dedup helps a lot) | Low (LAN sync) |
Failover complexity | Reconnect clients to new endpoint | Update PBS hostname only |
Ongoing cost | Storage subscription or colo fee | Second server, power, rack space |
The honest answer for most operators: do both. Pattern B for fast restores from local hardware failure, Pattern A for everything else. PBS supports chained sync, so the secondary can itself be a sync source for a third tier.
Configuring PBS Sync Jobs (Pull Mode)
Pull mode means the secondary reaches into the primary and copies chunks. The primary needs no inbound connection to the secondary, no awareness it exists, and no API access pointing outward.
This matters for two reasons. First, the primary's network attack surface stays unchanged. Second, if the primary is ever compromised, the attacker cannot push poisoned chunks to the secondary because the primary holds no credentials for it.
Do not use an admin token for sync
The sync user needs read access on a single datastore, nothing more. A scoped DatastoreReader token cannot delete, prune, or modify anything on the primary. If the secondary is ever compromised, the blast radius stops at "attacker can read snapshots they could already pull."
Step by step.
1. Create the sync user and a scoped token on the primary.
# Create the sync user
proxmox-backup-manager user create sync-user@pbs --password '<long-random>'
# Create an API token under that user
proxmox-backup-manager user generate-token sync-user@pbs sync-token
# Scope read-only ACL to the source datastore
proxmox-backup-manager acl update /datastore/main \
--auth-id 'sync-user@pbs!sync-token' \
--role DatastoreReaderThe generate-token command prints the token secret once. Copy it immediately; PBS does not store it in retrievable form.
2. On the secondary, add the primary as a Remote.
In the web UI: Configuration → Remotes → Add. Fill in:
- ID: a friendly name, for example
primary-hub - Host: the primary's hostname or IP, reachable from the secondary
- Auth ID:
sync-user@pbs!sync-token - Password: the token secret from step 1
- Fingerprint: the SHA-256 of the primary's TLS cert (the UI will offer to fetch it)
Or, by CLI:
proxmox-backup-manager remote create primary-hub \
--host primary-pbs.example.com \
--auth-id 'sync-user@pbs!sync-token' \
--password '<token-secret>' \
--fingerprint '<sha256-fingerprint>'3. Create the sync job.
proxmox-backup-manager sync-job create primary-mirror \
--remote primary-hub \
--remote-store main \
--store mirror \
--schedule 'hourly' \
--remove-vanished true--remove-vanished true keeps the secondary's content aligned with the primary, including pruning. If you want the secondary to retain snapshots the primary has already deleted (an extra retention tier), set this to false and run a separate prune policy on the secondary.
4. Run the sync once manually before trusting the schedule.
proxmox-backup-manager sync-job run primary-mirror
proxmox-backup-manager task list --type syncThe first run is the slow one. Every chunk transfers. After that, PBS deduplication means only new and changed chunks move, and incremental syncs typically finish in single-digit minutes for healthy environments.
Sync Scheduling and Bandwidth
Daily sync is enough for most environments. Critical workloads (PCI estates, healthcare, finance) usually want a 4-hour or hourly schedule. The cost of a more aggressive schedule is mostly in network IO; the storage cost on the secondary is identical because dedup is content-addressed.
If the link between primary and secondary is shared with production traffic, rate-limit the sync job. From the UI: Sync Jobs → Edit → Rate Limit (MB/s). From the CLI:
proxmox-backup-manager sync-job update primary-mirror \
--rate-in 100Watch the sync logs the same way you watch backup logs. From the UI: Datastore → {name} → Sync Jobs → Show Log. From the CLI: proxmox-backup-manager task list --type sync. If a sync starts failing silently, the secondary drifts, and you find out during failover, which is exactly the wrong moment. Hook these tasks into the same monitoring you use for backup jobs.
Verify after sync, not just transfer
A sync job copies chunks. It does not validate them on the secondary unless you run verification there. Schedule a weekly verify job on the secondary datastore so failover-day surprises stay theoretical.
Failover Procedure
When the primary goes down, the goal is to keep new backups landing somewhere safe while the primary recovers. Restores from the secondary's existing snapshots already work, and that is the larger of the two concerns.
Step 1. Confirm the secondary is healthy and recently synced.
# Last sync run
proxmox-backup-manager task list --type sync --limit 5
# Datastore health
proxmox-backup-manager datastore show mirrorStep 2. Update PBS storage config on each PVE host.
The PBS storage entry in /etc/pve/storage.cfg points at the primary's address. Change it to point at the secondary.
pbs: backup-primary
datastore main
server primary-pbs.example.com
username sync-user@pbs
fingerprint aa:bb:cc:...pbs: backup-primary
datastore mirror
server secondary-pbs.example.com
username sync-user@pbs
fingerprint dd:ee:ff:...The fingerprint and datastore name change too. Or do it in the UI: Datacenter → Storage → Edit → Server.
Step 3. Trigger a test backup.
Pick a small VM or container. Run a one-shot backup. Confirm it lands on the secondary and verifies. This is the moment you find out whether your firewall, DNS, or VPN config remembered to allow the secondary's address.
Failover checklist
| Step | 1. Verify secondary | 2. Update storage.cfg | 3. Test backup | 4. Notify clients/team | 5. Open primary recovery ticket |
|---|---|---|---|---|---|
Action | Check sync job status and datastore health | Change server address on every PVE host | One-shot backup of a small VM | Email or Slack: failover complete, primary in recovery | Track the work; failback depends on it |
Owner | On-call engineer | On-call engineer | On-call engineer | Comms or on-call | Ops lead |
Time estimate | 2 min | 5 min for ≤10 hosts | 5 min | 5 min | Async |
Using a managed secondary
If your secondary is a managed PBS endpoint at remote-backups.com, the server address and fingerprint are in your dashboard under Connection Details. Failover is the same procedure, you just do not also have to keep the secondary's hardware alive.
Failing Back
This is the step that causes the most data loss in the wild, and it is almost always avoidable.
The instinct after the primary recovers is to switch clients back immediately. Don't. While the primary was down, every backup landed on the secondary. Those snapshots do not exist on the primary. If you point clients back at the primary now, you have an asymmetric dataset and no clear story for which copy is authoritative.
The correct path:
- Once the primary is back online, configure a reverse sync job on the primary that pulls from the secondary into the primary's datastore. Same pull-mode pattern, just inverted: now the primary holds a token scoped to the secondary's mirror store.
- Let the reverse sync catch up. Watch task logs until the primary holds every snapshot the secondary has.
- Verify the primary datastore. A
proxmox-backup-manager verify-jobrun will confirm the chunks are intact after the bulk transfer. - Switch PVE hosts back to the primary in
storage.cfg. - Disable the reverse sync job. Re-enable the original primary-to-secondary sync.
The reverse sync is throwaway. It exists for the recovery window and gets retired the moment failback completes. Document the steps in your runbook and rehearse them once a year, the same way you rehearse restore drills and other operational procedures.
Never delete primary data before failback completes
If the primary recovered with its old data intact, do not wipe and rebuild it before reverse sync runs. Even if the data looks stale, it might contain snapshots the secondary missed (a sync job that failed an hour before the outage, for example). Treat the primary's old data as authoritative until the secondary has been compared against it.
Wrapping Up
PBS HA is manual by design, and that is the feature, not a limitation. A scoped sync job and a runbooked failover procedure cover the failure modes that actually occur: dead hardware, corrupted pools, bad upgrades, site-level events. The cost of operating it is low because there is no cluster state to debug, only two daemons and a network in between.
If you are looking for an offsite secondary without buying a second server, remote-backups.com ships exactly this: a managed Proxmox Backup Server endpoint, isolated namespaces per client, and the connection details ready to drop into your sync job today.
Need a ready-made secondary PBS node?
remote-backups.com gives you an offsite Proxmox Backup Server endpoint in EU datacenters. Drop the connection details into a sync job and you have a working dual-node setup in under an hour.
View Plans


