remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

Multi-Site PBS Architecture for MSPs

You started with one PBS instance and a handful of clients. Now you have thirty. The flat single-instance model that worked at five clients is groaning under the weight of cross-client noise, retention conflicts, and a permissions matrix nobody fully understands anymore. What follows is the architecture pattern MSPs settle on once a shared box stops scaling: hub-and-spoke, with deliberate namespace and sync design.

Key Takeaways
  • One Proxmox Backup Server per MSP plus optional per-site spokes scales further than one shared instance ever will.
  • Pull mode at the hub keeps client networks closed and credentials out of client hands.
  • Top-level namespace per client is the cleanest tenancy boundary inside a shared datastore.
  • Hub retention should always outlive spoke retention, not mirror it.
  • Fleet monitoring belongs in Prometheus and Grafana, not the PBS task log.
  • Air-gapped or compliance clients usually need a dedicated topology, not a shared one.

The Problem with One PBS for Everything

Run one PBS box for every client and the seams show fast. A failing verify job from one client buries notifications from twenty others. A backup window blowout on a noisy client starves the others competing for the same disk IO. Permission drift makes it impossible to prove tenant isolation during audits. And restore tests slow to a crawl because the GC and verify load never lets up.

Namespaces help with logical separation, but they are a tenancy boundary, not a locality boundary. Your client in Frankfurt still has to push backups over WAN to your London Proxmox Backup Server. That's fine until WAN flaps or the client's DR plan calls for sub-hour local restore.

The model that scales is hub-and-spoke. One central PBS as the system of record for offsite. Optional smaller PBS instances at client sites for local restore performance. Pull from the hub so client networks stay closed.

Hub-and-Spoke: The MSP Reference Architecture

The hub is your single point of offsite truth. It lives in your datacenter or on a managed offsite endpoint like remote-backups.com. Every client backup eventually lands here. Your retention rules, your verify schedule, your monitoring, all centralised.

Spokes are optional. A spoke is a smaller Proxmox Backup Server running at a client site, sized for that client's working set with maybe a week of retention. Its job is local restore speed. When a client loses a VM at 14:00, they do not want to wait for a 200GB pull over a 100Mbit uplink before they can start a restore.

Sync direction matters. Pull mode at the hub is the right default for MSPs. The hub initiates the connection, holds the credentials, and the client site never needs an outbound API token to your infrastructure. If a client environment is compromised, the attacker cannot reach back into your backup pool with stolen credentials.

When to skip the spoke

A spoke PBS is not always worth it. Single-host PVE clients with under 1TB of working set and a stable WAN link can back up directly to the hub with no local Proxmox Backup Server in the middle. The spoke adds operational overhead. Only deploy one when local restore SLA, WAN reliability, or compliance scope demands it.

Architecture Decision Matrix
Scenario
Single PVE host client
Multi-host client
Compliance client (HIPAA, ISO27001)
Client with poor WAN
Air-gapped client
Recommended Topology
Hub pulls from PVE
Spoke at site, hub pulls from spoke
Dedicated spoke, dedicated hub namespace
Spoke required, hub pulls with rate limiting
Local spoke only, manual offsite via removable media
PBS Nodes Needed
1 (hub)
2 (spoke + hub)
2+ (isolated)
2 (spoke + hub)
1 (spoke)
Notes
Simplest pattern. No local spoke.
Spoke aggregates local PVE backups before offsite.
Separate ACLs, separate verify, audit log retention.
Long initial seed, daily incrementals over WAN.
No hub sync. Tape or rotating disk for offsite copy.

Namespace Design at the Hub

One top-level namespace per client. Always. Never collapse two clients into the same namespace, even tiny ones, because the moment you do you have broken the cleanest mental model for billing, ACLs, and offboarding.

Below the client namespace, sub-namespaces split by site or workload type. The layout should be obvious to anyone reading the datastore for the first time:

plaintext
main/
├── client-acme/
│   ├── site-hq/
│   └── site-dr/
├── client-globex/
│   ├── windows-vms/
│   └── linux-vms/
└── client-initech/
    └── default/
Hub namespace layout

This shape gives you per-client ACLs at the top level and per-site or per-workload scoping below. When a client wants read-only access to their own backups, you scope their token to their namespace and nothing else leaks.

bash
# Create the client tree
proxmox-backup-manager namespace create main client-acme
proxmox-backup-manager namespace create main client-acme/site-hq
proxmox-backup-manager namespace create main client-acme/site-dr

# Read-only user for the client
proxmox-backup-manager user create acme-readonly@pbs \
    --password 'change-me'

# Scope ACL to the client's namespace only
proxmox-backup-manager acl update \
    /datastore/main/client-acme \
    acme-readonly@pbs \
    --role DatastoreAudit \
    --propagate true

# Generate an API token for automation
proxmox-backup-manager user generate-token \
    acme-readonly@pbs readonly
Create namespaces and a scoped read-only token

DatastoreAudit lets the client see snapshot lists and metadata without pulling chunk data or deleting anything. If you want them to be able to restore, use DatastoreReader instead. Never give a client DatastoreAdmin on the hub.

Sync Scheduling and Bandwidth Control

The default mistake: every client syncs at 02:00. Your hub disk goes flat for two hours, your WAN saturates, and the verify jobs queue up behind the sync.

Stagger. Spread sync windows across the night so no two large clients overlap. Smaller clients can share a window. Use the schedule field on each pull job:

bash
proxmox-backup-manager sync-job create acme-pull \
    --store main --ns client-acme \
    --remote acme-spoke --remote-store local \
    --schedule "02:00" --rate-in 80MB

proxmox-backup-manager sync-job create globex-pull \
    --store main --ns client-globex \
    --remote globex-spoke --remote-store local \
    --schedule "02:30" --rate-in 80MB

proxmox-backup-manager sync-job create initech-pull \
    --store main --ns client-initech \
    --remote initech-spoke --remote-store local \
    --schedule "03:00" --rate-in 40MB
Staggered pull jobs at the hub

rate-in should be lower than your link's burst capacity, especially for clients who share an uplink with their production traffic. Better to take six hours and not page anyone than two hours and trigger a network alert at the client.

Retention is the tricky part. The temptation is to set hub retention identical to spoke retention. Resist it.

Hub retention should outlive spoke retention

Spokes hold short, fast-restore history. Hubs hold long, compliance-grade history. Mirror them 1:1 and you have defeated the whole point of running two layers. A common pattern: 7 days at the spoke, 90+ days at the hub.

Retention Policy Examples by Client Tier
Client Tier
Basic
Standard
Compliance
Air-gapped
Spoke Retention
None (no spoke)
7 daily
14 daily
30 daily, 8 weekly
Hub Retention
14 daily, 4 weekly
30 daily, 8 weekly, 6 monthly
90 daily, 12 weekly, 24 monthly, 7 yearly
N/A (no sync)
Rationale
Cost-sensitive. Restore SLA measured in hours.
Local fast restore, offsite for DR and compliance scope.
Audit-driven retention. Year-end snapshots locked.
Offsite copy via removable media, not network sync.

Monitoring a Multi-Site Fleet

PBS has a task log. It is adequate for one client. At thirty clients it is noise, and you will miss things.

Get Prometheus and Grafana in place before client count five, not client count fifty. Proxmox Backup Server exposes a /metrics endpoint covering datastore usage, GC stats, sync job state, and verify job state. Scrape every hub and every spoke. Build per-client dashboards keyed on the namespace label.

What to alert on:

  • Backup absence: no successful backup in a namespace for over 25 hours (daily-tier clients).
  • Verify job failure on any namespace, ever. Page on this.
  • Sync lag greater than 24 hours between spoke and hub.
  • Datastore usage above 85 percent on hub or spoke.
  • GC job has not completed in 7 days.

Do not alert on individual snapshot completion. Alert on the absence of backup activity. That is the signal that actually matters at scale, because successful backups are normal and silence is the anomaly.

Sync jobs drop their state into the task log and the exporter exposes it. If a sync lags overnight, you want to know by 09:00, not 09:00 the next day when the client opens a ticket.

Wrapping Up

The MSP architecture pattern is boringly consistent once you have built it twice: hub-and-spoke, pull mode, namespace per client, retention tiered by SLA, fleet monitoring in Prometheus. Build for thirty clients on day one even if you have five. The migration from a flat Proxmox Backup Server to a hub-and-spoke layout is painful, and the longer you wait the more snapshot history you have to move.

The hub is the part you do not want to operate yourself. Hardware, replacement disks, capacity planning, offsite seismic separation, EU data residency. Either own a second datacenter or rent the hub from someone who already does.

Need a hub PBS without owning the hardware?

MSPs using remote-backups.com as the hub get a managed Proxmox Backup Server endpoint in EU datacenters. Fixed per-TB pricing, isolated credentials per client, no per-VM fees.

See MSP Plans

Both are possible. Give the client a token scoped to their hub namespace with DatastoreAudit (read metadata) or DatastoreReader (read and restore). They never see other clients' namespaces because the ACL stops at their top-level namespace.

Versions should be within one minor release. Sync jobs use the PBS API and remain wire-compatible across recent minor versions, but mixing very old and very new releases occasionally exposes bugs in chunk handling. Patch the hub first, then the spokes.

The corresponding pull job fails and logs an error. Existing snapshots at the hub remain untouched. Once the spoke is back, the next scheduled pull picks up the missing snapshots. Set a sync-lag alert so you find out within hours, not days.

Yes, but bandwidth and latency matter for the initial seed. For clients spread across continents, run one hub per region and replicate hub-to-hub for DR rather than forcing every spoke to push intercontinental.

A flat namespaced PBS gives you logical isolation but no locality. Hub-and-spoke adds a local restore tier and decouples retention windows. The two patterns combine: the hub itself is a multi-tenant PBS with namespaces, and the spokes feed it.
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.