Your backups ran last night. Probably. You think. The PBS web UI says everything looks fine, but you're managing 30 client datastores across three nodes and nobody has time to click through each one every morning. The moment you stop checking is the moment something breaks silently. This post shows you how to build a monitoring stack that watches your Proxmox Backup Server infrastructure around the clock.
Key Takeaways
- PBS has no native Prometheus exporter — you need a custom or community exporter to pull metrics
- The PBS API at /api2/json/admin/datastore/{store}/status exposes storage usage, dedup factor, and snapshot counts
- Track five critical metrics: storage usage, dedup ratio, backup freshness, GC duration, and verification errors
- A single Prometheus instance can monitor multiple PBS nodes using labels for node, datastore, and client
- Alerting on missed backups (>48h) and storage thresholds (>85%) catches problems before they become outages
Why Monitor PBS?
Backups fail silently. A full disk, a crashed daemon, a network blip during a sync job. PBS logs these events, but logs don't wake you up at 3 AM. And they don't give you trend lines showing that your storage will be full in 12 days.
If you run backup infrastructure for clients, whether as an MSP or a hosting provider, per-client visibility is non-negotiable. You need to answer "when did this client's last backup complete?" without logging into the PBS UI. You need to spot a degrading dedup ratio before storage costs spike. And you need alerting that routes to Slack, PagerDuty, or wherever your on-call team lives.
The shift from reactive to proactive monitoring is what separates "we have backups" from "we have a backup service." Prometheus and Grafana give you the tools to make that shift.
PBS Metrics Overview
Proxmox Backup Server exposes a JSON API on port 8007. There is no native Prometheus endpoint. You need an exporter that queries the API and translates the responses into Prometheus metrics.
The key API endpoint is /api2/json/admin/datastore/{store}/status. It returns:
PBS API Status Fields
| Field | What It Contains |
|---|---|
used | Bytes consumed on disk |
total | Total datastore capacity |
avail | Bytes remaining |
deduplication-factor | Current dedup ratio (float) |
gc-status | Last GC run time, duration, removed chunks |
Authentication uses the Authorization: PBSAPIToken=user@realm!token:secret header. Create a dedicated read-only user with DatastoreAudit permissions for monitoring. Never reuse backup or admin tokens for metrics collection.
Audit-Only Token
Create a separate PBS user and token with DatastoreAudit role for your exporter. This grants read access to datastore status and task logs without any backup or delete permissions.
Beyond datastore status, the task log API (/api2/json/nodes/localhost/tasks) gives you backup job outcomes, verification results, and garbage collection runs. Parsing task logs is how you detect failed backup jobs and verification errors.
Setting Up the Prometheus Exporter
Two practical approaches exist: a community Go exporter or a custom Python script using the Prometheus textfile collector.
Option 1: Community Exporter
Several community-built PBS exporters exist on GitHub. They run as standalone HTTP servers that Prometheus scrapes directly. Install one, point it at your PBS API, and configure a scrape target.
Option 2: Custom Python Script
For more control, write a script that queries the PBS API and writes metrics to a file that the Prometheus node_exporter textfile collector picks up. This avoids running another HTTP listener.
#!/usr/bin/env python3
"""PBS metrics exporter for Prometheus textfile collector."""
import json
import time
import requests
import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
PBS_HOST = "https://localhost:8007"
PBS_TOKEN = "monitor@pbs!metrics:YOUR-TOKEN-SECRET"
DATASTORES = ["client-a", "client-b", "client-c"]
OUTPUT = "/var/lib/node_exporter/pbs_metrics.prom"
headers = {"Authorization": f"PBSAPIToken={PBS_TOKEN}"}
lines = []
for ds in DATASTORES:
resp = requests.get(
f"{PBS_HOST}/api2/json/admin/datastore/{ds}/status",
headers=headers, verify=False, timeout=10
)
data = resp.json()["data"]
lines.append(f'pbs_datastore_used_bytes{{datastore="{ds}"}} {data["used"]}')
lines.append(f'pbs_datastore_total_bytes{{datastore="{ds}"}} {data["total"]}')
lines.append(f'pbs_datastore_avail_bytes{{datastore="{ds}"}} {data["avail"]}')
dedup = data.get("deduplication-factor", 1.0)
lines.append(f'pbs_datastore_dedup_factor{{datastore="{ds}"}} {dedup}')
with open(OUTPUT, "w") as f:
f.write("\n".join(lines) + "\n")Extend for Task Logs
Add a second function that queries /api2/json/nodes/localhost/tasks filtered by type backup and verificationjob. Extract the last successful timestamp per datastore to create pbs_backup_last_success_timestamp and pbs_verification_errors_total metrics.
Systemd Timer
Run the script every 5 minutes with a systemd timer instead of cron. Systemd timers handle overlapping runs and provide journal logging.
[Unit]
Description=Collect PBS metrics for Prometheus
[Service]
Type=oneshot
ExecStart=/usr/bin/python3 /opt/pbs-metrics/pbs_metrics.py
User=node_exporter[Unit]
Description=Run PBS metrics collection every 5 minutes
[Timer]
OnCalendar=*:0/5
Persistent=true
[Install]
WantedBy=timers.targetsystemctl daemon-reload
systemctl enable --now pbs-metrics.timerPrometheus Scrape Config
If you're using the textfile collector via node_exporter, no extra scrape config is needed. The metrics appear alongside your node metrics. If you're running a standalone exporter, add a scrape job:
scrape_configs:
- job_name: "pbs"
scrape_interval: 5m
static_configs:
- targets: ["pbs-node-01:9101"]
labels:
node: "pbs-node-01"
- targets: ["pbs-node-02:9101"]
labels:
node: "pbs-node-02"Key Metrics to Track
Five metrics cover the critical health signals for any Proxmox Backup Server deployment.
pbs_datastore_used_bytes— Disk consumption trend. Alert at > 85% of total.pbs_datastore_dedup_factor— Dedup efficiency dropping. Alert at < 1.5.pbs_backup_last_timestamp— Backup freshness per datastore. Alert at > 48 hours.pbs_gc_duration_seconds— GC taking too long. Alert at > 4 hours.pbs_verification_errors— Chunk integrity failures. Alert at > 0.
Storage usage is the obvious one. But trending matters more than the current number. A datastore at 60% that grows 2% per day is a bigger problem than one sitting steady at 80%.
Dedup factor dropping below 1.5 suggests something changed in the backup sources. New full backups without prior data, or a datastore with only one backup group, will show low dedup. It's not always a problem, but it's worth investigating.
Backup freshness is the metric your clients care about most. If the last successful backup is more than 48 hours old, something is broken. This is the single most important alert to configure.
GC duration creeping up indicates growing datastore size or fragmentation. Normal GC on a healthy datastore completes in minutes to an hour. Multi-hour GC runs deserve attention.
Verification errors should always be zero. Any non-zero value means chunks on disk don't match their expected SHA-256 checksums. This is a data integrity issue that needs immediate investigation. See the restore testing guide for how verification fits into a broader DR strategy.
Building the Grafana Dashboard
A good MSP dashboard answers three questions at a glance: Is anything broken? Is anything about to break? How much storage is each client using?
Panel Layout
Organize your dashboard into four rows:
Row 1: Alert overview. A stat panel showing total active alerts. Green when zero, red otherwise. This is what you look at first.
Row 2: Storage by datastore. A horizontal bar chart showing used vs. total bytes per datastore, sorted by usage percentage descending. Color-code bars that exceed 80%.
{
"targets": [{
"expr": "pbs_datastore_used_bytes / pbs_datastore_total_bytes * 100",
"legendFormat": "{{datastore}}"
}],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "green", "value": 0 },
{ "color": "yellow", "value": 70 },
{ "color": "red", "value": 85 }
]
},
"unit": "percent"
}
}
}Row 3: Backup freshness heatmap. A table panel listing each datastore with the time since last successful backup. Use value mappings: green for <24h, yellow for 24-48h, red for >48h.
time() - pbs_backup_last_success_timestampRow 4: Trends. Time series panels for dedup factor and storage growth over 30 days. These are the panels you review weekly, not daily.
Template Variables
Use Grafana template variables for node and datastore so you can filter the dashboard by PBS node or drill into a specific client. This keeps one dashboard working across your entire fleet.
label_values(pbs_datastore_used_bytes, datastore)Alerting Rules
Prometheus alerting rules catch problems before they reach clients. Here are the four rules every PBS deployment should have.
groups:
- name: pbs
rules:
- alert: PBSBackupStale
expr: (time() - pbs_backup_last_success_timestamp) > 172800
for: 30m
labels:
severity: critical
annotations:
summary: "No backup for {{ $labels.datastore }} in 48h"
description: "Datastore {{ $labels.datastore }} on {{ $labels.node }} has not had a successful backup in over 48 hours."
- alert: PBSStorageHigh
expr: (pbs_datastore_used_bytes / pbs_datastore_total_bytes) > 0.85
for: 1h
labels:
severity: warning
annotations:
summary: "{{ $labels.datastore }} storage above 85%"
- alert: PBSVerificationFailure
expr: pbs_verification_errors > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Verification errors on {{ $labels.datastore }}"
description: "Chunk integrity check failures detected. Investigate immediately."
- alert: PBSGCTooLong
expr: pbs_gc_duration_seconds > 14400
for: 10m
labels:
severity: warning
annotations:
summary: "GC on {{ $labels.datastore }} exceeded 4 hours"Alertmanager Routing
Route critical alerts (stale backups, verification failures) to PagerDuty or your on-call rotation. Route warnings (storage, GC) to a Slack channel for next-business-day review.
route:
receiver: "slack-warnings"
routes:
- match:
severity: critical
receiver: "pagerduty-oncall"
- match:
severity: warning
receiver: "slack-warnings"
receivers:
- name: "pagerduty-oncall"
pagerduty_configs:
- service_key: "<your-pd-integration-key>"
- name: "slack-warnings"
slack_configs:
- api_url: "<your-slack-webhook>"
channel: "#backup-alerts"
title: '{{ .CommonAnnotations.summary }}'For teams already using the remote-backups.com notification system, Prometheus alerting adds a second layer of visibility. The built-in notifications catch individual job failures. Prometheus catches systemic issues: storage trends, cross-node patterns, and metric anomalies that no single job alert would surface.
Scaling for Multiple PBS Nodes
A single Prometheus instance handles monitoring 3-5 PBS nodes without issue. The key is consistent labeling.
Every metric should carry three labels: node (which PBS server), datastore (which client or dataset), and optionally client (if you want MSP-level grouping). These labels power the Grafana template variables and let you slice dashboards by any dimension.
# Add node label to all metrics
NODE = "pbs-node-01"
lines.append(
f'pbs_datastore_used_bytes{{node="{NODE}",datastore="{ds}"}} {data["used"]}'
)For larger deployments (10+ nodes across regions), consider Prometheus federation or a remote-write backend like Thanos or VictoriaMetrics. But don't over-engineer this early. Start with one Prometheus, add complexity when the data volume demands it.
If you're replicating backups across locations with PBS sync jobs, monitor both the source and target nodes. A sync job that silently stops replicating is just as dangerous as a missed backup.
Wrapping Up
Proxmox Backup Server gives you the API surface to build proper observability. A Python script, a systemd timer, Prometheus, and Grafana turn that API into dashboards and alerts that catch problems before your clients do. Start with the five metrics listed here, get alerting working, and expand from there.
Need managed PBS with built-in monitoring?
remote-backups.com includes dashboards, alerting, and 24/7 monitoring for your offsite Proxmox Backup Server targets.
View Plans


