remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

PBS Monitoring with Prometheus and Grafana

Your backups ran last night. Probably. You think. The PBS web UI says everything looks fine, but you're managing 30 client datastores across three nodes and nobody has time to click through each one every morning. The moment you stop checking is the moment something breaks silently. This post shows you how to build a monitoring stack that watches your Proxmox Backup Server infrastructure around the clock.

Key Takeaways
  • PBS has no native Prometheus exporter — you need a custom or community exporter to pull metrics
  • The PBS API at /api2/json/admin/datastore/{store}/status exposes storage usage, dedup factor, and snapshot counts
  • Track five critical metrics: storage usage, dedup ratio, backup freshness, GC duration, and verification errors
  • A single Prometheus instance can monitor multiple PBS nodes using labels for node, datastore, and client
  • Alerting on missed backups (>48h) and storage thresholds (>85%) catches problems before they become outages

Why Monitor PBS?

Backups fail silently. A full disk, a crashed daemon, a network blip during a sync job. PBS logs these events, but logs don't wake you up at 3 AM. And they don't give you trend lines showing that your storage will be full in 12 days.

If you run backup infrastructure for clients, whether as an MSP or a hosting provider, per-client visibility is non-negotiable. You need to answer "when did this client's last backup complete?" without logging into the PBS UI. You need to spot a degrading dedup ratio before storage costs spike. And you need alerting that routes to Slack, PagerDuty, or wherever your on-call team lives.

The shift from reactive to proactive monitoring is what separates "we have backups" from "we have a backup service." Prometheus and Grafana give you the tools to make that shift.

PBS Metrics Overview

Proxmox Backup Server exposes a JSON API on port 8007. There is no native Prometheus endpoint. You need an exporter that queries the API and translates the responses into Prometheus metrics.

The key API endpoint is /api2/json/admin/datastore/{store}/status. It returns:

PBS API Status Fields
Field
What It Contains
used
Bytes consumed on disk
total
Total datastore capacity
avail
Bytes remaining
deduplication-factor
Current dedup ratio (float)
gc-status
Last GC run time, duration, removed chunks

Authentication uses the Authorization: PBSAPIToken=user@realm!token:secret header. Create a dedicated read-only user with DatastoreAudit permissions for monitoring. Never reuse backup or admin tokens for metrics collection.

Audit-Only Token

Create a separate PBS user and token with DatastoreAudit role for your exporter. This grants read access to datastore status and task logs without any backup or delete permissions.

Beyond datastore status, the task log API (/api2/json/nodes/localhost/tasks) gives you backup job outcomes, verification results, and garbage collection runs. Parsing task logs is how you detect failed backup jobs and verification errors.

Setting Up the Prometheus Exporter

Two practical approaches exist: a community Go exporter or a custom Python script using the Prometheus textfile collector.

Option 1: Community Exporter

Several community-built PBS exporters exist on GitHub. They run as standalone HTTP servers that Prometheus scrapes directly. Install one, point it at your PBS API, and configure a scrape target.

Option 2: Custom Python Script

For more control, write a script that queries the PBS API and writes metrics to a file that the Prometheus node_exporter textfile collector picks up. This avoids running another HTTP listener.

python
#!/usr/bin/env python3
"""PBS metrics exporter for Prometheus textfile collector."""
import json
import time
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

PBS_HOST = "https://localhost:8007"
PBS_TOKEN = "monitor@pbs!metrics:YOUR-TOKEN-SECRET"
DATASTORES = ["client-a", "client-b", "client-c"]
OUTPUT = "/var/lib/node_exporter/pbs_metrics.prom"

headers = {"Authorization": f"PBSAPIToken={PBS_TOKEN}"}

lines = []
for ds in DATASTORES:
    resp = requests.get(
        f"{PBS_HOST}/api2/json/admin/datastore/{ds}/status",
        headers=headers, verify=False, timeout=10
    )
    data = resp.json()["data"]
    lines.append(f'pbs_datastore_used_bytes{{datastore="{ds}"}} {data["used"]}')
    lines.append(f'pbs_datastore_total_bytes{{datastore="{ds}"}} {data["total"]}')
    lines.append(f'pbs_datastore_avail_bytes{{datastore="{ds}"}} {data["avail"]}')
    dedup = data.get("deduplication-factor", 1.0)
    lines.append(f'pbs_datastore_dedup_factor{{datastore="{ds}"}} {dedup}')

with open(OUTPUT, "w") as f:
    f.write("\n".join(lines) + "\n")
pbs_metrics.py
Extend for Task Logs

Add a second function that queries /api2/json/nodes/localhost/tasks filtered by type backup and verificationjob. Extract the last successful timestamp per datastore to create pbs_backup_last_success_timestamp and pbs_verification_errors_total metrics.

Systemd Timer

Run the script every 5 minutes with a systemd timer instead of cron. Systemd timers handle overlapping runs and provide journal logging.

ini
[Unit]
Description=Collect PBS metrics for Prometheus

[Service]
Type=oneshot
ExecStart=/usr/bin/python3 /opt/pbs-metrics/pbs_metrics.py
User=node_exporter
/etc/systemd/system/pbs-metrics.service
ini
[Unit]
Description=Run PBS metrics collection every 5 minutes

[Timer]
OnCalendar=*:0/5
Persistent=true

[Install]
WantedBy=timers.target
/etc/systemd/system/pbs-metrics.timer
bash
systemctl daemon-reload
systemctl enable --now pbs-metrics.timer
Enable the timer

Prometheus Scrape Config

If you're using the textfile collector via node_exporter, no extra scrape config is needed. The metrics appear alongside your node metrics. If you're running a standalone exporter, add a scrape job:

yaml
scrape_configs:
  - job_name: "pbs"
    scrape_interval: 5m
    static_configs:
      - targets: ["pbs-node-01:9101"]
        labels:
          node: "pbs-node-01"
      - targets: ["pbs-node-02:9101"]
        labels:
          node: "pbs-node-02"
prometheus.yml

Key Metrics to Track

Five metrics cover the critical health signals for any Proxmox Backup Server deployment.

  • pbs_datastore_used_bytes — Disk consumption trend. Alert at > 85% of total.
  • pbs_datastore_dedup_factor — Dedup efficiency dropping. Alert at < 1.5.
  • pbs_backup_last_timestamp — Backup freshness per datastore. Alert at > 48 hours.
  • pbs_gc_duration_seconds — GC taking too long. Alert at > 4 hours.
  • pbs_verification_errors — Chunk integrity failures. Alert at > 0.

Storage usage is the obvious one. But trending matters more than the current number. A datastore at 60% that grows 2% per day is a bigger problem than one sitting steady at 80%.

Dedup factor dropping below 1.5 suggests something changed in the backup sources. New full backups without prior data, or a datastore with only one backup group, will show low dedup. It's not always a problem, but it's worth investigating.

Backup freshness is the metric your clients care about most. If the last successful backup is more than 48 hours old, something is broken. This is the single most important alert to configure.

GC duration creeping up indicates growing datastore size or fragmentation. Normal GC on a healthy datastore completes in minutes to an hour. Multi-hour GC runs deserve attention.

Verification errors should always be zero. Any non-zero value means chunks on disk don't match their expected SHA-256 checksums. This is a data integrity issue that needs immediate investigation. See the restore testing guide for how verification fits into a broader DR strategy.

Building the Grafana Dashboard

A good MSP dashboard answers three questions at a glance: Is anything broken? Is anything about to break? How much storage is each client using?

Panel Layout

Organize your dashboard into four rows:

Row 1: Alert overview. A stat panel showing total active alerts. Green when zero, red otherwise. This is what you look at first.

Row 2: Storage by datastore. A horizontal bar chart showing used vs. total bytes per datastore, sorted by usage percentage descending. Color-code bars that exceed 80%.

json
{
  "targets": [{
    "expr": "pbs_datastore_used_bytes / pbs_datastore_total_bytes * 100",
    "legendFormat": "{{datastore}}"
  }],
  "fieldConfig": {
    "defaults": {
      "thresholds": {
        "steps": [
          { "color": "green", "value": 0 },
          { "color": "yellow", "value": 70 },
          { "color": "red", "value": 85 }
        ]
      },
      "unit": "percent"
    }
  }
}
Storage bar chart panel query

Row 3: Backup freshness heatmap. A table panel listing each datastore with the time since last successful backup. Use value mappings: green for <24h, yellow for 24-48h, red for >48h.

promql
time() - pbs_backup_last_success_timestamp
Backup freshness query

Row 4: Trends. Time series panels for dedup factor and storage growth over 30 days. These are the panels you review weekly, not daily.

Template Variables

Use Grafana template variables for node and datastore so you can filter the dashboard by PBS node or drill into a specific client. This keeps one dashboard working across your entire fleet.

promql
label_values(pbs_datastore_used_bytes, datastore)
Template variable query for datastores

Alerting Rules

Prometheus alerting rules catch problems before they reach clients. Here are the four rules every PBS deployment should have.

yaml
groups:
  - name: pbs
    rules:
      - alert: PBSBackupStale
        expr: (time() - pbs_backup_last_success_timestamp) > 172800
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "No backup for {{ $labels.datastore }} in 48h"
          description: "Datastore {{ $labels.datastore }} on {{ $labels.node }} has not had a successful backup in over 48 hours."

      - alert: PBSStorageHigh
        expr: (pbs_datastore_used_bytes / pbs_datastore_total_bytes) > 0.85
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.datastore }} storage above 85%"

      - alert: PBSVerificationFailure
        expr: pbs_verification_errors > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Verification errors on {{ $labels.datastore }}"
          description: "Chunk integrity check failures detected. Investigate immediately."

      - alert: PBSGCTooLong
        expr: pbs_gc_duration_seconds > 14400
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "GC on {{ $labels.datastore }} exceeded 4 hours"
pbs_alerts.yml

Alertmanager Routing

Route critical alerts (stale backups, verification failures) to PagerDuty or your on-call rotation. Route warnings (storage, GC) to a Slack channel for next-business-day review.

yaml
route:
  receiver: "slack-warnings"
  routes:
    - match:
        severity: critical
      receiver: "pagerduty-oncall"
    - match:
        severity: warning
      receiver: "slack-warnings"

receivers:
  - name: "pagerduty-oncall"
    pagerduty_configs:
      - service_key: "<your-pd-integration-key>"
  - name: "slack-warnings"
    slack_configs:
      - api_url: "<your-slack-webhook>"
        channel: "#backup-alerts"
        title: '{{ .CommonAnnotations.summary }}'
alertmanager.yml (partial)

For teams already using the remote-backups.com notification system, Prometheus alerting adds a second layer of visibility. The built-in notifications catch individual job failures. Prometheus catches systemic issues: storage trends, cross-node patterns, and metric anomalies that no single job alert would surface.

Scaling for Multiple PBS Nodes

A single Prometheus instance handles monitoring 3-5 PBS nodes without issue. The key is consistent labeling.

Every metric should carry three labels: node (which PBS server), datastore (which client or dataset), and optionally client (if you want MSP-level grouping). These labels power the Grafana template variables and let you slice dashboards by any dimension.

python
# Add node label to all metrics
NODE = "pbs-node-01"

lines.append(
    f'pbs_datastore_used_bytes{{node="{NODE}",datastore="{ds}"}} {data["used"]}'
)
Multi-node metric labeling

For larger deployments (10+ nodes across regions), consider Prometheus federation or a remote-write backend like Thanos or VictoriaMetrics. But don't over-engineer this early. Start with one Prometheus, add complexity when the data volume demands it.

If you're replicating backups across locations with PBS sync jobs, monitor both the source and target nodes. A sync job that silently stops replicating is just as dangerous as a missed backup.

Wrapping Up

Proxmox Backup Server gives you the API surface to build proper observability. A Python script, a systemd timer, Prometheus, and Grafana turn that API into dashboards and alerts that catch problems before your clients do. Start with the five metrics listed here, get alerting working, and expand from there.

Need managed PBS with built-in monitoring?

remote-backups.com includes dashboards, alerting, and 24/7 monitoring for your offsite Proxmox Backup Server targets.

View Plans

No. PBS exposes a JSON API on port 8007 but does not ship with a Prometheus-compatible /metrics endpoint. You need an external exporter — either a community project or a custom script using the textfile collector.

Five minutes is a good default. PBS metrics like storage usage and dedup ratios change slowly. Scraping more frequently adds load without providing meaningful additional granularity.

Yes. Use consistent labels (node, datastore) across all your exporters and create Grafana template variables to filter by node. One dashboard handles your entire fleet.

Create a user with the DatastoreAudit role scoped to each datastore you want to monitor. This grants read-only access to status and task information without backup or administrative permissions.
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.