remote-backups.comremote-backups.com
Contact illustration
Sign In
Don't have an account ?Sign Up

PBS WAN Tuning: Saturate Your Uplink

You pay for a 1 Gbit symmetric line. Your offsite backup runs at 300 Mbit/s. Speedtest says the link is fine, the provider says the link is fine, and the backup still crawls. This is almost never a broken pipe. It is a stack that has not been told to use the pipe, and it is fixable without calling anyone.

Key Takeaways
  • A 1 Gbit line that backs up at 300 Mbit/s is rarely a buffer problem at European latencies. The shortfall is usually per-flow: congestion control reacting to loss, source disk read, CPU on hashing, or round-trip time on the chunk path.
  • Bandwidth-delay product sets the minimum socket buffer one stream needs. Modern TCP autotuning covers 1 Gbit at EU RTTs. It falls short at 10 Gbit, where the default send buffer caps a single stream below 2 Gbit/s.
  • Proxmox Backup Server runs one TCP connection and gets its concurrency from HTTP/2 chunk multiplexing. You cannot fix throughput by adding streams the way `iperf3 -P` does.
  • Raise tcp_rmem/tcp_wmem max above your BDP on both ends. BBR helps on lossy or bufferbloated WAN paths and does nothing on clean fiber. Measure before you keep it.
  • A single `iperf3` stream understates a high-BDP link. Run single-stream and `-P 8`; the gap is your per-flow penalty, not the link's ceiling.
  • When RTT is the wall, move the target closer and replicate server-side rather than pushing the same bytes across your uplink twice.

Before changing anything, get a number you can argue with. Open the TCP throughput calculator, enter your real upload bandwidth and the round-trip time to your backup target, and read off three things: the bandwidth-delay product, the maximum throughput of a single TCP stream at the kernel's default window, and how many parallel connections it would take to saturate the link. Those three numbers tell you whether you have a buffer problem, a latency problem, or neither. Most people skip this step and tune blind.

Why TCP Doesn't Fill Your Pipe By Default

A single TCP stream can only have so much data in flight before it has to stop and wait for acknowledgements. That ceiling is the receive window divided by the round-trip time: throughput = window_size / RTT. The amount of data that should be in flight to keep the pipe full is the bandwidth-delay product: BDP = bandwidth × RTT.

If your socket buffer is smaller than the BDP, the stream is window-limited: it drains its window, idles waiting for ACKs, and never touches line rate no matter how fat the link. This is the entire reason a fast link can feel slow. Work three cases for European latencies and the picture sharpens.

Bandwidth-delay product vs the default send buffer
Path
1 Gbit/s, DE ↔ DE
1 Gbit/s, DE ↔ EU edge
10 Gbit/s, DE ↔ EU
Round-trip time
10 ms
30 ms
20 ms
Bandwidth-delay product
~1.25 MB
~3.75 MB
~25 MB
Single-stream ceiling at default 4 MiB send buffer
~3.3 Gbit/s
~1.1 Gbit/s
~1.7 Gbit/s
Buffer-limited on upload?
No, fills the line
Barely, still fills the line
Yes, ~17% of the link
4 MiB is the typical modern-kernel net.ipv4.tcp_wmem max; the receive side defaults to ~6 MiB. Autotuning grows a stream up to those ceilings. Upload throughput is bound by the smaller of the two.

Read the table the way an operator should. On a 1 Gbit line at 10 to 30 ms, which covers most DACH-to-EU backup paths, the kernel's default buffers already cover the BDP. A single well-behaved stream fills that line. So if your 1 Gbit backup sits at 300 Mbit/s, the socket buffer is not your villain. Look at congestion control reacting to packet loss, the source disk you are reading from, the CPU doing the chunk hashing, or the round trips on the chunk-existence path before you blame TCP buffers.

The buffer ceiling only becomes the binding limit when the BDP outgrows the default window. That happens at 10 Gbit, where a 20 ms path needs 25 MB in flight and the default 4 MiB send buffer caps one stream near 1.7 Gbit/s. It also happens on intercontinental RTTs, or any time a middlebox, VPN, or application pins the window below the BDP. Those are the cases where the sysctl section below earns its keep. At 1 Gbit and EU latencies, it is cheap insurance, not the fix.

Where The Throughput Actually Goes

To tune Proxmox Backup Server you have to know how it puts bytes on the wire, because it does not behave like scp or a single rsync stream.

A backup walks the source, splits it into content-defined chunks averaging around 4 MB, and hashes each one. We covered that mechanism in PBS chunking explained. At session start the server hands the client a known-chunks list, the client diffs its hashes against it, and only the chunks the server is missing ever cross the link. On a steady-state incremental that is a small fraction of the disk. The wire format and transport are in the PBS wire protocol, and the part that matters for tuning is this: the client opens one TCP connection, negotiates HTTP/2 over TLS, and runs many chunk uploads as independent HTTP/2 streams multiplexed onto that single connection.

Loading...
Rendering diagram...
One TCP connection, many HTTP/2 chunk streams

This single-connection design has two consequences that decide your whole tuning strategy.

First, concurrency is internal. PBS gets parallelism from multiplexed chunk streams, not from opening eight TCP sockets. So the classic "open more parallel connections to beat latency" trick is not yours to pull. The protocol already decided how many chunks are in flight, and it is a small number by default. You cannot bolt streams onto it from the outside.

Second, that single connection lives and dies by one congestion window and one set of socket buffers. Everything the rest of this post tunes (buffers, congestion control, MTU, offloads) is in service of making that one connection behave on your specific path. There is no --threads 16 waiting to rescue a misconfigured link.

HTTP/2 helps with round trips, not raw streams

Multiplexing means the chunk-existence checks and uploads do not serialize behind one another the way they would on HTTP/1.1. That hides a lot of per-request latency. It does not change the fact that the bytes ride one TCP flow, so the BDP math from the previous section still applies to that flow.

Tuning The Kernel

The job here is to make sure one TCP stream can keep enough data in flight to fill your link, and that the congestion controller does not give up that window for the wrong reason. Four areas matter.

net.core.rmem_max and net.core.wmem_max are the ceilings for buffers an application requests explicitly with setsockopt. On a stock Debian they sit around 208 KiB.

net.ipv4.tcp_rmem and net.ipv4.tcp_wmem are the three-value (min, default, max) bounds for TCP's autotuning. The third value is the real lever: it caps how large a window autotuning will grow for a busy stream. Defaults are roughly 6 MiB receive and 4 MiB send.

net.ipv4.tcp_congestion_control selects the algorithm. The default is CUBIC. BBR is the alternative worth testing on a WAN.

net.core.default_qdisc sets the queueing discipline. fq provides the packet pacing that BBR needs to behave.

Put them in one file so they survive reboots:

ini
# Ceiling for buffers an app requests explicitly (128 MiB)
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728

# TCP autotuning bounds: min default max (bytes)
# The max must comfortably exceed your bandwidth-delay product.
net.ipv4.tcp_rmem = 4096 131072 134217728
net.ipv4.tcp_wmem = 4096 16384 134217728

# Window scaling is required for windows above 64 KiB.
# It is on by default; assert it so nothing downstream turns it off.
net.ipv4.tcp_window_scaling = 1

# Pacing qdisc. BBR relies on it to pace correctly.
net.core.default_qdisc = fq

# CUBIC is the default. BBR helps on lossy or bufferbloated WAN paths.
net.ipv4.tcp_congestion_control = bbr
/etc/sysctl.d/99-pbs-tuning.conf

Apply it with sysctl --system and confirm the controller actually loaded, because a kernel without the tcp_bbr module silently keeps CUBIC:

bash
sysctl net.ipv4.tcp_congestion_control
sysctl net.ipv4.tcp_available_congestion_control   # bbr must appear here
sysctl net.core.default_qdisc
Verify the settings took

The 128 MiB ceiling looks generous because it is. You only need the buffer to exceed your BDP, which is 25 MB at 10 Gbit and 20 ms. The headroom costs nothing in practice: autotuning allocates what a flow uses, not the maximum, so a 1 Gbit stream on a tuned box still only holds a few MB. Set the ceiling once and stop thinking about it.

Tune both ends or waste half of it

The receive buffer matters on whichever side is receiving the bulk data, which is your backup target during a backup. The send buffer matters on the client. Apply the snippet on the Proxmox Backup Server target and on the PVE node or client that pushes to it. Tuning one side leaves the other as the bottleneck.

BBR or CUBIC, honestly

CUBIC reads packet loss as the signal to back off. On a clean, low-loss path it grows its window until it fills the pipe and stays there. That describes most DE-to-DE fiber, where CUBIC is already doing the right thing and BBR will not beat it in any way you can measure.

BBR ignores loss as a primary signal and instead models the path's bottleneck bandwidth and minimum RTT, then paces to that estimate. It wins in two situations: paths with non-congestive loss, where a flaky long-haul segment makes CUBIC keep shrinking its window for losses that are not congestion, and bufferbloated paths, where CUBIC fills a deep buffer and drives latency up while BBR holds the queue short.

The honest caveats: BBR (v1) can be unfair to CUBIC flows sharing the same bottleneck, and it can overshoot a shallow-buffered hop and induce loss for everyone else on it. So do not deploy BBR as a reflex. Try it on a path where CUBIC underperforms despite buffers larger than the BDP, measure both, and keep whichever wins on your link. On a short clean path, leave CUBIC alone.

Reach for BBR

  • Non-congestive loss

    A flaky long-haul segment makes CUBIC keep shrinking its window for losses that are not congestion.

  • Bufferbloated paths

    CUBIC fills the deep buffer and drives latency up. BBR paces to the estimated bandwidth and holds the queue short.

  • High-BDP links with sporadic loss

    Where loss-based backoff repeatedly caps a single stream well below line rate.

Stay on CUBIC

  • Clean, low-loss fiber

    CUBIC already fills the pipe. On a short DE-to-DE path BBR gives nothing you can measure.

  • Shared shallow-buffer hop

    BBRv1 can overshoot and induce loss for everyone else on the bottleneck.

  • Fairness matters

    BBRv1 can starve CUBIC flows competing for the same link.

MTU And Path MTU

Larger packets mean fewer packets for the same throughput, which means less per-packet overhead and fewer interrupts for the CPU to service. That is why jumbo frames help at multi-gigabit speeds.

On a LAN you control end to end, set MTU 9000 on the backup interface and on every switch and NIC in the path. The catch is "every": one hop at 1500 in the middle either fragments or drops your 9000-byte frames, and a silent MTU mismatch is worse than no jumbo frames at all. Test it before trusting it:

bash
# 8972 payload + 28 bytes header = 9000. If this passes, 9000 is clean.
ping -M do -s 8972 <target-ip>

# Standard path: 1472 + 28 = 1500
ping -M do -s 1472 <target-ip>
Probe end-to-end MTU with the don't-fragment bit

On the WAN you almost never get jumbo frames, because you do not own the path. Carrier equipment, PPPoE (1492), VLAN stacking, and tunnels all shave the usable MTU below 1500. Assume 1500 and verify, do not assume 9000 and hope. If your offsite link runs over WireGuard, the tunnel takes 60 bytes on IPv4 and 80 on IPv6, so the inner MTU lands around 1420; our notes on the WireGuard offsite tunnel cover that setup.

The failure that eats hours is the path-MTU blackhole. Path MTU discovery depends on routers returning ICMP "fragmentation needed" (or ICMPv6 "packet too big"). When a firewall along the way drops that ICMP, the sender never learns to shrink its packets. Small packets pass, large ones with the don't-fragment bit set vanish silently. The signature in PBS is unmistakable once you have seen it: the TLS handshake completes, the backup session opens, small control calls succeed, and then the transfer hangs the instant large data segments start flowing. It looks like the backup froze at chunk upload. It is the network dropping your big packets without telling anyone.

tracepath finds it without root and reports the discovered pmtu and the hop where it changes:

Diagnosing a PMTU blackhole
$
tracepath -n 203.0.113.10
$
1?: [LOCALHOST] pmtu 1500
$
1: 10.0.0.1 0.412ms
$
2: 198.51.100.1 4.880ms pmtu 1492
$
3: no reply
$
4: no reply
$
5: 203.0.113.10 31.2ms reached
$
Resume: pmtu 1492

When you find one, the clean fix is MSS clamping on the gateway so TCP negotiates a segment size that fits the real path, instead of relying on ICMP that gets filtered:

bash
iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN \
    -j TCPMSS --clamp-mss-to-pmtu
Clamp MSS to the path MTU on the gateway

If the blackhole is on a tunnel you own, set the tunnel MTU explicitly to the working value instead and let TCP discover MSS from that.

NIC Offloads

GRO, GSO, and TSO let the system push fewer, larger units through the network stack and have the NIC or the lowest layer slice them into MTU-sized packets. Generic receive offload coalesces incoming segments, generic and TCP segmentation offload defer outbound slicing. The payoff is fewer trips through the stack per byte and lower CPU per gigabit. On modern hardware this is the difference between hitting 10 Gbit with a quiet CPU and pinning a core to do the same work by hand. Leave them on.

You can see and toggle them with ethtool:

bash
ethtool -k eth0 | grep -E 'generic-receive|generic-segmentation|tcp-segmentation'

# Only if you have a measured reason to:
ethtool -K eth0 gro off gso off tso off
Inspect and toggle offloads

The cases where offloads hurt are real but narrow, so do not turn them off as a ritual. On a CPU-starved or older small box a buggy offload engine can corrupt checksums or stall under load, and disabling the offload restores correctness at a CPU cost. If you are doing precise rate shaping with tc, GSO and TSO hand the qdisc super-packets up to 64 KiB, which wrecks the accuracy of the shaper, so shaping setups often disable segmentation offload deliberately. And a few virtio-plus-bridge combinations have historically shown GRO-related latency quirks. Outside those situations, defaults win. If you suspect an offload bug, the test is simple: run the backup with offloads off, compare, and let the number decide.

What You Can And Cannot Tune On The PBS Side

This is where expectations need managing. Proxmox Backup Server gives you fewer client-side throughput knobs than you would like, by design.

What exists is rate limiting, and it only ever slows things down to protect other traffic. vzdump takes --bwlimit in KiB/s, and sync jobs take --rate, both covered in PBS performance tuning. Reach for these when a large initial run is starving production, not when you want to go faster.

What does not exist is a parallelism dial. There is no supported thread-count or stream-count flag on proxmox-backup-client that you should build a strategy around. The client uses a small number of in-flight chunk uploads multiplexed over the one HTTP/2 connection, and as the protocol write-up notes, pushing parallelism higher rarely helps and can hurt a link that is already saturated. Proxmox deliberately does not hand you a thread count. So the honest answer to "how do I make PBS use more of my link" is not a PBS setting at all. It is: remove the per-flow limits the kernel and path impose (the buffer, congestion, MTU, and offload work above), shorten the RTT so the existing concurrency goes further, and feed the client fast enough source disk and CPU that chunking and hashing are not the bottleneck.

The client-side lever is RTT, not threads

On a high-latency path the biggest win is cutting round-trip time, which you do not change with a flag. The protocol already minimizes round trips with the known-chunks list. Everything left is the network underneath it and where the target sits.

Measuring Properly

You cannot tune what you measure wrong, and the most common measurement mistake is trusting a single iperf3 stream.

One TCP stream is governed by its own congestion window, its socket buffers, and any loss on the path. On a high-BDP or slightly lossy link that single flow can sit at 300 Mbit/s on a genuinely healthy 1 Gbit line, and you will walk away blaming the link. Run both a single stream and a parallel test:

bash
# On the target
iperf3 -s

# On the client: single stream for 30s
iperf3 -c <target-ip> -t 30

# Eight parallel streams reveal the link's real ceiling
iperf3 -c <target-ip> -t 30 -P 8

# Reverse direction (download to the client); test it too, uplinks are asymmetric
iperf3 -c <target-ip> -t 30 -P 8 -R
iperf3, single-stream and parallel, both directions

Now interpret the two numbers correctly for PBS, because this is the subtle part. The -P 8 aggregate is the link's true capacity. The single-stream number is the closer predictor of what Proxmox Backup Server will reach, because PBS rides one TCP connection. If single-stream is 300 Mbit/s and -P 8 is 940 Mbit/s, your link is fine and your problem is per-flow: fix the buffers, the congestion control, or the loss, and that single connection (and therefore PBS) climbs. If single-stream and -P 8 are both 300, the link itself is the ceiling and no kernel tuning will save you.

Then read the backup's own throughput, and read it from the right job. PBS reports per-job rate in the task log, visible in the GUI or via proxmox-backup-manager task log <UPID>. The trap: an incremental backup reports an enormous "speed" because deduplication means almost nothing crossed the wire, so that figure is a dedup artifact, not a measurement of your WAN. To see real link throughput, look at an initial or full backup, or the bytes-written over duration of a low-dedup dataset.

Sometimes the tuning is done and the link really is the wall. Three honest cases, and what to do about each.

~110 MB/s
Real 1 Gbit upload ceiling
What a saturated 1 Gbit line actually moves after protocol and framing overhead.
2+ days
20 TB initial seed over 1 Gbit
Continuous transfer at that ceiling, before anything else touches the link.
Times geo-replication uses your uplink
Server-side replication carries each byte once, not once per downstream copy.

The first backup of a large dataset is bandwidth, full stop. A 20 TB initial backup over a perfectly saturated 1 Gbit upload at ~110 MB/s real throughput is more than two days of continuous transfer, and that is before anything else touches the link. For multi-terabyte onboarding, physical seed loading beats the wire: ship a disk, ingest it locally, then let incrementals run over the WAN. Initial seed loading for large datasets walks through that.

RTT is a multiplier on every round trip and it sets the BDP, so a target 80 ms away is working against you on a workload built from many small chunk negotiations. The fix is geographic: put the backup target near the data. That is what our edge locations are for, and the edge locations feature page lists where they sit. Lower RTT shrinks the BDP, eases the buffer pressure, and lets the protocol's fixed concurrency cover more ground.

Loading map...
Round-trip time is the multiplier. Inside the EU the path to a Frankfurt target is single-digit to low-double-digit milliseconds; reach across the planet and RTT dominates every chunk negotiation.

Once the first copy is offsite, a second copy in another region should not cost your uplink a second time. Server-to-server geo-replication moves data between our regions across our backbone, so your WAN carries each byte exactly once no matter how many copies live downstream.

Let the offsite leg sit closer to your data

remote-backups.com runs Proxmox Backup Server targets in EU edge locations with server-side geo-replication, so your uplink carries each byte once and the round trips stay short.

See geo-replication

The Checklist

  • Run the TCP throughput calculator with your real upload bandwidth and RTT. Note the BDP and the single-stream ceiling before you touch anything.
  • Baseline with iperf3 single-stream and -P 8, in both directions. If single-stream sits far below -P 8, your problem is per-flow, not the link.
  • Drop the sysctl snippet on both the client and the Proxmox Backup Server target. Set tcp_rmem/tcp_wmem max comfortably above your BDP, then re-test.
  • Confirm the congestion controller actually loaded with sysctl net.ipv4.tcp_congestion_control. Keep BBR only if it measurably beats CUBIC on your path.
  • Use jumbo frames only on a LAN you fully control, and verify with ping -M do. On the WAN, assume 1500 and confirm nothing clamps below it.
  • If small transfers work but large ones hang, suspect a PMTU blackhole. Probe with tracepath and ping -M do, then clamp MSS at the gateway.
  • Leave NIC offloads on unless you are shaping with tc or you measured a bug on a small box.
  • Measure PBS throughput from a full or low-dedup job. Incrementals report inflated speed because dedup keeps the bytes off the wire.
  • For multi-terabyte first backups, price the wire against physical seed loading before you start.
  • When RTT is the wall, move the target closer and replicate server-side so your uplink carries the data once.
Bennet Gallein
Bennet Gallein

remote-backups.com operator

Infrastructure enthusiast and founder of remote-backups.com. I build and operate reliable backup infrastructure powered by Proxmox Backup Server, so you can focus on what matters most: your data staying safe.