Tailscale exit node behind gluetun: an MTU black hole after a silent gluetun update

I run a self-hosted Mullvad exit node: a tailscale container sharing a gluetun container’s network namespace, so any device on my tailnet
can route out through Mullvad just by selecting it as an exit node. It ran perfectly for weeks — and then, without any warning, every page started taking 15 to 20 seconds to load.

The connection wasn’t actually down. Small requests returned instantly while large ones hung, which denotes a special kind of networking problem.

tailscale ping mullvad-de
pong from mullvad-de … in 43ms # direct
43 ms and a direct connection — not relayed.

The link to the exit node was healthy. Throughput told a different story:

curl -o /dev/null -w '%{speed_download}\n' \
https://speed.cloudflare.com/__down?bytes=10000000
12261 # ~12 KB/s, then timeout


Perfect latency next to collapsed bulk throughput is the signature of an MTU black hole: small packets slip through, oversized ones get
silently dropped, and TCP keeps backing off until everything crawls.

Gluetun’s logs showed the following:

[MTU discovery] reverting tun0 MTU to 1320 (PMTUD failed)

docker exec gluetun ip -o link show tun0
tun0: mtu 1320


I hadn’t changed the config, but gluetun was pinned to :latest and a nightly update had quietly pulled a new build. That build added active MTU discovery, which probed the path, failed, and clamped the tunnel down to 1320.

Because the exit node reaches my tailnet through the Mullvad tunnel, my desktop’s Tailscale WireGuard packets are themselves wrapped
inside Mullvad’s tunnel. A full Tailscale packet is about 1360 bytes once you add WireGuard’s overhead, and that whole thing has to fit
inside the carrier:

Shrinking MTU experimentally to 1280 got the speed down to about 12 KB/s and caused timeouts. The carrier tunnel has to be at least as large as the packet it wraps, so the fix was to go up, not down.

The right value turned out to be 1420 — which is simply WireGuard’s documented default (1500 minus 80 bytes of overhead), and exactly what
the older gluetun had been using before the update. Setting it explicitly both restores that value and disables the discovery probe that caused the trouble. While I was there, I pinned the image so a future :latest can’t silently change this again:

environment:
- WIREGUARD_MTU=1420

Page loads dropped from 15–20 seconds to well under one — cloudflare.com from ~18s to 0.85s, google.com to 0.36s, and the 10 MB download that used to time out now completes.


Leave a comment

Discover more from /root

Subscribe now to keep reading and get access to the full archive.

Continue reading