NAT Traversal¶
WireKube implements a multi-stage NAT traversal strategy inspired by Tailscale's approach. The core idea: establish relay connectivity immediately, probe for direct paths in parallel, and transparently upgrade when a better path is found.
NAT Types¶
| NAT Type | Mapping Behavior | WireGuard P2P | WireKube Strategy |
|---|---|---|---|
| Full Cone | Endpoint-Independent | Direct | STUN discovery |
| Restricted Cone | Endpoint-Independent | Direct (with keepalive) | STUN discovery |
| Port Restricted Cone | Endpoint-Independent | Usually works | STUN discovery |
| Symmetric (EDM) | Endpoint-Dependent | Fails | Relay fallback |
Why Symmetric NAT Breaks WireGuard¶
flowchart TB
subgraph Node["Node (private: 10.0.0.5:51820)"]
WG[WireGuard]
end
subgraph NAT["NAT (Symmetric)"]
N[src port changes per destination]
end
subgraph STUN["STUN Servers"]
A[STUN Server A<br/>sees: 1.2.3.4:50001]
B[STUN Server B<br/>sees: 1.2.3.4:50002]
end
WG --> NAT
NAT --> A
NAT --> B
In Symmetric NAT, the NAT gateway assigns a different external port for each
destination. STUN discovers 1.2.3.4:50001 when talking to server A, but a
peer trying to send to 1.2.3.4:50001 gets a different mapping — the packet
never arrives.
Cloud Provider NAT Behavior¶
All major cloud NAT gateways use Symmetric NAT:
| Provider | NAT Product | NAT Type |
|---|---|---|
| AWS | NAT Gateway | Symmetric |
| GCP | Cloud NAT | Symmetric |
| Azure | Azure NAT Gateway | Symmetric |
| OCI | NAT Gateway | Symmetric |
Most home/ISP routers use Cone NAT (STUN-based P2P works).
When is relay needed?
Relay is needed only when both peers are behind Symmetric NAT. Cone ↔
Symmetric pairs achieve direct P2P: the Symmetric side initiates a
handshake to the Cone peer's stable STUN endpoint; the Cone NAT accepts
the packet (Endpoint-Independent Filtering), and WireGuard responds to
the actual source address. Each node publishes its natType in its
WireKubePeer status, so peers can determine the optimal transport path.
Traversal Strategy¶
graph TD
A[Agent starts] --> B[Endpoint Discovery]
B --> C{Manual annotation?}
C -->|Yes| D[Use annotated endpoint]
C -->|No| E[STUN binding to 2+ servers]
E --> F{Same mapped port<br/>from all servers?}
F -->|Yes: Cone NAT| G[Use STUN public IP:port]
F -->|No: Symmetric NAT| H[Flag isSymmetricNAT=true<br/>Use STUN public IP with listen port]
G --> I[Register WireKubePeer]
D --> I
H --> I
I --> J[Configure WireGuard peers]
J --> K{Symmetric NAT<br/>and peer also<br/>Symmetric?}
K -->|Both Symmetric| L[Relay immediately]
K -->|No / Cone peer| M{Handshake within<br/>timeout?}
M -->|Yes| N[Direct P2P]
M -->|No| L
L --> O[Relay mode]
O --> P[Periodic direct probe]
P -->|Success| N
P -->|Fail| O
Stage 1: Endpoint Discovery¶
The agent runs through the discovery chain on startup:
- Manual annotation (
wirekube.io/endpoint) — Highest priority, no network calls - STUN — Binding request to 2+ configured STUN servers. If mapped ports differ between servers, the node is classified as Symmetric NAT.
- AWS IMDSv2 — EC2 metadata service for Elastic IP lookup
- UPnP / NAT-PMP — Request port mapping from gateway router
- Node InternalIP — Last resort fallback
For Symmetric NAT nodes, the agent uses the STUN-discovered public IP combined with the configured WireGuard listen port as its registered endpoint. The port won't match the actual NAT mapping, but it provides a valid public IP for peers to attempt direct connections (which will fail, triggering relay).
Stage 2: Direct P2P or Relay¶
After endpoint discovery:
- Cone NAT / Public IP: Agent configures WireGuard with the peer's discovered endpoint and waits for a handshake.
- Symmetric NAT → Cone/Public peer: Agent tries direct. The Symmetric side initiates a handshake to the Cone peer's stable endpoint. Cone NAT accepts the incoming packet, WireGuard responds to the actual source address, and a bidirectional tunnel is established.
- Symmetric NAT → Symmetric NAT peer: Relay is activated immediately (both
sides change ports per destination — direct P2P is impossible without a birthday
attack). The peer's
natTypefield in its WireKubePeer status is used to make this decision. - Handshake timeout: If any peer's handshake doesn't complete within
handshakeTimeoutSeconds(default 30s), relay is activated for that peer.
ICE-like Negotiation¶
WireKube implements an ICE-like (Interactive Connectivity Establishment) negotiation protocol to optimize peer connectivity. Unlike full ICE/STUN/TURN, it leverages WireGuard's built-in handshake as the connectivity check and Kubernetes CRDs as the signaling channel.
Candidate Types¶
Each agent gathers connectivity candidates and publishes them in its WireKubePeer status:
| Type | Description | Priority |
|---|---|---|
host |
Node's internal/LAN IP + WG listen port | 100 |
srflx |
STUN-discovered public endpoint | 200 (cone) / 50 (symmetric) |
relay |
Relay server available | 10 |
prflx |
WireGuard-observed endpoint (learned during handshake) | — |
NAT Type Matrix¶
The agent evaluates the NAT type of both sides to select the optimal strategy:
| Local NAT | Peer NAT | Strategy |
|---|---|---|
| Cone | Cone | Direct probe via STUN endpoints (high success rate) |
| Cone | Symmetric | Probe with cone's stable endpoint; symmetric peer's keepalive opens pinhole |
| Symmetric | Cone | Probe peer's stable endpoint; our NAT creates mapping for response |
| Symmetric | Symmetric | Birthday attack (if enabled) or relay fallback |
Same-NAT Detection¶
When two peers share the same public IP (behind the same NAT gateway), STUN
endpoints are unreliable — the NAT can only forward a given external port to one
internal host. WireKube detects this by comparing STUN-discovered IPs and
switches to the peer's host candidate (internal LAN IP) for direct
communication. If the host candidate probe fails (e.g., peers are in different
VPCs sharing a NAT gateway), it falls back to relay.
Birthday Attack (Symmetric ↔ Symmetric)¶
For two Symmetric NAT peers, no direct path exists through normal means. The birthday attack opens many UDP sockets simultaneously (256 by default), sending probes to predicted port ranges on the peer's public IP. With enough entropy, the probability of finding a matching port pair is approximately 1 − e^(−n²/2k) where n is the number of probes and k is the port range.
Birthday attack considerations
Some NAT gateways may interpret the burst of UDP probes as a port-scanning attack and temporarily block the source. Birthday attack is disabled by default and can be enabled via:
- Cluster-wide:
WireKubeMesh.spec.natTraversal.birthdayAttack: enabled - Per-peer override: annotation
wirekube.io/birthday-attack: enabledon the WireKubePeer
Priority: peer annotation > mesh global > default (disabled).
Endpoint Reflection¶
When a direct connection succeeds, the WireGuard kernel module learns the peer's actual NAT-mapped endpoint (which may differ from the STUN-discovered one). The agent detects this change and patches the WireKubePeer CRD so other nodes also learn the correct endpoint. To prevent flapping with Symmetric NAT ports, endpoint updates for same-IP-different-port cases are only applied when ICE confirms the connection is stable.
Stage 3: Relay Fallback¶
When relay is activated for a peer:
- Agent connects to the relay server (or relay pool) via TCP
- Registers its WireGuard public key with the relay
- Creates a local UDP proxy (
127.0.0.1:random → 127.0.0.1:<wg-port>) - Sets the peer's WireGuard endpoint to the proxy's local address
- All subsequent WireGuard traffic for this peer routes through the relay
The relay connection auto-reconnects with exponential backoff (1s–30s) if the TCP connection drops. Existing UDP proxies are preserved across reconnections.
Stage 4: Direct Path Recovery¶
Every directRetryIntervalSeconds (default 120s), the agent probes relayed
peers to check if direct connectivity has become available:
- Temporarily set the peer's WireGuard endpoint back to the direct address
- Wait for the next sync cycle to check WireGuard stats
- If a successful handshake is detected on the non-proxy endpoint → upgrade to direct
- If no handshake → cancel probe, resume relay, wait for next retry interval
Skipping futile probes
The agent skips direct probes for peers whose WireKubePeer.Status.NATType
is symmetric when the local node is also Symmetric NAT. This prevents
wasting cycles probing paths that cannot succeed (both sides use endpoint-
dependent mapping).
Transport Modes and NAT Type Reporting¶
Each agent publishes two transport-related fields in its WireKubePeer status:
natType — The node's detected NAT mapping behavior (cone, symmetric,
or empty if detection was inconclusive). Other agents use this to decide whether
direct P2P is possible.
peerTransports — A per-peer map recording the transport mode to each
remote peer (e.g., {"node-worker1": "direct", "node-worker7": "relay"}).
This gives full visibility into which paths use relay.
transportMode — Aggregate derived from peerTransports:
| Mode | Meaning |
|---|---|
direct |
All peers connected via direct P2P |
relay |
All peers via relay |
mixed |
Some peers direct, some relayed |
Both natType and transportMode appear as kubectl print columns (NAT, Mode)
for quick inspection: kubectl get wirekubepeers.
Each agent only updates its own node's status. This prevents conflicting updates from multiple agents and eliminates status flapping.
Relay Protocol¶
See Relay System for the full relay protocol specification.
Performance¶
| Scenario | Typical Latency | Notes |
|---|---|---|
| Direct P2P (same VPC) | 0.5 – 2 ms | WireGuard overhead only |
| Relay (same region) | 1.5 – 3 ms | Added TCP hop through relay |
| Relay (cross-region) | 40 – 60 ms | Dominated by geographic distance |
The relay adds minimal latency within the same region because it only introduces one additional TCP hop (agent ↔ relay ↔ agent).