# HydraGuard Runbook

WireGuard mesh connecting venues, air units, and cloud infrastructure to the Hydra platform.

## Infrastructure

| Resource | Value |
|----------|-------|
| **Hub server** | `141.227.136.12` (OVHcloud b3-8, Brussels EU-WEST-LZ-BRU-A) |
| **Hub DNS** | `hydraguard.experiencenet.com` |
| **Hub WG address** | `10.10.0.1/24` |
| **Hub public key** | `VGA6ETZB2XFVRRb5KmcFvQ+Ybfh9KKfcWuXfP1IuvQE=` |
| **WireGuard port** | `51820/udp` |
| **Mesh file** | `/root/.hydraguard/mesh.yaml` |
| **Hub private key** | `/etc/wireguard/hub.key` |
| **WG config (generated)** | `/etc/wireguard/wg0.conf` |
| **API server** | `http://hydraguard.experiencenet.com:8081` |
| **API config** | `/root/.hydraguard/api.yaml` |
| **Requests store** | `/root/.hydraguard/requests.yaml` |
| **Audit log** | `/root/.hydraguard/audit.log` |
| **Service (API)** | `systemctl status hydraguard` |
| **Service (WireGuard)** | `systemctl status wg-quick@wg0` |
| **Logs** | `journalctl -u hydraguard -f` |
| **SSH** | `ubuntu@141.227.136.12` |
| **Shared with** | hydraneckwebrtc controller + worker on the same instance |

> **API warning**: Keep the HydraGuard API service **stopped** (`systemctl stop hydraguard`) when not actively enrolling new peers. The `auto_apply: true` setting causes `/api/v1/air/provision` to regenerate peer keypairs on every call, breaking existing connections. See [issue #64](https://issues.experiencenet.com/issues/64).

### Previous hub (pending decommission)

| Resource | Value |
|----------|-------|
| **Server** | `89.167.57.232` (Hetzner cx23, Falkenstein) |
| **Status** | WireGuard stopped, server still running |
| **Action** | Decommission after all peers validated on Brussels |

## Current Mesh

| Peer | Type | WG Address | LAN | Guard | Notes |
|------|------|------------|-----|-------|-------|
| AD6 | venue | 10.10.1.1/32 | 10.0.0.0/24 | omada | Overijse |
| air-001 | air | 10.10.100.1/32 | -- | -- | |
| air-tvl-one | air | 10.10.100.2/32 | -- | -- | |
| air-cederiks24 | air | 10.10.100.3/32 | -- | -- | |
| air-hydraneckwebrtc | air | 10.10.100.4/32 | -- | -- | Old Hetzner neckwebrtc (retired) |
| air-hydra-0000 | air | 10.10.100.5/32 | -- | -- | |
| air-sneaky-squid-86 | air | 10.10.100.6/32 | -- | -- | bxl1 body |
| air-boom-pickle-38 | air | 10.10.100.7/32 | -- | -- | bxl1 body |
| air-wobbly-llama-92 | air | 10.10.100.8/32 | -- | -- | bxl1 body |

## Address Scheme

| Type | WG tunnel range | LAN range | Capacity |
|------|----------------|-----------|----------|
| Hub | 10.10.0.1/24 | -- | 1 |
| Venues | 10.10.1-49.1/32 | 10.0.X.0/24 (auto or custom) | 49 |
| Neck Air | 10.10.50-99.1/32 | 10.0.X.0/24 | 50 |
| Hydra Air | 10.10.100.1-254/32 | -- (no LAN) | 254 |

## SSH Access

```bash
ssh ubuntu@141.227.136.12
```

## Health Check

```bash
curl -s http://hydraguard.experiencenet.com:8081/api/v1/health
```

---

## Operations

All commands run on the hub server unless stated otherwise.

### Check status

```bash
hydraguard status
```

Example output:
```
PEER               TYPE             ADDRESS         HANDSHAKE       TRANSFER
AD6                venue/omada      10.10.1.1       12s ago         4.86 KiB / 3.78 KiB
air-hydraneckwebrtc air              10.10.100.4     53s ago         4.41 KiB / 2.48 KiB
air-001            air              10.10.100.1     -- (offline)    0 / 0
```

- **Handshake "X ago"** = peer is online and connected
- **"-- (offline)"** = no recent handshake, peer is unreachable

### Raw WireGuard status

```bash
wg show wg0
```

Shows endpoints, allowed IPs, transfer bytes, and last handshake per peer.

### View logs

```bash
journalctl -u hydraguard -f              # Follow live
journalctl -u hydraguard -n 100 --no-pager  # Last 100 lines
```

### Restart

```bash
systemctl restart hydraguard
```

### Update

```bash
hydraguard check-update    # Check if a new version is available
hydraguard update          # Download and install the latest version
```

Never manually deploy. Always use the release pipeline (tag + push to trigger CI).

---

## Adding Peers

Every `add` command:
1. Generates a WireGuard keypair
2. Stores the public key in `mesh.yaml`
3. Prints the private key to stdout (save it, it is only shown once)
4. Auto-assigns the next available address

**After adding any peer, always run `hydraguard apply`.**

### Add a Venue

```bash
hydraguard venue add <name> --location <city> --guard <omada|citymesh|linuxvm|gateway> [--lan <cidr>]
hydraguard apply
```

Guard types:

| Guard type | Use case | Notes |
|-----------|----------|-------|
| `omada` | TP-Link Omada ER605/ER7212 | Configured via Omada SDN Controller API |
| `citymesh` | Citymesh Guard (Mikrotik) | Bare WireGuard config |
| `linuxvm` | Linux VM gateway (Azure/GCP/AWS) | Adds PostUp for IP forwarding and masquerade |
| `gateway` | On-prem LAN gateway (behind FortiGate) | iptables `-I FORWARD 1` (priority), masquerade, MSS clamping |

### Add a Hydra Air Unit

Standalone render nodes with WireGuard running directly on Windows.

```bash
hydraguard air add <id>
hydraguard apply
hydraguard air config <id>    # Get Windows .conf
```

### Add a Neck Air Unit

Mobile venue-in-a-box setups with a Mikrotik router.

```bash
hydraguard neckair add <id>
hydraguard apply
hydraguard neckair config <id>    # Get Mikrotik .conf
```

### Get a peer config

```bash
hydraguard venue config <name>
hydraguard air config <id>
hydraguard neckair config <id>
```

### Removing peers

```bash
hydraguard venue remove <name> && hydraguard apply
hydraguard air remove <id> && hydraguard apply
hydraguard neckair remove <id> && hydraguard apply
```

The peer is instantly unreachable after `apply`.

---

## Applying Changes

```bash
hydraguard apply
```

This regenerates `/etc/wireguard/wg0.conf` and runs `wg syncconf` to hot-reload. Existing connections are not disrupted. If wg0 is not up, it runs `wg-quick up wg0` instead.

### Full restart (when syncconf is not enough)

```bash
wg-quick down wg0
wg-quick up wg0
```

After a full restart, peers behind NAT need up to 25 seconds to re-establish their handshake (PersistentKeepalive interval).

---

## Self-Registration API

Peers can register themselves via the HTTP API instead of requiring SSH access.

### Workflow

1. Client generates a WireGuard keypair locally
2. Client submits public key via `POST /api/v1/register` (requires API bearer token)
3. Request appears as "pending" in `requests.yaml`
4. Admin reviews and approves via CLI
5. Client polls for approval, then fetches its WireGuard config

### Managing requests

```bash
hydraguard requests list              # Show pending
hydraguard requests list --all        # Show all (including approved/denied)
hydraguard requests approve <id>      # Approve, adds peer to mesh
hydraguard requests deny <id>
hydraguard requests delete <id>
```

When `--auto-apply` is enabled, the hub config is automatically updated after approval.

---

## Backup

The only critical file is `mesh.yaml`. Back it up:

```bash
cp ~/.hydraguard/mesh.yaml ~/.hydraguard/mesh.yaml.bak
scp ubuntu@141.227.136.12:~/.hydraguard/mesh.yaml ./mesh-backup-$(date +%Y%m%d).yaml
```

The private key at `/etc/wireguard/hub.key` should also be backed up securely. If lost, you need to regenerate it and update all peer configs with the new public key.

hydrabackup also backs up the mesh.yaml to hydramirror automatically.

---

## Troubleshooting

### Peer shows "offline" / no handshake

1. **Check firewall on hub:** `ufw status` -- port 51820/udp must be open
2. **Check peer's internet:** Can the peer reach the internet?
3. **Verify keys match:** The peer's config must have the hub's public key, and the hub's mesh.yaml must have the peer's public key
4. **Check PersistentKeepalive:** Must be 25 in peer configs (HydraGuard sets this automatically)
5. **Check endpoint:** Peer config should have `Endpoint = hydraguard.experiencenet.com:51820`

### Handshake works but no data flows

This happens when the WireGuard tunnel negotiates successfully but actual traffic (pings, connections) does not pass through. Common causes:

1. **UFW blocking FORWARD chain on hub.** The `wg-quick` PostUp rule must insert (not append) the FORWARD rule before UFW's default DROP:
   ```bash
   # Check current FORWARD chain
   iptables -L FORWARD -n | head -5
   # If the wg0 ACCEPT rule is after ufw-reject-forward, fix it:
   iptables -I FORWARD 1 -i wg0 -o wg0 -j ACCEPT
   ```
   The generated wg0.conf uses `iptables -I FORWARD 1` to avoid this. If you see `-A FORWARD` in the conf, update it.

2. **Peer behind NAT took too long to re-handshake.** After a hub `wg-quick down/up`, peers behind NAT must re-initiate. Wait 25 seconds for the keepalive. Check:
   ```bash
   wg show wg0 | grep -A5 "endpoint"
   ```
   If the peer has an endpoint but "latest handshake" is blank, the peer hasn't sent a keepalive yet.

3. **Routing table missing.** After `wg-quick down/up`, verify routes exist:
   ```bash
   ip route show dev wg0
   ```
   Should show routes for each peer's AllowedIPs. `hydraguard apply` automatically syncs kernel routes after `wg syncconf`, but if routes are still missing, run:
   ```bash
   wg-quick down wg0 && wg-quick up wg0
   ```

### Can't reach a venue's LAN devices

Test connectivity step by step:

```bash
ping 10.10.X.1    # 1. VPN box tunnel address (WG layer)
ping 10.0.X.1     # 2. VPN box LAN gateway (routing through VPN box)
ping 10.0.X.100   # 3. A device on the LAN
```

If step 1 works but step 2/3 fails:
- The VPN box (ER605/Mikrotik) is not forwarding traffic between WG and LAN
- Check firewall rules on the VPN box
- On Omada: check via the Omada SDN Controller (see omada-venue.md)

If step 1 fails:
- Check `hydraguard status` for handshake
- Verify the LAN CIDR in mesh.yaml matches the actual venue LAN (common mistake: `10.0.1.0/24` vs `10.0.0.0/24`)

### Bodies (Windows render nodes) unreachable via ping but online

Windows Firewall blocks ICMP by default. The bodies may be online and functional even if ping fails. Verify by:
- Checking the Omada controller's client list (shows MAC, connection status)
- Checking hydraneck's scan results
- Trying to connect to a known service port on the body

### Inter-peer traffic not forwarding (e.g., hydraneckwebrtc cannot reach venue LAN)

Traffic between two WG peers (e.g., hydraneckwebrtc at 10.10.100.4 reaching AD6 LAN at 10.0.0.0/24) must be forwarded by the hub. Check:

1. **IP forwarding enabled:** `cat /proc/sys/net/ipv4/ip_forward` (must be `1`)
2. **iptables FORWARD rule:** `iptables -L FORWARD -n | head -3` -- the `ACCEPT` rule for wg0 must be before any DROP/REJECT
3. **Both peers connected:** Both the source peer and the destination venue must have active handshakes

### WireGuard interface won't come up

```bash
ip link show wg0           # Check if interface exists
wg-quick strip wg0         # Check config syntax
journalctl -u wg-quick@wg0 # Check logs
```

### DNS not resolving

```bash
dig +short hydraguard.experiencenet.com @8.8.8.8
```

If DNS doesn't resolve, check the A record in Hetzner DNS (zone 788422).

### mesh.yaml out of sync with wg0.conf

```bash
hydraguard apply    # Regenerates wg0.conf from mesh.yaml and syncs
```

### Full reset

```bash
wg-quick down wg0
rm /etc/wireguard/wg0.conf
hydraguard apply
```

---

## Releasing

```bash
git tag v1.1.0
git push origin v1.1.0
```

This triggers CI which builds binaries for linux/darwin x amd64/arm64 and publishes them as a GitHub Release. The hub picks up new versions via `hydraguard update`.
