# HydraNeckWebRTC Runbook

## Overview

HydraNeckWebRTC is a centralized WebRTC relay service. It runs moonlight-web-stream processes on worker machines and proxies WebRTC streams to browsers. A controller routes session requests to the least-loaded worker.

## Architecture

Two hydracluster roles map to systemd services:

- **`hydraneckwebrtc-controller`** (singleton): Routes session requests, tracks worker health. Service: `hydraneckwebrtc-controller.service`, runs `hydraneckwebrtc controller`.
- **`hydraneckwebrtc`** (scalable): Manages local sessions, runs coturn for TURN relay. Service: `hydraneckwebrtc.service`, runs `hydraneckwebrtc worker`.

Both roles use the same binary (`/usr/local/bin/hydraneckwebrtc`). Hydranode handles downloading the binary and creating the systemd service for both roles.

## Automated Setup (via hydracluster)

Worker nodes are fully automated through hydracluster recipes:

1. Enroll a new Linux server in hydracluster
2. Assign the `hydraneckwebrtc` role (and optionally `hydraneckwebrtc-controller` for the singleton)
3. Trigger provisioning. The recipe will:
   - Verify WireGuard tunnel is active (enables `wg-quick@<iface>` for any config in `/etc/wireguard/`)
   - Install coturn via apt
   - Generate a random TURN credential (persisted at `/root/.hydraneckwebrtc-turn-credential`)
   - Write `/etc/turnserver.conf` with the node's IP and credential
   - Start coturn
   - Write `/root/.hydraneckwebrtc/config.yaml` with controller URL, token, and ice_servers
4. Hydranode downloads the binary and creates the systemd service
5. The worker starts and registers with the controller via heartbeat

### hydracluster config

Add to `~/.hydracluster/config.yaml`:

```yaml
hydraneckwebrtc:
  controller_url: "https://hydraneckwebrtc.experiencenet.com"
  controller_token: "controller-admin-token"
  admin_token: "worker-admin-token"
```

## Manual Installation

### Prerequisites

- Linux AMD64 or ARM64
- moonlight-web-stream installed at `/opt/moonlight-web-stream/web-server`
- WireGuard tunnel active (`wg-quick@wg0`) — required to reach body nodes on WG IPs
- Network access to Sunshine hosts on port 47990

### Install binary

```bash
curl -o /usr/local/bin/hydraneckwebrtc \
  https://releases.experiencenet.com/hydraneckwebrtc/production/latest/hydraneckwebrtc-linux-amd64
chmod +x /usr/local/bin/hydraneckwebrtc
```

### Install systemd services

```bash
# Worker
cp scripts/hydraneckwebrtc.service /etc/systemd/system/
# Controller (if running on this machine)
cp scripts/hydraneckwebrtc-controller.service /etc/systemd/system/
systemctl daemon-reload
```

## Configuration

Config file: `~/.hydraneckwebrtc/config.yaml`

### Controller config

```yaml
mode: controller
server:
  domain: hydraneckwebrtc.experiencenet.com
  admin_token: <secure-token>
workers:
  heartbeat_timeout: 60s
```

### Worker config

```yaml
mode: worker
server:
  listen: ":47990"
  admin_token: <secure-token>
controller:
  url: https://hydraneckwebrtc.experiencenet.com
  token: <controller-admin-token>
sessions:
  max: 15
  port_range_start: 8080
  return_url: https://hydraheadwebstream.experiencenet.com  # optional, default
sunshine:
  username: sunshine
  password: sunshine
ice_servers:
  - urls: ["stun:stun.l.google.com:19302"]
  - urls: ["turn:<server-public-ip>:3478"]
    username: "hydraturn"
    credential: "<turn-password>"
nps:
  url: https://hydranps.experiencenet.com  # optional, session records sent at session end
  token: <nps-admin-token>
```

### Single-machine deployment (controller + worker on same server)

When both run on the same machine, the controller proxies `/session/` paths to the worker.
Stream URLs use the controller's domain, so browsers connect to the controller which forwards to the worker internally.

Worker config for colocated setup (listens on a local port, no domain needed):

```yaml
server:
  listen: ":8090"
  admin_token: <same-as-controller>
controller:
  url: https://hydraneckwebrtc.experiencenet.com
  token: <same-as-controller>
sessions:
  max: 15
  port_range_start: 8080
sunshine:
  username: sunshine
  password: sunshine
```

### Dev mode

Use `server.listen` instead of `server.domain` for plain HTTP:

```yaml
server:
  listen: ":8080"
  admin_token: dev-token
```

## Common Operations

### Start the services

```bash
# Controller
systemctl enable --now hydraneckwebrtc-controller

# Worker
systemctl enable --now hydraneckwebrtc
```

### Check health

```bash
# Controller
curl -sf https://hydraneckwebrtc.experiencenet.com/api/v1/health | jq .

# Worker (shows WireGuard status, active sessions, capacity)
curl -sf http://localhost:47990/api/v1/health | jq .
```

Worker health response includes:
- `status`: `"ok"` or `"degraded"` (degraded when WireGuard is down)
- `extra.process_id`: worker process PID (identifies which instance is responding)
- `extra.uptime`: how long the worker has been running
- `extra.wireguard`: `true`/`false` — whether a WireGuard interface is detected
- `extra.active_sessions` / `extra.max_sessions`: current load
- `extra.sessions`: list of active sessions, each with:
  - `id`, `body_ip`, `status`: identity and state
  - `process_alive`: whether the moonlight-web-stream process is still running
  - `data_directory_ok`: whether the session's data directory exists on disk
  - `active_connections`: number of open WebSocket tunnels
  - `age`: how long since the session was created

Quick readiness check (single command):
```bash
curl -sf https://hydraneckwebrtc.experiencenet.com/api/v1/health | jq '{
  status: .status,
  wireguard: .extra.wireguard,
  sessions: "\(.extra.active_sessions)/\(.extra.max_sessions)",
  process_id: .extra.process_id,
  uptime: .extra.uptime
}'
```

### List workers (from controller)

```bash
curl -H "Authorization: Bearer <token>" \
  https://hydraneckwebrtc.experiencenet.com/api/v1/workers
```

### Create a session

```bash
curl -X POST -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{"body_ip":"10.0.1.50","sunshine_user":"sunshine","sunshine_pass":"sunshine"}' \
  https://hydraneckwebrtc.experiencenet.com/api/v1/sessions
```

Optional fields: `experience` (name), `exe_path` (Windows path on the body), `district`. When `exe_path` is set, the worker registers the app in Sunshine and launches it via hydrabody before starting the stream (both non-fatal). When `experience` is set, it is stored for experience rating.

Returns `session_id`, `stream_url`, and `worker_id`.

### Stream-ended overlay

When a browser connects to `/session/{id}/...`, the worker reverse proxy injects a script into HTML responses. This script detects WebRTC disconnects and shows a "Stream ended" overlay with a "Start new stream" link back to `sessions.return_url` (defaults to `https://hydraheadwebstream.experiencenet.com`).

The injected script also intercepts the moonlight-web-stream exit button, calling `POST /session/{id}/quit` (unauthenticated) to clean up the session server-side before redirecting.

The overlay includes a "Report a problem" link that opens `https://issues.experiencenet.com/report` in a new tab with session context prefilled as query parameters: `project`, `category`, `title`, `session_id`, `experience`, `district`, `body`, `head`, `body_ip`, `disconnect_count`, `duration_ms`, `end_reason`, and `ice_candidate_type`.

The overlay includes a star rating widget. When the user rates, `POST /session/{id}/rating` is sent to the worker, which forwards it to the experience library.

### Browser mic relay

The injected script adds a mic toggle button (bottom-right of the stream page). When enabled:

1. Browser captures mic via `getUserMedia({audio: true})`
2. Creates a send-only WebRTC PeerConnection for the audio track
3. Signals via `POST /session/{id}/mic/offer` (unauthenticated, browser-facing, proxied through controller)
4. Worker creates a pion/webrtc PeerConnection (receive-only), forwards raw RTP packets via UDP to `body_ip:47995` over WireGuard
5. On the body, hydravoice receives RTP and renders to VB-Cable via ffmpeg
6. UE reads from the "CABLE Output" virtual mic

Mic is automatically stopped when the stream ends.

### List sessions

```bash
# All sessions across workers (via controller)
curl -H "Authorization: Bearer <token>" \
  https://hydraneckwebrtc.experiencenet.com/api/v1/sessions
```

### Delete a session

```bash
curl -X DELETE -H "Authorization: Bearer <token>" \
  https://hydraneckwebrtc.experiencenet.com/api/v1/sessions/<session-id>
```

On deletion (or cleanup), if the session had an experience launched via `exe_path`, the worker calls `POST /api/v1/stop` on the body to kill the experience processes.

### Update the binary

```bash
hydraneckwebrtc update
# Or auto-update runs every 6 hours
```

### Check version

```bash
hydraneckwebrtc version
hydraneckwebrtc check-update
```

## Scaling

1. Start with one machine running both controller + worker (two systemd services)
2. Add worker machines: assign `hydraneckwebrtc` role in hydracluster, trigger provision
3. Recipe installs coturn, writes config with controller URL/token, worker auto-registers
4. Controller picks least-loaded worker for each new session
5. Browsers connect directly to workers for WebRTC (no load balancer needed for media)
6. Workers that stop heartbeating are marked unhealthy after `heartbeat_timeout`

## TURN Server (coturn)

All WebRTC media is relayed through TURN (`iceTransportPolicy: 'relay'` is forced on the browser side). Direct host/srflx candidates are disabled because NAT binding timeouts cause stream drops. coturn is automatically installed and configured by the hydraneckwebrtc recipe on each worker node.

### Manual setup (if not using recipe)

```bash
apt install coturn
```

Config (`/etc/turnserver.conf`):

```
listening-port=3478
external-ip=<server-public-ip>
realm=hydraneckwebrtc.experiencenet.com
server-name=hydraneckwebrtc.experiencenet.com
lt-cred-mech
user=hydraturn:<password>
min-port=49152
max-port=65535
no-multicast-peers
no-cli
log-file=/var/log/turnserver.log
simple-log
allowed-peer-ip=10.10.0.0-10.10.255.255
```

### Firewall

Ports required:
- UDP+TCP 3478 (TURN signaling)
- UDP+TCP 49152-65535 (relay media — both protocols required, browsers may use TCP TURN relay for audio)

### Verify

```bash
# Check coturn is running
systemctl status coturn

# Check listening
ss -ulnp | grep 3478

# In browser console during a stream, look for "typ relay" ICE candidates
```

## Troubleshooting

### Worker not registering with controller

- Check `controller.url` and `controller.token` in worker config match controller's `server.admin_token`
- Verify network connectivity between worker and controller
- Check worker logs: `journalctl -u hydraneckwebrtc -f`
- Heartbeat is sent every 30s; controller marks unhealthy after `heartbeat_timeout` (default 60s)

### Session creation fails

- "no healthy workers available": No workers registered or all are unhealthy/full
- "max sessions reached": Worker at capacity, add more workers or increase `sessions.max`
- "no ports available": All ports in the range are in use
- "moonlight-web-stream not installed": Install at `/opt/moonlight-web-stream/web-server`

### Pairing fails

- Check health endpoint first: `curl -s .../api/v1/health | jq .extra.wireguard` — if `false`, WireGuard is down
- "body X unreachable on port 47990": body is not reachable. Check WireGuard (`wg show`) and Sunshine on the body
- "pair failed: ... (crypto handshake rejected by Sunshine)": version mismatch between moonlight-web-stream and Sunshine
- Verify Sunshine is running on the body and reachable on port 47990
- Check sunshine credentials in worker config
- Pairing has a 30s timeout; if Sunshine is slow, it may time out

### WireGuard down on worker

Symptoms: all sessions to WireGuard-routed bodies fail with "body unreachable". Health endpoint shows `"wireguard": false` and `"status": "degraded"`.

```bash
# Check WireGuard status
wg show
ip link show type wireguard

# Start WireGuard (config must exist in /etc/wireguard/)
systemctl enable --now wg-quick@wg0

# Verify body reachable
ping -c 2 -W 3 10.10.100.6
curl -sk -u sunshine:sunshine https://10.10.100.6:47990/api/currentClient
```

The neckwebrtc recipe now auto-enables WireGuard on provision, but if the server was set up before this change or WG was manually stopped, re-enable it manually. The worker logs a `WARNING: no WireGuard interface detected` on startup if no WG interface is found.

### Session process dies

- The process health monitor checks every 30s; dead sessions are marked "error"
- Process liveness is checked via `kill -0` (signal 0) on Unix
- Data directory integrity is also checked every 30s; if the directory is deleted externally while the process is running, the session is marked "error" with an `ALERT` log
- Each moonlight-web-stream process logs to `web-server.log` in its session data dir
- Check logs: `find /tmp/hydraneckwebrtc-sessions -name web-server.log -exec cat {} \;`
- Cleanup goroutine removes error/stopped sessions after idle timeout (5 min)
- Check for orphaned session dirs in `/tmp/hydraneckwebrtc-sessions/`; they are cleaned on startup
- Use the health endpoint to check per-session `process_alive` and `data_directory_ok` without SSH

### Duplicate instance prevention

- On startup, the worker performs a port pre-flight check before cleaning orphaned session dirs
- If the worker HTTP port is already in use, the worker exits immediately with: `"cannot bind — is another hydraneckwebrtc already running?"`
- This prevents a rogue second instance (e.g. old systemd unit) from deleting active session directories
- If you see this error, check for stale systemd units: `systemctl list-units | grep hydraneck`

### Orphaned processes after crash

- On restart, orphaned session directories in `/tmp/hydraneckwebrtc-sessions/` are cleaned up (only after the port pre-flight confirms no other instance is running)
- Orphaned moonlight-web-stream processes may need manual cleanup: `pkill -f web-server`

### Port conflicts

- Default port range: 8080 to 8080+max_sessions
- WebRTC ports: 40000+ (20 ports per session)
- Ensure these ranges do not conflict with other services

## Logs

```bash
# systemd journal
journalctl -u hydraneckwebrtc -f          # worker
journalctl -u hydraneckwebrtc-controller -f  # controller

# Key log prefixes
# [controller] - controller routing decisions
# [worker] - worker session management
# [session <id>] - per-session lifecycle
# [pairing] - Sunshine pairing flow
# [heartbeat] - worker-to-controller heartbeat
# [cleanup] - expired session cleanup
# [proxy] - reverse proxy errors
# [mic] - browser mic WebRTC connections
# [rating] - experience ratings
```

## Releasing

1. Tag: `git tag v<X.Y.Z> && git push origin v<X.Y.Z>`
2. GitHub Actions builds linux-amd64 and linux-arm64 binaries
3. Uploads to GitHub Releases and releases.experiencenet.com
4. Running instances auto-update on next 6-hour check cycle
