Operations Runbook
Day-2 Operations Scope
This runbook covers platform operations for the Tornado VPN server stack:
- service health checks
- routine maintenance
- incident response
- session and network recovery
Health Check Matrix
| Layer | Check | Expected |
|---|---|---|
| Supervisor | systemctl status tornado |
active/running |
| Admin API | curl http://127.0.0.1:8000/health |
JSON status healthy |
| Client API | curl http://127.0.0.1:4605/health |
JSON status ok |
| Redis | redis-cli ping |
PONG |
| PostgreSQL | pg_isready -h 127.0.0.1 -p 5432 |
accepting connections |
| WireGuard | wg show |
wg0/wg1 interfaces active |
| Tor Manager | admin endpoint /relay/health |
healthy response |
Service Recovery Order
When full stack recovery is needed, use this order:
- PostgreSQL and Redis
tornado.service(starts master and microservices)- NGINX
- API health verification
- tunnel interface verification (
wg show)
Incident: Clients Cannot Establish VPN
- Verify client API health (
:4605). - Check auth service and wg manager status through admin APIs.
- Verify IPAM pool availability in Redis.
- Check WireGuard interface state (
wg0,wg1) and recent logs. - Confirm JWT key files exist and are readable.
Incident: Frequent Session Drops
- Check Redis keyspace events are enabled (
notify-keyspace-events Ex). - Verify heartbeat TTL and hard TTL expectations from session config.
- Inspect session service logs for
heartbeat_lostandhard_cleanupfrequency. - Validate client heartbeat interval behavior relative to returned
heartbeat_ttl.
Incident: Token Validation Failures After Rotation
- Check key rotator status and recent rotation logs.
- Verify overlap keys exist during expected cutover windows.
- Confirm pid files used for SIGHUP are valid and current.
- Force controlled reload or service restart if required.
Operational Flow
flowchart TD
A[Alert Triggered] --> B[Classify Impact]
B --> C[Check Core Dependencies\nPostgres + Redis + tornado.service]
C --> D[Check API health\n:8000 and :4605]
D --> E[Check domain-specific layer\nWG/Tor/Auth/Session/Logs]
E --> F[Apply targeted remediation]
F --> G[Validate recovery]
G --> H[Document root cause + action items]
Log and Metrics Operations
- Log microservice provides query, count, aggregate, histogram, and export actions.
- Admin API exposes live metrics endpoints and websocket streams.
- Export artifacts are written to configured
LOG_EXPORT_DIR.
Maintenance Tasks
- Rotate secrets and review key rotator interval policy.
- Validate database retention and log retention expectations.
- Check system packages and security updates.
- Verify backup and restore drills for PostgreSQL data.
Controlled Restart Procedure
sudo systemctl restart tornado
sleep 5
curl -sSf http://127.0.0.1:8000/health
curl -sSf http://127.0.0.1:4605/health
wg show
Emergency Containment
For severe incidents:
- Disable external access at firewall/load balancer edge.
- Keep internal services up for forensic data extraction.
- Export logs and preserve system journal.
- Rotate admin and JWT secrets before restoring ingress.
Post-Incident Checklist
- Incident timeline with UTC timestamps
- direct cause and contributing factors
- short-term fix and long-term prevention
- backlog tickets for resilience gaps