GPU Node Unresponsive — Live Incident A GPU compute node has stopped reporting metrics. You're on-call. Triage from alert to resolution.
ADVANCED 4 steps hardware linux ops-processes
Phantom Disk Full — Production Alert Root filesystem shows 100% full but du disagrees. Services are failing. Find the ghost.
INTERMEDIATE 4 steps linux ops-processes
NIC Flapping — Intermittent Connectivity A compute node keeps losing network connectivity for 5-10 seconds at a time. Training jobs are failing with NCCL timeouts.
ADVANCED 4 steps networking hardware linux
Permission Disaster — Locked Out of Production Someone ran a recursive chmod and now critical services are failing. SSH still works but sudo doesn't.
INTERMEDIATE 4 steps linux ops-processes
GPU Thermal Cascade — Cooling Failure Multiple GPUs are hitting thermal limits. Training jobs are crashing across a rack. Could be a cooling issue.
EXPERT 4 steps hardware power-cooling ops-processes