Step-by-step troubleshooting — 10 scenarios across 4 difficulty levels
Alert: Root filesystem at 98% on web-prod-03. Application logs are failing to write.
Users report the application is sluggish. Monitoring shows sustained 95%+ CPU on app-node-07.
Monitoring alert: nvidia-smi reports ERR! for GPU 2 on gpu-rack12-node07. Training job paused.
After a server reboot, the monitoring agent (node-exporter) fails to start. systemctl shows 'failed'.
Intermittent 5-8% packet loss reported between gpu-rack08-node03 and the storage cluster. Training throughput degraded.
3 AM alert: critical process 'ml-trainer' was killed by the OOM killer on gpu-node-22. Training job lost 4 hours of checkpoint data.
Applications on db-node-05 failing with 'Read-only file system' errors. Database writes are blocked.
Monitoring shows increasing correctable ECC memory errors on compute-node-15. No crashes yet, but error rate is climbing.
Researchers report 'Stale NFS file handle' errors when accessing shared datasets on ml-node-03.
8 nodes in Rack 22 simultaneously lost network connectivity. BMCs are also unreachable.