AWS SageMaker RCF Multi-Server Anomaly Detection
Achievement Log
Overview
Enterprise-grade anomaly detection system for a telecom client monitoring 500+ servers using Amazon SageMaker's Random Cut Forest algorithm. Detects abnormal CPU, memory, and network I/O patterns via an event-driven Lambda inference pipeline with automated retraining. Replaced static threshold alerting, reducing mean time to detection from 1 hour to 4.3 seconds.
Core Technologies
Implementation & Architecture
Historical Data Ingestion & Baseline Training Pipeline
Downloaded 6 months of CPU/memory/network data for 500+ servers via parallel CloudWatch calls, applied rolling z-score normalization, serialized to RecordIO-protobuf, and launched a SageMaker training job. Model Monitor baseline was established post-training, and the endpoint was deployed with auto-scaling and a 10% data capture config.
Real-Time Detection & Alerting Architecture
30-second Lambda loop pulled metric batches from DynamoDB, invoked the SageMaker endpoint with 50-server CSV payloads, and applied a two-tier threshold (P1 > 3.0, P2 1.5–3.0). A dynamic moving-average threshold adjusted for fleet-wide load cycles, reducing false positives by 40%. DLQ with a reprocessor Lambda handled endpoint downtime.
Model Drift Detection & Automated Retraining Loop
Custom Lambda computed 7-day rolling anomaly score mean and triggered a SageMaker Pipeline retraining when drift exceeded 1.5σ. Blue/green endpoint promotion routed 10% traffic to the new variant for 24 hours before full cutover. Retraining events published to SNS for ops team visibility.
Technical Skills
- Amazon SageMaker
- AWS Lambda
- Python
- Anomaly Detection
- CloudWatch
- MLflow
- Bash Scripting
Engineering Challenges
- →Metric Variance During Peak Traffic — Bimodal traffic distribution caused 30+ false positives/day with a fixed threshold. Fixed by rolling z-score normalization before inference and switching to a dynamic moving-average threshold.
- →CloudWatch API Rate Throttling — 500-server parallel ingestion hit the 400 TPS limit within 30 seconds. Fixed with token-bucket rate limiting (350 calls/sec), exponential backoff with jitter, and a TPS limit increase via AWS Support.
- →SageMaker Endpoint Cold Start During DLQ Reprocessing — DLQ reprocessor invoked the endpoint during a blue/green update window, losing events after 3 retries. Fixed by adding an endpoint health check before each retry attempt.
- →Model Drift Trigger Instability — Short drift window triggered 6 unnecessary retraining jobs in 72 hours during a maintenance window. Fixed by extending the drift evaluation window to 14 days and requiring 3 consecutive threshold breaches.
- →RecordIO Serialization Memory Spike — Building a 500-server payload in Lambda RAM consumed 480 MB, near the limit. Fixed by batching into 50-server groups with a generator-based CSV serializer, reducing peak memory to 180 MB.
- →DynamoDB Buffer Write Conflicts — Concurrent Lambda writes to the same item caused silent metric loss. Fixed by switching to an append-only design with composite keys and GSI queries for the rolling window.
Project Outcomes
- ✓Detected 15+ critical server anomalies before user-visible impact across a 500-server telecom fleet.
- ✓Reduced MTTD from 1 hour to 4.3 seconds — a 99% reduction in detection latency.
- ✓Achieved 95% detection precision on a 87-incident labeled holdout vs 2.7% for the legacy system.
- ✓Automated the full ML lifecycle (ingestion → training → deployment → drift → retraining) with zero human intervention for routine maintenance.
- ✓Reduced monthly alert noise from 847 to 51 — a 94% reduction in alert fatigue.
- ✓42% below budget on total monthly AWS cost after Spot migration and endpoint auto-scaling.