AWS SageMaker RCF Multi-Server Anomaly Detection

Achievement Log

2025-01 Week 1: Audited 500+ server CloudWatch coverage, resolved 17 missing-metric agents, validated >99.5% data completeness. Built parallel Python ingestion with rate limiting and z-score normalization. 2025-01 Week 2–3: Ran hyperparameter grid search (F1=0.87), trained on full 6-month corpus, deployed endpoint with data capture and Model Monitor baseline. Deployed 30-second Lambda inference loop with DynamoDB buffer and SNS/PagerDuty integration. 5-day parallel run calibrated thresholds. 2025-02 Week 1: Identified month-end false positive spikes — implemented dynamic EMA threshold (false positives: 847 → 51/month). Added DLQ resilience layer and endpoint health-check guard. 2025-02 Week 2: Implemented automated retraining pipeline with blue/green promotion. Migrated Batch to Spot (55% cost reduction), added auto-scaling (30% cost reduction). Delivered client handover with documented runbook. First autonomous retraining cycle completed without human intervention.

Overview

Enterprise-grade anomaly detection system for a telecom client monitoring 500+ servers using Amazon SageMaker's Random Cut Forest algorithm. Detects abnormal CPU, memory, and network I/O patterns via an event-driven Lambda inference pipeline with automated retraining. Replaced static threshold alerting, reducing mean time to detection from 1 hour to 4.3 seconds.

Core Technologies

Amazon SageMaker (RCF)Python & Boto3AWS Batch & LambdaAmazon CloudWatchMLflow & Experiment TrackingBash Scripting & CLI Automation

Implementation & Architecture

Historical Data Ingestion & Baseline Training Pipeline

Downloaded 6 months of CPU/memory/network data for 500+ servers via parallel CloudWatch calls, applied rolling z-score normalization, serialized to RecordIO-protobuf, and launched a SageMaker training job. Model Monitor baseline was established post-training, and the endpoint was deployed with auto-scaling and a 10% data capture config.

Real-Time Detection & Alerting Architecture

30-second Lambda loop pulled metric batches from DynamoDB, invoked the SageMaker endpoint with 50-server CSV payloads, and applied a two-tier threshold (P1 > 3.0, P2 1.5–3.0). A dynamic moving-average threshold adjusted for fleet-wide load cycles, reducing false positives by 40%. DLQ with a reprocessor Lambda handled endpoint downtime.

Model Drift Detection & Automated Retraining Loop

Custom Lambda computed 7-day rolling anomaly score mean and triggered a SageMaker Pipeline retraining when drift exceeded 1.5σ. Blue/green endpoint promotion routed 10% traffic to the new variant for 24 hours before full cutover. Retraining events published to SNS for ops team visibility.

Project Outcomes

✓
Detected 15+ critical server anomalies before user-visible impact across a 500-server telecom fleet.
✓
Reduced MTTD from 1 hour to 4.3 seconds — a 99% reduction in detection latency.
✓
Achieved 95% detection precision on a 87-incident labeled holdout vs 2.7% for the legacy system.
✓
Automated the full ML lifecycle (ingestion → training → deployment → drift → retraining) with zero human intervention for routine maintenance.
✓
Reduced monthly alert noise from 847 to 51 — a 94% reduction in alert fatigue.
✓
42% below budget on total monthly AWS cost after Spot migration and endpoint auto-scaling.