BACK TO PORTFOLIO REGISTRY
Private Architecture
February 20252 weeks

AWS SageMaker RCF Multi-Server Anomaly Detection

Lead Cloud/AI EngineerEngineering Dossier

Achievement Log

2025-01 Week 1: Audited 500+ server CloudWatch coverage, resolved 17 missing-metric agents, validated >99.5% data completeness. Built parallel Python ingestion with rate limiting and z-score normalization. 2025-01 Week 2–3: Ran hyperparameter grid search (F1=0.87), trained on full 6-month corpus, deployed endpoint with data capture and Model Monitor baseline. Deployed 30-second Lambda inference loop with DynamoDB buffer and SNS/PagerDuty integration. 5-day parallel run calibrated thresholds. 2025-02 Week 1: Identified month-end false positive spikes — implemented dynamic EMA threshold (false positives: 847 → 51/month). Added DLQ resilience layer and endpoint health-check guard. 2025-02 Week 2: Implemented automated retraining pipeline with blue/green promotion. Migrated Batch to Spot (55% cost reduction), added auto-scaling (30% cost reduction). Delivered client handover with documented runbook. First autonomous retraining cycle completed without human intervention.

Overview

Enterprise-grade anomaly detection system for a telecom client monitoring 500+ servers using Amazon SageMaker's Random Cut Forest algorithm. Detects abnormal CPU, memory, and network I/O patterns via an event-driven Lambda inference pipeline with automated retraining. Replaced static threshold alerting, reducing mean time to detection from 1 hour to 4.3 seconds.

Core Technologies

Amazon SageMaker (RCF)Python & Boto3AWS Batch & LambdaAmazon CloudWatchMLflow & Experiment TrackingBash Scripting & CLI Automation

Implementation & Architecture

Historical Data Ingestion & Baseline Training Pipeline

Downloaded 6 months of CPU/memory/network data for 500+ servers via parallel CloudWatch calls, applied rolling z-score normalization, serialized to RecordIO-protobuf, and launched a SageMaker training job. Model Monitor baseline was established post-training, and the endpoint was deployed with auto-scaling and a 10% data capture config.

Real-Time Detection & Alerting Architecture

30-second Lambda loop pulled metric batches from DynamoDB, invoked the SageMaker endpoint with 50-server CSV payloads, and applied a two-tier threshold (P1 > 3.0, P2 1.5–3.0). A dynamic moving-average threshold adjusted for fleet-wide load cycles, reducing false positives by 40%. DLQ with a reprocessor Lambda handled endpoint downtime.

Model Drift Detection & Automated Retraining Loop

Custom Lambda computed 7-day rolling anomaly score mean and triggered a SageMaker Pipeline retraining when drift exceeded 1.5σ. Blue/green endpoint promotion routed 10% traffic to the new variant for 24 hours before full cutover. Retraining events published to SNS for ops team visibility.

Technical Skills

  • Amazon SageMaker
  • AWS Lambda
  • Python
  • Anomaly Detection
  • CloudWatch
  • MLflow
  • Bash Scripting

Engineering Challenges

  • Metric Variance During Peak Traffic — Bimodal traffic distribution caused 30+ false positives/day with a fixed threshold. Fixed by rolling z-score normalization before inference and switching to a dynamic moving-average threshold.
  • CloudWatch API Rate Throttling — 500-server parallel ingestion hit the 400 TPS limit within 30 seconds. Fixed with token-bucket rate limiting (350 calls/sec), exponential backoff with jitter, and a TPS limit increase via AWS Support.
  • SageMaker Endpoint Cold Start During DLQ Reprocessing — DLQ reprocessor invoked the endpoint during a blue/green update window, losing events after 3 retries. Fixed by adding an endpoint health check before each retry attempt.
  • Model Drift Trigger Instability — Short drift window triggered 6 unnecessary retraining jobs in 72 hours during a maintenance window. Fixed by extending the drift evaluation window to 14 days and requiring 3 consecutive threshold breaches.
  • RecordIO Serialization Memory Spike — Building a 500-server payload in Lambda RAM consumed 480 MB, near the limit. Fixed by batching into 50-server groups with a generator-based CSV serializer, reducing peak memory to 180 MB.
  • DynamoDB Buffer Write Conflicts — Concurrent Lambda writes to the same item caused silent metric loss. Fixed by switching to an append-only design with composite keys and GSI queries for the rolling window.

Project Outcomes

  • Detected 15+ critical server anomalies before user-visible impact across a 500-server telecom fleet.
  • Reduced MTTD from 1 hour to 4.3 seconds — a 99% reduction in detection latency.
  • Achieved 95% detection precision on a 87-incident labeled holdout vs 2.7% for the legacy system.
  • Automated the full ML lifecycle (ingestion → training → deployment → drift → retraining) with zero human intervention for routine maintenance.
  • Reduced monthly alert noise from 847 to 51 — a 94% reduction in alert fatigue.
  • 42% below budget on total monthly AWS cost after Spot migration and endpoint auto-scaling.