EKS Deployment and Monitoring Project

Achievement Log

2025-03 Week 1: Designed 3-tier VPC, wrote Terraform vpc/ and eks/ modules, provisioned EKS 1.29 with private endpoint. Resolved 3 IAM permission gaps during apply. 2025-03 Week 2: Deployed ALB controller, PostgreSQL RDS, ElastiCache Redis, and n8n via Helm. Hit CrashLoopBackoff from missing ElastiCache SG rule — fixed. Consolidated 4 ALBs to 1 ($88 → $22/month). 2025-03 Week 3: Found dual-execution bug during load test — migrated to n8n queue mode with Redis locking. Deployed Open WebUI. Deployed kube-prometheus-stack, OTEL DaemonSet, 4 Grafana dashboards. Fixed Fluentd silent log-drop (wrong OIDC ARN in IRSA). 2025-04 Week 1–2: Hit Prometheus TSDB corruption (50Gi full) — resized to 200Gi. Resolved HPA oscillation (added stabilization window). Deployed PgBouncer (RDS connections: 100 → 12 stable). k6 load test passed: 1,200 executions, zero failures. Implemented WARN/ERROR log filtering (45 GB/day → 7.2 GB/day). Rightsized worker CPU requests (1000m → 250m). Delivered full handover documentation.

Overview

Production Kubernetes infrastructure automating the deployment and monitoring of n8n workflow automation, Open WebUI, and a full observability stack (Prometheus, Grafana, OpenTelemetry) on a 20-node AWS EKS cluster. Provisioned entirely via Terraform with zero manual console actions. Reduced incident MTTR from 45 minutes to under 8 minutes and achieved 99.9% uptime over a 6-week production period.

Core Technologies

AWS EKS (Kubernetes)TerraformPrometheus & GrafanaOpenTelemetry (OTEL)Helm & Kubernetes Manifests

Implementation & Architecture

VPC & EKS Cluster Foundation (Terraform)

3-tier VPC (public/private/isolated subnets across 3 AZs) with NAT Gateways, private EKS endpoint, and VPC Flow Logs. IRSA roles scoped for cluster-autoscaler, ALB controller, external-secrets, and OTEL collector. ALB consolidated from 4 to 1 via IngressGroup annotations — 75% ALB cost reduction. Node taints segregated monitoring pods from user workloads.

n8n Stateful Workflow Automation Stack

n8n in main-worker mode: 1 main pod (webhook/API) + 3 worker pods consuming a Redis Bull queue, backed by PostgreSQL RDS Multi-AZ. HPA scaled workers from 2 to 8 replicas at 70% CPU. S3 binary data manager replaced NFS for multi-pod binary access. PgBouncer in transaction pooling mode capped RDS connections at 12 regardless of worker pod count.

Observability Stack (Prometheus + Grafana + OpenTelemetry)

kube-prometheus-stack with 4 custom Grafana dashboards, 4 Prometheus alerting rules, and Fluentd DaemonSet forwarding WARN/ERROR-only logs to CloudWatch Logs — reducing daily ingestion from 45 GB to 7.2 GB (84% reduction). Grafana workflow failure rate alert included a direct link to the failing workflow's execution log.

Project Outcomes

✓
Full cluster provisioned via Terraform in under 25 minutes — zero manual AWS console actions.
✓
99.9% uptime for n8n over 6 weeks, including 3 node failures handled automatically by cluster autoscaler.
✓
Incident MTTR reduced from 45 minutes to under 8 minutes via pre-built Grafana dashboards with direct alert-to-log links — 82% MTTR reduction.
✓
20-node EKS cluster scaled autonomously through a 200-concurrent-workflow stress test.
✓
CloudWatch Logs ingestion reduced 84% (45 GB/day → 7.2 GB/day) without losing actionable signal.
✓
ALB costs reduced 75% ($88 → $22/month) through IngressGroup consolidation.
✓
Zero duplicate workflow executions after migrating to n8n queue mode — validated across 1,200 test runs.