EKS Deployment and Monitoring Project
Achievement Log
Overview
Production Kubernetes infrastructure automating the deployment and monitoring of n8n workflow automation, Open WebUI, and a full observability stack (Prometheus, Grafana, OpenTelemetry) on a 20-node AWS EKS cluster. Provisioned entirely via Terraform with zero manual console actions. Reduced incident MTTR from 45 minutes to under 8 minutes and achieved 99.9% uptime over a 6-week production period.
Core Technologies
Implementation & Architecture
VPC & EKS Cluster Foundation (Terraform)
3-tier VPC (public/private/isolated subnets across 3 AZs) with NAT Gateways, private EKS endpoint, and VPC Flow Logs. IRSA roles scoped for cluster-autoscaler, ALB controller, external-secrets, and OTEL collector. ALB consolidated from 4 to 1 via IngressGroup annotations — 75% ALB cost reduction. Node taints segregated monitoring pods from user workloads.
n8n Stateful Workflow Automation Stack
n8n in main-worker mode: 1 main pod (webhook/API) + 3 worker pods consuming a Redis Bull queue, backed by PostgreSQL RDS Multi-AZ. HPA scaled workers from 2 to 8 replicas at 70% CPU. S3 binary data manager replaced NFS for multi-pod binary access. PgBouncer in transaction pooling mode capped RDS connections at 12 regardless of worker pod count.
Observability Stack (Prometheus + Grafana + OpenTelemetry)
kube-prometheus-stack with 4 custom Grafana dashboards, 4 Prometheus alerting rules, and Fluentd DaemonSet forwarding WARN/ERROR-only logs to CloudWatch Logs — reducing daily ingestion from 45 GB to 7.2 GB (84% reduction). Grafana workflow failure rate alert included a direct link to the failing workflow's execution log.
Technical Skills
- AWS EKS
- Terraform
- Kubernetes
- Helm
- Prometheus
- Amazon VPC
Engineering Challenges
- →RBAC Misconfiguration Blocking EKS Add-on Updates — eks:UpdateAddon and eks:DescribeAddonVersions were missing from the provisioning role. Fixed by expanding the inline policy and re-running terraform apply.
- →n8n Worker CrashLoopBackoff from Redis Connection Timeout — ElastiCache security group missing inbound 6379 rule from EKS node SG (cross-module reference gap in Terraform). Fixed by adding an aws_security_group_rule in the eks/ module.
- →Prometheus Storage Exhaustion After 8 Days — hostmetrics generated 4× more time series than estimated (12,000 vs 3,000), filling 50Gi PVC and corrupting the TSDB. Fixed by resizing to 200Gi, reducing collection interval from 10s to 30s, and adding a retention.size cap.
- →Fluentd DaemonSet Silently Dropping Logs — OIDC provider ARN in the IRSA trust policy was hardcoded to us-east-1 instead of eu-west-1. Identified via CloudTrail STS AssumeRole failures. Fixed by using the cluster's OIDC issuer output.
- →HPA Oscillation Under Steady-State Load — Workers scaled in/out every 8–10 minutes because 6–8 minute jobs caused brief CPU dips below the scale-in threshold. Fixed by adding scaleDown stabilizationWindowSeconds=600.
- →PostgreSQL Connection Pool Saturation — 7 worker pods × 5 connections each exhausted RDS max_connections (100). Fixed by deploying PgBouncer in transaction pooling mode, stabilizing RDS connections at 12.
Project Outcomes
- ✓Full cluster provisioned via Terraform in under 25 minutes — zero manual AWS console actions.
- ✓99.9% uptime for n8n over 6 weeks, including 3 node failures handled automatically by cluster autoscaler.
- ✓Incident MTTR reduced from 45 minutes to under 8 minutes via pre-built Grafana dashboards with direct alert-to-log links — 82% MTTR reduction.
- ✓20-node EKS cluster scaled autonomously through a 200-concurrent-workflow stress test.
- ✓CloudWatch Logs ingestion reduced 84% (45 GB/day → 7.2 GB/day) without losing actionable signal.
- ✓ALB costs reduced 75% ($88 → $22/month) through IngressGroup consolidation.
- ✓Zero duplicate workflow executions after migrating to n8n queue mode — validated across 1,200 test runs.