BACK TO PORTFOLIO REGISTRY
Private Architecture
April 20252 weeks

EKS Deployment and Monitoring Project

Cloud Infrastructure EngineerEngineering Dossier

Achievement Log

2025-03 Week 1: Designed 3-tier VPC, wrote Terraform vpc/ and eks/ modules, provisioned EKS 1.29 with private endpoint. Resolved 3 IAM permission gaps during apply. 2025-03 Week 2: Deployed ALB controller, PostgreSQL RDS, ElastiCache Redis, and n8n via Helm. Hit CrashLoopBackoff from missing ElastiCache SG rule — fixed. Consolidated 4 ALBs to 1 ($88 → $22/month). 2025-03 Week 3: Found dual-execution bug during load test — migrated to n8n queue mode with Redis locking. Deployed Open WebUI. Deployed kube-prometheus-stack, OTEL DaemonSet, 4 Grafana dashboards. Fixed Fluentd silent log-drop (wrong OIDC ARN in IRSA). 2025-04 Week 1–2: Hit Prometheus TSDB corruption (50Gi full) — resized to 200Gi. Resolved HPA oscillation (added stabilization window). Deployed PgBouncer (RDS connections: 100 → 12 stable). k6 load test passed: 1,200 executions, zero failures. Implemented WARN/ERROR log filtering (45 GB/day → 7.2 GB/day). Rightsized worker CPU requests (1000m → 250m). Delivered full handover documentation.

Overview

Production Kubernetes infrastructure automating the deployment and monitoring of n8n workflow automation, Open WebUI, and a full observability stack (Prometheus, Grafana, OpenTelemetry) on a 20-node AWS EKS cluster. Provisioned entirely via Terraform with zero manual console actions. Reduced incident MTTR from 45 minutes to under 8 minutes and achieved 99.9% uptime over a 6-week production period.

Core Technologies

AWS EKS (Kubernetes)TerraformPrometheus & GrafanaOpenTelemetry (OTEL)Helm & Kubernetes Manifests

Implementation & Architecture

VPC & EKS Cluster Foundation (Terraform)

3-tier VPC (public/private/isolated subnets across 3 AZs) with NAT Gateways, private EKS endpoint, and VPC Flow Logs. IRSA roles scoped for cluster-autoscaler, ALB controller, external-secrets, and OTEL collector. ALB consolidated from 4 to 1 via IngressGroup annotations — 75% ALB cost reduction. Node taints segregated monitoring pods from user workloads.

n8n Stateful Workflow Automation Stack

n8n in main-worker mode: 1 main pod (webhook/API) + 3 worker pods consuming a Redis Bull queue, backed by PostgreSQL RDS Multi-AZ. HPA scaled workers from 2 to 8 replicas at 70% CPU. S3 binary data manager replaced NFS for multi-pod binary access. PgBouncer in transaction pooling mode capped RDS connections at 12 regardless of worker pod count.

Observability Stack (Prometheus + Grafana + OpenTelemetry)

kube-prometheus-stack with 4 custom Grafana dashboards, 4 Prometheus alerting rules, and Fluentd DaemonSet forwarding WARN/ERROR-only logs to CloudWatch Logs — reducing daily ingestion from 45 GB to 7.2 GB (84% reduction). Grafana workflow failure rate alert included a direct link to the failing workflow's execution log.

Technical Skills

  • AWS EKS
  • Terraform
  • Kubernetes
  • Helm
  • Prometheus
  • Amazon VPC

Engineering Challenges

  • RBAC Misconfiguration Blocking EKS Add-on Updates — eks:UpdateAddon and eks:DescribeAddonVersions were missing from the provisioning role. Fixed by expanding the inline policy and re-running terraform apply.
  • n8n Worker CrashLoopBackoff from Redis Connection Timeout — ElastiCache security group missing inbound 6379 rule from EKS node SG (cross-module reference gap in Terraform). Fixed by adding an aws_security_group_rule in the eks/ module.
  • Prometheus Storage Exhaustion After 8 Days — hostmetrics generated 4× more time series than estimated (12,000 vs 3,000), filling 50Gi PVC and corrupting the TSDB. Fixed by resizing to 200Gi, reducing collection interval from 10s to 30s, and adding a retention.size cap.
  • Fluentd DaemonSet Silently Dropping Logs — OIDC provider ARN in the IRSA trust policy was hardcoded to us-east-1 instead of eu-west-1. Identified via CloudTrail STS AssumeRole failures. Fixed by using the cluster's OIDC issuer output.
  • HPA Oscillation Under Steady-State Load — Workers scaled in/out every 8–10 minutes because 6–8 minute jobs caused brief CPU dips below the scale-in threshold. Fixed by adding scaleDown stabilizationWindowSeconds=600.
  • PostgreSQL Connection Pool Saturation — 7 worker pods × 5 connections each exhausted RDS max_connections (100). Fixed by deploying PgBouncer in transaction pooling mode, stabilizing RDS connections at 12.

Project Outcomes

  • Full cluster provisioned via Terraform in under 25 minutes — zero manual AWS console actions.
  • 99.9% uptime for n8n over 6 weeks, including 3 node failures handled automatically by cluster autoscaler.
  • Incident MTTR reduced from 45 minutes to under 8 minutes via pre-built Grafana dashboards with direct alert-to-log links — 82% MTTR reduction.
  • 20-node EKS cluster scaled autonomously through a 200-concurrent-workflow stress test.
  • CloudWatch Logs ingestion reduced 84% (45 GB/day → 7.2 GB/day) without losing actionable signal.
  • ALB costs reduced 75% ($88 → $22/month) through IngressGroup consolidation.
  • Zero duplicate workflow executions after migrating to n8n queue mode — validated across 1,200 test runs.