ECS Fargate deploy lifecycle
AI-generated content
This document was generated by an AI assistant. Verify accuracy before relying on the details.
The new infra/ Pulumi project deploys Spring Boot services to AWS ECS Fargate. Each microservice runs as one ECS service, with autoscaling between minTasks and maxTasks (defined per env in infra/src/catalog/services.js). A push to main triggers pulumi up -s staging; a push to the production branch triggers prod deploy with manual approval. PR previews deploy to a dedicated adb-preview AWS account on /deploy PR-comment commands.
At a glance
| Fact | Value |
|---|---|
| Compute | AWS ECS Fargate, capacity providers FARGATE + FARGATE_SPOT (Spot for non-prod) |
| Networking | VPC per env (3 AZs), private subnets for tasks, public ALB with HTTPS only, optional WAFv2 (prod) |
| Image registry | ECR in adb-shared AWS account |
| Image pull | Cross-account via repository policy scoped to the org |
| Image tag pattern | prod/staging: <env>-<sha>. preview: pr-<n>-<sha>. |
| Health check | ALB targets path /actuator/health, ECS task-level same |
| Deploy strategy | rolling, deploymentMinimumHealthyPercent=50, deploymentMaximumPercent=200 |
| ECS Exec | Enabled (gated by IAM Debug permission set) |
| Logs | CloudWatch log group per service (/ecs/adb-<env>/<service>), retention 14 d non-prod / 90 d prod |
Details
Stack layout
flowchart TB
push[Push to main / production]
gha[GitHub Actions: deploy-staging.yml or deploy-production.yml]
pulumi[pulumi up -s staging|production]
ecr[ECR: adb-shared]
network[VPC + subnets + endpoints]
cluster[ECS Cluster]
services[ECS Services x9]
alb[ALB + listener rules]
sqs[SQS queues from catalog]
push --> gha --> pulumi
pulumi --> network
pulumi --> cluster
pulumi --> services
pulumi --> alb
pulumi --> sqs
services -.pulls images.-> ecr
services -.consumes.-> sqs
alb -->|/api/<service>/*| services
Deploy paths
| Trigger | Workflow | Stack | Approval |
|---|---|---|---|
Push to main | .github/workflows/deploy-staging.yml | staging | none |
Push to production branch | .github/workflows/deploy-production.yml | production | GitHub Environment manual approval |
/deploy PR comment | .github/workflows/pr-deploy.yml | pr-<N> (created on demand) | none |
/destroy PR comment | .github/workflows/pr-destroy.yml | pr-<N> (destroyed) | none |
| PR closed | .github/workflows/pr-destroy.yml | pr-<N> (destroyed) | none |
| nightly | .github/workflows/nightly-cleanup.yml | sweeps pr-* stacks whose PR closed >24h | none |
Per-service plumbing
For every entry in services.js SERVICES, the env stack creates:
- An ECR-scoped task definition with image
adb-shared.dkr.ecr.eu-west-3.amazonaws.com/<image>:<env>-<sha>. - A task IAM role with read on its own queues, buckets, and secrets only (least privilege).
- An ALB listener rule routing
/api/<service>/*to a target group with the service's port. - Autoscaling on CPU 60% target, between
sizing[<env>].minTasksandmaxTasks. - A CloudWatch log group with the per-env retention.
The whole topology is reproducible: pulumi destroy and pulumi up rebuild it deterministically.
Local mirror
infra/docker/compose.yaml runs the same 9 services + MongoDB replica set + Keycloak + LocalStack (SQS/SNS/S3/Secrets Manager) + Mailpit. The LocalStack init script provisions queues using the same names the catalog produces, so messaging code works locally without changes.
Open questions
- Container Insights is enabled but no dashboard is provisioned. Worth adding a per-env CloudWatch dashboard with p95 latency, 5xx rate, queue depth, and DLQ count.
- We deploy from images tagged
<env>-<sha>but don't currently restrict which SHAs can deploy. A future hardening: signed images + ECR scan-on-push gating. - Spring Boot startup is slow (~30–60 s);
healthCheckGracePeriodSecondson the ECS service may need tuning before the first prod cutover.