There is a specific kind of optimism that exists only in the first five minutes of a new greenfield project. You have a fresh directory, a blank main.tf, and a mental image of the architecture that is clean, logical, and devoid of friction.
Last weekend, I sat down to build what I considered a "commodity" stack: a Node.js backend running on AWS Fargate, fronted by an Application Load Balancer (ALB), and fed by a continuous deployment pipeline triggered from GitHub. I’ve drawn this architecture on whiteboards a muiltiple times.
But as I began translating that whiteboard into HCL (HashiCorp Configuration Language), I was reminded of a hard truth that diagrams conveniently omit: the map is not the territory.
This is not a tutorial on how to write Terraform. This is a story about the gap between infrastructure correctness and operational reality, and the friction points where the "perfect" declarative model of Terraform grinds against the messy, stateful, imperative reality of AWS.
The Architecture Blueprint
Before diving into the failures, it is important to understand what "success" was supposed to look like. The goal was a fully automated delivery system for a Node.js API.
I separated the project into two distinct repositories to maintain separation of concerns:
Application Code: devmetrics-server - A distinct Node.js backend service that is responsible for fetching user github contributions and compute meaningful statistics and draw some inferencies from the data.
Infrastructure Code: aws-terraform-backend-cicd - The Terraform definitions that manage the AWS estate.
The architecture relied on a strictly defined set of AWS resources, each serving a specific role in the lifecycle of a code commit:
1. The CI/CD Pipeline Layer
AWS CodePipeline: The orchestrator. It listens for changes on the
mainbranch of the GitHub repository.AWS CodeStar Connections: The bridge between AWS and GitHub. Unlike legacy OAuth tokens, this resource handles the authentication handshake securely.
AWS CodeBuild: The workhorse. It spins up a temporary container to run
npm install, build the application, and create the Docker container.Amazon ECR (Elastic Container Registry): The library. CodeBuild pushes the versioned Docker images here, tagged by commit SHA.
2. The Compute & Networking Layer
VPC (Virtual Private Cloud): Designed with public subnets for the load balancer and private subnets for the application logic.
NAT Gateways: Essential plumbing that allows the private Fargate tasks to reach the internet (to pull images or talk to external APIs) without exposing them to incoming traffic.
Application Load Balancer (ALB): The public face of the application. It handles SSL termination and routes traffic to the healthy containers.
AWS ECS (Fargate): The serverless compute engine. It runs the Node.js containers pulled from ECR, injecting environment variables at runtime.
3. Configuration & Observability
AWS Secrets Manager: The vault. It stores sensitive database credentials and API keys, injecting them into the ECS tasks only at runtime.
CloudWatch & SNS: The nervous system. Logs are streamed here, and alarms trigger SNS topics to email me when CPU spikes or tasks fail to start.
It looked perfect on paper. Then I ran terraform init.
The First Lie: The Clean Slate
We often talk about Infrastructure as Code (IaC) as if it allows us to command the cloud to exist. You declare resource "aws_vpc" "main", and like a digital fiat, there is a VPC.
I started with a modular approach. I wanted clean separation: a module for networking, a module for the database, and a module for the CI/CD pipeline. I ran terraform init. The green text was comforting. I ran terraform plan. It promised to create 45 resources.
The first crack in the façade appeared when I tried to use Terraform Data Sources to reference resources I was creating in the same run.
In my mind, the dependency graph was clear. The VPC creates the subnets; the ECS cluster needs the subnets. However, inside my ECS module, I attempted to look up the availability zones using a data "aws_availability_zones" block to dynamically assign subnets.
Terraform exploded.
It wasn’t a syntax error. It was a logic error in my own mental model. I was asking Terraform to query the AWS API for information about resources that strictly speaking, didn't exist yet in the "real" world, even though they existed in the "planned" world.
This is where you learn that Terraform is not a wizard; it is a state machine. When you use a data block, you are effectively saying, "Go ask AWS what this looks like right now." If you are building the infrastructure from zero, the answer is "null." I had to shift my thinking from "lookup" to "pass-through." I had to explicitly pipe resource IDs from the networking module outputs into the ECS module inputs. The implicit global visibility I had assumed was a phantom; in Terraform, modules are black boxes, and variables are the only windows.
The CodeStar Friction
The architecture required a CI/CD pipeline. The flow is standard: push to GitHub, trigger AWS CodePipeline, build in CodeBuild, deploy to Fargate.
Historically, we used GitHub OAuth tokens stored in Secrets Manager. But AWS has deprecated that in favor of CodeStar Connections. This is a more secure, app-based authentication method. I wrote the Terraform resource:
resource "aws_codestarconnections_connection" "github" {
name = "production-connection"
provider_type = "GitHub"
}
I ran terraform apply. It succeeded.
Then the pipeline failed immediately.
This brings us to one of the most jarring "breaks" in the automation dream. You can create a CodeStar connection via Terraform, but you cannot activate it. The connection initializes in a PENDING state. For security reasons, AWS mandates that a human being must log into the AWS Console, click "Update Pending Connection," and authorize the handshake with GitHub in a browser window.
I spent an hour debugging IAM roles, assuming my CodePipeline didn't have permission to pull the source. I churned through policy documents, adding codestar-connections:UseConnection permissions, only to realize the infrastructure code was perfect. The infrastructure state was incomplete.
This creates a terrifying "chicken and egg" problem in a fully automated disaster recovery scenario. If this region burns down and I need to redeploy this stack elsewhere, the script will halt. A human must intervene. It was a stark reminder that we haven't actually automated away the operator; we've just changed when they have to show up.
The Silent Account Blockers
Once the pipeline was flowing, the next hurdle was the Application Load Balancer (ALB).
The Terraform plan was flawless. The security groups allowed ingress on port 80 and 443. The subnets were public. The internet gateway was attached.
I ran apply. The ALB resource creation hung.
If you’ve used Terraform enough, you know the feeling of the timer ticking past 3 minutes, then 5, then 8. Terraform is waiting for the AWS API to return a "steady state" signal. Finally, it timed out with a generic error.
I assumed it was a networking issue. I double-checked the route tables. I verified the NACLs. I spent two hours treating this as an engineering problem.
It wasn't an engineering problem. It was a billing problem.
The AWS account I was using was relatively new. To prevent fraud, AWS places silent quotas on new accounts, specifically regarding ELB (Elastic Load Balancing) and EC2 usage. My account hadn't been fully verified for "production-level" limits. The API wasn't rejecting the request with "Permission Denied" or "Quota Exceeded"; it was simply swallowing the request into a black hole of provisioning status that never resolved.
This is the difference between infrastructure correctness and account readiness. Your Terraform can be syntactic poetry, but if your credit card on file triggered a fraud alert, your load balancer will simply never exist. DevOps involves debugging the organization and the vendor, not just the code.
The "ResourceAlreadyExists" Paradox
After resolving the account limits and fixing the CodeStar connection, I had to re-run the deployment. But during the previous failed debugging sessions, the terraform apply had crashed halfway through creation due to a timeout.
When I ran terraform apply again, I was greeted with:
Error: creating CloudWatch Log Group (/ecs/backend-app): ResourceAlreadyExistsException
This error usually confuses juniors. "If it exists," they ask, "why doesn't Terraform just use it?"
This reveals the rigid nature of Terraform's state file (terraform.tfstate). Terraform doesn't look at your AWS account to see what's there; it looks at its state file to see what it thinks is there. Because the previous run crashed, Terraform never recorded that it successfully created the Log Group.
So, when I ran it again, Terraform tried to create an object that AWS said was already there.
This forces you into the messy world of state surgery. You have two choices:
Go into the console, delete the log group, and let Terraform create it again. (The "Destructive" approach).
Use
terraform importto map the existing real-world resource to the Terraform configuration. (The "Reconciliation" approach).
I chose the latter. It is a humbling experience to manually tell your automation tool, "Hey, I know you're confused, but this thing is actually right here." It breaks the illusion of the omnipotent declarative tool.
The Observability afterthought
Finally, the stack was up. The container was running. The ALB passed health checks. I could hit the endpoint and get a JSON response.
But a system that is "up" isn't necessarily "working."
I realized I had built a black box. I had no idea if the Fargate tasks were struggling with memory limits. I didn't know if the CodeBuild project was queuing.
I had to go back and instrument the stack. This is where Terraform shines, but also where it becomes verbose. Creating a CloudWatch Dashboard in HCL is painful. You are essentially writing JSON inside of HCL string literals.
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "backend-overview"
dashboard_body = jsonencode({
widgets = [
{
type = "metric"
# ... 40 lines of configuration ...
}
]
})
}
I set up SNS topics for alarms. I routed them to email. The first time the pipeline ran successfully, I got an email.
It wasn't until I saw that email that I felt the project was done. The infrastructure code wasn't just about building the server; it was about building the confidence that the server was healthy.
Conclusion: The Maturity of Failure
Looking back at the finished repository, it looks clean. The main.tf is organized. The modules are decoupled. The graph is green.
But the git history tells the real story. It tells a story of wrestling with IAM policies for CodeBuild to talk to ECR. It tells the story of the depends_on flag I had to hack into the routing table associations. It tells the story of the hard-coded ARN I had to use temporarily while the CodeStar connection was pending.
We often sell DevOps as a way to "simplify" operations. In reality, tools like Terraform and AWS don't simplify the complexity; they abstract it into a different domain. You trade the complexity of manual server configuration for the complexity of state management, API quotas, and dependency graphs.
The shift from "it works on my machine" to "it works in production" isn't about typing terraform apply. It's about understanding the boundaries of your system. It's about knowing that AWS is an eventually consistent distributed system that sometimes lies to you. It's about accepting that your automation tool has a partial view of reality.
Real DevOps maturity isn't built by staring at architecture diagrams. It is built in those quiet, frustrating moments when the plan fails, the state is corrupted, and you have to dig into the logs to find out where reality diverged from your code.
References
Infrastructure Repository: TanyaMushonga/aws-terraform-backend-cicd
Application Repository: TanyaMushonga/devmetrics-server
Terraform AWS Provider Documentation: HashiCorp Registry
AWS Fargate Guide: Serverless Compute for Containers
AWS CodeStar Connections: GitHub Connections Documentation
Terraform State Management: Remote State in S3
AWS CodePipeline User Guide: CodePipeline Documentation