Terraform State Management at Scale: Patterns That Actually Work

Remote state, workspaces, and module versioning each solve a different problem. We share the patterns we've settled on after managing Terraform at scale across dozens of AWS and Azure accounts.

State is where Terraform gets complicated

The Terraform documentation is good. The getting-started experience — write a config, run apply, see resources created — is smooth. The part that gets complicated, and that most tutorials gloss over, is state management at scale.

State is how Terraform tracks what it has created. When you're working alone on a single environment, a local state file in your project directory works fine. When you have multiple engineers, multiple environments, and multiple accounts, the decisions you make about state storage, locking, and structure determine whether Terraform is a productive tool or a source of ongoing toil.

Here are the patterns we've settled on, and the reasoning behind each.

Remote state is non-negotiable for teams

Local state files don't work for teams. Two engineers running `terraform apply` concurrently against the same state file will corrupt it. An engineer whose laptop is lost or reset takes the state file with it.

Remote state — storing the state file in S3 (with DynamoDB for locking) on AWS, or in Azure Blob Storage (with state locking built in) — is the minimum viable setup for any environment that more than one person touches.

The state backend configuration should be the first thing you set up in a new Terraform project, before any resources are defined. Migrating from local to remote state on an existing project with real resources is a solvable problem, but it's an unnecessary one.

Our standard AWS backend block:

terraform {
  backend "s3" {
    bucket         = "your-org-terraform-state"
    key            = "product/environment/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
  }
}

The `key` path structure matters: `product/environment/terraform.tfstate` means each product and environment has its own isolated state file. Don't use a single key for all your infrastructure.

Workspaces: useful but limited

Terraform workspaces let you maintain multiple state files from a single configuration, selected with `terraform workspace select`. The common use case is managing dev, staging, and production environments from a single set of Terraform files with different variable values.

This works, and it's simpler than maintaining separate directories per environment. But workspaces have a limitation that makes them unsuitable for environments that genuinely need to be different: the configuration is the same across all workspaces.

If your production environment needs different instance types, different scaling policies, different security group rules, or different service configurations from your development environment, you'll find yourself writing increasingly contorted `locals` and `count` expressions to handle the differences. At some point, separate directories (with shared modules) are cleaner.

Our rule of thumb: use workspaces when environments are identical except for variable values. Use separate directories when environments have structural differences. Most teams start with workspaces and migrate to separate directories as their infrastructure matures — it's fine to do it that way, just recognise the inflection point when it comes.

The module versioning problem

Modules are how you reuse Terraform code. A well-designed module for "a standard EKS cluster with our security baseline" can be referenced from every team's Terraform code, and updates to the module propagate to everyone.

The danger is unversioned modules. If every project references a module via a local path or an unpinned Git ref (`ref=main`), then a change to the module affects all consumers simultaneously. This is fine for internal tooling. It is not fine for production infrastructure.

Pin your module versions. Use Git tags and reference them explicitly:

module "eks_cluster" {
  source  = "git::https://github.com/your-org/terraform-modules//eks?ref=v1.4.2"
  # ...
}

Adopting a new module version is then a deliberate act — update the ref, review the changelog, run `terraform plan`, verify the diff, apply. Not an accident caused by a module author pushing a breaking change.

For modules from the public Terraform Registry (including the official AWS and Azure modules), pin to a specific version range:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.1"
}

The `~>` operator allows patch updates but not minor or major version bumps. Adjust the range based on how carefully you track upstream changes.

Separating state by blast radius

One state file per project is a reasonable starting point, but as infrastructure grows, you want to think about blast radius: if a `terraform apply` goes wrong, how much infrastructure is at risk?

A single state file covering your entire AWS account means a bad apply can affect everything. A state file that covers only a specific application's resources means a bad apply affects only that application.

We separate state along two dimensions:

Layer. Foundation infrastructure (VPCs, DNS, shared IAM roles, logging) lives in one state file. Cluster infrastructure (EKS, RDS, Elasticache) lives in another. Application infrastructure (ECS services, Lambda functions, S3 buckets) lives in a third. Each layer depends on outputs from the layer below it, accessed via `terraform_remote_state` data sources or (better) SSM Parameter Store.

Team or product. Within the application layer, each product team manages their own state. The payments team's Terraform code and state file are separate from the messaging team's. Changes to one don't risk breaking the other.

This separation also makes access control tractable: you can give the payments team write access to their own state backend without giving them access to anything else.

The practical tooling question

Terraform CLI is sufficient for small teams. For larger teams, a CI/CD integration — either Terraform Cloud/HCP Terraform, Atlantis (self-hosted), or Spacelift — provides PR-based plan reviews, automated applies on merge, and centralised state management.

The key feature is plan visibility in pull requests. When a PR shows the Terraform plan as a comment — "will destroy 0, create 3, update 1" — reviewers can reason about what the infrastructure change actually does, not just what the Terraform code says. This is the single most valuable workflow improvement for teams that operate infrastructure collaboratively.

Whichever tool you use, the state management principles are the same. The tooling is a delivery mechanism; the patterns are what matter.