Infrastructure as Code from scratch
Problem
The client had been running their GCP environment for two years entirely through the cloud console. Every resource — VM instances, VPC networks, Cloud SQL databases, service accounts, IAM bindings — had been created by hand, often by whoever happened to need it at the time. There was no central record of what existed. Engineers would discover resources by stumbling across them, or worse, by being surprised by a line item on the billing statement. When we asked who owned a particular service account with project-level editor permissions, nobody could answer with confidence.
Onboarding a new backend engineer took the better part of two days. A senior engineer would sit alongside them, walking through which consoles to visit, which projects to request access to, which service accounts to download credentials for, and which environment variables to set locally. None of this was written down in a way that reflected reality — the runbook existed, but it had drifted so far from the actual state of the environment that following it introduced as many questions as it answered. The deeper problem was that nobody had a clear picture of what the infrastructure was supposed to look like versus what it actually looked like. Resources from abandoned experiments sat idle, costing money. IAM roles had accumulated permissions through incremental additions with no corresponding removals. The environment had become, in effect, undocumented legacy infrastructure — despite being less than two years old.
What we did
We started with an import phase rather than a rebuild. Throwing away existing infrastructure and recreating it cleanly is tempting but rarely practical — databases have data, DNS records are live, and service accounts are referenced by running workloads. Instead, we spent the first week writing Terraform definitions for every resource in the environment and importing them into state one by one. In total we imported 180+ resources across seven GCP projects, covering Compute Engine instances, Cloud Run services, VPC networks and firewall rules, Cloud SQL instances, GCS buckets, Cloud DNS zones, Secret Manager secrets, and IAM bindings at both project and resource level. We ran every import against a plan and verified zero drift before moving on.
The state was organized in layers rather than a single monolithic workspace. Foundational resources — VPC networks, shared service accounts, DNS zones — live in a foundation root module with its own state file. Individual services each have their own workspace, which references foundation outputs via terraform_remote_state. This separation means a change to a Cloud Run service doesn’t require locking the same state as network infrastructure, and it limits the blast radius of a bad plan. For cost visibility, we wired Infracost into the GitHub Actions CI pipeline. Every pull request that touches Terraform gets an automated comment showing the estimated monthly cost delta — broken down by resource, with a total. GitHub Actions handles plan, Infracost comment, and apply on merge to main, with state stored in GCS and locked via Cloud Storage object conditions.
Result
The most immediate payoff was visibility. Within a week of completing the import, the team had their first complete picture of what was running and what it cost. Several resources that had been forgotten — a Cloud SQL instance provisioned for a proof of concept six months prior, a handful of Compute Engine instances running at n1-standard-4 with near-zero CPU utilization — were identified and decommissioned. The monthly GCP bill dropped noticeably before we’d written a single line of new infrastructure code.
Onboarding changed from a guided tour to a pull request. The new engineer flow is now: clone the repo, run a bootstrap script that calls gcloud auth and configures application default credentials, and open a PR adding your IAM bindings to the relevant Terraform module. A reviewer approves, it merges, CI applies it, and you have the access you need. The entire process takes under an hour and leaves a clear record of who requested what and who approved it. The layered state architecture has held up well operationally. Teams working on application services can iterate on their Terraform without touching foundational resources, and the client’s engineering team, who had no prior Terraform experience, was writing new resource definitions independently within the first week after handoff.
Key highlights
- 180+ resources imported into Terraform without downtime
- Onboarding reduced from 2 days to a single PR
- Cost estimation on every pull request via Infracost
- Layered state architecture — networking, compute, and IAM isolated
Tech stack
Have a similar challenge?
Book a call