IaC 2 weeks

IAM-based service-to-service auth on GCP

Problem

When we first audited the client’s service mesh, we found authentication was being handled a different way in almost every service. Some services were passing long-lived API keys through environment variables, others had rolled their own JWT signing and verification logic, and a handful were using a shared HMAC secret that had been copy-pasted into seventeen different deployment configs over the years. The credentials themselves had grown organically — no rotation policy, no expiry, and in two cases no record of who had originally generated them. The blast radius of any single compromised secret was effectively the entire internal API surface.

The deeper problem was that none of this was auditable in any meaningful sense. When we asked “which services can call the billing API?”, the honest answer was “whichever ones have the secret.” There was no enforcement layer, no policy document, and no way to answer that question from first principles without reading every service’s source code. For a team preparing for SOC 2 Type II, that was a serious gap — auditors want to see access control that is defined, enforced, and reviewable, not access control that is implied by who happened to receive a Slack message containing a token.

What we did

The fix was to stop treating identity as something each service had to implement and start treating it as something the platform provides. On GCP, that meant leaning fully into Workload Identity — each Cloud Run service was assigned a dedicated service account with a narrow IAM role, and inter-service calls were authenticated using short-lived OIDC tokens that GCP issues and rotates automatically. The calling service never touches a credential; it asks the metadata server for a token scoped to the target audience, and the receiving service verifies it against GCP’s public keys. The entire handshake is handled by the runtime and the HTTP client library.

We wrote the service account definitions and IAM bindings in Terraform, which meant the full access control matrix lived in version-controlled infrastructure code rather than scattered across CI secrets and deployment dashboards. Adding a new service-to-service trust relationship became a two-line Terraform change with a pull request, a reviewer, and a permanent audit trail. We also ran a systematic sweep to remove the old credentials — 23 hardcoded secrets pulled out of environment variable configs, Kubernetes secrets, and a handful of places where they had been inlined directly into Helm values files. Once the new auth path was validated in staging, the legacy code paths — custom middleware, token exchange endpoints, signature verification utilities — were deleted outright, coming to roughly 800 lines across the codebase.

Result

The SOC 2 audit was the most immediate validation. The auditors asked for evidence of least-privilege access between internal services, and we were able to show them a Terraform file where every binding was explicit, every role was scoped, and every change was traceable to a commit and a code review. There were no shared credentials to account for, no rotation schedule to verify, and no exceptions to explain. The auditors flagged it as a model implementation — one of the cleaner IAM configurations they had seen for a microservices deployment of this size.

Beyond compliance, the operational benefit was quieter but real. The on-call rotation had previously dealt with a recurring class of incidents where a rotated or expired secret would silently break a service-to-service call in a way that was hard to distinguish from a network partition. That class of incident disappeared. Token issuance failures are now surfaced through GCP’s own credential infrastructure with clear error codes, and because there are no long-lived secrets to expire unexpectedly, the failure mode is almost always a misconfigured IAM binding — which is visible in Cloud Audit Logs within seconds. The two-week timeline was tight, but the migration was safe to do incrementally because we could run the old and new auth paths in parallel per-service until we had confidence in each cutover.

Key highlights

Removed 23 hardcoded secrets from environment variables
Zero shared credentials between any two services
SOC 2 audit passed with no auth-related findings
~800 lines of custom auth middleware deleted

Tech stack

GCP IAMWorkload IdentityTerraformCloud Run

Have a similar challenge?

Book a call