Cloud Migration 2 weeks

Self-managed database to managed cloud

Problem

The client was running PostgreSQL on a self-managed VM cluster — a setup that had made sense years earlier but had quietly become one of the most expensive line items on their engineering budget. A dedicated DBA was spending the majority of their time on operational work: applying minor version patches during scheduled maintenance windows, babysitting logical replication slots, monitoring WAL archiving, and responding to alerts at 2am when a long-running VACUUM process caused autovacuum to fall behind. None of this was differentiating work — it was pure infrastructure tax.

The failure mode was the most concerning part. Their high-availability setup relied on Patroni for automatic leader election and a hand-rolled set of shell scripts for failover orchestration. In theory, it worked. In practice, every time the primary went down, someone had to watch the logs, confirm the replica had caught up, and manually verify the connection pooler had rerouted traffic before declaring the incident resolved. The mean time to recovery was acceptable on paper; the actual human cost of each event was not. Postmortems consistently pointed to the same root cause: too much operational complexity maintained by too few people.

What we did

We started with a two-week audit of the existing database workload — slow query logs, index usage statistics, connection pooling patterns, and peak IOPS — before touching anything in production. This gave us a sizing baseline for Cloud SQL and surfaced three unused indexes and one particularly destructive N+1 query pattern that we flagged for the application team. Migrating to a managed service was the right call, but we wanted to land on a better-configured instance, not just a managed version of the existing mess.

The migration itself used Google’s Database Migration Service with continuous replication. We provisioned a Cloud SQL for PostgreSQL instance via Terraform — including replica configuration, automated backup schedules, and maintenance window settings — so the entire target environment was reproducible and code-reviewed before we promoted it. DMS handled the initial full dump and then streamed ongoing changes over logical replication, keeping replication lag consistently below 30 seconds throughout the two-week parallel-run period. Cutover was a coordinated sequence: quiesce writes at the application layer, confirm zero lag on the DMS job, update the connection string in Secret Manager, re-enable writes. Total application downtime was under 90 seconds. We kept the source instance running in read-only mode for 72 hours post-cutover as a rollback option, then decommissioned it cleanly once confidence was established.

Result

The immediate operational impact was straightforward to measure: the DBA on-call rotation was eliminated entirely. Cloud SQL handles minor version patching automatically during the configured maintenance window with a rolling restart that keeps the instance available. Automated backups run on a defined schedule with point-in-time recovery granularity down to the second — no more manual pg_basebackup jobs or worrying about whether the WAL archive was actually usable. Failover to the read replica, when tested, completed in under 60 seconds with no manual intervention required.

The cost reduction came from two places. First, the managed instance right-sized to actual workload rather than the original VM cluster that had been provisioned for a peak that never materialized — that alone accounted for roughly 25% of the savings. Second, and more significantly, the DBA role was reoriented entirely toward application-layer database work: schema design review, query optimization, and advising on data modeling decisions. The operational overhead that had consumed most of that capacity simply disappeared. Combined, monthly database-related costs dropped by 40%. Query Insights in Cloud SQL gave the team the slow query visibility they had previously needed a separate monitoring stack to achieve, which simplified the observability tooling as a secondary benefit.

Key highlights

Zero-downtime migration with < 30s replication lag
Eliminated DBA on-call rotation entirely
40% reduction in monthly database costs
Automated backups with point-in-time recovery to any second

Tech stack

Cloud SQL (PostgreSQL)TerraformDatabase Migration Service

Have a similar challenge?

Book a call