Observability 3 weeks

OpenTelemetry for end-to-end observability

Problem

The engineering team had grown the platform to eight microservices running on Cloud Run, each emitting logs in slightly different formats — some JSON, some plaintext, some mixing both depending on which developer had last touched the service. Metrics were split across Cloud Monitoring for infrastructure and a self-hosted Prometheus for application-level data, but the two were never correlated. There was no distributed tracing at all. When a latency spike or error surge hit production, the on-call engineer’s first move was to pick the service that seemed most likely to be at fault, grep through its logs, and work backward from there.

This approach had a ceiling, and the team hit it hard during a series of incidents in the months before we engaged. A p99 latency regression on the checkout flow took nearly four hours to resolve — not because the fix was complex, but because identifying the offending database query required manually correlating timestamps across three different log streams and two dashboards in different tools. Another incident involving a downstream API timeout was initially misattributed to the wrong service entirely, wasting over an hour of debugging effort. The fundamental issue was that the observability tooling treated each service as an island, with no shared request context threading through the system.

What we did

We started with a two-day audit of every service’s existing instrumentation, cataloguing what was being emitted, in what format, and where it was going. From there we designed a unified OpenTelemetry pipeline: each Node.js service would emit traces, structured logs, and metrics through the OTel SDK, all flowing to a single OpenTelemetry Collector deployment that forwarded everything to Grafana Cloud. Critically, we chose to instrument at the HTTP middleware layer first — adding trace context propagation via W3C TraceContext headers meant that every inbound request to any service automatically inherited or initiated a trace, with span context injected into all downstream calls without requiring changes to business logic.

For structured logging, we replaced the ad-hoc console.log calls scattered across services with a shared logging wrapper that automatically attached trace_id and span_id from the active OTel context. This single change meant that every log line was now correlated to its parent trace, making it possible to jump from a Grafana Explore query directly into the full trace view for any log entry. We also wired up the OpenTelemetry metrics SDK to emit request duration histograms and error rate counters per service and per route, then built a set of Grafana alerting rules targeting p95 latency thresholds and error rate percentages — thresholds informed by two weeks of baseline data we collected during the instrumentation rollout. The entire migration was done service-by-service over three weeks with no downtime, using feature flags to control whether the OTel exporter was active in each environment.

Result

The most immediate change was in how the team experienced incidents. Within the first week of full rollout, an alert fired on elevated p95 latency in the order-processing service. The on-call engineer opened Grafana, clicked through to the trace view, and within three minutes had identified that the slowdown originated in a single database query inside a background job that was unintentionally running on the hot path during peak traffic. Fix deployed, incident closed — total time under twenty minutes. The same class of issue would previously have required grepping through logs across at least three services before even forming a hypothesis.

Over the following six weeks, the proactive alerting rules caught three separate regressions before they were reported by users: a memory pressure issue causing elevated GC pauses in the notification service, a third-party API beginning to degrade visible as rising p99 on outbound spans, and a bad deployment that introduced an N+1 query pattern. In each case, the alert fired within minutes of the regression appearing in the trace data. Consolidating from four tools — Cloud Monitoring, a self-hosted Prometheus, a separate log aggregator, and ad-hoc Cloud Run log tailing — into Grafana Cloud also eliminated a meaningful amount of operational overhead. The team now has a single pane of glass where a trace can be expanded to show every service hop, the database queries executed within each span, and the structured log lines emitted at each stage.

Key highlights

Mean time to resolution (MTTR) dropped from ~4 hours to 20 minutes
Traces span 8 services end-to-end, including database queries
Proactive alerting caught 3 latency regressions before users noticed
Consolidated 4 separate logging/monitoring tools into one

Tech stack

OpenTelemetryGrafana CloudCloud RunNode.js

Have a similar challenge?

Book a call