Skip to main content

Command Palette

Search for a command to run...

Your Lambda Is Running, But Are You Actually Seeing the Problems?

Updated
5 min read
  Your Lambda Is Running, But Are You Actually Seeing the Problems?

There's a particular kind of silence that should terrify every engineer running serverless workloads. It's the silence of a dashboard full of green. Invocations: normal. Duration: normal. Errors: zero. Throttles: zero. Everything, according to CloudWatch, is fine.

And yet, somewhere out there, your users are screaming.

This isn't a hypothetical. This happened to us. And the lesson cost us the trust we had to earn back, one customer at a time.

The Day Everything Was "Fine"

We had a Lambda powering a critical flow. Nothing fancy — HTTP in, business logic, HTTP out. It had been running for months without issue. Our alerts were tuned, our dashboards were polished, and our on-call rotation was quiet.

Then the support tickets started coming in.

Not a flood. A trickle at first. "It's not working." "I can't complete this action." A couple of customers, then a handful, then dozens. By the time we connected the dots, we were staring at a Slack channel full of complaints stretching back days.

We pulled up the dashboards. Green. We checked the error rate. Flat. We looked at the alarms. Silent.

The Lambda was running. It just wasn't working.

The Bug Beneath the Bug

Here's what had happened: somewhere in our error handling, an exception path was returning a 403 Forbidden to the client. Not a 500. Not a 502. A clean, intentional-looking 403.

From the platform's perspective, the Lambda had done its job. It received a request, executed code, returned a response. No timeout. No crash. No 5xx. Nothing for AWS to flag, nothing for our alerts to catch.

But to the customer? Every single request was failing. The flow was completely dead.

The actual bug — the one that had triggered the bad error path — was almost beside the point. The real failure was that we had no way of seeing what was happening. Our system was lying to us, calmly and confidently, in the language of HTTP status codes.

The Illusion of Default Metrics

Lambda gives you a small set of metrics out of the box: invocations, duration, errors, throttles, concurrent executions. These are infrastructure metrics. They tell you whether the runtime is healthy. They tell you nothing about whether your product is healthy.

A Lambda that returns 403 on every request for six hours straight is, by AWS's definition, a perfectly healthy Lambda.

This is the trap. The cloud provider gives you observability into its own concerns — was the function invoked, did it crash, did it run out of memory. Your concerns are different. Your concerns are: did the user successfully do the thing they came here to do?

If your alerts only fire on 5xx, you are explicitly betting that every failure mode in your system will surface as a 5xx. That bet is almost never correct.

What We Should Have Been Watching

After we fixed it, we sat down and asked an uncomfortable question: what else are we not seeing right now?

A few things came out of that conversation.

Status code distributions matter as much as error rates. A sudden spike in 403s is not "normal traffic." A jump in 404s isn't either. Any non-2xx response, in volume, is a signal. We started alerting on rate-of-change across the entire response code spectrum, not just the 5xx bucket.

Business outcomes beat HTTP semantics. The question we actually cared about wasn't "are we returning errors?" — it was "are customers completing the flow?" We added counters for the meaningful steps of the journey, with alerts on conversion-rate drops. When a flow goes from 95% completion to 0%, you find out in minutes, not days.

Logs need structure, and structure needs to be queryable. Our logs were there. The information was technically present. But "grep for 403" across days of unstructured logs from a Lambda you don't suspect is broken is not a strategy. Structured logging, with consistent fields and an actual log aggregator behind them, turns "we had no idea" into "we noticed within minutes."

Traces close the loop. When the alert fires, the question becomes: why? Distributed tracing — connecting the request from the edge through every internal hop — turns that question from a multi-hour archaeology project into a single click. We had been treating tracing as a nice-to-have. After this incident, it became table stakes.

The Real Lesson

The most dangerous outages aren't the ones where your service is down. Those are loud, obvious, and well-instrumented. Everyone knows what to do when the 5xx graph goes vertical.

The dangerous ones are the ones where your service is up and wrong. Where the metrics lie because nobody told them what the truth was supposed to look like. Where the alerts are silent because they were written against the failure modes you imagined, not the ones that actually happened.

Observability is not a feature you bolt on after the fact. It is the question you ask before you ship:

▎ If this flow stopped working tomorrow, in a way I haven't predicted, how long would it take me to know?

If the honest answer is "a customer would have to tell me" — and for most teams, most of the time, it is — then you don't have observability. You have hope.

And hope is a terrible monitoring strategy.

__________________________________________________________

Enjoyed this? The Practical Serverless newsletter covers this kind of stuff every week — no filler, just real serverless patterns from the field.

AI Disclaimer: AI has been utilized to refine the text. The experiences and content are my own.