Practical Serverless

How I Built a Serverless Testing Library That Cuts Test Setup by 90%

Lucas Brogni — Wed, 08 Apr 2026 09:19:36 GMT

Every Lambda test starts the same way: you need an event object — and crafting one is annoying. API Gateway v2 events have 30+ fields, SQS needs message IDs, receipt handles, and ARNs, and DynamoDB Streams expect marshaled AttributeValue maps. The usual options are copy‑pasting a 60‑line JSON fixture or spending 20 minutes hand‑crafting one from memory.

I built @sls-testing to stop that. It provides typed, composable one‑line builders that give sensible defaults, automatic marshaling, and easy overrides so your tests only express what matters.

The payoff: what used to be a 30–60 line fixture becomes a single builder call — cutting test setup by roughly 90%. Below, I’ll show before/after examples, the API surface, and how it handles common event types (API Gateway, SQS, S3, DynamoDB Streams).

Here's what the before/after looks like.

The Problem: 60 Lines to Say "POST /users"

Testing a Lambda handler behind API Gateway v2 requires an APIGatewayProxyEventV2 object. Here's the minimum viable event most teams copy around:

const event = {
  version: '2.0',
  routeKey: '$default',
  rawPath: '/users',
  rawQueryString: '',
  headers: {
    'content-type': 'application/json',
    'accept': 'application/json',
  },
  isBase64Encoded: false,
  body: JSON.stringify({ name: 'Lucas' }),
  requestContext: {
    accountId: '123456789012',
    apiId: 'test-api-id',
    domainName: 'test-api-id.execute-api.us-east-1.amazonaws.com',
    domainPrefix: 'test-api-id',
    http: {
      method: 'POST',
      path: '/users',
      protocol: 'HTTP/1.1',
      sourceIp: '127.0.0.1',
      userAgent: 'jest',
    },
    requestId: 'some-uuid-here',
    routeKey: '$default',
    stage: '$default',
    time: '01/Jan/2024:00:00:00 +0000',
    timeEpoch: 1704067200000,
  },
}

That's 30+ lines for an event where the only things you actually care about are the method, path, and body. The rest is structural noise — correct enough to not crash, meaningless to your test.

Now multiply that by every event type in your service. SQS needs messageId, receiptHandle, attributes, eventSourceARN. S3 needs bucket, key, responseElements, userIdentity. DynamoDB Streams need marshalled AttributeValue maps where "hello" becomes { S: "hello" } and 42 becomes { N: "42" }.

Most teams solve this one of three ways:

Copy-paste JSON fixtures — Brittle, verbose, drift from reality over time.
Hand-roll factory functions — Every team writes their own, slightly differently, and they're never complete.
Skip testing — The honest answer when the setup cost exceeds the perceived value.

None of these are good.

The Solution: Express Intent, Not Structure

With @sls-testing/core, the same test becomes:

import { buildApiGatewayEvent } from '@sls-testing/core'

const event = buildApiGatewayEvent({
  method: 'POST',
  path: '/users',
  body: JSON.stringify({ name: 'Lucas' }),
})

Three lines. Same fully-typed event. Every field you didn't specify gets a sensible default — a real-looking request ID, a timestamp, valid ARNs. The TypeScript types come from @types/aws-lambda, so your IDE autocompletes every field if you need to override something specific.

The pattern is the same across all six event types:

// SQS — bodies auto-serialized, each record gets a unique messageId
const sqsEvent = buildSQSEvent({
  records: [
    { body: { orderId: 'abc-123', amount: 99.9 } },
    { body: { orderId: 'def-456', amount: 49.9 } },
  ],
})

// S3 — just bucket and key, everything else filled in
const s3Event = buildS3Event({
  bucket: 'uploads',
  key: 'images/photo.png',
})

// DynamoDB Streams — plain objects auto-marshalled to AttributeValue
const streamEvent = buildDynamoDBStreamEvent({
  records: [{
    eventName: 'INSERT',
    keys: { id: 'abc' },
    newImage: { id: 'abc', name: 'Lucas', count: 42 },
  }],
})

// EventBridge
const ebEvent = buildEventBridgeEvent({
  source: 'app.orders',
  'detail-type': 'OrderPlaced',
  detail: { orderId: 'abc-123' },
})

// SNS
const snsEvent = buildSNSEvent({
  records: [{ message: { action: 'notify' } }],
})

The DynamoDB builder is where the savings are most dramatic. Manually constructing a DynamoDBStreamEvent with marshalled values is easily 40-50 lines. The builder does the marshalling for you — pass { count: 42 } and it becomes { N: "42" } automatically.

Beyond Events: Lambda Context

Events are half the story. Your handler also receives a Context object, and AWS's type definition has 12 fields. Most tests either ignore it (handler(event, {} as any) — hello, runtime crash) or build an incomplete mock.

import { buildLambdaContext } from '@sls-testing/core'

const context = buildLambdaContext({
  functionName: 'order-service-dev-processOrder',
  memoryLimitInMB: '512',
  remainingTimeOverride: 5000,
})

context.getRemainingTimeInMillis() // 5000 — actually works

Every field has a default. getRemainingTimeInMillis() returns the value you configure. The awsRequestId is a real UUID. The logGroupName derives from the function name. It's a real Context object, not a type-cast empty object.

Assertions That Speak Serverless

The companion package @sls-testing/jest adds custom Jest matchers that understand Lambda response shapes:

import '@sls-testing/jest'

const result = await handler(event, context)

// Status code assertions
expect(result).toHaveStatusCode(200)
expect(result).toBeSuccessfulApiResponse()  // any 2xx
expect(result).toBeClientError()             // any 4xx
expect(result).toBeServerError()             // any 5xx

// Deep response matching with asymmetric matchers
expect(result).toMatchLambdaResponse({
  statusCode: 201,
  body: { userId: expect.any(String) },
  headers: { 'content-type': 'application/json' },
})

// SQS batch response assertions
expect(result).toHaveNoFailedMessages()
expect(result).toHaveFailedMessage('msg-id-2')

toMatchLambdaResponse automatically parses the JSON body for comparison — you don't need to JSON.parse(result.body) in every test. Asymmetric matchers like expect.any(String) work inside the body, so you can assert structure without pinning every generated value.

The error messages are designed for Lambda. When toHaveStatusCode fails, it shows you both the expected and actual status codes plus the response body — because when a Lambda returns 500 instead of 200, the first thing you need is the error message, not a generic "expected 200 but received 500".

What the Numbers Actually Look Like

Let me do the math on a real scenario — a service with three Lambda functions (API Gateway handler, SQS consumer, DynamoDB Stream processor), each with 3-4 test cases.

Without @sls-testing

Component	Lines
API Gateway event fixture	~35
SQS event fixture (2 records)	~45
DynamoDB Stream event fixture	~50
Lambda context mock	~20
Helper: JSON body parser for assertions	~10
Helper: status code checker	~8
Copy-paste overhead across test files	~40
Total test infrastructure	~208

With @sls-testing

Component	Lines
API Gateway event (per test)	3-4
SQS event (per test)	3-5
DynamoDB Stream event (per test)	4-6
Lambda context (per test)	1-3
Import + matcher setup	2
Total test infrastructure	~20

That's roughly a 90% reduction in test setup code. But the real win isn't the line count — it's the cognitive load. When a test file is 80% fixture and 20% assertion, you can't see what's being tested. When it's 20% setup and 80% assertion, the intent is obvious.

Design Decisions

A few choices I made that shaped the library:

Sensible defaults, full override. Every builder returns a complete, valid event with zero arguments. Pass a DeepPartial override to change any field. This means the simple case is one line, but you can still construct precise edge cases when you need to test specific header combinations or malformed payloads.

Auto-serialization. SQS bodies and SNS messages are automatically JSON.stringify'd. DynamoDB images are automatically marshalled. You pass plain objects; the builder handles the format Lambda actually receives.

Framework-agnostic core. @sls-testing/core works with Jest, Vitest, Mocha, or any test runner. The Jest-specific matchers are a separate package. Vitest adapters are planned for v2.

Types from the source. All event types come from @types/aws-lambda — the community-maintained definitions that match the actual AWS runtime. No custom type definitions that could drift.

Unique identifiers per call. Every buildSQSEvent() call generates unique messageIds, every context gets a unique awsRequestId. This prevents subtle test pollution where two tests accidentally share the same ID.

Getting Started

npm install @sls-testing/core @sls-testing/jest --save-dev

Add the Jest setup (or import per file):

{
  "setupFilesAfterEnv": ["@sls-testing/jest"]
}

Write a test:

import { buildApiGatewayEvent, buildLambdaContext } from '@sls-testing/core'
import '@sls-testing/jest'
import { handler } from './handler'

it('creates a user', async () => {
  const event = buildApiGatewayEvent({
    method: 'POST',
    path: '/users',
    body: JSON.stringify({ name: 'Lucas' }),
  })

  const result = await handler(event, buildLambdaContext())

  expect(result).toHaveStatusCode(201)
  expect(result).toMatchLambdaResponse({
    body: { name: 'Lucas', id: expect.any(String) },
  })
})

That's it. No fixture files. No factory functions. No as any casts.

What's Next

The library is at v1 and covers the six most common Lambda event sources. The roadmap includes:

Vitest adapter — Same matchers, native Vitest integration
Serverless Framework plugin — Bridge serverless.yml config into tests so function names, timeouts, and env vars stay in sync automatically
More event types — Cognito triggers, CloudWatch Events, Kinesis
Snapshot testing — Assert that response shapes haven't changed across deploys
Error simulation — Builders for timeout, OOM, and cold start scenarios

The repo is at github.com/brognilucas/sls-testing. Contributions welcome — especially if you have event types you'd like to see supported.

Testing serverless applications shouldn't require more boilerplate than the business logic itself. If your test files are 80% fixture setup, something is wrong with the tooling, not with your tests.

How to Choose the Right Database for Your Serverless Application

Lucas Brogni — Wed, 01 Apr 2026 18:56:27 GMT

Serverless promises to free teams from infrastructure worries, but picking the wrong database can hurt your performance, increase your costs, and affect developer velocity.

As with everything in software, the database choice comes with trade-offs, and understanding what are those are extremely important. Scaling characteristics, connection handling and concurrency, latency, consistency and transactional needs, operational overhead, and pricing model are all factors to consider.

This article unpacks those trade-offs, compares common patterns (serverless‑native databases, managed relational options, caches, and streaming stores), and offers practical rules of thumb so you can pick a database that fits your application rather than creating new operational headaches. By the end, you’ll have a concise checklist to make the decision faster and more confidently.

Why database choice matters more in serverless

In traditional servers, database connections are opened once and reused across thousands of requests. Application instances are long‑lived and predictable. Serverless flips that model: functions spin up, live for seconds or minutes, and vanish. Each invocation may be a fresh process with no previous state, no persistent connection, and no guarantee of locality to previous requests. That changes the calculus: connection limits, cold‑start penalties, and per‑operation pricing matter far more than they did in long‑running servers.

A database that works well behind a long‑lived app can cause connection storms, latency spikes, or runaway costs when used directly from a fleet of ephemeral functions. The goal is to match your workload’s requirements (throughput, latency, consistency, transactions) with a storage option whose trade‑offs align with serverless behavior.

Key trade-offs to weigh

Scaling characteristics: Does the database scale horizontally without connection limits or shard coordination that conflicts with ephemeral clients?
Connection handling and concurrency: Can thousands of short‑lived connections be supported efficiently, or do you need a pooling/proxy layer?
Latency: Are single‑digit‑millisecond reads required, or can you accept higher, variable latency?
Consistency and transactions: Do you need strong ACID guarantees across multiple keys/tables, or is eventual consistency acceptable?
Operational overhead: How much maintenance, tuning, backups, and failover handling will your team manage?
Pricing model: Per‑operation, provisioned capacity, or storage‑centric billing—how do patterns of traffic (spiky vs steady) affect cost?

Common patterns and how they map to serverless

Serverless‑native databases (e.g., serverless NoSQL or fully serverless managed stores):
- Pros: Auto‑scaling, connectionless or HTTP/SDK access, fine‑grained billing, low operational overhead.
- Cons: Weaker transactional guarantees or complex modeling for relational data; can be expensive at very high sustained throughput.
- When to use: Spiky workloads, simple access patterns, evented architectures, or when you want minimal ops.
Managed relational databases (serverless variants or provisioned RDS/Aurora/etc.):
- Pros: Familiar SQL, strong transactions, complex queries.
- Cons: Connection limits and scaling challenges; may require connection pooling (proxy, pooler, or Data API) and can incur cold‑start latency.
- When to use: Applications that require ACID across multiple records or complex joins and cannot be re‑modeled easily.
Caches and in‑memory stores (Redis, Memcached, or managed variants):
- Pros: Extremely low latency for hot reads, useful for rate limiting, sessions, and ephemeral state.
- Cons: Not a durable primary store (unless using persistence features), additional operational cost, eventual consistency with origin store.
- When to use: Read‑heavy, low‑latency needs, offloading hotspots from a primary datastore.
Streaming/append logs (Kafka, Kinesis, Pulsar, streaming databases):
- Pros: Durable event delivery, great for event‑sourcing, async processing, and decoupling components.
- Cons: Not a drop‑in replacement for arbitrary reads/transactions; requires different application patterns.
- When to use: Event‑driven architectures, audit logs, long‑running workflows.

Practical rules of thumb

If your functions open many short‑lived DB connections, use a serverless‑friendly datastore or a connection proxy. Don’t rely on direct DB connections from unpooled functions.
For strong multi‑row/multi‑table transactions choose managed relational options—but consider a serverless (Data API) or pooled access pattern to avoid connection storms.
For spiky traffic with bursty reads, prefer serverless‑native stores and caches; they scale on demand and bill for usage.
If your app can tolerate eventual consistency, embracing key‑value or document models often reduces complexity and cost.
Use streaming stores for durable event capture and decoupling; combine with a materialized view or read store for low‑latency queries.
Measure cost at expected traffic patterns—serverless pricing can be higher for sustained, heavy throughput than for bursty, intermittent use.

Closing thoughts

Choosing a database for serverless shouldn’t be guesswork. Match your access patterns and operational constraints to the storage option whose trade‑offs you can live with, and use small experiments to validate latency, scaling, and cost under realistic load. This keeps serverless simple, where it should be—letting your team move faster without trading away reliability or spiraling costs.

Events, Messages & Commands: The Concepts That Make or Break Your Serverless Architecture

Lucas Brogni — Wed, 25 Mar 2026 11:00:00 GMT

You might have created a Lambda function that "handles events." But take a moment to question yourself about what an event actually is.

Let's forget the object that you can access on the lambda, and think of its concept: what makes something an event, and not a command, or a message.

In serverless, I believe knowing this concept matters a lot. The whole ecosystem is built on a deeply event-driven model. EventBridge, SQS, SNS, DynamoDB Streams, and S3 notifications all depend on events.

In this post, we'll return to the basics. We'll explain what events really are, how they differ from commands and messages, and why these differences matter in every serverless system you create.

What is an event

A few years ago, during a talk by James Eastham, I learned something crucial: an event is a fact and cannot be undone. Exactly, you can't reverse an event. Consider writing a post for this blog; once you publish it, the action is irreversible. The post.published event has already been triggered.
You might wonder: if I delete the post, have I undone the action? Not quite. You haven't reversed the publication; instead, you've added another event to the sequence.

That's the essence of an event. In simple terms, an event represents an action that has occurred in the real world within your system.

What is a Command

If an event is something that has happened, a command is something you're asking to happen. It's a request, not a fact. And unlike events, commands can be rejected.

Think of it this way: when a user clicks "Publish" on your blog editor, your frontend might send a PublishPost command to your backend. That command can fail. The post might not meet validation rules, the user might not have the right permissions, or the system might be temporarily unavailable. The command is an intention, not a truth.

This distinction has real architectural consequences. Commands generally have an intended recipient. You don't broadcast a command to anyone who might be listening. You send it to the one service or function responsible for handling it. There's an implicit contract: someone is expected to act on it.

In serverless terms, an SQS queue carrying a ResizeImage instruction is a good example of a command channel. One producer, one consumer, one clear responsibility.

What is a Message

A message is the broadest of the three. Both events and commands travel as messages. The word "message" tells you about the transport, not the intent.

This is where a lot of confusion creeps in. Developers see SNS delivering a payload and call it "just a message." Technically, yes. But what matters architecturally is what's inside. Is it announcing something that happened, or requesting something to be done?

Getting that wrong leads to systems where consumers start making assumptions they shouldn't. A consumer that receives an event shouldn't be the one deciding whether the action was valid. That ship has sailed. But a consumer that receives a command absolutely should validate it before acting.

Why These Differences Matter in Serverless

In a distributed architecture, the distinction between events and commands changes how you design your application, how you deal with errors, and how do you handle a retry logic.

With events, every listener is an observer. They react to facts. If a user.registered event triggers a welcome email Lambda and that function fails, you don't "undo" the registration — you retry the email. The event remains true regardless.

With commands, the linesteners are executors. They own the outcome. A failed ProcessPayment command is not something you silently retry without careful thought. The intent hasn't been fulfilled, and that matters.

EventBridge is a great example of an event bus done right: it's designed around broadcasting facts to multiple consumers. SQS, on the other hand, lends itself naturally to commands. It's point-to-point, with visibility timeouts and dead-letter queues that reflect the expectation that someone must handle this.

Conclusion

Understanding the difference between events, commands, and messages is more than academic — it's foundational to building reliable, scalable serverless systems.

Events are immutable facts about things that have already happened; commands are intent to perform an action; messages are the vehicles that convey either. Treating them correctly changes how you design APIs, choose services, handle failures, and reason about system behavior.

Key takeaways and practical guidance:

Name things clearly: events in past tense (e.g., post.published), commands as imperatives (e.g., createPost), messages as contextual envelopes.
Model events as immutable facts: persist them, append rather than overwrite, and use them to drive downstream state and side effects.
Use commands when you need explicit intent and control over execution (and choose queuing patterns that preserve ordering and retries).
Expect duplicates and out-of-order delivery in distributed systems: make consumers idempotent and design for eventual consistency.
Keep schemas explicit and versioned; consider a registry or strict contracts for producers and consumers.
Pick the right tool for the job:

When Messages Fail: How DLQs Save Your Event-Driven System

Lucas Brogni — Wed, 18 Mar 2026 12:04:40 GMT

In recent interviews, I asked candidates a system-design question about managing failures in a serverless, event-driven architecture. I was surprised by how many didn't include retry mechanisms or a Dead Letter Queue (DLQ) for investigation. In serverless systems, where functions are stateless, and communication often depends on event-driven messaging, failures can be silent and difficult to trace, making proper error handling essential. This gap inspired this article, which explains what a DLQ is, why it is important, and how to use one effectively in your serverless and event-driven workflows.

What is a DLQ?

Before explaining the importance of it, let's make sure we are aligned on what a DLQ is.

Dead Letter Queue, or simply DLQ, is a message queue used to store messages that could not be successfully processed by a consumer. When a message can't be successfully processed, regardless of the reason, instead of losing or keeping it, retrying forever, this message is redirected and stored in the DLQ.

Imagine it as a holding area for problem messages. Instead of letting failures vanish or stop your system, the DLQ catches them. This allows engineers to check, fix, and handle them later without affecting the main process.

Why use a DLQ?

Now that we understand what a DLQ is, let's talk about why you should use one and why not having one is a red flag in any event-driven or message-based architecture.

Prevent message loss.

Without a DLQ, a message that fails to be processed can simply disappear. Depending on your configuration, it might be discarded, leaving no trace of what went wrong. A DLQ ensures that no message is silently dropped. You can count on the fact that every failure is preserved and accounted for.

Avoid infinite retry loops.

Retries are great, and we should absolutely have them. But retries alone are not enough. If a message is fundamentally broken, for instance, with an invalid format or references data that no longer exists, it can lead to retrying it indefinitely, which wastes resources, is not cost-efficient, and potentially blocks other messages from being processed. A DLQ acts as the exit door for those unrecoverable failures.

Improved observability and debugging.

When a message lands in a DLQ, it presents an opportunity. You can examine the payload to understand what caused the failure and enhance your system. Without a DLQ, that context is lost, but with one, it provides a valuable feedback loop for your application's reliability.

A useful practice I've learned over the years is that you can use DLQ payloads for writing tests. This helps identify where errors occurred and serves as documentation for the fix.

Operational safety net

Systems fails that is a fact.

Sooner or later, either the network will be unreachable, the third-party service you're integrating with will go down, or perhaps a bug was introduced into your application and the previous payload isn't acceptable anymore.

A DLQ will provide architectural resilience and ensure that transient failures don't cause permanent data loss. Once the underlying issue is resolved, messages can be reprocessed from the DLQ as if nothing had happened.

In short, Build for Failure, Design for Resilience

Dead Letter Queues are a fundamental safety net for event-driven systems: they prevent silent failures, preserve the context needed for diagnosing issues, and allow teams to address problematic messages without disrupting normal processing. When paired with strong observability and clear operational playbooks, DLQs enhance the reliability and maintainability of event-driven systems.

Quick practical checklist:

Define sensible retry limits and exponential backoff to ensure only truly problematic messages reach the DLQ.
Capture detailed metadata (timestamps, error reasons, processing context) with each dead-lettered message.
Monitor DLQ size and rate, setting alerts for spikes or stagnation.
Provide tools and processes for safe reprocessing, manual inspection, and automated remediation.
Treat DLQs as integral components in architecture reviews and tests.

Adopting DLQs turns failures into actionable insights, keeping your system resilient and operable under real-world conditions.

Lucas Brogni is a Senior Software Engineer with 10+ years of experience building distributed systems.

Why I'm Writing This

Lucas Brogni — Fri, 13 Mar 2026 09:48:20 GMT

I've been building with serverless since 2021.

Not just tinkering — using it as the primary architectural choice for production systems, advocating for it in hiring conversations, writing about it, giving talks about it, and making it the backbone of my graduate thesis on cloud-native architecture.

And yet, the question I get asked most often isn't about DynamoDB access patterns or cold start optimization. It's this:

"How do I actually know when I've done it right?"

That question has a longer answer than most people expect. That's what this blog is for.

The gap nobody warns you about

There's a very seductive version of serverless that gets sold in conference talks and documentation pages. Deploy a function. It scales. You pay nothing when it's idle. Zero infrastructure to manage.

All of that is true. None of it prepares you for production.

The real learning curve in serverless isn't writing functions — it's understanding the execution model well enough to make good decisions under pressure. Why does your function behave differently under concurrent load? Why is that DynamoDB error only happening in production? Why did your SQS queue suddenly back up overnight with no error rate spike?

The answers to these questions all trace back to the same place: how Lambda actually works, and how the services around it actually behave. Not in theory. In practice.

What "practical" means here

I'm not going to write tutorials that walk you through creating an S3 bucket. There are plenty of those. What I want to write — and what I wish had existed when I was figuring this out — is the thinking behind the decisions.

Why you should treat the handler as an entry point and nothing more.
Why idempotency isn't optional the moment you introduce asynchronous processing.
Why that IAM wildcard that "works fine" is a problem you haven't encountered yet.
Why your local environment is an approximation, and which differences will actually matter.

Each post here is going to take a concept that looks simple from the outside — and show you what it actually looks like from the inside of a running production system.

Where this comes from

My day job is backend engineering on a growth team at a SaaS company. We run a serverless-first stack on AWS: Lambda, DynamoDB, SQS, EventBridge, API Gateway. I've shipped billing systems, built MCP-powered tooling, modernized test infrastructure, and handled zero-downtime schema migrations — all within this architecture.

I've also made most of the mistakes worth making. Misconfigured IAM roles that only failed at runtime. A trigger loop I caught in staging, barely. An SQS processor that quietly stopped processing because I hadn't understood partial batch failures. An observability gap that turned a 20-minute incident into a 3-hour one.

That's not a credentials flex. It's context. The patterns I write about here have been tested in the only environment that really matters.

What's coming

I'll publish roughly twice a month. No rigid structure — just whatever's most worth writing about. Some posts will be conceptual, building the mental models that underpin everything else. Some will be deeply technical: specific patterns, concrete code, tradeoffs spelled out in full.

A few topics already in the pipeline:

The execution environment, actually explained — what init, invoke, and shutdown mean for the code you write every day
Why your tests pass and production still breaks — the serverless testing gap and how to close it
IAM for people who don't want to read the entire IAM docs — least privilege, per-function roles, and the wildcards that will eventually hurt you
Idempotency from scratch — because "process it once" is harder than it sounds when Lambda will retry anything that fails

If there's something specific you've been struggling with, I want to hear it. The goal of this blog is to be useful — not to document what I already know, but to address the questions you're actually asking.

One more thing.

Serverless isn't perfect. It's not always the right choice. I'll say so when it isn't. The best thing I can offer here isn't enthusiasm — it's honesty about where the edges are and what happens when you hit them.

Let's get into it.

Lucas Brogni is a Senior Software Engineer with 10+ years of experience building distributed systems.