Day 17: AI as SRE: Is This Code Operable?

Imagine this. Your feature works perfectly in development. You ship it. A week later, something’s wrong. Users mention weird behavior. You check the logs.

There are no logs.

You check the metrics. There are no metrics. You check for errors. Nothing. The code is a black box. It’s doing something, but you have no idea what.

This is the difference between code that works and code you can operate. AI is excellent at generating the first kind. It has no idea the second kind matters unless you tell it.

Working code and operable code are different things. AI needs to be told to care about both.

What Makes Code Operable

Operable code answers these questions:

Is it working? Can I tell at a glance if this feature is healthy?
What’s happening? Can I see what the code is doing right now?
What went wrong? When it fails, can I find out why?
How do I fix it? Do I have the tools to diagnose and resolve issues?

AI-generated code often works but fails all four questions. It’s optimized for functionality, not operability.

The SRE Audit Prompt

Here’s the prompt that catches operability gaps:

Act as an SRE reviewing this code for operational readiness.

Assume you'll be paged at 3am when this breaks. You're tired,
you didn't write this code, and you need to fix it fast.

Review for:

1. Observability
   - Are important operations logged?
   - Are logs structured and searchable?
   - Do logs include enough context to debug?
   - Are metrics exposed for key operations?
   - Are there traces connecting related operations?

2. Failure Modes
   - What can fail?
   - How does each failure manifest?
   - Are failures silent or visible?
   - Do errors include actionable information?

3. Debuggability
   - Can you trace a request through the system?
   - Can you reproduce issues from logs alone?
   - Are there debug endpoints or tools?

4. Recoverability
   - Can the system recover automatically?
   - What manual intervention is needed?
   - Is there a runbook for common failures?

5. Alerting
   - What should trigger an alert?
   - What thresholds indicate problems?
   - How do you know before users complain?

For each gap found:
- The problem
- Why it matters at 3am
- How to fix it

Code:
[paste code]

A Real SRE Audit

Here’s code AI generated for processing card trades:

async function processTrade(tradeId: string): Promise<void> {
  const trade = await db.trade.findUnique({
    where: { id: tradeId },
    include: { offeredCards: true, requestedCards: true }
  });

  if (!trade || trade.status !== 'accepted') {
    return;
  }

  for (const card of trade.offeredCards) {
    await db.card.update({
      where: { id: card.id },
      data: { ownerId: trade.toUserId }
    });
  }

  for (const card of trade.requestedCards) {
    await db.card.update({
      where: { id: card.id },
      data: { ownerId: trade.fromUserId }
    });
  }

  await db.trade.update({
    where: { id: tradeId },
    data: { status: 'completed', completedAt: new Date() }
  });
}

The SRE audit found:

Observability Gaps:

No logging at all. If this fails, you won’t know it ran.
No metrics. You can’t track trade processing rate or success rate.
Silent early return. If trade is missing or wrong status, nothing is logged.

Failure Mode Issues:

Partial failure possible. If it fails after transferring some cards, trade is in inconsistent state.
No error handling. Database errors will crash silently.
No indication of what failed or why.

Debuggability Issues:

Can’t trace which trades processed or failed.
Can’t tell how long processing took.
Can’t identify which card transfer failed in a multi-card trade.

Recoverability Issues:

No way to retry failed trades.
No idempotency. Running twice could cause issues.
No transaction. Partial state is possible.

Fixed version:

async function processTrade(tradeId: string): Promise<TradeResult> {
  const startTime = Date.now();

  logger.info('Processing trade', { tradeId });
  metrics.increment('trade.processing.started');

  const trade = await db.trade.findUnique({
    where: { id: tradeId },
    include: { offeredCards: true, requestedCards: true }
  });

  if (!trade) {
    logger.warn('Trade not found', { tradeId });
    metrics.increment('trade.processing.not_found');
    return { success: false, error: 'Trade not found' };
  }

  if (trade.status !== 'accepted') {
    logger.info('Trade not in accepted status', {
      tradeId,
      currentStatus: trade.status
    });
    metrics.increment('trade.processing.wrong_status');
    return { success: false, error: `Trade status is ${trade.status}` };
  }

  try {
    await db.$transaction(async (tx) => {
      // Transfer offered cards
      for (const card of trade.offeredCards) {
        await tx.card.update({
          where: { id: card.id },
          data: { ownerId: trade.toUserId }
        });
        logger.debug('Transferred offered card', {
          tradeId,
          cardId: card.id,
          toUserId: trade.toUserId
        });
      }

      // Transfer requested cards
      for (const card of trade.requestedCards) {
        await tx.card.update({
          where: { id: card.id },
          data: { ownerId: trade.fromUserId }
        });
        logger.debug('Transferred requested card', {
          tradeId,
          cardId: card.id,
          toUserId: trade.fromUserId
        });
      }

      // Mark trade complete
      await tx.trade.update({
        where: { id: tradeId },
        data: {
          status: 'completed',
          completedAt: new Date()
        }
      });
    });

    const duration = Date.now() - startTime;
    logger.info('Trade completed successfully', {
      tradeId,
      offeredCount: trade.offeredCards.length,
      requestedCount: trade.requestedCards.length,
      durationMs: duration
    });
    metrics.increment('trade.processing.success');
    metrics.histogram('trade.processing.duration', duration);

    return { success: true, tradeId };

  } catch (error) {
    const duration = Date.now() - startTime;
    logger.error('Trade processing failed', {
      tradeId,
      error: error.message,
      stack: error.stack,
      durationMs: duration
    });
    metrics.increment('trade.processing.error');

    return { success: false, error: 'Trade processing failed', tradeId };
  }
}

The fixed version tells you everything: what’s processing, what succeeded, what failed, how long it took, and why.

The Logging Audit Prompt

For focused logging review:

Review the logging in this code.

For each operation, check:
1. Is success logged?
2. Is failure logged?
3. Does the log include enough context to debug?
4. Is the log level appropriate (debug/info/warn/error)?
5. Are sensitive values excluded?

Context should include:
- Request/transaction ID for correlation
- Relevant entity IDs
- User ID if applicable
- Timing information
- State before/after for mutations

Missing logs for:
[list what's not logged that should be]

Logs that need more context:
[list logs that exist but lack context]

The Metrics Audit Prompt

For metrics coverage:

What metrics should this code expose?

Categories:
1. Throughput - How many operations per time period?
2. Latency - How long do operations take?
3. Errors - What's the error rate?
4. Saturation - How close to capacity?
5. Business metrics - What matters to the product?

For this code:
[paste code]

Generate:
1. Metrics to add with names and types
2. Where in the code to instrument
3. Alert thresholds for each metric
4. Dashboard queries to visualize

The Failure Mode Prompt

For comprehensive failure analysis:

Analyze the failure modes of this code.

For each failure:
1. What causes it?
2. How does it manifest? (error, timeout, wrong result, silent)
3. What's the blast radius? (one user, all users, data corruption)
4. How would you detect it?
5. How would you recover?

Consider:
- Network failures
- Database failures
- Invalid data
- Concurrency issues
- Resource exhaustion
- Dependency failures
- Configuration errors

Code:
[paste code]

The Runbook Generation Prompt

For operational documentation:

Generate a runbook for operating this feature.

Feature: [describe it]
Code: [paste code]

Include:
1. Health check - How to verify it's working
2. Common issues - What usually goes wrong
3. Diagnostic steps - How to investigate problems
4. Resolution steps - How to fix common issues
5. Escalation - When to page someone else
6. Recovery procedures - How to restore service

Write for an on-call engineer who's never seen this code.

The 3am Test

For every piece of code, ask:

I'm on call. It's 3am. This code is broken. Users are complaining.

With the current logging, metrics, and tooling:
1. How would I know something is wrong?
2. How would I find the relevant logs?
3. How would I identify the root cause?
4. How would I fix it or mitigate?
5. How would I verify the fix worked?

What's missing that would make this easier?

If the answer to any question is “I don’t know,” the code isn’t ready for production.

The Operability Checklist

Before shipping:

□ Key operations logged with context
□ Errors logged with stack traces and context
□ Logs are structured (JSON) and searchable
□ Metrics exposed for throughput, latency, errors
□ Alerts configured for critical thresholds
□ Failure modes identified and handled
□ Runbook documented for common issues
□ Health check endpoint exists
□ Can trace requests through the system
□ Recovery procedures documented

Building Operability Into Prompts

Don’t audit after. Build it in from the start:

Build this feature with full operational readiness.

Feature: [description]

Include:
- Structured logging for all operations
- Metrics for throughput, latency, and errors
- Error handling that logs full context
- Health check capability
- Graceful degradation where possible

Imagine you'll be paged when this breaks. Build what you'd want to have.

Tomorrow

You’ve audited for operability. Now who writes the tests? Tomorrow I’ll show you how to use AI as a test generator, creating comprehensive test suites that catch bugs before they reach the code you just made operable.

Try This Today

Take a piece of AI-generated code that’s in production
Run the SRE audit prompt
Ask: “If this broke at 3am, what would I wish I had?”

The gap between what you have and what you’d wish for is your operability debt. Start paying it down before you get that 3am page.