Production Hardening
AI Feature Observability on Payload and Medusa: The Four Signals We Wire on Day One
Every Claude feature we ship on Payload or Medusa gets four signals wired before it sees production traffic. Skip any of them and the regression will be silent — and expensive.

*Generic APM tells you the request returned 200. It will not tell you the model started hallucinating refunds on Tuesday.*
Every AI feature we ship on Payload or Medusa hits the same problem in month two: the team cannot tell whether it is still working. Sentry reports the endpoint at 99.98% success. Vercel Analytics shows response times inside the SLO. The Anthropic dashboard shows spend, but not per-feature. And yet the product description job started returning empty strings on Tuesday, the support copilot has been quoting the wrong return policy since a prompt tweak on the 14th, and the editors have quietly overridden 71% of the AI-suggested internal links this week. Nobody knows any of this.
The failure mode of an AI feature is not an HTTP 500. It is a 200 that returns confident garbage. Generic APM does not catch it because from the transport layer everything is fine. This post is the exact instrumentation shape we wire on day one of any Claude feature — four signals, one Postgres table, one wrapper around the SDK, and a rollout order we have argued with clients about enough times to have opinions on. If you have one or two AI features live and no real visibility into what they cost, when they regress, or how often your team silently undoes them, this is the pattern we would install before we touched anything else.
Why generic APM misses every AI regression that matters
Sentry, Datadog, and Vercel Analytics are excellent at what they measure: exceptions, spans, cold starts, RUM. None of them know that `response.content[0].text` is supposed to be valid JSON matching a Zod schema. None of them know that the same prompt cost you $0.004 last week and $0.011 this week because you switched from Claude 3.5 Haiku to Sonnet for a "quick test" that never got reverted. And crucially, none of them know that your editors have been quietly clicking "regenerate" or overwriting the field entirely — the strongest signal you have that the feature is dead.
The four signals we wire are picked precisely because they cover the failure modes APM cannot see: cost regression, latency regression, semantic regression, and trust regression. We store them in one Postgres table, we query them from a boring SQL dashboard, and we alert on two of them. That is the entire system.
The Postgres table that anchors everything
Before the four signals make sense, here is the table. Every AI call — whether it originated in a Payload `afterChange` hook, a Medusa subscriber, or a Next.js Server Action — writes exactly one row here. We do not use a separate observability vendor for this; the data is cheap, the query patterns are simple, and the team already knows Postgres.
create table ai_calls (
id uuid primary key default gen_random_uuid(),
created_at timestamptz not null default now(),
feature text not null, -- 'product_description', 'support_copilot', 'alt_text'
model text not null, -- 'claude-3-5-sonnet-20241022'
input_tokens int not null,
output_tokens int not null,
cost_usd numeric(10, 6) not null,
latency_ms int not null,
status text not null, -- 'ok' | 'timeout' | 'invalid_output' | 'error'
validation_error text, -- populated when status = 'invalid_output'
entity_type text, -- 'product', 'article', 'ticket'
entity_id text, -- FK-ish, not enforced
prompt_version text not null, -- 'v3' — bumped every prompt edit
request_id text -- Anthropic request id for support tickets
);
create index ai_calls_feature_created on ai_calls (feature, created_at desc);
create index ai_calls_status_created on ai_calls (status, created_at desc)
where status <> 'ok';
create index ai_calls_entity on ai_calls (entity_type, entity_id);Two things worth calling out. First, `prompt_version` is a string, not a number, and it lives next to the prompt in the codebase — every edit bumps it. This is the single most useful column in the table when you are trying to correlate "the output validity dropped" with "someone changed the prompt." Second, the partial index on non-ok status means our alerting queries are effectively free even when the table has millions of rows.
Signal 1 — cost per call, bucketed by feature and model
The Anthropic console shows you total spend. It does not show you that 82% of it went to a product-description backfill you ran once and forgot about. Every row in `ai_calls` carries `input_tokens`, `output_tokens`, and a computed `cost_usd` — computed at write time using the model's published pricing, not looked up later, because pricing tables change and you want the cost that was true when the call happened.
// pricing frozen at write time — do not look this up in a shared module
// that gets edited when Anthropic changes prices.
const PRICING: Record<string, { input: number; output: number }> = {
'claude-3-5-sonnet-20241022': { input: 3.00 / 1_000_000, output: 15.00 / 1_000_000 },
'claude-3-5-haiku-20241022': { input: 0.80 / 1_000_000, output: 4.00 / 1_000_000 },
};
export function computeCostUsd(
model: string,
inputTokens: number,
outputTokens: number,
): number {
const p = PRICING[model];
if (!p) throw new Error(`unknown model pricing: ${model}`);
return inputTokens * p.input + outputTokens * p.output;
}The query we actually look at every Monday is this — cost per feature per model, week over week. It is the report that catches "someone switched to Sonnet three weeks ago and never switched back."
select
feature,
model,
date_trunc('week', created_at) as week,
count(*) as calls,
sum(cost_usd)::numeric(10, 2) as cost_usd,
avg(input_tokens)::int as avg_in,
avg(output_tokens)::int as avg_out
from ai_calls
where created_at > now() - interval '8 weeks'
group by feature, model, week
order by week desc, cost_usd desc;Signal 2 — latency and timeout rate
p95 latency is the signal that predicts a bad launch week before your users complain. Claude latency drifts. The Anthropic API is fast and reliable, but "fast" is a distribution and the tail matters — a job that averages 2.1s with a p95 of 4.8s will start timing out the moment you add a retrieval step or bump the max_tokens. We track p50, p95, and timeout rate per feature. See the Anthropic API reference for the streaming and timeout semantics we build against.
Our default timeout on non-streaming calls is 30 seconds. For streaming calls from a Next.js Server Action into a Payload admin UI, we set the outer timeout at 90 seconds but track "time to first token" separately — that is the number that predicts user abandonment.
Signal 3 — output validity
This is the signal every team skips and every team regrets. "The model returned something" is not a success metric. The question is whether what it returned is what your downstream code expects. We use Zod at every model boundary. If a call returns text that does not parse, `status` is written as `invalid_output` and the raw response is captured in `validation_error`. Nothing downstream ever sees the bad output — the wrapper either returns a validated object or throws.
import Anthropic from '@anthropic-ai/sdk';
import { z, ZodSchema } from 'zod';
import { db } from '@/db';
import { computeCostUsd } from './pricing';
const anthropic = new Anthropic();
type CallOpts<T> = {
feature: string;
model: string;
promptVersion: string;
schema: ZodSchema<T>;
system: string;
user: string;
maxTokens?: number;
entityType?: string;
entityId?: string;
timeoutMs?: number;
};
export async function callClaude<T>(opts: CallOpts<T>): Promise<T> {
const started = Date.now();
let status: 'ok' | 'timeout' | 'invalid_output' | 'error' = 'error';
let validationError: string | null = null;
let inputTokens = 0;
let outputTokens = 0;
let requestId: string | null = null;
try {
const res = await anthropic.messages.create(
{
model: opts.model,
max_tokens: opts.maxTokens ?? 1024,
system: opts.system,
messages: [{ role: 'user', content: opts.user }],
},
{ timeout: opts.timeoutMs ?? 30_000 },
);
requestId = res.id;
inputTokens = res.usage.input_tokens;
outputTokens = res.usage.output_tokens;
const text = res.content
.filter((b): b is Anthropic.TextBlock => b.type === 'text')
.map((b) => b.text)
.join('');
const parsed = opts.schema.safeParse(JSON.parse(text));
if (!parsed.success) {
status = 'invalid_output';
validationError = parsed.error.message.slice(0, 500);
throw new Error('output failed schema validation');
}
status = 'ok';
return parsed.data;
} catch (err) {
if (status === 'error' && err instanceof Anthropic.APIConnectionTimeoutError) {
status = 'timeout';
}
throw err;
} finally {
await db.insert('ai_calls', {
feature: opts.feature,
model: opts.model,
input_tokens: inputTokens,
output_tokens: outputTokens,
cost_usd: computeCostUsd(opts.model, inputTokens, outputTokens),
latency_ms: Date.now() - started,
status,
validation_error: validationError,
entity_type: opts.entityType,
entity_id: opts.entityId,
prompt_version: opts.promptVersion,
request_id: requestId,
});
}
}The `finally` block is doing the important work: the row is written regardless of outcome, so `invalid_output`, `timeout`, and `error` are all first-class citizens in the dataset. If the DB insert itself fails, we log locally and continue — instrumentation must never take down the feature.
Signal 4 — human-override rate
This is the signal nobody thinks to instrument, and it is the one that tells you the truth. If your Payload editors are rewriting the AI-generated product description more than half the time, the feature does not work — no matter what your validity rate says. If your support agents are silently editing the copilot's draft response before sending, same story.
We wire this through Payload `afterChange` hooks. Any field populated by AI carries a sibling `_ai` field with the generated value and the `ai_calls.id` that produced it. On every subsequent update, the hook compares the current value against the AI-generated one and records the delta. See the Payload hooks reference for the hook lifecycle we hang this off of.
import type { CollectionAfterChangeHook } from 'payload';
import { db } from '@/db';
export const trackAiOverride: CollectionAfterChangeHook = async ({
doc,
previousDoc,
operation,
}) => {
if (operation !== 'update') return;
const fields = ['description', 'metaDescription'] as const;
for (const field of fields) {
const aiMeta = doc[`${field}_ai`];
if (!aiMeta?.callId) continue;
const wasAiValue = previousDoc[field] === aiMeta.value;
const isEdited = doc[field] !== aiMeta.value;
if (wasAiValue && isEdited) {
await db.insert('ai_overrides', {
call_id: aiMeta.callId,
field,
original_length: aiMeta.value.length,
edited_length: doc[field].length,
char_distance: levenshtein(aiMeta.value, doc[field]),
});
}
}
};The report we care about: override rate per feature per week. Under 20% is healthy. 20–50% is a prompt problem. Above 50% is a feature that should be turned off until you understand why. We have shipped this exact hook on four projects; on two of them it caught features that were technically "working" but that editors had given up on.
A war story from a Medusa copilot we shipped
On one Medusa support copilot we wired, everything green: p95 latency 1.8s, validity 99.4%, cost tracking to budget. Then the override rate report showed agents were rewriting 68% of drafts. The root cause was not the model — it was a policy doc we had indexed that had been superseded three months earlier. The copilot was accurate to the wrong document. The four-signal dashboard did not tell us the answer, but it told us the question to ask on Monday morning instead of the following quarter.
Where to call the wrapper from — Payload hooks vs Medusa subscribers
The `callClaude` wrapper is stack-agnostic. Where you call it from matters. On Payload, we call it from a `beforeChange` hook when the AI output must exist before the document is saved (alt text, meta descriptions), and from an `afterChange` hook when the AI work can happen asynchronously (translation, internal linking). On Medusa, we call it from subscribers on `order.placed` or `product.created`, never inline in a checkout path — the p99 latency of any AI call is too unpredictable to sit in a purchase flow.
Alerting thresholds — the four we set and the two we deliberately do not
Alert: invalid_output rate > 2% over 1 hour, per feature. This is the semantic canary — it almost always means a prompt version regression or an upstream schema change.
Alert: timeout rate > 1% over 1 hour, per feature. Usually correlates with an Anthropic incident or a max_tokens bump nobody flagged.
Alert: daily cost exceeds 150% of the 7-day rolling average, per feature. Catches runaway backfills and accidental Sonnet-instead-of-Haiku switches.
Alert: override rate > 60% over a 7-day window, per feature. Weekly digest, not a page — this is a product signal, not an incident.
We do NOT alert on: individual call latency. It fluctuates, it wakes people up at 3am, and by the time you look, the p95 has already normalised. Look at the trend line during business hours.
We do NOT alert on: total monthly spend. It is a lagging indicator. If per-feature daily spend is under control, monthly takes care of itself — and if you alert on monthly you learn about the problem three weeks late.
What we would NOT build on day one
The temptation, on any observability rollout, is to build the dashboard first. Do not. A bespoke Grafana or Metabase board on day one is a distraction — the six SQL queries above, saved in a shared doc, are enough for the first three months. Also skip: per-user cost attribution (you do not have the volume to justify the JOIN complexity), and a Langfuse-style trace tree for a three-prompt product (real value at 20+ chained calls, overkill below). If you cross those thresholds later, the `ai_calls` table already has everything the more sophisticated tool needs to import.
The rollout order we recommend
Signal 3 first — output validity. Zod at every boundary, `invalid_output` written as its own status. This alone changes how the team thinks about AI features. One day of work.
Signal 1 — cost per call. The `ai_calls` table, the wrapper, the weekly SQL query. Half a day if signal 3 is already wired.
Signal 2 — latency and timeout rate. Falls out of the wrapper for free once cost tracking is in place. An hour to add the alert query.
Signal 4 — human-override rate. Hardest to wire, most valuable to look at. One to two days depending on how many AI-populated fields you have across Payload collections. Ship this last, but ship it — this is the signal that separates "the feature runs" from "the feature works."
What good looks like at month three
Three months in, on a healthy rollout, the Monday report takes ten minutes to read and looks like this: cost per feature within 10% of forecast, invalid_output under 0.5% across all features, p95 latency stable, override rate trending down as prompts get tuned against real editor behaviour. When something regresses, you know within 24 hours which feature, which prompt version, and whether it is a model issue, a schema issue, or a trust issue — because the four signals disambiguate exactly those four failure modes.
Most of the copilots and enrichment jobs we instrument this way live on Medusa. See how we wire AI features into Medusa storefronts — the same four-signal shape ships on every project.
If you have a Claude or GPT feature live on Payload or Medusa and no real visibility into what it costs, when it regresses, or how often your team silently undoes it — Send us the AI feature you cannot see into. We will tell you which of the four signals to wire first.
On every Payload or Medusa project with AI features that we ship or inherit, this instrumentation is now the first thing we install — before we tune a prompt, before we swap a model, before we argue about RAG vs fine-tuning. The four signals do not make the feature better. They tell you whether it is working. Everything else follows from that.
// After the call
Questions operators ask next
Does this pattern work if we are using OpenAI or Gemini instead of Claude?
Yes — the wrapper shape is identical, only the SDK call and the pricing table change. The `ai_calls` schema is model-agnostic; we run it in production against Claude, GPT-4o, and Gemini in the same table with the `model` column doing the disambiguation. Token counting semantics differ slightly per vendor, so verify against each provider's usage response shape.
How does the wrapper behave when Postgres is unavailable — does it take down the AI feature?
No. The `finally` block wraps the insert in its own try/catch and logs to stderr on failure. Instrumentation must never fail the primary operation. In practice we have never seen the insert fail in production, but the guarantee matters — you want engineers to trust the wrapper enough to put it in the hot path.
Is the ai_calls table going to blow up in size at scale?
At 100k calls/day the table adds about 20MB/day uncompressed — trivial for Postgres. We partition by month once a project crosses 500k calls/day, and we archive rows older than 90 days to cold storage. The partial index on non-ok status keeps alerting queries fast even at tens of millions of rows.
How does this compare to Langfuse or Helicone for LLM observability?
Langfuse and Helicone are excellent when you have chained agents, multi-step tool calls, or need trace trees across 20+ prompts. For the three-to-five-prompt features most teams actually run in production, the Postgres table is simpler, cheaper, and owned by your team. When you cross the complexity threshold, `ai_calls` exports cleanly into either tool.
How do we track override rate for AI outputs that never touch Payload — like a support copilot draft?
Same idea, different surface. For a copilot draft, log the AI output and the final agent-sent message against the same `call_id`, then compute Levenshtein or semantic distance in a batch job. We use a nightly cron that populates `ai_overrides` for anything that did not go through a Payload hook. The signal is what matters, not where the diff is computed.
What is the day-one cost of wiring this into an existing Payload + Medusa project?
Two to four engineering days for signals 1–3, another one to two days for signal 4 depending on how many AI-populated fields exist. No new vendor cost — Postgres and the existing Anthropic account cover it. The payback is typically within the first month, either through a cost regression caught early or a broken feature identified before a stakeholder notices.
Pull quote
The failure mode of an AI feature is not a 500. It is a 200 that returns confident garbage, an editor quietly rewriting every output, and a token bill that triples the month you shipped a new prompt.