Does this architecture work with OpenAI or Gemini instead of Claude?

Yes — the retrieval shape is model-agnostic. The tool-call surface differs slightly (OpenAI's function-calling and Gemini's function declarations have their own schemas), but the three-call pattern, the pgvector store, the Payload policy collection, and the grounding gate are identical. We have shipped variants on Claude 3.5 Sonnet, GPT-4.1, and Gemini 2.5 Pro. Pick on cost, latency, and your structured-output reliability needs, not architecture.

Can we run pgvector on the same Postgres instance as Medusa, or do we need a separate vector DB?

Same instance, in our experience, up to a few hundred thousand policy chunks. With an HNSW index, p95 similarity queries stay under 100ms at that scale on a modest managed Postgres. We have not yet needed to split out a dedicated vector DB for support-copilot workloads. If you are also embedding product catalog or doing multi-million-row semantic search, that calculus changes.

How do you handle webhook retries or duplicate re-embed jobs when a policy is updated rapidly?

The re-embed job is keyed by (policy_id, policy_version, clause_index) and is idempotent — running it twice writes the same row. We also debounce in the afterChange hook so rapid saves within a 30-second window collapse into one enqueue. The pgvector write itself uses an upsert on the composite key, so concurrent jobs cannot create duplicates.

What does the cost look like at, say, 1,000 tickets per day on Claude?

At roughly 3,500 input tokens and 400 output tokens per conversation turn (the structured retrieval keeps the context lean), Claude 3.5 Sonnet runs around $0.015–$0.025 per resolved ticket. At 1,000 tickets per day with a 70% containment rate, that is $300–$500/month in model spend. Embeddings and pgvector storage are noise compared to that. The real cost is the engineering build, not the inference.

How do you prevent the copilot from leaking one customer's order to another?

The customer ID is bound to the session on the server side and injected into every tool handler — the model never receives it and cannot pass an arbitrary customer ID. Order lookups are always filtered by that bound ID at the SQL layer. If a customer is not authenticated, `getOrderByReference` returns nothing and the gate refuses order-status intents. There is no path where the model can request another customer's data, because the tool surface does not expose one.

Can this pattern extend to other channels — WhatsApp, email autoresponders, Shopify chat?

Yes. The retrieval layer and the gate are channel-agnostic — they sit behind a service boundary. We have wired the same context builder into a Next.js chat widget, an email triage worker (Resend inbound webhooks), and a WhatsApp Business handler. The only channel-specific work is intent classification on inbound and rendering citations on outbound. The grounding architecture does not change.

Back to insights

AI Engineering

RAG Support Copilots on Medusa: The Retrieval Shape We Ship Before Claude Sees an Order

Support copilots do not hallucinate refunds because the model is weak. They hallucinate because retrieval is shaped wrong. Here is the three-call retrieval pattern we wire on Medusa + Payload before Claude generates a token.

13 Jun 20269 min readBy Krešimir Galić · Founder & Principal Engineer

*Most copilot failures are retrieval failures wearing a model's clothes — wrong order, stale policy, invented SKU, then a confident sentence on top.*

Every support copilot we have been asked to rescue this year failed in the same place. Not the model. Not the prompt. Retrieval. The copilot was handed the wrong order, a policy doc from 2022, or a SKU that never existed, and then a perfectly fluent sentence was generated on top of that bad context. To the customer it reads like a confident refund promise. To the finance team it reads like a liability.

If you are the CTO at a Medusa-backed D2C brand and your Head of CX is pushing for an AI assistant by end of quarter, the question is not which model to pick. Claude, GPT-4.1, Gemini 2.5 — they will all hallucinate the same refund if you feed them the same wrong order. The work that matters happens before the first token. It happens in how you shape the three retrieval calls that build the context window, what you embed versus what you query relationally, and where you put the hard gate that refuses to answer.

This is the retrieval architecture we now ship by default on Medusa + Payload support copilots. It is boring in the right places and opinionated in the places that have burned us. We will show the Payload policy collection, the tool definitions, the pre-flight context builder, and the confidence gate. And we will name the things we still refuse to automate.

Why support copilots fail at retrieval, not generation

In production we see three failure modes, in roughly this order of frequency:

Wrong-customer context. The copilot pulled the most recent order matching an email fragment, not the order this customer is asking about. A returning customer with six orders gets told their non-existent order #1142 shipped yesterday.
Stale policy. The return policy was embedded eight months ago. Legal updated the EU window from 14 to 30 days. The vector store still serves the old chunk because nobody wired the re-embed on publish.
Hallucinated structure. The model was asked to summarise the order and invented a line item, a tracking number, or a refund amount because the JSON it was given was incomplete and it filled the gap.

None of these are model bugs. They are architecture bugs. And all three disappear when you stop treating the copilot as a single RAG pipeline and start treating it as three separate retrieval calls with different storage, different freshness guarantees, and different failure behaviour.

The three-call retrieval shape

Before Claude generates a token, three retrieval calls run in parallel inside a single context-builder function. Each has a different source of truth and a different failure mode.

Structured order lookup — Postgres relational query against Medusa's `order`, `line_item`, `fulfillment`, and `return` tables, joined by verified customer ID. No vector search. No fuzzy match. If we cannot resolve the customer to a single ID, we stop.
Policy RAG — pgvector similarity search against embedded chunks from the Payload policies collection, filtered by locale and policy type. Returns top-k with a cosine threshold; below the threshold, the policy slot in the context window stays empty.
Conversation memory — the last N turns of this session, stored in a short-lived Postgres table keyed by session ID. Not embedded. Not summarised by an LLM. Just the raw turns, capped at a token budget.

The context window Claude finally sees is assembled from these three sources with explicit section headers. The model is told, in the system prompt, that anything outside these sections does not exist. That instruction does not make hallucination impossible — but combined with tool-only data access and a confidence gate, it gets the rate low enough to ship.

What lives in pgvector vs what stays in Postgres relations

This is the decision that flips the failure rate. We use pgvector inside the same Postgres instance that Medusa uses — no separate vector DB, no Pinecone, no Weaviate. At the scale most D2C brands operate (a few hundred policy chunks, a few thousand macros, maybe tens of thousands of FAQ entries), a dedicated vector store is overhead we cannot justify.

What gets embedded:

Policy documents — returns, shipping, warranty, sizing, materials, care instructions. Versioned, locale-tagged, authored in Payload.
Macros and canned responses — the CX team's existing reply library, tagged by intent.
Product care and usage content — long-form guidance that lives outside the PDP.

What never gets embedded:

Orders. Queried by ID. Joined relationally. Returned as structured JSON to the tool call.
Customers. Resolved by verified email or session token, never by similarity.
Inventory and SKU rows. Queried by exact match.
Tracking numbers, refund amounts, fulfilment statuses. Anything where a wrong digit is a wrong answer.

SQL

-- The policy chunks table — lives in the same Postgres as Medusa
create table policy_chunks (
 id uuid primary key default gen_random_uuid(),
 policy_id uuid not null references policies(id) on delete cascade,
 policy_version int not null,
 locale text not null,
 policy_type text not null, -- 'returns' | 'shipping' | 'warranty' | 'care'
 chunk_index int not null,
 content text not null,
 embedding vector(1536) not null,
 payload_block_id text not null, -- the Payload block this chunk came from
 created_at timestamptz not null default now()
);

create index policy_chunks_embedding_idx
 on policy_chunks using hnsw (embedding vector_cosine_ops);

create index policy_chunks_filter_idx
 on policy_chunks (locale, policy_type, policy_version);

-- Retrieval query: filter HARD on locale and type, then rank by similarity
select id, content, payload_block_id,
 1 - (embedding <=> $1) as similarity
from policy_chunks
where locale = $2
 and policy_type = any($3)
 and policy_version = (
 select max(policy_version) from policy_chunks
 where locale = $2 and policy_type = any($3)
 )
order by embedding <=> $1
limit 5;

The `payload_block_id` column is the citation hook — it lets the copilot link back to the exact Payload block it used. We will come back to that.

The Payload side: policies as versioned blocks

Policies live in Payload as a collection of versioned documents, each composed of typed blocks (returns clause, shipping clause, warranty clause, exceptions). The block-level structure matters because we embed per-block, not per-document — which means a policy update only re-embeds the changed blocks, and the copilot can cite the specific clause, not the whole policy page.

TypeScript

// payload/collections/Policies.ts
import type { CollectionConfig } from 'payload'
import { enqueuePolicyReembed } from '../jobs/reembed-policy'

export const Policies: CollectionConfig = {
 slug: 'policies',
 versions: { drafts: true, maxPerDoc: 20 },
 access: { read: () => true },
 fields: [
 { name: 'title', type: 'text', required: true },
 {
 name: 'policyType',
 type: 'select',
 required: true,
 options: ['returns', 'shipping', 'warranty', 'care'],
 },
 {
 name: 'locale',
 type: 'select',
 required: true,
 options: ['en', 'de', 'fr', 'nl', 'hr'],
 },
 {
 name: 'effectiveFrom',
 type: 'date',
 required: true,
 },
 {
 name: 'clauses',
 type: 'blocks',
 required: true,
 blocks: [
 {
 slug: 'clause',
 fields: [
 { name: 'heading', type: 'text', required: true },
 { name: 'body', type: 'richText', required: true },
 {
 name: 'appliesTo',
 type: 'select',
 hasMany: true,
 options: ['standard', 'sale', 'subscription', 'gift'],
 },
 ],
 },
 ],
 },
 ],
 hooks: {
 afterChange: [
 async ({ doc, previousDoc, operation, req }) => {
 if (operation === 'create' || operation === 'update') {
 // Diff clauses; only re-embed blocks whose body changed.
 await enqueuePolicyReembed({
 policyId: doc.id,
 previousClauses: previousDoc?.clauses ?? [],
 nextClauses: doc.clauses,
 locale: doc.locale,
 policyType: doc.policyType,
 })
 }
 },
 ],
 },
}

The `afterChange` hook is the freshness guarantee. Without it, you ship a copilot whose return-policy chunks drift out of date the first time legal updates a clause and nobody knows the vector store needed touching. We have walked into this exact failure twice. Both times the editorial team thought they had updated the policy. Both times the copilot kept quoting the old window for weeks. The fix is the hook above — embedding is a publish-time concern, not a cron-job concern. See Payload's hooks reference for the full lifecycle.

Tool definitions for Claude: narrow, typed, no raw SQL

Claude does not get a database connection. Claude gets three tools, each with a tight JSON schema, each backed by a server-side handler that does the actual query. The tools exist so the model can ask for structured data when it needs it — but the surface area is deliberately small. No `runQuery`. No `searchOrders`. No `getCustomer`. Just the verbs we want the copilot to be able to perform.

TypeScript

// support-copilot/tools.ts
import Anthropic from '@anthropic-ai/sdk'

export const tools: Anthropic.Tool[] = [
 {
 name: 'getOrderByReference',
 description:
 'Look up a single order belonging to the authenticated customer. Returns order, line items, fulfilments, and any open returns. Use only when the customer references an order.',
 input_schema: {
 type: 'object',
 properties: {
 orderReference: {
 type: 'string',
 description: 'The order display ID, e.g. #10421',
 },
 },
 required: ['orderReference'],
 },
 },
 {
 name: 'getPolicyForContext',
 description:
 'Retrieve the active policy clauses for a given topic and the customer locale. Returns clauses with citation IDs.',
 input_schema: {
 type: 'object',
 properties: {
 policyType: {
 type: 'string',
 enum: ['returns', 'shipping', 'warranty', 'care'],
 },
 },
 required: ['policyType'],
 },
 },
 {
 name: 'getReturnEligibility',
 description:
 'Check whether a specific line item on a specific order is currently eligible for return. Returns a boolean, the reason, and the deadline.',
 input_schema: {
 type: 'object',
 properties: {
 orderReference: { type: 'string' },
 lineItemId: { type: 'string' },
 },
 required: ['orderReference', 'lineItemId'],
 },
 },
]

Two things to notice. First, `getOrderByReference` is scoped to the authenticated customer server-side — the model cannot pass a customer ID to fetch someone else's order, because the customer ID is injected from the session, not from the tool call. This is the leak prevention. Second, `getReturnEligibility` exists as a separate tool with deterministic logic — we do not ask the model to compute whether a 32-day-old purchase is within a 30-day window. The model gets a boolean and a deadline; the math runs in TypeScript. See the Anthropic tool use docs for the full handshake.

The pre-flight context builder

Before the first model call, a context builder runs in parallel across the three retrieval sources. The goal is a sub-200ms cold path so the first token-out latency stays under a second. On a typical Medusa Postgres in the same region as the Next.js runtime, we see 30–80ms for the order lookup, 40–120ms for the pgvector query, and the conversation memory is essentially free.

TypeScript

// support-copilot/context.ts
import { embed } from './embeddings'
import { sql } from './db'

type Ctx = {
 customerId: string
 locale: string
 sessionId: string
 latestUserMessage: string
}

export async function buildPreflightContext(ctx: Ctx) {
 const [recentOrders, policyHits, memory] = await Promise.all([
 sql`
 select id, display_id, status, total, currency_code, created_at
 from "order"
 where customer_id = ${ctx.customerId}
 order by created_at desc
 limit 3
 `,
 (async () => {
 const queryEmbedding = await embed(ctx.latestUserMessage)
 return sql`
 select id, content, payload_block_id,
 1 - (embedding <=> ${queryEmbedding}::vector) as similarity
 from policy_chunks
 where locale = ${ctx.locale}
 order by embedding <=> ${queryEmbedding}::vector
 limit 4
 `
 })(),
 sql`
 select role, content from copilot_turns
 where session_id = ${ctx.sessionId}
 order by created_at desc
 limit 8
 `,
 ])

 const topSimilarity = policyHits[0]?.similarity ?? 0

 return {
 recentOrders,
 policyHits: topSimilarity >= 0.72 ? policyHits : [],
 memory: memory.reverse(),
 retrievalConfidence: topSimilarity,
 }
}

The `0.72` threshold is not magic — it is the floor we settled on after measuring false positives on a 600-question evaluation set across two pilots. Below that, the policy section of the context window is empty. The model is told, in the system prompt, that an empty policy section means it must escalate rather than guess. That single rule cuts the largest class of confident-wrong answers.

The grounding gate

After the context is built but before the model is called, a gate runs. It is twenty lines of code and it does more for hallucination rate than any prompt engineering we have tried.

TypeScript

// support-copilot/gate.ts
import type { PreflightContext } from './context'

type GateResult =
 | { proceed: true }
 | { proceed: false; reason: string; handoff: 'human' | 'clarify' }

export function groundingGate(
 ctx: PreflightContext,
 intent: 'order_status' | 'return' | 'policy' | 'other',
): GateResult {
 if (intent === 'order_status' && ctx.recentOrders.length === 0) {
 return {
 proceed: false,
 reason: 'No orders on file for this customer.',
 handoff: 'human',
 }
 }
 if (intent === 'policy' && ctx.retrievalConfidence < 0.72) {
 return {
 proceed: false,
 reason: 'Policy retrieval below confidence threshold.',
 handoff: 'human',
 }
 }
 if (intent === 'return' && ctx.recentOrders.length === 0) {
 return {
 proceed: false,
 reason: 'Cannot evaluate return without an order on file.',
 handoff: 'clarify',
 }
 }
 return { proceed: true }
}

When the gate refuses, the copilot does not call the model at all. It returns a deterministic message that hands off to a human or asks one clarifying question. This is the line we will not cross: the model is never allowed to answer a policy question without a grounded policy chunk in its context.

Citations: linking back to the Payload block

Every policy chunk we hand the model carries its `payload_block_id`. The system prompt instructs the model to attach a citation marker to any sentence that draws on a policy clause. We post-process the response, resolve the markers to Payload block URLs, and render them as inline links in the chat UI.

The operator value of this is not just trust — it is auditability. When CX leadership asks why the copilot told a customer they had 30 days to return, we can show the exact clause it used and the version of the policy that was active at that moment. We have not yet had to defend a response in a regulatory context, but the pattern is built to make that defence trivial.

What we measured on a 12-week pilot

On the most recent pilot — a European D2C brand on Medusa with roughly 40 support tickets per day — we ran a baseline copilot with naive RAG (everything embedded, single retrieval call) against the architecture above. We will not name the client, but the shape of the numbers is consistent with two earlier builds:

Hallucination rate on policy questions dropped from roughly 11–14% on the naive baseline to under 2% on the structured-retrieval build (evaluated against a 200-question labelled set reviewed by the CX lead).
Wrong-order context — the copilot referencing the wrong order in its answer — went from 6–8% to effectively zero, because order lookup is no longer fuzzy.
First-token latency stayed under 900ms on the p95, because the three retrieval calls run in parallel and pgvector at this scale is fast.
Escalation rate rose from 18% to 27%. This is on purpose — the gate refuses more often, and that is the trade we want. Confident-wrong is worse than escalated.

What we still refuse to automate

Even with this architecture, there are actions we do not let the copilot take end-to-end. The model can draft, the human approves:

Refunds above a per-brand threshold (typically €50–€150). The copilot can initiate; a human confirms.
Address changes after fulfilment has started. The race condition with the 3PL is not worth the AI win.
Anything touching subscription billing cadence. Too many edge cases, too much downstream blast radius.
Regulated categories — supplements, cosmetics with active ingredients, anything where misstating an ingredient or a contraindication is a legal exposure, not a CX miss.

If you are scoping a support copilot or a broader AI rollout on a headless commerce stack, See how we ship Medusa storefronts and copilots in production — including the retrieval architecture, the Payload policy modelling, and the human-in-the-loop boundaries we ship by default.

Building this on Medusa + Payload right now, or rescuing a copilot that is hallucinating refunds? Tell us what you are wiring up — send the architecture diagram or the system prompt and we will tell you which of the three retrieval calls is the one biting you.

On every Medusa + Payload support copilot we ship now, this is the default shape: pgvector for policy and macros, relational queries for orders and customers, three tightly scoped tools, a confidence gate before the model is called, and citations from the model back to the Payload block. The model choice is the last decision, not the first — and it is rarely the one that determines whether the copilot is safe to put in front of customers.

// After the call

Questions operators ask next

Does this architecture work with OpenAI or Gemini instead of Claude?
Yes — the retrieval shape is model-agnostic. The tool-call surface differs slightly (OpenAI's function-calling and Gemini's function declarations have their own schemas), but the three-call pattern, the pgvector store, the Payload policy collection, and the grounding gate are identical. We have shipped variants on Claude 3.5 Sonnet, GPT-4.1, and Gemini 2.5 Pro. Pick on cost, latency, and your structured-output reliability needs, not architecture.
Can we run pgvector on the same Postgres instance as Medusa, or do we need a separate vector DB?
Same instance, in our experience, up to a few hundred thousand policy chunks. With an HNSW index, p95 similarity queries stay under 100ms at that scale on a modest managed Postgres. We have not yet needed to split out a dedicated vector DB for support-copilot workloads. If you are also embedding product catalog or doing multi-million-row semantic search, that calculus changes.
How do you handle webhook retries or duplicate re-embed jobs when a policy is updated rapidly?
The re-embed job is keyed by (policy_id, policy_version, clause_index) and is idempotent — running it twice writes the same row. We also debounce in the afterChange hook so rapid saves within a 30-second window collapse into one enqueue. The pgvector write itself uses an upsert on the composite key, so concurrent jobs cannot create duplicates.
What does the cost look like at, say, 1,000 tickets per day on Claude?
At roughly 3,500 input tokens and 400 output tokens per conversation turn (the structured retrieval keeps the context lean), Claude 3.5 Sonnet runs around $0.015–$0.025 per resolved ticket. At 1,000 tickets per day with a 70% containment rate, that is $300–$500/month in model spend. Embeddings and pgvector storage are noise compared to that. The real cost is the engineering build, not the inference.
How do you prevent the copilot from leaking one customer's order to another?
The customer ID is bound to the session on the server side and injected into every tool handler — the model never receives it and cannot pass an arbitrary customer ID. Order lookups are always filtered by that bound ID at the SQL layer. If a customer is not authenticated, `getOrderByReference` returns nothing and the gate refuses order-status intents. There is no path where the model can request another customer's data, because the tool surface does not expose one.
Can this pattern extend to other channels — WhatsApp, email autoresponders, Shopify chat?
Yes. The retrieval layer and the gate are channel-agnostic — they sit behind a service boundary. We have wired the same context builder into a Next.js chat widget, an email triage worker (Resend inbound webhooks), and a WhatsApp Business handler. The only channel-specific work is intent classification on inbound and rendering citations on outbound. The grounding architecture does not change.

Pull quote

Embed the policy. Query the order. If the copilot is doing vector search to find out who the customer is, the architecture is already wrong.