How does this pipeline handle Medusa product updates from the admin UI versus bulk CSV imports?

The subscriber listens to `product.created` and `product.updated` events, so both paths trigger enrichment. We dedupe via a 24-hour cooldown on `metadata.enriched_at` and skip products already `pending` in the queue. Admin edits and CSV imports flow through the same path — there is no special case.

What is the realistic editorial review time per SKU once the queue is running?

On a well-tuned brand-voice prompt with the diff view, experienced editors review 40–60 SKUs per hour for short-form fields and 20–30 per hour for long descriptions. Budget roughly 60–80 hours of editorial time for a 2,000-SKU first pass. Ongoing maintenance is closer to 2–4 hours per week for typical newness cadence.

Can we use OpenAI or Gemini instead of Claude for the generation step?

Yes. The schema-driven approach works with OpenAI's [structured outputs](https://platform.openai.com/docs/guides/structured-outputs) and Gemini's function calling. We default to Claude because Sonnet's adherence to constrained enums has been the most reliable in production, but the surrounding pipeline (subscriber, queue, Payload approval, rollback) is model-agnostic — swap the client, keep everything else.

How do you handle multiple locales — does each get its own Claude call?

Yes, one call per locale per SKU, but we feed the approved English version into the localisation prompt rather than starting from supplier attributes again. This roughly doubles token cost per added locale but cuts review time significantly because translators verify rather than rewrite. Caching the brand-voice prompt across locales keeps costs sane.

What happens if Medusa's admin API call fails after a Payload approval?

The `afterChange` hook wraps the push in a retry with exponential backoff. If all retries fail, the row stays at `status: approved` but with a `syncError` field populated and a Slack alert fires. The editor can re-trigger the push from the Payload UI once the underlying issue is resolved — no data loss, no half-applied updates.

Does this pattern work for B2B catalogs where attributes are more important than copy?

It works better, actually. B2B buyers care about correct attribute extraction (dimensions, tolerances, compatibility codes) more than prose. We tighten the schema, drop the long-form description, and add a stricter confidence floor — typically 0.85 for attributes that drive search filters. The approval queue stays the same; the prompts and validators shift.

Back to insights

AI Automation

Catalog Enrichment on Medusa with Claude: The Import-Time Pipeline We Ship for 2,000+ SKU Brands

Bolting AI onto live product pages burns trust by week three. Here is the import-time enrichment pipeline we wire into Medusa with Claude, structured outputs, and a Payload approval queue — including token cost per 1,000 SKUs and the regressions it prevents.

30 May 20268 min readBy Krešimir Galić · Founder & Principal Engineer

*The cheapest place to put AI in your catalog is the one most brands skip — before the row ever reaches the storefront.*

Every D2C brand we talk to with 2,000+ SKUs has the same drawer of unfinished copy: half-written descriptions, missing attributes, SEO meta that says "Buy [Product Name] online" on 400 PDPs. The Head of E-commerce wants Claude to write all of it by Friday. The CTO has seen what happens when you do that — a month later, half the catalog reads like it was written by the same slightly drunk copywriter, attributes are hallucinated, and the brand voice has drifted into beige.

We have shipped this enrichment shape on Medusa catalogs five times now. The pattern that works is not "AI on the PDP" — it is AI at import time, with structured outputs, a Payload approval queue, and a rollback path that does not require re-importing the CSV. The result for a 2,000-SKU brand: roughly €0.40–€1.20 per SKU in token cost, 3–5 weeks to ship, and a content team that reviews instead of writes.

This is the shape, the schema, the guardrails, and the regressions it prevents. CFO can read the cost section; CTO can read the rest.

The PDP regression problem: why bolting AI on live products burns trust

The lazy version of this project takes about a week to build and about three weeks to regret. You wire Claude or GPT-4 behind a route that generates a description on PDP render, cache it for a day, ship it. It looks magical for the first 50 products.

Then the regressions land. The same product gets two different descriptions across two cache windows. An attribute is invented that the warehouse cannot fulfil ("machine washable" on a wool blazer). A category page lists six products and Claude has decided three of them are "the perfect choice for everyday wear" — verbatim. A returns spike follows a week later because a description over-promised a feature.

Where the AI step belongs: at import, not at render

Moving the AI step into the product import pipeline changes the cost and risk profile completely. The generation runs once per SKU per version. Each output is a draft, not a published field. A human reviews before it ships. Token cost is amortised across the product's lifetime instead of per-pageview. And — the part everyone forgets — you get a queryable history of every AI-written field, which means rollback is a single SQL update, not a CSV re-import.

On Medusa, the import pipeline is the right hook point. We subscribe to `product.created` and `product.updated` events from Medusa's event bus, push the SKU onto a Redis queue, and a worker calls Claude with the product's existing structured fields (title, category, raw attributes from the supplier feed, any photos' alt text).

TypeScript

// src/subscribers/enrich-on-import.ts
import type { SubscriberConfig, SubscriberArgs } from "@medusajs/medusa"
import { enrichmentQueue } from "../lib/queues"

export default async function enrichOnImport({
  data,
  eventName,
  container,
}: SubscriberArgs<{ id: string }>) {
  const productService = container.resolve("productService")
  const product = await productService.retrieve(data.id, {
    relations: ["variants", "categories", "images", "tags"],
  })

  // Skip products already in the approval queue or recently enriched.
  if (product.metadata?.enrichment_status === "pending") return
  if (product.metadata?.enriched_at &&
      Date.now() - Number(product.metadata.enriched_at) < 86_400_000) return

  await enrichmentQueue.add("enrich-sku", {
    productId: product.id,
    supplierAttributes: product.metadata?.raw_attributes ?? {},
    locale: "en-GB",
  }, { attempts: 3, backoff: { type: "exponential", delay: 5000 } })
}

export const config: SubscriberConfig = {
  event: ["product.created", "product.updated"],
  context: { subscriberId: "enrich-on-import" },
}

Structured outputs over prose: the JSON schema we hand Claude

If you ask Claude for "a product description" you get prose, and prose is unparseable. We use Claude's tool use to force structured output — the model fills a schema, not a textarea. This single decision removes 80% of the regressions.

The schema we ship for catalog enrichment has four field groups: description (long form + short form), bullets (3–5 selling points), attributes (a whitelist — see below), and seo (meta title, meta description, focus keyword). Every field has a length constraint. Every attribute is constrained to a closed enum drawn from the brand's taxonomy.

TypeScript

// src/lib/enrichment-schema.ts
import Anthropic from "@anthropic-ai/sdk"

export const enrichmentTool: Anthropic.Tool = {
  name: "submit_product_copy",
  description: "Submit enriched copy and attributes for a single SKU.",
  input_schema: {
    type: "object",
    required: ["description_long", "description_short", "bullets", "attributes", "seo"],
    properties: {
      description_long: { type: "string", minLength: 220, maxLength: 700 },
      description_short: { type: "string", minLength: 80, maxLength: 180 },
      bullets: {
        type: "array", minItems: 3, maxItems: 5,
        items: { type: "string", minLength: 20, maxLength: 110 },
      },
      attributes: {
        type: "object",
        properties: {
          material: { type: "string", enum: ["cotton", "linen", "wool", "silk", "synthetic-blend"] },
          care: { type: "string", enum: ["machine-wash", "hand-wash", "dry-clean", "wipe-clean"] },
          season: { type: "string", enum: ["ss", "aw", "all-season"] },
          confidence: { type: "number", minimum: 0, maximum: 1 },
        },
        required: ["confidence"],
      },
      seo: {
        type: "object",
        required: ["meta_title", "meta_description", "focus_keyword"],
        properties: {
          meta_title: { type: "string", maxLength: 60 },
          meta_description: { type: "string", maxLength: 155 },
          focus_keyword: { type: "string", maxLength: 80 },
        },
      },
    },
  },
}

The `confidence` field on attributes is doing more work than it looks. If Claude returns `confidence: 0.4` on `material: wool`, the approval queue surfaces that row first and flags it as "AI uncertain" in the diff view. Editors learn within a week which signals to trust.

The Payload approval queue: drafts in, signed-off fields out

Claude's output never writes directly to Medusa. It writes to a `product_enrichments` collection in Payload, where each row is a draft tied to a Medusa product ID. The editor opens Payload, sees a side-by-side diff (current Medusa field vs proposed AI field), and either approves, edits, or rejects. Approval triggers a Payload afterChange hook that calls Medusa's admin API to update the product.

TypeScript

// payload/collections/ProductEnrichments.ts
import type { CollectionConfig } from "payload"
import { pushToMedusa } from "../lib/medusa-sync"

export const ProductEnrichments: CollectionConfig = {
  slug: "product-enrichments",
  admin: { useAsTitle: "medusaProductId", defaultColumns: ["medusaProductId", "status", "updatedAt"] },
  fields: [
    { name: "medusaProductId", type: "text", required: true, index: true },
    { name: "status", type: "select", required: true, defaultValue: "pending",
      options: ["pending", "approved", "rejected", "published", "rolled-back"] },
    { name: "sourceSnapshot", type: "json", admin: { readOnly: true } },
    { name: "proposed", type: "json" },
    { name: "approvedBy", type: "relationship", relationTo: "users", admin: { readOnly: true } },
    { name: "tokenCost", type: "number", admin: { readOnly: true, description: "USD" } },
    { name: "modelVersion", type: "text", admin: { readOnly: true } },
  ],
  hooks: {
    beforeChange: [({ data, req, originalDoc }) => {
      if (originalDoc?.status !== "approved" && data.status === "approved") {
        data.approvedBy = req.user?.id
      }
      return data
    }],
    afterChange: [async ({ doc, previousDoc, req }) => {
      if (previousDoc?.status !== "approved" && doc.status === "approved") {
        await pushToMedusa(doc, { reqId: req.headers.get("x-request-id") })
      }
    }],
  },
}

Two things to notice. First, `sourceSnapshot` stores the Medusa fields as they were when Claude was called — this is what makes rollback trivial. Second, `tokenCost` and `modelVersion` are persisted per row, so when the CFO asks "what did this batch cost us" the answer is a single SQL query, not a guess against the Anthropic console.

Guardrails we ship by default

Attribute whitelist — Claude cannot invent a new `material` value. The schema enum is the source of truth, drawn from the brand's PIM taxonomy. New values require a human to extend the enum.
Brand-voice prompt — a 400-word system prompt drawn from the brand's existing top-10 best-written PDPs, with "do not write" and "do write" examples. We version this in Git, not in a Notion doc that drifts.
Banned-claims list — a hard list of phrases the model is instructed to never produce ("clinically proven", "hypoallergenic", "vegan" unless the supplier attribute confirms it). We also post-filter with a regex check before the row lands in the queue.
Diff view — editors see the proposed copy next to the current copy with word-level highlighting. The single most-used feature after week one.
Per-batch sampling — every 50th SKU gets sent to a stricter model (Sonnet) even on Haiku batches, so we catch quality drift before it spreads.

Token economics: what 1,000 SKUs actually cost

Numbers from our recent rollouts, using Anthropic's published pricing. Input averages 1,800 tokens per SKU (product context + system prompt + supplier attributes). Output averages 600 tokens (the structured response). Costs are estimates rounded for budgeting — your mileage varies with category complexity and locale count.

Claude Haiku 3.5 · ~€0.40 per 1,000 SKUs · acceptable for short descriptions, bullet lists, basic attribute extraction. We use this for the first pass on bulk imports.
Claude Sonnet 4 · ~€8–12 per 1,000 SKUs · what we use for hero products, category leaders, and any SKU where `confidence < 0.7` on the Haiku pass.
Two-pass strategy (Haiku then Sonnet on flags) · ~€1.50–€3 per 1,000 SKUs · the default we recommend. About 15–20% of rows get re-run on Sonnet.
Caching · we cache the brand-voice system prompt and category context blocks using Anthropic's prompt caching — cuts input cost by ~60% on a batch of 1,000+ SKUs from the same brand.

Real-world arithmetic for a 2,000-SKU catalog with the two-pass strategy and caching: €3–€6 in token cost for a full catalog regeneration. The CFO line item is not the model — it is the 60–80 hours of editorial review.

What this pipeline prevents (and what it does not)

Three regressions this shape kills before they hit production:

Hallucinated attributes — the whitelist enum + confidence score + human review means no SKU ships with an invented material or care instruction. We hit this on an early build where the cache happily served "100% organic cotton" on a polyester blend for nine days.
Tone drift across the catalog — because every output is reviewed against a brand-voice prompt and the editor sees a diff, the catalog stays coherent. The "every product is the perfect choice" failure mode does not survive a diff view.
Over-promising claims that drive returns — the banned-claims regex catches the obvious ones, and editorial review catches the rest. The cost-of-getting-this-wrong is not the model bill, it is the returns rate.

What this does not solve: category pages, collection blurbs, and SEO landing copy. Those are higher-stakes, brand-defining surfaces. We ship them through a different Payload workflow with a senior editor in the loop — never on import-time autopilot.

Rollback: reverting a bad batch without re-importing the CSV

Because every approved row has a `sourceSnapshot`, rollback is a single Payload action that flips `status` back to `rolled-back` and pushes the snapshot fields back to Medusa via the admin API. We exposed this as a custom Payload list view action — select 200 rows, click "Roll back batch", confirm. Total time: under a minute for a few thousand rows.

On one project we caught a prompt-drift issue 48 hours after a 1,400-SKU batch went live — a system-prompt change had pushed descriptions toward a clinical tone the brand hated. Rolling back was a 30-second operation. Without `sourceSnapshot` and the approval queue, that would have been a panic CSV export from a backup snapshot and a 2-hour cutover.

What we would budget for a 2,000-SKU rollout

Weeks 1–2 — schema design, brand-voice prompt extraction from existing top PDPs, attribute taxonomy lockdown. This is the unglamorous half of the project and the one that determines quality.
Weeks 2–3 — Medusa subscriber + worker + Anthropic integration, Payload collection + approval queue + diff view.
Weeks 3–4 — pilot batch of 100 SKUs, editorial calibration, prompt iteration. Expect to throw away the first prompt entirely.
Weeks 4–5 — full catalog run in Haiku, Sonnet flagging, editorial review at scale.
Cost range — €18k–€32k for the build, depending on how messy the supplier attribute feed is and how many locales need separate prompts. Token cost is a rounding error on top.

If you are scoping the storefront alongside the enrichment layer — See how we ship Medusa storefronts end to end — including the catalog, checkout, and AI workflows that earn their seat.

If you have a 1,000+ SKU catalog on Medusa (or are migrating to one) and your content team is drowning — Send us your catalog and stack. We will tell you which fields are worth automating, which are not, and what your token bill looks like at your scale.

On every Medusa + Payload catalog we ship at this scale, this is the shape we wire on day one: import-time enrichment, structured outputs, approval queue, source snapshots, rollback. The cost is small, the editorial leverage is large, and the regressions you avoid are the ones that quietly damage trust for months before anyone notices. That is the actual product — not the AI, but the gate in front of it.

// After the call

Questions operators ask next

How does this pipeline handle Medusa product updates from the admin UI versus bulk CSV imports?
The subscriber listens to `product.created` and `product.updated` events, so both paths trigger enrichment. We dedupe via a 24-hour cooldown on `metadata.enriched_at` and skip products already `pending` in the queue. Admin edits and CSV imports flow through the same path — there is no special case.
What is the realistic editorial review time per SKU once the queue is running?
On a well-tuned brand-voice prompt with the diff view, experienced editors review 40–60 SKUs per hour for short-form fields and 20–30 per hour for long descriptions. Budget roughly 60–80 hours of editorial time for a 2,000-SKU first pass. Ongoing maintenance is closer to 2–4 hours per week for typical newness cadence.
Can we use OpenAI or Gemini instead of Claude for the generation step?
Yes. The schema-driven approach works with OpenAI's structured outputs and Gemini's function calling. We default to Claude because Sonnet's adherence to constrained enums has been the most reliable in production, but the surrounding pipeline (subscriber, queue, Payload approval, rollback) is model-agnostic — swap the client, keep everything else.
How do you handle multiple locales — does each get its own Claude call?
Yes, one call per locale per SKU, but we feed the approved English version into the localisation prompt rather than starting from supplier attributes again. This roughly doubles token cost per added locale but cuts review time significantly because translators verify rather than rewrite. Caching the brand-voice prompt across locales keeps costs sane.
What happens if Medusa's admin API call fails after a Payload approval?
The `afterChange` hook wraps the push in a retry with exponential backoff. If all retries fail, the row stays at `status: approved` but with a `syncError` field populated and a Slack alert fires. The editor can re-trigger the push from the Payload UI once the underlying issue is resolved — no data loss, no half-applied updates.
Does this pattern work for B2B catalogs where attributes are more important than copy?
It works better, actually. B2B buyers care about correct attribute extraction (dimensions, tolerances, compatibility codes) more than prose. We tighten the schema, drop the long-form description, and add a stricter confidence floor — typically 0.85 for attributes that drive search filters. The approval queue stays the same; the prompts and validators shift.

Pull quote

AI at render time is a liability. AI at import time is an asset — because every word has a draft, a reviewer, and a rollback path before a customer sees it.