AI Automation
Claude vs Gemini vs GPT-4o-mini for Catalog and Editorial AI: The €/1k-SKU Cost Framework We Use
Operators keep asking which model to wire into their Payload or Medusa pipeline. The honest answer: the model name moves the bill less than batching, caching, and structured outputs do. Here is the framework we use on real client work, with line-item numbers.

The model name on your invoice moves the bill less than the three architecture choices above it.
Every operator running Payload or Medusa with an AI budget eventually sends us the same question, usually forwarded from their CTO with a line that reads something like "finance wants to know why we picked Sonnet over Haiku". The honest answer is that the model name moves the bill less than three architecture choices above it: whether you batch, whether you cache, and whether you ask for structured output. We have shipped catalog enrichment and editorial drafting on Claude Sonnet 4, Claude Haiku 4, Gemini 2.5 Flash, and GPT-4o-mini across real client work on Payload and Medusa. Same prompts, same content, different invoices. The spread between "we picked the wrong model" and "we picked the right model" is usually 20–40% of the bill. The spread between "we picked the right model but skipped prompt caching" and "we wired it properly" is 60–80%.
This is the framework we use when an operator asks us to keep their AI catalog and editorial pipeline under €2k/month at 2,000 SKUs and 40 articles across 4 locales — and the quality floor we refuse to cross to get there.
The wrong question: which model is best
The right question is which model survives the batch. A model that is 30% cheaper per token but needs two retries to produce valid JSON is not cheaper. A model that is 50% cheaper per token but cannot keep brand voice across 2,000 SKUs is not cheaper either — it is just spending your editor's salary instead of your API budget. We have watched both happen on client projects we inherited.
Our cost framework: three numbers we hold the line on
Before we pick a model, we write down three numbers for the client. They are deliberately boring and deliberately specific:
Cost per SKU enriched: target €0.004–€0.012 for a description + 8 attributes + SEO meta in one locale. Above €0.02 means we have not turned on caching; above €0.04 means we are using the wrong model for the job.
Cost per 1,000-word article translated: target €0.03–€0.08 per locale with cached system context. Above €0.15 means we are re-sending the style guide on every request.
Quality floor pass rate: the percentage of generated outputs that clear our eval harness without human edits. We will not ship a pipeline below 85% on catalog or 75% on editorial. Below those, the editor cost eats whatever the model saves.
Every model choice in the rest of this post is judged against those three. The pricing pages we cite are the official ones — Anthropic pricing, Google Gemini API pricing, OpenAI pricing — as of November 2025; verify before you sign anything.
Sonnet vs Haiku vs Gemini Flash vs GPT-4o-mini: what we use each for
Criteria first, verdicts second. We grade each model on five things: structured output reliability, brand-voice retention over a batch of 500+ items, latency at p95, input-token caching support, and behaviour under JSON schema constraints. Here is where each one earns its keep on Payload and Medusa work:
Claude Sonnet 4 · editorial drafting, long-form translation, anything where brand voice matters across 800+ words · the model we default to when the output is what the reader sees and the editor only does QA, not rewrite.
Claude Haiku 4 · catalog enrichment at volume (descriptions, attribute extraction, SEO meta), classification, tagging · 4–5× cheaper than Sonnet on input, holds the quality floor on structured tasks, fails on nuanced brand voice.
Gemini 2.5 Flash · bulk translation with context caching, image-to-attribute extraction from product photos, anything where the input is large and reused · context caching makes per-SKU cost collapse when you reuse the same brand guide across thousands of items.
GPT-4o-mini · structured outputs with strict JSON schema enforcement, the structured outputs API gives schema guarantees the other three approximate · we use it when downstream Payload validation cannot tolerate a malformed field.
Notice what we do not say: "Sonnet is best". Sonnet on a 2,000-SKU enrichment run is a €60 mistake we have watched a client make. Haiku on the homepage hero copy is a brand drift we have watched another client roll back after two weeks.
Prompt caching: where 60–80% of the bill actually hides
Anthropic's prompt caching is the single biggest lever on a catalog pipeline. Cached input tokens are billed at roughly 10% of the standard input rate (cache reads) with a one-time write premium. For a catalog enrichment job, the system prompt + brand guide + attribute schema is the same on every call — that is 2,000–6,000 tokens of context you would otherwise re-pay for on every SKU.
The publish-hook shape that actually uses it looks like this on a Payload collection. The cache control marker on the system block is what activates the discount; the variable per-SKU content goes after the cached prefix so Anthropic's cache hits on every call in the batch.
// src/collections/Products/hooks/enrichOnPublish.ts
import type { CollectionAfterChangeHook } from 'payload'
import Anthropic from '@anthropic-ai/sdk'
import { productEnrichmentSchema } from './schema'
import { brandVoiceGuide } from '../../../lib/prompts/brand-voice'
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY })
export const enrichProductOnPublish: CollectionAfterChangeHook = async ({
doc,
operation,
req,
}) => {
if (operation !== 'create' && !doc._enrichmentRequested) return doc
if (doc.aiEnrichedAt && !doc._enrichmentRequested) return doc
const response = await anthropic.messages.create({
model: 'claude-haiku-4-20250514',
max_tokens: 1024,
system: [
{
type: 'text',
text: brandVoiceGuide, // ~4,200 tokens, identical across the batch
cache_control: { type: 'ephemeral' },
},
],
messages: [
{
role: 'user',
content: `Enrich this SKU as JSON matching the schema.\nSKU: ${doc.sku}\nTitle: ${doc.title}\nRaw attributes: ${JSON.stringify(doc.rawAttributes)}`,
},
],
})
const enriched = productEnrichmentSchema.parse(
JSON.parse(extractJson(response.content)),
)
await req.payload.update({
collection: 'products',
id: doc.id,
data: {
description: enriched.description,
attributes: enriched.attributes,
seo: enriched.seo,
aiEnrichedAt: new Date().toISOString(),
aiModel: 'claude-haiku-4',
},
context: { skipHooks: true },
})
return doc
}On a 2,000-SKU run with a 4,200-token system prompt, the cache saves roughly €18–€24 per full pass vs the un-cached version on Haiku. Run that pipeline twice a month and caching has paid for the half-day it took to wire up.
Batching vs streaming: pick the one the user is actually waiting for
Streaming exists for the editor watching tokens land in the Lexical editor. A cron job enriching 2,000 SKUs at 02:00 does not care — it cares about throughput and retries. We default to batch APIs for catalog work (Anthropic's Message Batches API is roughly 50% cheaper than synchronous calls; OpenAI's Batch API is the same discount shape) and synchronous + streaming only for the editorial UI.
The rule we give clients: if a human is staring at the screen waiting for the answer, stream. If the result lands in Payload while everyone is asleep, batch. The 50% discount alone takes a €2k/month bill to €1.4k without changing a single prompt.
Structured outputs vs free-form
Catalog goes through JSON schema. Editorial goes through Lexical blocks. We do not mix these. On catalog, every output field maps to a Payload field, and a malformed JSON response means a failed publish — so we use either GPT-4o-mini with strict structured outputs, or Claude with a Zod parser and one retry budget. On editorial, the model writes prose into a Lexical-compatible block tree; structured outputs there cost you the voice.
// src/collections/Products/hooks/schema.ts
import { z } from 'zod'
export const productEnrichmentSchema = z.object({
description: z.string().min(80).max(600),
attributes: z.object({
material: z.string().nullable(),
color: z.string().nullable(),
sizeRange: z.string().nullable(),
careInstructions: z.string().nullable(),
origin: z.string().nullable(),
}),
seo: z.object({
title: z.string().max(60),
description: z.string().max(155),
keywords: z.array(z.string()).max(8),
}),
})
export type ProductEnrichment = z.infer<typeof productEnrichmentSchema>We hit this on a Medusa + Payload project where Haiku occasionally returned an SEO description at 168 characters — past Google's truncation point and past our schema. Without the Zod parse + retry, those went straight to production. With it, the malformed 0.4% of responses regenerate; the rest write straight to Payload. The bug class disappeared.
The quality floor: the eval harness we run before swapping models
Before we swap a model in production, we run a fixed eval set of 100 SKUs and 20 articles through the candidate and score four things: schema validity, length compliance, brand-voice similarity (cosine against a reference embedding of 30 hand-written examples), and factual consistency against the raw attributes. We refuse to ship a model that scores below 85% on catalog tasks.
Locale economics: caching collapses cost, QA does not
Translation cost per word with context caching on Gemini Flash or Claude lands around €0.00003–€0.00008 per source word. A 1,000-word article translated to four locales is somewhere between €0.12 and €0.32 — call it negligible. What is not negligible: the QA pass. We never automate the final review for a locale where the brand has paying customers in market. The model writes the draft; a native-speaker editor signs off.
The locale we refuse to fully automate is the one where the client sells the most. The risk asymmetry is wrong: a botched English translation on a low-traffic locale is a fix-on-next-publish problem; a botched French page on a brand doing 40% of revenue in France is a brand problem that does not show up in your eval harness.
The €2k/month walkthrough
Here is the math on the brief we get most often — a Payload + Medusa client running 2,000 SKUs and publishing 40 articles per month across 4 locales. Numbers below are estimates from our own runs at current public pricing; verify against the vendor pages before signing a budget.
Catalog enrichment · 2,000 SKUs/month on Haiku 4 with prompt caching + batch API · ~€16–€22/month
Catalog re-enrichment on attribute change · ~600 SKUs/month average churn · ~€5–€8/month
Editorial drafting · 40 articles × ~1,200 tokens output on Sonnet 4 with cached brand guide · ~€28–€40/month
Translation · 40 articles × 4 locales on Gemini 2.5 Flash with context caching · ~€18–€28/month
Structured SEO meta regeneration · 2,000 SKUs + 40 articles on GPT-4o-mini structured outputs · ~€6–€10/month
Eval harness runs · weekly regression on 100 SKUs + 20 articles across all 4 models · ~€8–€12/month
Buffer for retries, embeddings, ad-hoc copilot calls · ~€40–€60/month
Total · ~€120–€180/month at the API layer · the rest of the €2k budget is editor QA time, observability tooling, and the developer hours to maintain the pipeline
The API bill is rarely the issue. The infrastructure around it — observability, eval, retry budgets, schema validation, locale routing, the publish hooks that wire it all into Payload — is where the budget actually goes. That is the part we ship; the API call is twelve lines.
What we would not do
Route a customer-facing copilot through the cheapest model. The model that talks to your customers handles refund language, returns policy, and inventory questions — Haiku-class models hallucinate on order context in ways Sonnet-class models do not. We have measured this.
Let finance pick the model without an eval. The €/token number is meaningless until you run the eval harness against your own content. We have seen €300/month "savings" cost €1,200/month in editor rewrites.
Skip prompt caching to "keep it simple". Caching is twenty lines of config and 60–80% off the input bill. Refusing to wire it is leaving money on the table for a comfort that does not exist.
Use the same model for catalog and copy. They are different jobs with different quality floors. Pick per job; the multi-model pipeline pays for itself in the first month.
If you are weighing this on a Medusa build specifically — catalog enrichment, attribute extraction, multi-locale PDPs — See how we ship Medusa storefronts with AI catalog ops wired in from import pipeline through publish hooks.
If your AI invoice is climbing and you cannot tell whether the model or the architecture is the problem, Send us your stack and your token bill — we will tell you where the leak is before we touch a prompt.
On every Payload + Medusa AI pipeline we ship, we now wire prompt caching, batch APIs for offline jobs, structured outputs on catalog, and the eval harness on day one. The model choice is the last decision, not the first. It is also the easiest one to change — once the architecture around it is right, swapping Haiku for Sonnet on a single job is a one-line config change and a re-run of the eval. That is the position you want to be in when your CFO asks the question again next quarter.
// After the call
Questions operators ask next
How much should an AI catalog and editorial pipeline actually cost per month at 2,000 SKUs?
At the API layer, €120–€200/month is achievable with prompt caching, batch APIs, and the right model per job. The full operational cost including editor QA, observability, and pipeline maintenance typically lands €1,500–€2,500/month. If your API bill alone is past €1,500, the architecture needs work before the model does.
Does prompt caching work the same way on Anthropic, OpenAI, and Gemini?
The mechanics differ. Anthropic uses explicit cache_control markers with a 5-minute or 1-hour TTL and roughly 10% read cost. Gemini uses explicit context caching with a configurable TTL. OpenAI introduced automatic prompt caching with no opt-in but less control over what gets cached. The pricing math is similar; the wiring is not.
Can we run all of this through Payload hooks or do we need a separate worker service?
For under ~500 SKUs/day, Payload afterChange hooks with a queue (BullMQ on Redis, or Payload's built-in jobs queue in v3.30+) handle it fine. Past that, we move enrichment to a dedicated worker that calls Payload's Local API on completion. The threshold is usually how long your serverless function can run, not the model speed.
Does the eval harness need to be a separate service or can it live in the same repo?
Same repo, separate runner. We keep it as a CLI script that loads a fixed eval set from a JSON fixture, runs it against each candidate model, and writes results to a Payload collection so editors can review regressions in the admin UI. It runs in CI on every prompt change and weekly on a cron against production.
What is the quickest win if we are already over budget on AI spend?
Three things in order: turn on prompt caching for any repeated system context (typically 60–80% off input cost), move offline jobs to the batch API (50% off), and split the model per job instead of using one model for everything. Most clients see a 50–70% bill reduction without changing prompts or quality.
Should we wait for cheaper models before building this pipeline?
No. The pipeline architecture — hooks, schemas, eval harness, observability — is the expensive part to build and the part that does not change when models get cheaper. Build it now against today's models; swap the model name in config when the next generation lands. Clients who waited 18 months for "cheaper models" lost 18 months of content velocity.
Pull quote
The model name on your invoice moves the bill less than three architecture choices above it: whether you batch, whether you cache, and whether you ask for structured output.