AI Automation
Alt Text on Autopilot: The Payload + Claude Vision Pipeline We Ship for 10,000-Image Archives
Alt text is the AI workflow with the highest ROI and the lowest glamour. Here is the exact Payload upload-hook, queue, and Claude Vision pipeline we ship — including the locales we still refuse to automate.

*Editors will not write 10,000 alt strings. Auditors will not accept empty ones. This is the gap we close.*
Every Payload project we ship past 5,000 images hits the same wall: an editorial team that will not retro-write alt text, an accessibility auditor who will not sign off without it, and an SEO lead quietly counting the missing strings. The Head of Content wants the archive compliant by next quarter. The engineering lead does not want another half-finished script running on someone's laptop. The CFO does not want a €4,000/month vendor.
Alt text is the AI workflow with the highest ROI and the lowest glamour. It is unglamorous because it does not demo well — no chatbot, no agent, no slide. But it earns its tokens on day one: it satisfies WCAG 1.1.1, it feeds image search, and it removes a recurring tax from the editorial calendar. On a recent media archive shape — roughly 38,000 historical images plus ~400 new uploads per week — the Claude Haiku + Payload upload-hook pipeline we ship runs the backfill in under 48 hours and costs less than a single freelance copy day.
This is the exact shape we wire: the collection schema, the hook signature, the queue, the prompt, the validators, the cost model, and — importantly — the locales and image classes where we still refuse to automate. If you are evaluating this on your own stack, the snippets are production-shaped enough to lift directly.
What auditors check and what Google actually reads
Two mandates collide here, and operators usually conflate them. WCAG 2.2 (success criterion 1.1.1) requires a text alternative that serves the same purpose as the image. Decorative images must have `alt=""` — not missing, empty. Informational images need a description of function, not pixels. Google's image SEO guidance wants descriptive, concise, context-aware text — and explicitly penalises keyword stuffing.
That gap is where the prompt does its work. A vision model left to its own devices will write "image of a woman holding a coffee cup" — which fails both mandates: the word "image" violates the WCAG "do not announce the medium" rule, and the description has zero context about why the image is on the page. The prompt has to inject the surrounding article, the locale, and the role (hero, inline, gallery, decorative).
Where alt text lives in a Payload Media collection
We add three fields to every Media collection on every Payload project, regardless of whether AI alt text ships in v1. They cost nothing to add and they make the later automation trivial. If you wire them on day one, you avoid a migration later.
// collections/Media.ts
import type { CollectionConfig } from 'payload'
import { generateAltTextHook } from '../hooks/generateAltText'
export const Media: CollectionConfig = {
slug: 'media',
upload: {
staticDir: 'media',
mimeTypes: ['image/*'],
imageSizes: [
{ name: 'thumbnail', width: 400 },
{ name: 'card', width: 768 },
{ name: 'hero', width: 1600 },
],
},
access: { read: () => true },
fields: [
{
name: 'alt',
type: 'text',
required: false, // we no longer block uploads on this
admin: { description: 'Auto-generated on upload. Edit freely.' },
},
{
name: 'altSource',
type: 'select',
defaultValue: 'pending',
options: ['pending', 'ai', 'human', 'human-edited-ai', 'decorative'],
admin: { position: 'sidebar' },
},
{
name: 'altLocale',
type: 'text',
admin: { position: 'sidebar', description: 'Locale used to generate' },
},
{
name: 'altReviewedAt',
type: 'date',
admin: { position: 'sidebar' },
},
],
hooks: {
afterChange: [generateAltTextHook],
},
}Three things to notice. First, `alt` is no longer `required: true` at the field level — we used to do that and it broke bulk uploads from the editorial team. The validator runs at publish time on the *consuming* document, not on the Media doc itself. Second, `altSource` is the audit trail field — when a DPA or accessibility auditor asks "how do you know which strings are AI-generated", you answer with one query. Third, `altLocale` is the field that lets us re-run a single locale without nuking the rest.
The pipeline: upload → queue → Claude Vision → validator → write-back
The afterChange hook does almost nothing. It enqueues. The work happens in a BullMQ worker that pulls from Redis, calls Claude Vision, validates the output, and writes back via the Payload Local API. Keeping the worker out of the request lifecycle is the single most important architectural decision in this pipeline — it is what lets the same code handle a 40,000-image backfill and a single net-new upload.
// hooks/generateAltText.ts
import type { CollectionAfterChangeHook } from 'payload'
import { altTextQueue } from '../queue/altText'
export const generateAltTextHook: CollectionAfterChangeHook = async ({
doc,
operation,
req,
}) => {
// Only on create, or when the file itself changed
if (operation !== 'create' && !req.file) return doc
// Skip if a human already wrote alt text
if (doc.altSource === 'human' || doc.altSource === 'human-edited-ai') {
return doc
}
// Skip explicitly decorative images
if (doc.altSource === 'decorative') return doc
await altTextQueue.add(
'generate',
{
mediaId: doc.id,
filename: doc.filename,
mimeType: doc.mimeType,
locale: req.locale ?? 'en',
},
{
attempts: 4,
backoff: { type: 'exponential', delay: 2000 },
removeOnComplete: 1000,
removeOnFail: false, // keep failed jobs for inspection
},
)
return doc
}The worker is where the prompt lives. We pull the image URL, fetch surrounding context if the media is already referenced by an article or product, and call Claude's vision API with a system prompt tuned for accessibility-grade output.
// workers/altText.ts
import { Worker } from 'bullmq'
import Anthropic from '@anthropic-ai/sdk'
import { getPayload } from 'payload'
import config from '@payload-config'
import { validateAlt } from '../lib/validateAlt'
import { buildContext } from '../lib/buildContext'
const anthropic = new Anthropic()
export const altTextWorker = new Worker(
'alt-text',
async (job) => {
const { mediaId, locale } = job.data
const payload = await getPayload({ config })
const media = await payload.findByID({ collection: 'media', id: mediaId })
const context = await buildContext(payload, mediaId, locale)
const response = await anthropic.messages.create({
model: 'claude-haiku-4-5',
max_tokens: 200,
system: ALT_TEXT_SYSTEM_PROMPT,
messages: [
{
role: 'user',
content: [
{ type: 'image', source: { type: 'url', url: media.url! } },
{
type: 'text',
text: `Locale: ${locale}\nContext: ${context.summary}\nRole: ${context.role}`,
},
],
},
],
})
const raw = response.content[0].type === 'text' ? response.content[0].text : ''
const { ok, alt, reason } = validateAlt(raw, { locale })
if (!ok) throw new Error(`alt validation failed: ${reason}`)
await payload.update({
collection: 'media',
id: mediaId,
data: {
alt,
altSource: 'ai',
altLocale: locale,
altReviewedAt: null,
},
})
},
{ concurrency: 8 },
)Eight concurrent workers is the sweet spot we land on for most clients. Higher, and we start hitting Anthropic rate limits on shared org keys; lower, and a 40,000-image backfill runs overnight instead of in an afternoon.
The prompt: describe, do not interpret
This is the prompt we ship. It is the result of about a dozen iterations on real archives. The rules are not negotiable — every one of them is a failure mode we have seen in production.
export const ALT_TEXT_SYSTEM_PROMPT = `
You write alt text for images on a published website. Your output is read aloud by screen readers and indexed by search engines.
Rules:
1. Never start with "image of", "picture of", "photo of", "a photograph showing". Describe directly.
2. Write one or two sentences. Hard cap: 160 characters.
3. Describe what the image shows and its likely function in the surrounding context. Do not interpret mood, intent, or symbolism.
4. Do not invent proper nouns. If a brand, person, or place is not named in the provided context, do not name it.
5. Do not include the word "alt text" or any meta-commentary.
6. Match the locale exactly. If locale is hr, write in Croatian. If de, German. Never mix languages in one string.
7. If the image appears purely decorative (abstract texture, generic background, divider), respond with the single token: DECORATIVE
8. If the image is unreadable, blurred, or you cannot describe it with confidence, respond with the single token: UNCERTAIN
Return only the alt text. No quotes, no prefix, no explanation.
`Rule 7 and Rule 8 are the most important. They are the model's escape hatches. Without them, the model invents — it will confidently describe a blurred image, or write four sentences of "description" about a beige rectangle. With them, the validator can route `DECORATIVE` to `altSource: 'decorative'` and `UNCERTAIN` to a human review queue. This is the difference between a pipeline that runs unattended and a pipeline that requires babysitting.
Validators that catch the failure modes
We have seen every one of these in production. The validator is short, but each rule maps to a specific real-world incident.
// lib/validateAlt.ts
const BANNED_PREFIXES = [
'image of', 'picture of', 'photo of', 'photograph of',
'a photo', 'an image', 'a picture', 'this image',
'slika', 'fotografija', 'bild', 'foto',
]
export function validateAlt(raw: string, { locale }: { locale: string }) {
const alt = raw.trim().replace(/^["'`]|["'`]$/g, '')
if (alt === 'DECORATIVE') return { ok: true, alt: '', decorative: true }
if (alt === 'UNCERTAIN') return { ok: false, reason: 'model-uncertain' }
if (alt.length < 8) return { ok: false, reason: 'too-short' }
if (alt.length > 180) return { ok: false, reason: 'too-long' }
const lower = alt.toLowerCase()
for (const prefix of BANNED_PREFIXES) {
if (lower.startsWith(prefix)) return { ok: false, reason: `banned-prefix:${prefix}` }
}
// Locale drift: cheap heuristic, not perfect, catches obvious cases
if (locale === 'hr' && /\bthe\b|\band\b|\bof\b/i.test(alt)) {
return { ok: false, reason: 'locale-drift-en-in-hr' }
}
// Common hallucination signature
if (/©|all rights reserved|getty|shutterstock/i.test(alt)) {
return { ok: false, reason: 'hallucinated-attribution' }
}
return { ok: true, alt }
}The locale drift check is crude but it has caught real regressions. We hit this on a Payload + Claude project where the locale parameter was being passed correctly to the API but the model still drifted to English on Croatian product photography — the visual content was "universal" enough that the model defaulted to its strongest language. The validator catches it; the job retries with a stronger locale instruction in the user message; the second attempt holds. Without the validator, those English strings would have shipped to a Croatian audience.
Backfilling 40,000 legacy images
Net-new images are easy — the hook fires, the queue handles it. The backfill is where teams blow their token budget. We ship a one-off script that pages through the Media collection, filters on `altSource: 'pending'` and missing/empty `alt`, and enqueues in chunks with a delay between batches to stay under rate limits.
// scripts/backfillAltText.ts
import { getPayload } from 'payload'
import config from '@payload-config'
import { altTextQueue } from '../queue/altText'
const BATCH = 100
const DELAY_MS = 1000
async function main() {
const payload = await getPayload({ config })
let page = 1
let enqueued = 0
while (true) {
const { docs, hasNextPage } = await payload.find({
collection: 'media',
where: {
and: [
{ mimeType: { contains: 'image/' } },
{ or: [{ alt: { exists: false } }, { alt: { equals: '' } }] },
{ altSource: { not_equals: 'human' } },
],
},
limit: BATCH,
page,
depth: 0,
})
for (const doc of docs) {
await altTextQueue.add('generate', {
mediaId: doc.id,
filename: doc.filename,
mimeType: doc.mimeType,
locale: 'en',
}, { delay: enqueued * 50 })
enqueued++
}
console.log(`page ${page}: enqueued ${docs.length}, total ${enqueued}`)
if (!hasNextPage) break
page++
await new Promise(r => setTimeout(r, DELAY_MS))
}
}
main().then(() => process.exit(0))The `delay: enqueued * 50` smooths out the queue — instead of 40,000 jobs hitting Redis in 30 seconds and the worker thrashing, jobs are scheduled 50ms apart. With 8 concurrent workers, this lands around 160 images/minute, or ~4 hours for 40,000 images.
Cost model: what 10,000 images actually runs
Numbers below are estimates based on Claude Haiku 4.5 vision pricing as of late 2025 (see Anthropic pricing) and our measured average of ~1,200 input tokens (image + prompt + context) and ~60 output tokens per call. Verify before quoting to a CFO; vendor pricing moves.
Claude Haiku 4.5 · ~1,260 tokens/image · estimated $3–5 per 10k images · our default for archive backfill and generic editorial
Claude Sonnet 4.5 · same token shape · estimated $15–25 per 10k images · we use this for regulated verticals or hero imagery where description quality matters
Gemini 2.5 Flash · comparable per-image cost to Haiku · we use it as a fallback when Anthropic rate limits hit during large backfills
GPT-4o-mini · close to Haiku on cost · we have shipped it on two projects but prefer Claude for the `DECORATIVE`/`UNCERTAIN` rule-following
The headline: a 10,000-image backfill on Haiku costs less than a takeaway lunch. A 40,000-image backfill costs less than half a freelance copywriting day. This is why we say alt text has the highest ROI of any AI workflow we ship — the alternative is genuinely tens of thousands of euros in editorial time or a permanent compliance gap.
Where we still refuse to automate
This is the part vendors will not tell you. Three image classes still get human-written alt text on every project we ship, regardless of budget.
Editorial photography with named subjects. If the image shows a person, place, or product that the article identifies by name, the model is one context-injection-failure away from inventing or misattributing. Humans write these.
Medical, legal, and financial imagery. A diagram of a procedure, a chart in a regulated filing, a piece of pharmaceutical packaging — the regulatory cost of a hallucinated description outweighs every euro the automation saves. Humans write these, with sign-off.
Locales below a confidence threshold. Our Croatian and Slovenian clients keep human review on Croatian and Slovenian alt text. The model is competent in both, but "competent" is not "native-grade," and these audiences notice. Editors still edit; the AI provides the first draft they correct.
What we ship on day one vs month three
The 80/20 we give clients: do not try to ship the whole pipeline in week one. Ship the schema, the hook, the queue, the worker, the validator, and English. That is week one to week three on a Payload codebase the team already has.
Day one to week three · Media collection fields, upload hook, BullMQ worker, English-only Claude Haiku, basic validator. Backfill runs over a weekend.
Month two · Additional locales with stronger prompt-side instructions, locale-drift validator, human review queue for `UNCERTAIN` outputs.
Month three · Context injection from referencing documents (article body, product description), per-collection prompt overrides, `altReviewedAt` workflow for compliance reporting.
Never on day one · Multi-model fallback, per-image cost dashboards, fine-tuning. Operators ask for these. They are not where the ROI is.
What we ship by default now
On every Payload project we start in 2025, the Media collection ships with `alt`, `altSource`, `altLocale`, and `altReviewedAt` from commit one — even if the AI worker comes online in month three. The schema cost is zero, and it means we never have to run a migration on a six-figure media table later. The hook is a fifteen-line addition. The worker is a separate process the client's ops team can scale independently.
If you are evaluating Payload for a content-heavy build and want to see what we wire into every project by default, See how we ship Payload CMS builds — the alt-text pipeline is one of about a dozen patterns we ship on day one.
If you have a Payload archive between 5,000 and 50,000 images and a WCAG or SEO mandate hanging over Q1, Send us your Media schema and archive size — we will tell you what this pipeline costs to wire on your stack and how long the backfill runs.
We have shipped this shape on media archives, D2C product catalogs, and editorial publishers. The pattern is the same; the prompt context and the refused locales change per vertical. If your team is staring at a five-figure image table and an auditor's checklist, this is the cheapest credible thing you can ship in a quarter.
// After the call
Questions operators ask next
Does this pattern work with Payload's Local API writes, or only with admin uploads?
Both. The `afterChange` hook fires on Local API `create` calls too, so programmatic imports — a one-off migration script, a CMS-to-CMS sync, a CDN backfill — enqueue alt-text jobs the same way an editor upload does. The only thing to watch is `req.file` being undefined on Local API calls; gate the enqueue on `operation === 'create'` instead, as the snippet does.
How do we handle Claude Vision rate limits during a 40,000-image backfill?
Two levers. First, BullMQ concurrency — we land at 8 workers for most org-tier Anthropic keys; bump higher only if you have a dedicated key. Second, the `delay` parameter on enqueue smooths the job rate so Redis does not flood the worker. If you still hit 429s, BullMQ's exponential backoff (4 attempts, 2s base) absorbs them; failed jobs stay in the queue for inspection rather than silently disappearing.
Can the same pipeline generate captions or long-form descriptions, not just alt text?
Yes, with a separate field and a separate prompt. We sometimes ship a `caption` field alongside `alt` for image-heavy editorial — the alt stays under 160 characters for WCAG, the caption can run longer and serve as marketing copy. Use two separate jobs (or a single job with two `messages.create` calls) rather than one prompt asking for both; the model follows length rules more reliably when each call has one job.
What does this cost annually for a publisher uploading 400 images per week?
On Claude Haiku at current pricing, roughly $8–12 per year in inference for net-new images at that volume. The backfill is a one-off cost — for a 40,000-image archive, estimated $12–20 total. The dominant cost is engineering time to wire and tune the pipeline (typically 1–2 weeks on an existing Payload codebase), not tokens.
Will this still work when we move from Payload 3 to a future major version?
The hook signature (`CollectionAfterChangeHook`) and Local API surface have been stable across Payload 3.x. The pipeline does not depend on any internal APIs — only the documented hook contract, the Local API `update`/`find` methods, and the standard `CollectionConfig` shape. Future-proofing-wise, the queue and worker are entirely outside Payload's lifecycle, so a Payload upgrade rarely touches them.
How do we audit which alt strings are AI-generated versus human-written?
The `altSource` enum is the audit trail — query `altSource: 'ai'` for a count of unreviewed AI strings, `altSource: 'human-edited-ai'` for strings a human improved, and `altSource: 'human'` or `'decorative'` for the rest. Pair this with `altReviewedAt` to show an auditor exactly how much of the archive has been signed off and when. We typically expose this as a small Payload custom view for the compliance team.
Pull quote
Alt text is the AI workflow operators underestimate and editors hate writing. It is also the one where a vision model earns its tokens on day one.