Why not just use the Vercel AI SDK's `streamText` directly from a Server Action?

It works for chat toys and short completions. It breaks the moment the user refreshes, switches networks, or the function hits Vercel's 60s ceiling. We use the AI SDK's primitives inside the worker, but the stream's destination is Postgres via the Payload Local API, not the client. The client subscribes to the row.

Does polling every 500ms not hammer Postgres?

Two indexed reads per second per active streaming message is trivial on any Postgres instance from a $20 Neon plan upward — we typically see sub-5ms query times with the `(id)` primary key lookup. The flush cadence on the worker side (one write per 250ms) is where you tune for write load, not the read side.

Can the worker live on the same Vercel deployment, or do we always need Railway / Fly?

If completions reliably finish under 60s and you do not need tool-calling loops, a Next.js route handler with `export const maxDuration = 300` on Vercel Pro is fine. We move to Railway or Fly when streams routinely exceed 45s, when we need long-lived background work, or when the operational benefit of independent scaling outweighs the extra deployment target.

How does the `requestHash` idempotency hold up against legitimate retries — same prompt, same user, intentional second ask?

It does not, by design. If a user genuinely wants to regenerate, the UI should include a regenerate action that mutates the conversation id or appends a nonce to the hash input. The default behavior — collapsing identical (conversation, user, prompt) tuples to one row — is what stops accidental double-charges. Make the regenerate path explicit.

Will this pattern survive Next.js 15's caching changes and Server Action revalidation semantics?

Yes. The Server Action only writes a row and triggers a worker; it does not depend on streaming response semantics or `unstable_*` APIs. The `revalidatePath` call invalidates the conversation page so the new message id renders. We have run this on Next.js 14 and 15 without changes — the Anthropic SDK's streaming surface is the only moving part to keep an eye on.

How does this interact with Payload's access control on a multi-tenant build?

The `aiMessages` collection's `read` access reads `req.user` and should be scoped per tenant — typically by adding a `tenant` relationship field and filtering on `where: { tenant: { equals: req.user.tenant } }`. Worker writes use the Local API with `overrideAccess: true` because the worker runs as a system actor, not as the user, which is the correct boundary to draw.

Back to insights

AI Engineering

Streaming Claude from Server Actions into Payload: The Resume-on-Refresh Pattern We Ship

Streaming Claude from a Next.js Server Action looks fine on localhost and falls apart the first week in production. Here is the shape we ship instead: kick off the job, stream into Postgres via Payload, let the client subscribe to the row.

19 Jun 20268 min readBy Krešimir Galić · Founder & Principal Engineer

*The demo streams Claude beautifully. Then a user refreshes the tab and the message is gone forever. This is the wiring we ship instead.*

Every Claude integration we ship on Next.js + Payload hits the same week-three failure mode. The demo works. The Loom is gorgeous. Then a real user opens the tool, asks Claude a 4,000-token question, refreshes the tab to check something — and the answer is gone. Not paused. Not resumable. Gone. The Server Action that was streaming server-sent events to their browser closed the moment they navigated, and nothing on the server ever wrote a row.

We have wired this pattern six times now — support copilots on Medusa, editorial drafting tools inside Payload, internal RAG dashboards on Postgres — and the shape that survives production is not the shape the Vercel AI SDK quickstart shows you. The quickstart streams Claude's tokens straight from a Server Action to the client. That works for a chat toy. It does not work when a Head of Content closes their laptop mid-generation and expects the draft to be there in the morning, and it does not work when Vercel's function ceiling guillotines a 90-second generation at 60 seconds with no warning.

This is the wiring we now ship by default: the Server Action kicks off the job, Claude streams into Postgres via the Payload Local API, and the client subscribes to the row. The browser never owns the stream. Postgres does. Below is the schema, the worker, the client subscription, and the failure modes we instrument before any of it ships.

Why we stopped streaming Claude directly to the client

The Vercel AI SDK pattern — `streamText` inside a Server Action, returning a `ReadableStream` the client consumes with `useChat` — is fine for the first sprint. We have shipped it. We have also been bitten by it on every project that survived past the prototype demo:

Refresh kills the message. The stream lives in the HTTP response. Close the tab, refresh, switch networks on mobile — the Server Action's writer is gone and nothing wrote the partial output anywhere durable.
Vercel function ceiling. On the Hobby and Pro plans, serverless functions cap at 60 seconds (Fluid Compute pushes this further but you are still on a clock). A Claude Sonnet response running tool calls plus a long completion routinely crosses 45 seconds. We have watched it die at 59.8s in front of a CFO.
Double-submit costs real money. A user double-clicks the submit button, the Server Action fires twice, you pay for two completions. On Claude Sonnet 4.5 at current pricing that is roughly $0.20–$0.60 per accidental duplicate on a longer prompt. Multiply by a content team of twelve.
No audit trail. When the support team asks 'what did Claude say to this customer at 14:32?', you have nothing. The bytes streamed through your serverless function and were never persisted.

The fix is not better client code. The fix is making Postgres — via Payload — the single source of truth for every token Claude emits, from the first chunk.

The shape we ship

Three moving parts. The Server Action is thin. The worker is durable. The client polls or subscribes to a row.

Server Action receives the user's prompt, hashes it with the conversation id to produce a `requestHash`, creates an `aiMessages` row in `pending` status via Payload Local API, returns the id to the client, and triggers the worker.
Worker (a separate route handler on a longer-runtime, or a background job) opens the Anthropic stream, appends each chunk to the row's `chunks` jsonb array via Payload Local API, flips status to `streaming → complete` (or `error`) at the end.
Client receives the id from the Server Action, then subscribes to the row — polling every 500ms, or via SSE from a thin `/api/messages/[id]/subscribe` route, or via Supabase Realtime if the project already uses it.

Refresh the tab, the row is still there. Close the laptop, the worker keeps writing. Reopen on mobile, the client subscribes to the same id and renders whatever Postgres holds — partial, complete, or errored.

The Payload collection we ship

This is the collection definition we drop in on day one. It is intentionally boring. Status enum, chunks as jsonb, a `requestHash` we index for idempotency, audit fields the support team reads when something goes sideways.

TypeScript

import type { CollectionConfig } from 'payload'

export const AiMessages: CollectionConfig = {
  slug: 'aiMessages',
  admin: { useAsTitle: 'id', defaultColumns: ['status', 'model', 'tokensOut', 'createdAt'] },
  access: {
    read: ({ req }) => Boolean(req.user),
    create: () => false, // only server code creates these
    update: () => false,
    delete: ({ req }) => req.user?.role === 'admin',
  },
  fields: [
    { name: 'conversationId', type: 'text', required: true, index: true },
    { name: 'userId', type: 'relationship', relationTo: 'users', required: true, index: true },
    { name: 'requestHash', type: 'text', required: true, unique: true, index: true },
    { name: 'model', type: 'text', required: true }, // 'claude-sonnet-4-5'
    { name: 'prompt', type: 'textarea', required: true },
    {
      name: 'status',
      type: 'select',
      required: true,
      defaultValue: 'pending',
      options: ['pending', 'streaming', 'complete', 'error', 'timeout'],
      index: true,
    },
    { name: 'chunks', type: 'json', defaultValue: [] }, // append-only token chunks
    { name: 'output', type: 'textarea' }, // populated on complete
    { name: 'tokensIn', type: 'number' },
    { name: 'tokensOut', type: 'number' },
    { name: 'costUsd', type: 'number' },
    { name: 'errorMessage', type: 'textarea' },
    { name: 'startedAt', type: 'date' },
    { name: 'completedAt', type: 'date' },
  ],
  timestamps: true,
}

Two design choices worth defending. First, `requestHash` is unique — a double-click cannot create two rows for the same prompt in the same conversation, which means it cannot trigger two completions. Second, `chunks` is jsonb, not a relationship to a `messageChunks` collection. We tried the relational shape on one project; the write amplification under streaming load (one row insert per token chunk, with Payload running access checks each time) crushed write latency. A single jsonb append per N tokens is two orders of magnitude cheaper.

The Server Action: thin, idempotent, fast

The Server Action does four things and exits. It does not stream. It does not wait for Claude. It returns within ~80ms so the UI can render the empty message bubble immediately.

TypeScript

'use server'

import { getPayload } from 'payload'
import config from '@payload-config'
import { createHash } from 'crypto'
import { headers } from 'next/headers'
import { revalidatePath } from 'next/cache'

export async function startClaudeMessage(input: {
  conversationId: string
  prompt: string
}): Promise<{ id: string; resumed: boolean }> {
  const payload = await getPayload({ config })
  const user = await getUserFromHeaders(headers())
  if (!user) throw new Error('unauthorized')

  const requestHash = createHash('sha256')
    .update(`${input.conversationId}:${user.id}:${input.prompt}`)
    .digest('hex')

  // Idempotency: if this exact request is in flight or complete, return the existing row.
  const existing = await payload.find({
    collection: 'aiMessages',
    where: { requestHash: { equals: requestHash } },
    limit: 1,
  })
  if (existing.docs[0]) {
    return { id: String(existing.docs[0].id), resumed: true }
  }

  const row = await payload.create({
    collection: 'aiMessages',
    data: {
      conversationId: input.conversationId,
      userId: user.id,
      requestHash,
      model: 'claude-sonnet-4-5',
      prompt: input.prompt,
      status: 'pending',
      chunks: [],
      startedAt: new Date().toISOString(),
    },
  })

  // Fire-and-forget trigger to the worker. Do NOT await the completion.
  await fetch(`${process.env.WORKER_URL}/run`, {
    method: 'POST',
    headers: { 'x-worker-secret': process.env.WORKER_SECRET! },
    body: JSON.stringify({ id: row.id }),
  })

  revalidatePath(`/conversations/${input.conversationId}`)
  return { id: String(row.id), resumed: false }
}

Note what is missing: there is no `await streamText(...)`, no `ReadableStream` returned to the client, no `useChat` hook on the other end. The Server Action's only job is to write a durable row and hand off.

The worker: where Claude actually streams

The worker is where the runtime decision matters. Three options, ranked by how often we pick them:

A dedicated Node route on Railway or Fly — our default when streams routinely exceed 30 seconds. No function ceiling, predictable memory, can be scaled independently of the web app. Costs $5–$20/mo for low volume.
A Next.js route handler with `maxDuration = 300` on Vercel Pro — fine for completions that finish under five minutes. We have shipped this on internal tools where the upgrade to Railway was not worth the operational overhead.
Inngest, Trigger.dev, or BullMQ on a dedicated worker — when we need retries, scheduling, fan-out, or observability beyond what a single route gives us. We reach for this on projects with more than ~1,000 AI calls per day.

The worker body itself is straightforward. It uses the Anthropic Messages streaming API and the Payload Local API to append chunks. The only real subtlety is the token-rate guard — we batch writes so we are not hitting Postgres on every single token.

TypeScript

import Anthropic from '@anthropic-ai/sdk'
import { getPayload } from 'payload'
import config from '@payload-config'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! })
const FLUSH_EVERY_MS = 250
const MAX_CHUNK_TOKENS = 32

export async function runMessage(messageId: string) {
  const payload = await getPayload({ config })
  const row = await payload.findByID({ collection: 'aiMessages', id: messageId })
  if (!row || row.status !== 'pending') return // idempotency: another worker grabbed it

  await payload.update({
    collection: 'aiMessages',
    id: messageId,
    data: { status: 'streaming' },
  })

  let buffer = ''
  let pendingChunks: string[] = []
  let lastFlush = Date.now()
  let tokensOut = 0

  const flush = async () => {
    if (pendingChunks.length === 0) return
    const toWrite = pendingChunks
    pendingChunks = []
    await payload.update({
      collection: 'aiMessages',
      id: messageId,
      data: { chunks: [...(row.chunks ?? []), ...toWrite], tokensOut },
    })
    lastFlush = Date.now()
  }

  try {
    const stream = anthropic.messages.stream({
      model: 'claude-sonnet-4-5',
      max_tokens: 4096,
      messages: [{ role: 'user', content: row.prompt }],
    })

    for await (const event of stream) {
      if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
        buffer += event.delta.text
        tokensOut += 1
        if (buffer.length >= MAX_CHUNK_TOKENS) {
          pendingChunks.push(buffer)
          buffer = ''
        }
        if (Date.now() - lastFlush > FLUSH_EVERY_MS) await flush()
      }
    }
    if (buffer) pendingChunks.push(buffer)
    await flush()

    const final = await stream.finalMessage()
    const output = final.content.filter(c => c.type === 'text').map(c => c.text).join('')

    await payload.update({
      collection: 'aiMessages',
      id: messageId,
      data: {
        status: 'complete',
        output,
        tokensIn: final.usage.input_tokens,
        tokensOut: final.usage.output_tokens,
        costUsd: calculateCost(final.usage),
        completedAt: new Date().toISOString(),
      },
    })
  } catch (err) {
    await payload.update({
      collection: 'aiMessages',
      id: messageId,
      data: {
        status: 'error',
        errorMessage: err instanceof Error ? err.message : String(err),
        completedAt: new Date().toISOString(),
      },
    })
  }
}

The flush logic is the part we tune per project. 250ms is the default — short enough that a polling client sees the message growing in real time, long enough that we are not hammering Postgres with hundreds of writes per second on a long generation.

The client: polling beats SSE more often than you think

There are three reasonable ways for the client to read the row. We default to polling. Here is why.

Polling every 400–600ms — one line of code with SWR or TanStack Query, works through every proxy and corporate firewall, survives mobile network switches, costs ~2 cheap Postgres reads per second per active user. Default.
SSE from a dedicated route — lower perceived latency, no polling overhead, but you are back to long-lived connections that die on Vercel function timeouts. Only worth it on a Node runtime, only when polling latency is a measured problem.
Supabase Realtime / Postgres LISTEN/NOTIFY — clean, push-based, but adds a dependency and requires care around connection limits. Worth it only if the project already uses Supabase for other reasons.

TSX

'use client'

import useSWR from 'swr'

type AiMessage = {
  id: string
  status: 'pending' | 'streaming' | 'complete' | 'error' | 'timeout'
  chunks: string[]
  output?: string
  errorMessage?: string
}

export function StreamingMessage({ messageId }: { messageId: string }) {
  const { data } = useSWR<AiMessage>(
    `/api/messages/${messageId}`,
    (url) => fetch(url).then(r => r.json()),
    {
      refreshInterval: (latest) =>
        latest && (latest.status === 'complete' || latest.status === 'error') ? 0 : 500,
      revalidateOnFocus: true, // refetch when user returns to the tab
    },
  )

  if (!data) return <div className="opacity-50">Starting…</div>
  if (data.status === 'error') return <div className="text-red-600">{data.errorMessage}</div>

  const text = data.status === 'complete' ? data.output : (data.chunks ?? []).join('')
  return (
    <div>
      <p className="whitespace-pre-wrap">{text}</p>
      {data.status === 'streaming' && <span className="animate-pulse">▍</span>}
    </div>
  )
}

`revalidateOnFocus: true` is the small detail that delivers the resume-on-refresh promise. The user refreshes, the component remounts, SWR refetches immediately, and the message renders from whatever Postgres holds — partial, complete, or errored. No state to rehydrate. No client-side cache to reconcile. The row is the truth.

The war story

We hit this exact failure on a Payload + Claude editorial tool last year. The first version used the AI SDK quickstart — Server Action returned a stream, client consumed it. Worked beautifully in QA. Two days after launch, content lead Slacks us: 'I lost a 1,200-word draft because I switched tabs to check a source.' We dug in. The Server Action had been killed when she navigated, no row had been written, the tokens were gone. The fix took three days: the schema above, a worker on Railway, polling on the client. Two months in, the team had not lost a single draft to a tab switch — and the support copilot we wired on top of the same pattern survived its first Vercel timeout without anyone noticing, because the worker kept writing and the client kept polling.

Failure modes we instrument

The pattern is not done when the happy path works. It is done when the four ugly paths are visible in the Payload admin and alerting somewhere.

Abandoned streams — a cron checks for rows in `streaming` status with `startedAt` older than 10 minutes, flips them to `timeout`, alerts. Catches worker crashes and silently dropped Anthropic connections.
Partial completions — the row is `complete` but `output` is shorter than the sum of chunks. Indicates the final-message reconciliation failed. Rare but worth catching.
Token overruns — `tokensOut` approaching `max_tokens`. Either the model is rambling or the prompt needs a tighter stop condition. We surface this in the admin so editors can tune their own prompts.
Cost spikes — daily aggregate of `costUsd` over a threshold pages the on-call. Cheaper than discovering a runaway loop a week later on the Anthropic invoice.

Where this pattern stops scaling

We have run this shape comfortably up to ~5,000 AI generations per day on a single Postgres instance and a single Railway worker. Past that, three things change. The `chunks` jsonb writes start showing up in slow query logs — we move to a separate `messageChunks` table with append-only inserts and a covering index on `(messageId, sequence)`. The single worker becomes a bottleneck — we move to BullMQ with a Redis queue and 4–8 worker replicas. The polling load on the API route becomes non-trivial — we either move that one route to a longer-running runtime or switch to Supabase Realtime.

Until then, the pattern above is roughly 200 lines of code, survives refresh, survives Vercel timeouts, survives double-clicks, and gives the operator team an audit trail their CFO can actually defend.

We wire this pattern on every Payload project that touches Claude or OpenAI — See how we ship Payload + AI workflows for the full shape, the hooks, and the editorial UX we ship around it.

If you are streaming Claude or OpenAI into a Next.js + Payload app and the demo is starting to feel fragile, Tell us what you are wiring up — send the schema, the route handler, the failure you are trying to design out. We will tell you what we would change.

On every Payload + Claude project we now ship the `aiMessages` collection, the thin Server Action, and a worker on a longer-runtime as the day-one scaffold — before any prompt engineering, before any UI polish. It is the cheapest insurance we know against the week-three Slack message that starts with 'I just lost…'.

// After the call

Questions operators ask next

Why not just use the Vercel AI SDK's `streamText` directly from a Server Action?
It works for chat toys and short completions. It breaks the moment the user refreshes, switches networks, or the function hits Vercel's 60s ceiling. We use the AI SDK's primitives inside the worker, but the stream's destination is Postgres via the Payload Local API, not the client. The client subscribes to the row.
Does polling every 500ms not hammer Postgres?
Two indexed reads per second per active streaming message is trivial on any Postgres instance from a $20 Neon plan upward — we typically see sub-5ms query times with the `(id)` primary key lookup. The flush cadence on the worker side (one write per 250ms) is where you tune for write load, not the read side.
Can the worker live on the same Vercel deployment, or do we always need Railway / Fly?
If completions reliably finish under 60s and you do not need tool-calling loops, a Next.js route handler with `export const maxDuration = 300` on Vercel Pro is fine. We move to Railway or Fly when streams routinely exceed 45s, when we need long-lived background work, or when the operational benefit of independent scaling outweighs the extra deployment target.
How does the `requestHash` idempotency hold up against legitimate retries — same prompt, same user, intentional second ask?
It does not, by design. If a user genuinely wants to regenerate, the UI should include a regenerate action that mutates the conversation id or appends a nonce to the hash input. The default behavior — collapsing identical (conversation, user, prompt) tuples to one row — is what stops accidental double-charges. Make the regenerate path explicit.
Will this pattern survive Next.js 15's caching changes and Server Action revalidation semantics?
Yes. The Server Action only writes a row and triggers a worker; it does not depend on streaming response semantics or `unstable_*` APIs. The `revalidatePath` call invalidates the conversation page so the new message id renders. We have run this on Next.js 14 and 15 without changes — the Anthropic SDK's streaming surface is the only moving part to keep an eye on.
How does this interact with Payload's access control on a multi-tenant build?
The `aiMessages` collection's `read` access reads `req.user` and should be scoped per tenant — typically by adding a `tenant` relationship field and filtering on `where: { tenant: { equals: req.user.tenant } }`. Worker writes use the Local API with `overrideAccess: true` because the worker runs as a system actor, not as the user, which is the correct boundary to draw.

Pull quote

If the client owns the stream, a refresh kills the message. If Postgres owns the stream, the client is just a window into a row that already exists.

Back to insights

AI Engineering

Streaming Claude from Server Actions into Payload: The Resume-on-Refresh Pattern We Ship

19 Jun 20268 min readBy Krešimir Galić · Founder & Principal Engineer

*The demo streams Claude beautifully. Then a user refreshes the tab and the message is gone forever. This is the wiring we ship instead.*

Why we stopped streaming Claude directly to the client

Refresh kills the message. The stream lives in the HTTP response. Close the tab, refresh, switch networks on mobile — the Server Action's writer is gone and nothing wrote the partial output anywhere durable.
Vercel function ceiling. On the Hobby and Pro plans, serverless functions cap at 60 seconds (Fluid Compute pushes this further but you are still on a clock). A Claude Sonnet response running tool calls plus a long completion routinely crosses 45 seconds. We have watched it die at 59.8s in front of a CFO.
Double-submit costs real money. A user double-clicks the submit button, the Server Action fires twice, you pay for two completions. On Claude Sonnet 4.5 at current pricing that is roughly $0.20–$0.60 per accidental duplicate on a longer prompt. Multiply by a content team of twelve.
No audit trail. When the support team asks 'what did Claude say to this customer at 14:32?', you have nothing. The bytes streamed through your serverless function and were never persisted.

The fix is not better client code. The fix is making Postgres — via Payload — the single source of truth for every token Claude emits, from the first chunk.

The shape we ship

Three moving parts. The Server Action is thin. The worker is durable. The client polls or subscribes to a row.

Server Action receives the user's prompt, hashes it with the conversation id to produce a `requestHash`, creates an `aiMessages` row in `pending` status via Payload Local API, returns the id to the client, and triggers the worker.
Worker (a separate route handler on a longer-runtime, or a background job) opens the Anthropic stream, appends each chunk to the row's `chunks` jsonb array via Payload Local API, flips status to `streaming → complete` (or `error`) at the end.
Client receives the id from the Server Action, then subscribes to the row — polling every 500ms, or via SSE from a thin `/api/messages/[id]/subscribe` route, or via Supabase Realtime if the project already uses it.

The Payload collection we ship

TypeScript

import type { CollectionConfig } from 'payload'

export const AiMessages: CollectionConfig = {
  slug: 'aiMessages',
  admin: { useAsTitle: 'id', defaultColumns: ['status', 'model', 'tokensOut', 'createdAt'] },
  access: {
    read: ({ req }) => Boolean(req.user),
    create: () => false, // only server code creates these
    update: () => false,
    delete: ({ req }) => req.user?.role === 'admin',
  },
  fields: [
    { name: 'conversationId', type: 'text', required: true, index: true },
    { name: 'userId', type: 'relationship', relationTo: 'users', required: true, index: true },
    { name: 'requestHash', type: 'text', required: true, unique: true, index: true },
    { name: 'model', type: 'text', required: true }, // 'claude-sonnet-4-5'
    { name: 'prompt', type: 'textarea', required: true },
    {
      name: 'status',
      type: 'select',
      required: true,
      defaultValue: 'pending',
      options: ['pending', 'streaming', 'complete', 'error', 'timeout'],
      index: true,
    },
    { name: 'chunks', type: 'json', defaultValue: [] }, // append-only token chunks
    { name: 'output', type: 'textarea' }, // populated on complete
    { name: 'tokensIn', type: 'number' },
    { name: 'tokensOut', type: 'number' },
    { name: 'costUsd', type: 'number' },
    { name: 'errorMessage', type: 'textarea' },
    { name: 'startedAt', type: 'date' },
    { name: 'completedAt', type: 'date' },
  ],
  timestamps: true,
}

The Server Action: thin, idempotent, fast

The Server Action does four things and exits. It does not stream. It does not wait for Claude. It returns within ~80ms so the UI can render the empty message bubble immediately.

TypeScript

'use server'

import { getPayload } from 'payload'
import config from '@payload-config'
import { createHash } from 'crypto'
import { headers } from 'next/headers'
import { revalidatePath } from 'next/cache'

export async function startClaudeMessage(input: {
  conversationId: string
  prompt: string
}): Promise<{ id: string; resumed: boolean }> {
  const payload = await getPayload({ config })
  const user = await getUserFromHeaders(headers())
  if (!user) throw new Error('unauthorized')

  const requestHash = createHash('sha256')
    .update(`${input.conversationId}:${user.id}:${input.prompt}`)
    .digest('hex')

  // Idempotency: if this exact request is in flight or complete, return the existing row.
  const existing = await payload.find({
    collection: 'aiMessages',
    where: { requestHash: { equals: requestHash } },
    limit: 1,
  })
  if (existing.docs[0]) {
    return { id: String(existing.docs[0].id), resumed: true }
  }

  const row = await payload.create({
    collection: 'aiMessages',
    data: {
      conversationId: input.conversationId,
      userId: user.id,
      requestHash,
      model: 'claude-sonnet-4-5',
      prompt: input.prompt,
      status: 'pending',
      chunks: [],
      startedAt: new Date().toISOString(),
    },
  })

  // Fire-and-forget trigger to the worker. Do NOT await the completion.
  await fetch(`${process.env.WORKER_URL}/run`, {
    method: 'POST',
    headers: { 'x-worker-secret': process.env.WORKER_SECRET! },
    body: JSON.stringify({ id: row.id }),
  })

  revalidatePath(`/conversations/${input.conversationId}`)
  return { id: String(row.id), resumed: false }
}

The worker: where Claude actually streams

The worker is where the runtime decision matters. Three options, ranked by how often we pick them:

A dedicated Node route on Railway or Fly — our default when streams routinely exceed 30 seconds. No function ceiling, predictable memory, can be scaled independently of the web app. Costs $5–$20/mo for low volume.
A Next.js route handler with `maxDuration = 300` on Vercel Pro — fine for completions that finish under five minutes. We have shipped this on internal tools where the upgrade to Railway was not worth the operational overhead.
Inngest, Trigger.dev, or BullMQ on a dedicated worker — when we need retries, scheduling, fan-out, or observability beyond what a single route gives us. We reach for this on projects with more than ~1,000 AI calls per day.

TypeScript

import Anthropic from '@anthropic-ai/sdk'
import { getPayload } from 'payload'
import config from '@payload-config'

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! })
const FLUSH_EVERY_MS = 250
const MAX_CHUNK_TOKENS = 32

export async function runMessage(messageId: string) {
  const payload = await getPayload({ config })
  const row = await payload.findByID({ collection: 'aiMessages', id: messageId })
  if (!row || row.status !== 'pending') return // idempotency: another worker grabbed it

  await payload.update({
    collection: 'aiMessages',
    id: messageId,
    data: { status: 'streaming' },
  })

  let buffer = ''
  let pendingChunks: string[] = []
  let lastFlush = Date.now()
  let tokensOut = 0

  const flush = async () => {
    if (pendingChunks.length === 0) return
    const toWrite = pendingChunks
    pendingChunks = []
    await payload.update({
      collection: 'aiMessages',
      id: messageId,
      data: { chunks: [...(row.chunks ?? []), ...toWrite], tokensOut },
    })
    lastFlush = Date.now()
  }

  try {
    const stream = anthropic.messages.stream({
      model: 'claude-sonnet-4-5',
      max_tokens: 4096,
      messages: [{ role: 'user', content: row.prompt }],
    })

    for await (const event of stream) {
      if (event.type === 'content_block_delta' && event.delta.type === 'text_delta') {
        buffer += event.delta.text
        tokensOut += 1
        if (buffer.length >= MAX_CHUNK_TOKENS) {
          pendingChunks.push(buffer)
          buffer = ''
        }
        if (Date.now() - lastFlush > FLUSH_EVERY_MS) await flush()
      }
    }
    if (buffer) pendingChunks.push(buffer)
    await flush()

    const final = await stream.finalMessage()
    const output = final.content.filter(c => c.type === 'text').map(c => c.text).join('')

    await payload.update({
      collection: 'aiMessages',
      id: messageId,
      data: {
        status: 'complete',
        output,
        tokensIn: final.usage.input_tokens,
        tokensOut: final.usage.output_tokens,
        costUsd: calculateCost(final.usage),
        completedAt: new Date().toISOString(),
      },
    })
  } catch (err) {
    await payload.update({
      collection: 'aiMessages',
      id: messageId,
      data: {
        status: 'error',
        errorMessage: err instanceof Error ? err.message : String(err),
        completedAt: new Date().toISOString(),
      },
    })
  }
}

The client: polling beats SSE more often than you think

There are three reasonable ways for the client to read the row. We default to polling. Here is why.

Polling every 400–600ms — one line of code with SWR or TanStack Query, works through every proxy and corporate firewall, survives mobile network switches, costs ~2 cheap Postgres reads per second per active user. Default.
SSE from a dedicated route — lower perceived latency, no polling overhead, but you are back to long-lived connections that die on Vercel function timeouts. Only worth it on a Node runtime, only when polling latency is a measured problem.
Supabase Realtime / Postgres LISTEN/NOTIFY — clean, push-based, but adds a dependency and requires care around connection limits. Worth it only if the project already uses Supabase for other reasons.

TSX

'use client'

import useSWR from 'swr'

type AiMessage = {
  id: string
  status: 'pending' | 'streaming' | 'complete' | 'error' | 'timeout'
  chunks: string[]
  output?: string
  errorMessage?: string
}

export function StreamingMessage({ messageId }: { messageId: string }) {
  const { data } = useSWR<AiMessage>(
    `/api/messages/${messageId}`,
    (url) => fetch(url).then(r => r.json()),
    {
      refreshInterval: (latest) =>
        latest && (latest.status === 'complete' || latest.status === 'error') ? 0 : 500,
      revalidateOnFocus: true, // refetch when user returns to the tab
    },
  )

  if (!data) return <div className="opacity-50">Starting…</div>
  if (data.status === 'error') return <div className="text-red-600">{data.errorMessage}</div>

  const text = data.status === 'complete' ? data.output : (data.chunks ?? []).join('')
  return (
    <div>
      <p className="whitespace-pre-wrap">{text}</p>
      {data.status === 'streaming' && <span className="animate-pulse">▍</span>}
    </div>
  )
}

The war story

Failure modes we instrument

The pattern is not done when the happy path works. It is done when the four ugly paths are visible in the Payload admin and alerting somewhere.

Abandoned streams — a cron checks for rows in `streaming` status with `startedAt` older than 10 minutes, flips them to `timeout`, alerts. Catches worker crashes and silently dropped Anthropic connections.
Partial completions — the row is `complete` but `output` is shorter than the sum of chunks. Indicates the final-message reconciliation failed. Rare but worth catching.
Token overruns — `tokensOut` approaching `max_tokens`. Either the model is rambling or the prompt needs a tighter stop condition. We surface this in the admin so editors can tune their own prompts.
Cost spikes — daily aggregate of `costUsd` over a threshold pages the on-call. Cheaper than discovering a runaway loop a week later on the Anthropic invoice.

Where this pattern stops scaling

We wire this pattern on every Payload project that touches Claude or OpenAI — See how we ship Payload + AI workflows for the full shape, the hooks, and the editorial UX we ship around it.

// After the call

Questions operators ask next

Why not just use the Vercel AI SDK's `streamText` directly from a Server Action?
It works for chat toys and short completions. It breaks the moment the user refreshes, switches networks, or the function hits Vercel's 60s ceiling. We use the AI SDK's primitives inside the worker, but the stream's destination is Postgres via the Payload Local API, not the client. The client subscribes to the row.
Does polling every 500ms not hammer Postgres?
Two indexed reads per second per active streaming message is trivial on any Postgres instance from a $20 Neon plan upward — we typically see sub-5ms query times with the `(id)` primary key lookup. The flush cadence on the worker side (one write per 250ms) is where you tune for write load, not the read side.
Can the worker live on the same Vercel deployment, or do we always need Railway / Fly?
If completions reliably finish under 60s and you do not need tool-calling loops, a Next.js route handler with `export const maxDuration = 300` on Vercel Pro is fine. We move to Railway or Fly when streams routinely exceed 45s, when we need long-lived background work, or when the operational benefit of independent scaling outweighs the extra deployment target.
How does the `requestHash` idempotency hold up against legitimate retries — same prompt, same user, intentional second ask?
It does not, by design. If a user genuinely wants to regenerate, the UI should include a regenerate action that mutates the conversation id or appends a nonce to the hash input. The default behavior — collapsing identical (conversation, user, prompt) tuples to one row — is what stops accidental double-charges. Make the regenerate path explicit.
Will this pattern survive Next.js 15's caching changes and Server Action revalidation semantics?
Yes. The Server Action only writes a row and triggers a worker; it does not depend on streaming response semantics or `unstable_*` APIs. The `revalidatePath` call invalidates the conversation page so the new message id renders. We have run this on Next.js 14 and 15 without changes — the Anthropic SDK's streaming surface is the only moving part to keep an eye on.
How does this interact with Payload's access control on a multi-tenant build?
The `aiMessages` collection's `read` access reads `req.user` and should be scoped per tenant — typically by adding a `tenant` relationship field and filtering on `where: { tenant: { equals: req.user.tenant } }`. Worker writes use the Local API with `overrideAccess: true` because the worker runs as a system actor, not as the user, which is the correct boundary to draw.

Pull quote

If the client owns the stream, a refresh kills the message. If Postgres owns the stream, the client is just a window into a row that already exists.