All services

Services / What we ship

04 · Shipped AI · Evals & review

AI Integrations

Claude, OpenAI, Gemini — in the product, not the press release.

AI-native search, support agents, content pipelines, internal copilots, RAG over your data, automation flows. We architect the integration, build the prompt and tool layer in TypeScript, ship to production with evals.

Typical scope · scoped per project · evals included

Symptoms we recognise

You should talk to us if…

  • You've shipped a chatbot demo, but it never made it to production.
  • Your team copy-pastes between ChatGPT and your tools — and wants that workflow inside the product.
  • You've spent months on a RAG prototype that nobody trusts.
  • You want measurable AI features, not a press-release line item.

What ships

/ 04 deliverables

Concrete deliverables, not vague promises.

Production-grade

An AI feature that ships, not demos.

Claude, OpenAI, or Gemini — wired into your product through a typed prompt + tool layer that survives the next model release.

Evals first

A test suite for prompts, not vibes.

Before any prompt change goes live, it runs through evals on real cases. Regressions caught in CI, not by users.

Tracing & versioning

Visibility into every model call.

Each call traced, every prompt version recorded, cost tracked. You debug in the open, not by guessing what the model did.

Human in the loop

Review where it matters.

Customer-facing copy and high-stakes actions go through a human approval step. AI accelerates; humans accept.

How this engagement runs

Four phases. No bench.

  1. 01

    Week 1 · Eval design

    Start with the test set, not the prompt.

    We write 20–50 representative cases from your real data before any model call. The eval is the spec; the prompt comes second.

  2. 02

    Weeks 2–3 · Prompt + tool layer

    TypeScript wrappers around the SDK.

    Each tool is small, typed, and individually testable. Prompts versioned like code. Schema-validated outputs, not freeform JSON.

  3. 03

    Week 4 · Integrate

    Wire into your real UI and workflow.

    Feature-flagged rollout. Tracing on every call. Internal users first, then a 5% canary, then the full base.

  4. 04

    Measure · Iterate

    Track the metric that matters.

    Deflection rate, resolution time, accuracy on the eval set. Iterate prompts. Swap models when something better lands. No ego in the model choice.

Stack we ship against

  • Claude · OpenAI · Gemini SDKs
  • TypeScript prompt + tool layer
  • Evals · tracing · prompt versioning

We pass when

You want a slide deck about AI instead of a measurable product change, or you can't commit to evals and human review before merge.

Field questions

Honest answers.

  • Which model should we use?

    Whichever wins on your eval set. We benchmark Claude, GPT, and Gemini on your data before recommending. Often the answer is two — one for quality, one for cost.

  • How do you handle hallucinations?

    Tool calls instead of freeform generation wherever possible. Retrieval over your data. Citations on every answer. Evals catch regressions before deploy.

  • Can we self-host the model?

    Sometimes. We ship Claude / OpenAI via API, and we ship open-source models on Bedrock or VLLM when data residency requires. We pick what fits, not what's trendy.

  • What about cost?

    Tracked per call from day one. Most production features land at €0.02–€0.20 per interaction. We design to a cost ceiling, not a vague budget line.

From the principal engineer

AI without evals is a demo. AI with evals is software. The first one impresses the boardroom; the second one is the one customers actually use.

Related work

— / no shipped cases under this door yet

Discovery

Pick a door. Or describe what's broken.

We pitch the problem, not the platform.

Reply within one business day · No discovery deck tax