The drawing set

The engineering under every system

Ten capability clusters, from retrieval to evals to classic ML. The tools are named, the limits are stated, the proof is linked. If something is on this page, someone here can defend it on a call.

Get a free audit

See the proof

The architecture, drawn

One production retrieval path, end to end

This is the architecture our retrieval systems actually run, not a brain with gears. Every node states something true. The dashed lane is the manual process it replaces.

FIG. 01

The retrieval path: recall measured before generation; fails closed at the gate.REV 2026.06

01 · Ingest
Chunking and indexing strategy is where most retrieval quality is won or lost, before any model is involved.
02 · Hybrid retrieve
Dense vectors and keyword search in parallel, fused. Pure vector retrieval misses exact terms: names, codes, citations.
03 · Rerank, then generate
A second-pass precision score on the top results. The cheapest quality win in most production RAG systems.
04 · The eval gate
Faithfulness measured against a golden set before the answer ships. The answer cites its sources, or it fails closed.

The matrix

Problem, approach, and where the proof lives

Every row links to something published. No row exists without one.

Problem

Approach

Proof

Answers from your documents, with citations

Hybrid retrieval plus reranking, faithfulness measured before generation

PE deal desk case

A back-office workflow that runs itself

Deterministic execution, model judgment at decision points only, human gate before send

Breez outbound engine

Proving it works before it ships

Golden datasets and regression gates wired into CI; deploys blocked below threshold

The method, written up

Documents into structured data

VLM extraction with schema validation and a human review queue

Legal document automation

A pilot that never made it to production

Production hardening: deployment, monitoring, eval-gated CI, documented handover

Enterprise adoption case

The clusters

Ten clusters, honestly tiered

Organized by capability, not by logo wall. Each cluster says when we reach for it and when we will not, because the second answer is the one that proves the first.

In production: Running in our own or client systems today.
On demand: Proven capability we bring when the job calls for it.
Your stack: Your existing environment. We deploy into it instead of replacing it.

Model training

The workshop tools for building and modifying neural networks ourselves, not just calling someone else's API.

When we reach for it

Three real triggers: proprietary data with no off-the-shelf equivalent, hard latency or edge constraints, or IP ownership requirements. We check for all three before recommending custom model work, because at least two are usually absent.

When we won’t

Custom training is rare and expensive, and most problems that look like training problems are retrieval or fine-tuning problems in disguise. We can build models from scratch; most of the value we add is knowing when you should not.

In production

Python

On demand

PyTorch
Hugging Face Transformers

Your stack

TensorFlow / Keras

Fine-tuning and customization

Teaching an existing model your domain's style, vocabulary or task. Far cheaper than building one from scratch.

When we reach for it

Tone of voice, structured-output reliability, narrow task specialization, and cost reduction where a small fine-tuned model replaces a big general one. Distillation is the quiet win: a frontier model generates the training data, a small model serves it at a fraction of the unit cost.

When we won’t

Fine-tuning is a cost, latency and consistency optimization, not a knowledge store. If you want the model to know your documents, that is retrieval, not training, and we will tell you which one you actually need before you pay for either.

On demand

LoRA / QLoRA
SFT
DPO
Distillation
OpenAI / Anthropic fine-tuning APIs

Inference and serving

Making models run fast and cheap in production. The difference between a demo and a system that serves real volume without melting the budget.

When we reach for it

Self-hosting wins in exactly three cases: data sovereignty, very high sustained volume, or a fine-tuned small model. Then we reach for vLLM, quantization, and careful prompt-cache and KV-cache engineering, and we benchmark the unit economics before and after.

When we won’t

Outside those three cases, managed APIs win on total cost of ownership, and we say so. Most engagements run on Claude and GPT through their APIs because that is what the math supports, not because it is easier to sell.

In production

Claude / GPT APIs
Modal

On demand

vLLM
Ollama
Gemini
Llama / Mistral / Qwen / DeepSeek
Quantization (GGUF / AWQ)
ONNX Runtime
Together AI / Replicate

Your stack

AWS Bedrock / Azure OpenAI / Vertex AI

Retrieval and RAG

Connecting AI to your actual documents so it answers from your knowledge, accurately and with citations, instead of making things up.

When we reach for it

Internal knowledge assistants, support copilots, contract and policy Q&A, research tools. We start with pgvector inside the Postgres you already run, use hybrid search because pure vector retrieval misses exact terms, and rerank because it is the cheapest quality win in most systems. Retrieval quality is measured separately from generation, and before it.

When we won’t

A dedicated vector database is justified by scale, not by default. Most of RAG quality is won or lost in chunking and indexing strategy, not in the model choice, and we have walked clients back from RAG to plain search when that was the honest answer.

In production

Supabase + pgvector
Claude / GPT-4o
OpenAI embeddings

On demand

Hybrid search (BM25 + RRF)
Rerankers (Cohere Rerank, cross-encoders)
LlamaIndex
GraphRAG
Pinecone
Weaviate

Agents and orchestration

AI that does the work instead of just answering: looks things up, calls your systems, completes multi-step tasks, and knows when to hand off to a human.

When we reach for it

Back-office automation, research and report pipelines, customer-ops copilots. Every agent we ship runs on the same architecture: plain-text directives your team can edit, model judgment spent only at decision points, deterministic scripts for everything else, an approval gate before anything irreversible, and an audit log of every action.

When we won’t

Multi-agent swarms are oversold; most jobs need one well-guarded loop. If a cron job and a script solve it, that is what we build, because 90 percent per-step accuracy compounds to 59 percent over five chained steps and no framework changes that arithmetic.

In production

MCP
Claude (Agent SDK)
Claude Code
n8n
Playwright
Apify
Perplexity API
Slack
Instantly
HeyReach
PandaDoc
Stripe
Google Workspace

On demand

LangGraph
OpenAI Agents SDK
Pydantic AI
CrewAI / AutoGen
DSPy
LangChain
Temporal
Apollo

Your stack

Salesforce / HubSpot / Pipedrive
SAP / NetSuite / QuickBooks
Jira / Confluence / Linear / Teams

Evaluation and observability

How we prove the AI actually works: measured, monitored and regression-tested like real software, not vibes.

When we reach for it

Every engagement. A golden dataset before we build, regression gates before we ship, tracing in production after. The client keeps the eval suite; it is part of the deliverable, because a system you cannot measure is a system you cannot trust.

When we won’t

There is no engagement where we skip this. The honest variable is depth: a document pipeline gets faithfulness and extraction suites, an outbound agent gets human review sampling, a classifier gets a held-out test set. We size the harness to the risk, never to zero.

In production

Golden datasets + regression gates
Playwright + Vitest

On demand

promptfoo
Langfuse
LangSmith
Arize Phoenix
Helicone
Guardrails AI

Your stack

OpenTelemetry

Classic and predictive ML

Not every problem needs a language model. Predicting numbers, churn, demand, fraud risk, is usually solved better, cheaper and more explainably with proven statistical ML.

When we reach for it

Tabular prediction: churn, lead scoring, risk, pricing, demand forecasting. Gradient-boosted trees still beat deep learning on most business data, run for pennies, and produce feature attributions an auditor can actually read.

When we won’t

When the input is language, judgment or unstructured documents, classic ML underperforms and we say so. The discipline runs both ways: if your problem is a prediction problem, you will hear that it does not need an LLM from us before you pay for one.

On demand

XGBoost
LightGBM
scikit-learn
statsmodels / Prophet

Your stack

Snowflake / Databricks / BigQuery

Vision, documents and speech

AI for eyes and ears: reading documents, watching camera feeds, transcribing calls.

When we reach for it

Document-heavy workflows are the most common win we see: invoices, contracts and forms into structured, validated data, with a human review queue on the output. Vision-language extraction with schema validation now beats bespoke OCR at most volumes, and Whisper-class transcription feeds call analytics and voice agents.

When we won’t

Bespoke computer vision is justified by volume and latency, not by novelty. Below that bar, a vision-language model on demand is cheaper to run and easier to maintain, and we will tell you which side of the bar you are on before anything is built.

In production

Claude / GPT-4o vision

On demand

Whisper / faster-whisper
YOLO (v8 to v11)
Segment Anything (SAM)
PaddleOCR
ElevenLabs / OpenAI TTS
Retell AI

Your stack

Azure / Google Document AI

MLOps and infrastructure

The plumbing that keeps AI running in production: deployment, scaling, monitoring and retraining, so it does not quietly degrade after launch.

When we reach for it

Every system we ship lands on infrastructure we can defend: serverless Python on Modal, dashboards on Vercel, Postgres on Supabase, CI with eval gates on GitHub Actions. Production hardening of a stalled internal pilot is a standalone engagement we take often.

When we won’t

We do not park your system on infrastructure only we understand. If your platform team runs Kubernetes on AWS, we deploy there. The architecture has to survive us leaving; that is the point of it.

In production

Modal
Vercel + Next.js
Supabase (Postgres)
TypeScript
GitHub Actions

On demand

Docker
Railway
MLflow / model registries

Your stack

Kubernetes
AWS / Azure / Google Cloud
SageMaker / Vertex AI
Kafka / RabbitMQ
Airflow / dbt
Prometheus / Grafana
MySQL / MongoDB / Redis / SQLite
Java / Go / Ruby / .NET stacks
REST / GraphQL / webhooks

Security, privacy and governance

Keeping your data yours, and your AI safe to put in front of customers.

When we reach for it

Every build: row-level security in the database, per-client isolation, model APIs under no-training terms, approval gates and audit logs on anything that acts. Customer-facing deployments add prompt-injection testing and output filtering before launch, not after the first incident.

When we won’t

We do not claim certifications we do not hold, and we will not ship a customer-facing agent without a human gate and an eval suite. If a vendor cannot offer no-training terms, it does not get into the stack.

In production

Supabase row-level security
JWT auth + RBAC
No-training API terms
Audit logging

On demand

OWASP LLM Top 10 red-teaming
Output filtering / Guardrails
PII redaction

Your stack

VPC / private-cloud deployment
Regional model residency
OAuth / SSO / MFA / TLS

Why deterministic

Orchestration under guardrails

Agents are a reliability engineering problem. The flow below is the system we actually ship: the dot stops at the dashed human node because the system does.

FIG. 02

A real agent workflow: deterministic retries, a human approval gate, an audit log.REV 2026.06

Five chained steps of model judgment

59%

Compound success rate of a five-step workflow at 90 percent per-step accuracy. Arithmetic, not a benchmark. It is why every step that can be deterministic becomes deterministic code, and the model is spent only where judgment is the job.

Compound success rate by number of chained model steps at 90 percent per-step accuracy
Steps chained	Per-step	Compound success
1	90%	90%
2	90%	81%
3	90%	73%
4	90%	66%
5	90%	59%

Evals

How we know it works

We do not ship what we cannot measure. Every engagement leaves with an eval suite the client keeps.

01 · Golden datasets

A held-out set of real cases with known-good answers, agreed before we build. The system is graded against it, not against vibes.

02 · Regression gates

Eval suites run in CI on every change, with tools like promptfoo. Scores below threshold block the deploy. It fails closed.

03 · Tracing in production

Every call, cost, latency and failure is inspectable, with tools like Langfuse. Quality decay gets caught by monitoring, not by your customers.

The same discipline runs our own pipeline

The transcript on the right is output from the system that sells our systems: reply detection, research, a three-stage strategy generator, a generated deck, and a hard stop at human review. The figures are from a documented run and are labeled as such, because the rule about uncited numbers applies to us first.

See how we’d measure yours

REPLY-TO-DECK PIPELINE · stage 3 of 5
lead: [redacted] · channel: email · tone: professional

[1] research analyst
    three pain points extracted, each from a different
    business function, evidence attached
[2] roi calculator
    annual waste identified      $132,600
    projected savings            $85,800
    roi multiple                 17x
    recovery assumption: 60-80% of wasted hours, never 100
    math verified before submit
[3] message writer
    one personalization hook · no sales language
    deck link placeholder inserted

-> pandadoc deck generated from pipeline data
-> posted to #3-strategy-ready for human review
-> nothing sends without a person approving it

REPLY-TO-DECK PIPELINE · stage 3 of 5
lead: [redacted] · channel: email · tone: professional

[1] research analyst
    three pain points extracted, each from a different
    business function, evidence attached
[2] roi calculator
    annual waste identified      $132,600
    projected savings            $85,800
    roi multiple                 17x
    recovery assumption: 60-80% of wasted hours, never 100
    math verified before submit
[3] message writer
    one personalization hook · no sales language
    deck link placeholder inserted

-> pandadoc deck generated from pipeline data
-> posted to #3-strategy-ready for human review
-> nothing sends without a person approving it

Output from our own sales pipeline. Illustrative numbers from a real run.

Classic ML

Not everything is an LLM problem

We did ML before the chat window. If your problem is predicting a number, a gradient-boosted model is usually better, cheaper and explainable, and we will tell you so before you pay for a language model.

Where gradient-boosted models beat language models on tabular prediction work
Dimension	Gradient-boosted model	Large language model
Cost per prediction	fractions of a cent	orders of magnitude more
Latency	milliseconds	seconds
Explainability	auditable attributions	post-hoc rationales
Drift handling	scheduled retrain, measured	re-prompt, re-evaluate
Tabular prediction	the default	the exception

The reverse holds too: when the input is language, judgment or messy documents, classic ML loses and we reach for the model. The discipline is owning both answers. The classic ML cluster above lists the tools.

The boundary

What we won’t build

Blockchain, VR and IoT side quests.

Breadth claims across hype domains predict mediocrity in all of them. We do AI systems for operations; that is the whole list.

Fine-tuning when retrieval solves it.

A fine-tuned model is expensive to build, hard to audit, and stale the day training ends. If the job is knowing your documents, the answer is retrieval, and we will say so in the first call.

Agents where a script will do.

Model judgment is spent only where judgment is the job. Everything that can be deterministic becomes deterministic code that either works or fails loudly.

Customer-facing AI without a gate and a test suite.

No eval suite, no human approval path, no launch. We size the harness to the risk, but it never rounds down to zero.

Guaranteed accuracy percentages.

Anyone promising a number before measuring your data is guessing. We commit to a measurement method first, then to the numbers it produces.

Written up

The thinking, published

One genuinely technical write-up outperforms ten capability adjectives. Here are some of ours.

2026-06-10

Why we push complexity into deterministic code→

Five 90% steps compound to 59%. How the DOE architecture keeps judgment in the model, control in deterministic code, and a human at every gate.

2026-03-23

AI lead generation: how automated prospecting works→

How an AI prospecting pipeline works stage by stage, from sourcing to reply detection, and why most AI lead generation tools fail to deliver.

2026-03-23

How to choose an AI partner for your business→

Build-only agency, consultant, or build plus educate partner: the three models compared, green and red flags, and six questions to ask before signing.

All insights

Start free. Know your number in five days.

A 3 to 5 day audit of your operations, ending in a plan with the ROI math attached. No obligation.

Get the free audit