Advizr

The drawing set

The engineering under every system

Ten capability clusters, from retrieval to evals to classic ML. The tools are named, the limits are stated, the proof is linked. If something is on this page, someone here can defend it on a call.

The architecture, drawn

One production retrieval path, end to end

This is the architecture our retrieval systems actually run, not a brain with gears. Every node states something true. The dashed lane is the manual process it replaces.

FIG. 01
The retrieval pathDocuments flow through chunking and embeddings into hybrid retrieval with parallel vector and keyword lanes, then reranking, generation, and an eval gate before a cited answer. The dashed lane above is the manual search it replaces; the eval suite feeds back on every change.Your docsChunkingEmbeddingsVector searchKeyword searchRerankGenerateEvalgateCitedanswerBefore: search by hand, paste, hopeRecall measured before generationRerank: cheapest quality winFails closedEval suite reruns on every change
The retrieval path: recall measured before generation; fails closed at the gate.REV 2026.06
  1. 01 · Ingest

    Chunking and indexing strategy is where most retrieval quality is won or lost, before any model is involved.

  2. 02 · Hybrid retrieve

    Dense vectors and keyword search in parallel, fused. Pure vector retrieval misses exact terms: names, codes, citations.

  3. 03 · Rerank, then generate

    A second-pass precision score on the top results. The cheapest quality win in most production RAG systems.

  4. 04 · The eval gate

    Faithfulness measured against a golden set before the answer ships. The answer cites its sources, or it fails closed.

The matrix

Problem, approach, and where the proof lives

Every row links to something published. No row exists without one.

Answers from your documents, with citations

Hybrid retrieval plus reranking, faithfulness measured before generation

PE deal desk case

A back-office workflow that runs itself

Deterministic execution, model judgment at decision points only, human gate before send

Breez outbound engine

Proving it works before it ships

Golden datasets and regression gates wired into CI; deploys blocked below threshold

The method, written up

Documents into structured data

VLM extraction with schema validation and a human review queue

Legal document automation

A pilot that never made it to production

Production hardening: deployment, monitoring, eval-gated CI, documented handover

Enterprise adoption case

The clusters

Ten clusters, honestly tiered

Organized by capability, not by logo wall. Each cluster says when we reach for it and when we will not, because the second answer is the one that proves the first.

In production
Running in our own or client systems today.
On demand
Proven capability we bring when the job calls for it.
Your stack
Your existing environment. We deploy into it instead of replacing it.

01

Model training

The workshop tools for building and modifying neural networks ourselves, not just calling someone else's API.

When we reach for it

Three real triggers: proprietary data with no off-the-shelf equivalent, hard latency or edge constraints, or IP ownership requirements. We check for all three before recommending custom model work, because at least two are usually absent.

When we won’t

Custom training is rare and expensive, and most problems that look like training problems are retrieval or fine-tuning problems in disguise. We can build models from scratch; most of the value we add is knowing when you should not.

In production

  • Python

On demand

  • PyTorch
  • Hugging Face Transformers

Your stack

  • TensorFlow / Keras

02

Fine-tuning and customization

Teaching an existing model your domain's style, vocabulary or task. Far cheaper than building one from scratch.

When we reach for it

Tone of voice, structured-output reliability, narrow task specialization, and cost reduction where a small fine-tuned model replaces a big general one. Distillation is the quiet win: a frontier model generates the training data, a small model serves it at a fraction of the unit cost.

When we won’t

Fine-tuning is a cost, latency and consistency optimization, not a knowledge store. If you want the model to know your documents, that is retrieval, not training, and we will tell you which one you actually need before you pay for either.

On demand

  • LoRA / QLoRA
  • SFT
  • DPO
  • Distillation
  • OpenAI / Anthropic fine-tuning APIs

03

Inference and serving

Making models run fast and cheap in production. The difference between a demo and a system that serves real volume without melting the budget.

When we reach for it

Self-hosting wins in exactly three cases: data sovereignty, very high sustained volume, or a fine-tuned small model. Then we reach for vLLM, quantization, and careful prompt-cache and KV-cache engineering, and we benchmark the unit economics before and after.

When we won’t

Outside those three cases, managed APIs win on total cost of ownership, and we say so. Most engagements run on Claude and GPT through their APIs because that is what the math supports, not because it is easier to sell.

In production

  • Claude / GPT APIs
  • Modal

On demand

  • vLLM
  • Ollama
  • Gemini
  • Llama / Mistral / Qwen / DeepSeek
  • Quantization (GGUF / AWQ)
  • ONNX Runtime
  • Together AI / Replicate

Your stack

  • AWS Bedrock / Azure OpenAI / Vertex AI

04

Retrieval and RAG

Connecting AI to your actual documents so it answers from your knowledge, accurately and with citations, instead of making things up.

When we reach for it

Internal knowledge assistants, support copilots, contract and policy Q&A, research tools. We start with pgvector inside the Postgres you already run, use hybrid search because pure vector retrieval misses exact terms, and rerank because it is the cheapest quality win in most systems. Retrieval quality is measured separately from generation, and before it.

When we won’t

A dedicated vector database is justified by scale, not by default. Most of RAG quality is won or lost in chunking and indexing strategy, not in the model choice, and we have walked clients back from RAG to plain search when that was the honest answer.

In production

  • Supabase + pgvector
  • Claude / GPT-4o
  • OpenAI embeddings

On demand

  • Hybrid search (BM25 + RRF)
  • Rerankers (Cohere Rerank, cross-encoders)
  • LlamaIndex
  • GraphRAG
  • Pinecone
  • Weaviate

05

Agents and orchestration

AI that does the work instead of just answering: looks things up, calls your systems, completes multi-step tasks, and knows when to hand off to a human.

When we reach for it

Back-office automation, research and report pipelines, customer-ops copilots. Every agent we ship runs on the same architecture: plain-text directives your team can edit, model judgment spent only at decision points, deterministic scripts for everything else, an approval gate before anything irreversible, and an audit log of every action.

When we won’t

Multi-agent swarms are oversold; most jobs need one well-guarded loop. If a cron job and a script solve it, that is what we build, because 90 percent per-step accuracy compounds to 59 percent over five chained steps and no framework changes that arithmetic.

In production

  • MCP
  • Claude (Agent SDK)
  • Claude Code
  • n8n
  • Playwright
  • Apify
  • Perplexity API
  • Slack
  • Instantly
  • HeyReach
  • PandaDoc
  • Stripe
  • Google Workspace

On demand

  • LangGraph
  • OpenAI Agents SDK
  • Pydantic AI
  • CrewAI / AutoGen
  • DSPy
  • LangChain
  • Temporal
  • Apollo

Your stack

  • Salesforce / HubSpot / Pipedrive
  • SAP / NetSuite / QuickBooks
  • Jira / Confluence / Linear / Teams

06

Evaluation and observability

How we prove the AI actually works: measured, monitored and regression-tested like real software, not vibes.

When we reach for it

Every engagement. A golden dataset before we build, regression gates before we ship, tracing in production after. The client keeps the eval suite; it is part of the deliverable, because a system you cannot measure is a system you cannot trust.

When we won’t

There is no engagement where we skip this. The honest variable is depth: a document pipeline gets faithfulness and extraction suites, an outbound agent gets human review sampling, a classifier gets a held-out test set. We size the harness to the risk, never to zero.

In production

  • Golden datasets + regression gates
  • Playwright + Vitest

On demand

  • promptfoo
  • Langfuse
  • LangSmith
  • Arize Phoenix
  • Helicone
  • Guardrails AI

Your stack

  • OpenTelemetry

07

Classic and predictive ML

Not every problem needs a language model. Predicting numbers, churn, demand, fraud risk, is usually solved better, cheaper and more explainably with proven statistical ML.

When we reach for it

Tabular prediction: churn, lead scoring, risk, pricing, demand forecasting. Gradient-boosted trees still beat deep learning on most business data, run for pennies, and produce feature attributions an auditor can actually read.

When we won’t

When the input is language, judgment or unstructured documents, classic ML underperforms and we say so. The discipline runs both ways: if your problem is a prediction problem, you will hear that it does not need an LLM from us before you pay for one.

On demand

  • XGBoost
  • LightGBM
  • scikit-learn
  • statsmodels / Prophet

Your stack

  • Snowflake / Databricks / BigQuery

08

Vision, documents and speech

AI for eyes and ears: reading documents, watching camera feeds, transcribing calls.

When we reach for it

Document-heavy workflows are the most common win we see: invoices, contracts and forms into structured, validated data, with a human review queue on the output. Vision-language extraction with schema validation now beats bespoke OCR at most volumes, and Whisper-class transcription feeds call analytics and voice agents.

When we won’t

Bespoke computer vision is justified by volume and latency, not by novelty. Below that bar, a vision-language model on demand is cheaper to run and easier to maintain, and we will tell you which side of the bar you are on before anything is built.

In production

  • Claude / GPT-4o vision

On demand

  • Whisper / faster-whisper
  • YOLO (v8 to v11)
  • Segment Anything (SAM)
  • PaddleOCR
  • ElevenLabs / OpenAI TTS
  • Retell AI

Your stack

  • Azure / Google Document AI

09

MLOps and infrastructure

The plumbing that keeps AI running in production: deployment, scaling, monitoring and retraining, so it does not quietly degrade after launch.

When we reach for it

Every system we ship lands on infrastructure we can defend: serverless Python on Modal, dashboards on Vercel, Postgres on Supabase, CI with eval gates on GitHub Actions. Production hardening of a stalled internal pilot is a standalone engagement we take often.

When we won’t

We do not park your system on infrastructure only we understand. If your platform team runs Kubernetes on AWS, we deploy there. The architecture has to survive us leaving; that is the point of it.

In production

  • Modal
  • Vercel + Next.js
  • Supabase (Postgres)
  • TypeScript
  • GitHub Actions

On demand

  • Docker
  • Railway
  • MLflow / model registries

Your stack

  • Kubernetes
  • AWS / Azure / Google Cloud
  • SageMaker / Vertex AI
  • Kafka / RabbitMQ
  • Airflow / dbt
  • Prometheus / Grafana
  • MySQL / MongoDB / Redis / SQLite
  • Java / Go / Ruby / .NET stacks
  • REST / GraphQL / webhooks

10

Security, privacy and governance

Keeping your data yours, and your AI safe to put in front of customers.

When we reach for it

Every build: row-level security in the database, per-client isolation, model APIs under no-training terms, approval gates and audit logs on anything that acts. Customer-facing deployments add prompt-injection testing and output filtering before launch, not after the first incident.

When we won’t

We do not claim certifications we do not hold, and we will not ship a customer-facing agent without a human gate and an eval suite. If a vendor cannot offer no-training terms, it does not get into the stack.

In production

  • Supabase row-level security
  • JWT auth + RBAC
  • No-training API terms
  • Audit logging

On demand

  • OWASP LLM Top 10 red-teaming
  • Output filtering / Guardrails
  • PII redaction

Your stack

  • VPC / private-cloud deployment
  • Regional model residency
  • OAuth / SSO / MFA / TLS

Why deterministic

Orchestration under guardrails

Agents are a reliability engineering problem. The flow below is the system we actually ship: the dot stops at the dashed human node because the system does.

FIG. 02
Orchestration under guardrailsA plain-text directive drives orchestration, which makes tool calls, holds at a dashed human approval gate, then executes; every action lands in an audit log. A deterministic retry loop runs between tool calls and orchestration.DirectiveOrchestrationTool callsApprovalExecutionAudit logHolds for approvalRetry: deterministicEvery action logged
A real agent workflow: deterministic retries, a human approval gate, an audit log.REV 2026.06

Five chained steps of model judgment

59%

Compound success rate of a five-step workflow at 90 percent per-step accuracy. Arithmetic, not a benchmark. It is why every step that can be deterministic becomes deterministic code, and the model is spent only where judgment is the job.

Compound success rate by number of chained model steps at 90 percent per-step accuracy
Steps chainedPer-stepCompound success
190%90%
290%81%
390%73%
490%66%
590%59%

Evals

How we know it works

We do not ship what we cannot measure. Every engagement leaves with an eval suite the client keeps.

01 · Golden datasets

A held-out set of real cases with known-good answers, agreed before we build. The system is graded against it, not against vibes.

02 · Regression gates

Eval suites run in CI on every change, with tools like promptfoo. Scores below threshold block the deploy. It fails closed.

03 · Tracing in production

Every call, cost, latency and failure is inspectable, with tools like Langfuse. Quality decay gets caught by monitoring, not by your customers.

The same discipline runs our own pipeline

The transcript on the right is output from the system that sells our systems: reply detection, research, a three-stage strategy generator, a generated deck, and a hard stop at human review. The figures are from a documented run and are labeled as such, because the rule about uncited numbers applies to us first.

REPLY-TO-DECK PIPELINE · stage 3 of 5 lead: [redacted] · channel: email · tone: professional [1] research analyst three pain points extracted, each from a different business function, evidence attached [2] roi calculator annual waste identified $132,600 projected savings $85,800 roi multiple 17x recovery assumption: 60-80% of wasted hours, never 100 math verified before submit [3] message writer one personalization hook · no sales language deck link placeholder inserted -> pandadoc deck generated from pipeline data -> posted to #3-strategy-ready for human review -> nothing sends without a person approving it

Output from our own sales pipeline. Illustrative numbers from a real run.

Classic ML

Not everything is an LLM problem

We did ML before the chat window. If your problem is predicting a number, a gradient-boosted model is usually better, cheaper and explainable, and we will tell you so before you pay for a language model.

Where gradient-boosted models beat language models on tabular prediction work
DimensionGradient-boosted modelLarge language model
Cost per predictionfractions of a centorders of magnitude more
Latencymillisecondsseconds
Explainabilityauditable attributionspost-hoc rationales
Drift handlingscheduled retrain, measuredre-prompt, re-evaluate
Tabular predictionthe defaultthe exception

The reverse holds too: when the input is language, judgment or messy documents, classic ML loses and we reach for the model. The discipline is owning both answers. The classic ML cluster above lists the tools.

The boundary

What we won’t build

01

Blockchain, VR and IoT side quests.

Breadth claims across hype domains predict mediocrity in all of them. We do AI systems for operations; that is the whole list.

02

Fine-tuning when retrieval solves it.

A fine-tuned model is expensive to build, hard to audit, and stale the day training ends. If the job is knowing your documents, the answer is retrieval, and we will say so in the first call.

03

Agents where a script will do.

Model judgment is spent only where judgment is the job. Everything that can be deterministic becomes deterministic code that either works or fails loudly.

04

Customer-facing AI without a gate and a test suite.

No eval suite, no human approval path, no launch. We size the harness to the risk, but it never rounds down to zero.

05

Guaranteed accuracy percentages.

Anyone promising a number before measuring your data is guessing. We commit to a measurement method first, then to the numbers it produces.

Start free. Know your number in five days.

A 3 to 5 day audit of your operations, ending in a plan with the ROI math attached. No obligation.

2x ROI in 90 days. Or we work for free.