Playbook — Data / ML / AI

Stay boring until you measure a reason not to. Most “AI features” are a well-prompted LLM call + a good UX, not a rag-agent-framework stack.

Project shape

Python service (FastAPI) for anything non-trivial.
Postgres + pgvector as the default store — including for vector search.
Evals in the repo, run in CI when the prompt or model changes.

src/app/
  llm/              # Prompt construction, model clients
  pipelines/        # Multi-step flows (retrieval, extraction)
  tools/            # If building agents with tool use
  db/
  routers/
evals/
  golden/           # Expected inputs + outputs
  runner.py         # Runs prompts against golden set
  metrics.py        # Scoring

LLM choice

Default: Anthropic Claude via the API. Pick the right size per task:
- Haiku — cheap, fast, simple classification/extraction.
- Sonnet — general-purpose workhorse. Most features land here.
- Opus — hard reasoning, agent loops, anything where quality matters more than cost.
Prompt caching on by default for any repeated-context call (system prompt, long context, tool definitions). Cuts cost and latency materially — use it.
Other providers (OpenAI, Gemini, open models) only when a concrete need — modality, price point, on-device — demands it. Don’t “hedge” by maintaining two clients.

Orchestration

Start with plain SDK calls. Most features are one or two calls + some glue. That’s fine.
Don’t reach for LangChain / LlamaIndex / framework-of-the-week until you have a multi-step agent loop with tool use that you’re tired of hand-rolling.
Agent loops: prefer Anthropic’s native tool use API over a framework abstraction. Debugging a framework call stack mid-incident is miserable.

Retrieval / RAG

Postgres + pgvector under ~1M vectors. Fine for almost everything we’ll build.
Dedicated vector DB (Pinecone, Weaviate, Qdrant) only past that, or when filtering + vector search at high QPS demands it.
Embedding model: voyage-3 or OpenAI text-embedding-3-small. Pick one per project; don’t mix.
Chunk sensibly (512–1024 tokens, with overlap) and store the chunking scheme alongside the vectors. When you change chunking, you’re re-embedding everything.
Hybrid retrieval (vector + keyword/BM25) beats pure vector on most real corpora. Postgres does both.
Always return citations to the UI. If the user can’t verify the source, they won’t trust the answer.

Evals (non-negotiable)

Golden set: 20–100 real inputs with expected outputs, stored as JSON in the repo.
Runner script replays inputs against the current prompt + model, scores outputs (exact match, LLM-as-judge for open-ended, metric-based for classification).
CI runs evals on PRs that touch prompts, model config, or retrieval. A regression blocks merge.
Version your evals as the product evolves. An eval set that hasn’t changed in 6 months is probably out of date.
LLM-as-judge is acceptable but validate your judge — spot-check its verdicts vs human review weekly until you trust it.

Cost & latency

Log token counts per call (prompt, completion, cached). Put it on a dashboard.
Alert on spikes — 2x your baseline daily spend.
Cache aggressively — prompt caching at the API, response caching (Redis) for deterministic prompts with identical inputs.
Stream responses to the user so p95 perceived latency is bearable even when totals are slow.
Batch async workloads via Anthropic’s batch API for 50% cost savings when latency isn’t urgent.

PII & safety

Redact PII before logging. Regex on emails/phones/addresses; tokenize IDs.
Never send unreviewed user data to a third-party model from a paying client’s domain without an explicit DPA and a lead’s OK.
Output filtering / moderation on any user-facing generation.
Document the system prompt + data flow in the project README. Audits happen.

Prompts

Keep prompts in versioned files (prompts/classify-v3.md), not string-embedded in code. Load at runtime. Diff-friendly.
A prompt is code — it goes through review. No sneaking prompt changes into a “tiny fix” PR.
Write prompts like specs: concrete, with examples, with the failure modes you’ve seen.

Pipelines (non-LLM data work)

Python + pandas / polars for transforms. Polars when you’re hitting perf walls.
Orchestration: start with cron + scripts. Reach for Prefect or Dagster only when you have real DAGs with dependencies and failure handling worth abstracting.
Output schema validated with Pandera or Pydantic — typed at both ends.
Idempotent pipelines. Running the same day twice shouldn’t double-count.

Common mistakes to avoid

Building a framework before building the feature. Ship the one-call version first.
RAG when the user’s context already fits in the prompt. Just put it in the prompt.
Tuning a prompt without evals. You’re flying blind.
Caching with a hash of the full prompt, forgetting the cache is stale when you change the system prompt. Include the system prompt version in the key.
Storing embeddings without the model version. You can’t debug drift if you don’t know which model produced them.