Playbook — Data / ML / AI
Stay boring until you measure a reason not to. Most “AI features” are a well-prompted LLM call + a good UX, not a rag-agent-framework stack.
Project shape
- Python service (FastAPI) for anything non-trivial.
- Postgres + pgvector as the default store — including for vector search.
- Evals in the repo, run in CI when the prompt or model changes.
src/app/
llm/ # Prompt construction, model clients
pipelines/ # Multi-step flows (retrieval, extraction)
tools/ # If building agents with tool use
db/
routers/
evals/
golden/ # Expected inputs + outputs
runner.py # Runs prompts against golden set
metrics.py # Scoring
LLM choice
- Default: Anthropic Claude via the API. Pick the right size per task:
- Haiku — cheap, fast, simple classification/extraction.
- Sonnet — general-purpose workhorse. Most features land here.
- Opus — hard reasoning, agent loops, anything where quality matters more than cost.
- Prompt caching on by default for any repeated-context call (system prompt, long context, tool definitions). Cuts cost and latency materially — use it.
- Other providers (OpenAI, Gemini, open models) only when a concrete need — modality, price point, on-device — demands it. Don’t “hedge” by maintaining two clients.
Orchestration
- Start with plain SDK calls. Most features are one or two calls + some glue. That’s fine.
- Don’t reach for LangChain / LlamaIndex / framework-of-the-week until you have a multi-step agent loop with tool use that you’re tired of hand-rolling.
- Agent loops: prefer Anthropic’s native tool use API over a framework abstraction. Debugging a framework call stack mid-incident is miserable.
Retrieval / RAG
- Postgres + pgvector under ~1M vectors. Fine for almost everything we’ll build.
- Dedicated vector DB (Pinecone, Weaviate, Qdrant) only past that, or when filtering + vector search at high QPS demands it.
- Embedding model:
voyage-3or OpenAItext-embedding-3-small. Pick one per project; don’t mix. - Chunk sensibly (512–1024 tokens, with overlap) and store the chunking scheme alongside the vectors. When you change chunking, you’re re-embedding everything.
- Hybrid retrieval (vector + keyword/BM25) beats pure vector on most real corpora. Postgres does both.
- Always return citations to the UI. If the user can’t verify the source, they won’t trust the answer.
Evals (non-negotiable)
- Golden set: 20–100 real inputs with expected outputs, stored as JSON in the repo.
- Runner script replays inputs against the current prompt + model, scores outputs (exact match, LLM-as-judge for open-ended, metric-based for classification).
- CI runs evals on PRs that touch prompts, model config, or retrieval. A regression blocks merge.
- Version your evals as the product evolves. An eval set that hasn’t changed in 6 months is probably out of date.
- LLM-as-judge is acceptable but validate your judge — spot-check its verdicts vs human review weekly until you trust it.
Cost & latency
- Log token counts per call (prompt, completion, cached). Put it on a dashboard.
- Alert on spikes — 2x your baseline daily spend.
- Cache aggressively — prompt caching at the API, response caching (Redis) for deterministic prompts with identical inputs.
- Stream responses to the user so p95 perceived latency is bearable even when totals are slow.
- Batch async workloads via Anthropic’s batch API for 50% cost savings when latency isn’t urgent.
PII & safety
- Redact PII before logging. Regex on emails/phones/addresses; tokenize IDs.
- Never send unreviewed user data to a third-party model from a paying client’s domain without an explicit DPA and a lead’s OK.
- Output filtering / moderation on any user-facing generation.
- Document the system prompt + data flow in the project README. Audits happen.
Prompts
- Keep prompts in versioned files (
prompts/classify-v3.md), not string-embedded in code. Load at runtime. Diff-friendly. - A prompt is code — it goes through review. No sneaking prompt changes into a “tiny fix” PR.
- Write prompts like specs: concrete, with examples, with the failure modes you’ve seen.
Pipelines (non-LLM data work)
- Python + pandas / polars for transforms. Polars when you’re hitting perf walls.
- Orchestration: start with cron + scripts. Reach for Prefect or Dagster only when you have real DAGs with dependencies and failure handling worth abstracting.
- Output schema validated with Pandera or Pydantic — typed at both ends.
- Idempotent pipelines. Running the same day twice shouldn’t double-count.
Common mistakes to avoid
- Building a framework before building the feature. Ship the one-call version first.
- RAG when the user’s context already fits in the prompt. Just put it in the prompt.
- Tuning a prompt without evals. You’re flying blind.
- Caching with a hash of the full prompt, forgetting the cache is stale when you change the system prompt. Include the system prompt version in the key.
- Storing embeddings without the model version. You can’t debug drift if you don’t know which model produced them.