EGCA EGCA Engineering Handbook
Internal reference · v1
Home Onboarding Git Review Prototype ↔ Prod
Home Playbooks Data / ML / AI

Playbook — Data / ML / AI

Stay boring until you measure a reason not to. Most “AI features” are a well-prompted LLM call + a good UX, not a rag-agent-framework stack.

Project shape

  • Python service (FastAPI) for anything non-trivial.
  • Postgres + pgvector as the default store — including for vector search.
  • Evals in the repo, run in CI when the prompt or model changes.
src/app/
  llm/              # Prompt construction, model clients
  pipelines/        # Multi-step flows (retrieval, extraction)
  tools/            # If building agents with tool use
  db/
  routers/
evals/
  golden/           # Expected inputs + outputs
  runner.py         # Runs prompts against golden set
  metrics.py        # Scoring

LLM choice

  • Default: Anthropic Claude via the API. Pick the right size per task:
    • Haiku — cheap, fast, simple classification/extraction.
    • Sonnet — general-purpose workhorse. Most features land here.
    • Opus — hard reasoning, agent loops, anything where quality matters more than cost.
  • Prompt caching on by default for any repeated-context call (system prompt, long context, tool definitions). Cuts cost and latency materially — use it.
  • Other providers (OpenAI, Gemini, open models) only when a concrete need — modality, price point, on-device — demands it. Don’t “hedge” by maintaining two clients.

Orchestration

  • Start with plain SDK calls. Most features are one or two calls + some glue. That’s fine.
  • Don’t reach for LangChain / LlamaIndex / framework-of-the-week until you have a multi-step agent loop with tool use that you’re tired of hand-rolling.
  • Agent loops: prefer Anthropic’s native tool use API over a framework abstraction. Debugging a framework call stack mid-incident is miserable.

Retrieval / RAG

  • Postgres + pgvector under ~1M vectors. Fine for almost everything we’ll build.
  • Dedicated vector DB (Pinecone, Weaviate, Qdrant) only past that, or when filtering + vector search at high QPS demands it.
  • Embedding model: voyage-3 or OpenAI text-embedding-3-small. Pick one per project; don’t mix.
  • Chunk sensibly (512–1024 tokens, with overlap) and store the chunking scheme alongside the vectors. When you change chunking, you’re re-embedding everything.
  • Hybrid retrieval (vector + keyword/BM25) beats pure vector on most real corpora. Postgres does both.
  • Always return citations to the UI. If the user can’t verify the source, they won’t trust the answer.

Evals (non-negotiable)

  • Golden set: 20–100 real inputs with expected outputs, stored as JSON in the repo.
  • Runner script replays inputs against the current prompt + model, scores outputs (exact match, LLM-as-judge for open-ended, metric-based for classification).
  • CI runs evals on PRs that touch prompts, model config, or retrieval. A regression blocks merge.
  • Version your evals as the product evolves. An eval set that hasn’t changed in 6 months is probably out of date.
  • LLM-as-judge is acceptable but validate your judge — spot-check its verdicts vs human review weekly until you trust it.

Cost & latency

  • Log token counts per call (prompt, completion, cached). Put it on a dashboard.
  • Alert on spikes — 2x your baseline daily spend.
  • Cache aggressively — prompt caching at the API, response caching (Redis) for deterministic prompts with identical inputs.
  • Stream responses to the user so p95 perceived latency is bearable even when totals are slow.
  • Batch async workloads via Anthropic’s batch API for 50% cost savings when latency isn’t urgent.

PII & safety

  • Redact PII before logging. Regex on emails/phones/addresses; tokenize IDs.
  • Never send unreviewed user data to a third-party model from a paying client’s domain without an explicit DPA and a lead’s OK.
  • Output filtering / moderation on any user-facing generation.
  • Document the system prompt + data flow in the project README. Audits happen.

Prompts

  • Keep prompts in versioned files (prompts/classify-v3.md), not string-embedded in code. Load at runtime. Diff-friendly.
  • A prompt is code — it goes through review. No sneaking prompt changes into a “tiny fix” PR.
  • Write prompts like specs: concrete, with examples, with the failure modes you’ve seen.

Pipelines (non-LLM data work)

  • Python + pandas / polars for transforms. Polars when you’re hitting perf walls.
  • Orchestration: start with cron + scripts. Reach for Prefect or Dagster only when you have real DAGs with dependencies and failure handling worth abstracting.
  • Output schema validated with Pandera or Pydantic — typed at both ends.
  • Idempotent pipelines. Running the same day twice shouldn’t double-count.

Common mistakes to avoid

  • Building a framework before building the feature. Ship the one-call version first.
  • RAG when the user’s context already fits in the prompt. Just put it in the prompt.
  • Tuning a prompt without evals. You’re flying blind.
  • Caching with a hash of the full prompt, forgetting the cache is stale when you change the system prompt. Include the system prompt version in the key.
  • Storing embeddings without the model version. You can’t debug drift if you don’t know which model produced them.