EGCA EGCA Engineering Handbook
Internal reference · v1
Home Onboarding Git Review Prototype ↔ Prod
Home Playbooks Backend API

Playbook — Backend API / Service

When a Next.js route is the wrong shape for the work — split into a service.

When to split from Next.js

You should split when one of these is true. Not when it “feels cleaner.”

  • Work is CPU or IO heavy (> a few seconds per request) and doesn’t fit Vercel’s function timeout.
  • Work is long-running / background (video processing, report generation, AI pipelines).
  • A non-Next consumer exists (mobile app, partner integration, internal Python batch).
  • A different scaling profile — e.g. an AI endpoint that needs GPU/warm memory while the web app needs spiky scale-to-zero.
  • Different team ownership — another team owns this logic and deploy coupling is painful.

Not reasons to split: “microservices scale,” “it’s a separate concern,” “I want to try X.”

Service shape

TypeScript — default: Hono

  • Hono over Express. Fast, tiny, TypeScript-native, runs on Node / Bun / Deno / Cloudflare Workers / Vercel.
  • Structure:
    src/
      routes/        # One file per resource (users, invoices, etc.)
      lib/
        db/          # Drizzle
        auth/        # JWT verification
        errors/      # Typed error classes
      middleware/    # Auth, logging, rate-limit
      schemas/       # Zod schemas, shared with clients
      index.ts       # App bootstrap
  • Validate with Zod; derive types from the schema. The schema is the contract.

Python — default: FastAPI

  • FastAPI with Pydantic v2 models.
  • uv for deps. ruff + black for style. pytest for tests.
  • Structure:
    src/app/
      routers/
      models/        # Pydantic
      db/            # SQLAlchemy or raw SQL via psycopg
      deps.py        # FastAPI dependencies
      main.py
  • Async by default (async def). Use asyncpg or SQLAlchemy 2.0 async.

Choose TS or Python by:

  • Who consumes it? JS / web clients → TS (shared Zod types are a huge win).
  • What does it do? ML / data / pandas-heavy → Python.
  • What does the team know? When in doubt, TS — most of the team is already there.

API contract

  • OpenAPI schema auto-generated from Zod (hono/zod-openapi) or FastAPI (built-in). Never hand-written.
  • Clients import types from the schema, not from the service’s internal code.
  • Versioning: /v1/... in the path. Introduce /v2/ when you break compatibility; don’t break /v1 clients silently.
  • Errors: consistent shape — { error: { code, message, details? } }. Document the codes.

Auth between services

  • Short-lived JWT (≤ 15 min) signed by an internal IdP, verified with JWKS. No shared secret passed around.
  • mTLS for service-to-service calls behind a private network (Azure VNet, Tailscale on VPS).
  • API keys only for external integrations. Rotate them. Scope them.
  • Never trust the client — validate the JWT on every request, re-check permissions per endpoint.

Database

  • Every service owns its tables. No cross-service DB sharing.
  • Migrations live in the service that owns the tables.
  • Cross-service reads happen via API, not a SQL JOIN.
  • Acceptable exception: a read-only replica / warehouse for analytics. Not for live product code.

Background jobs

  • BullMQ on Redis (TS) or Celery / RQ (Python) for async work.
  • Separate worker process, not tacked onto the API server. Deploy them independently.
  • Idempotent by design — jobs can be retried. Use a job id and skip already-processed ones.

Observability

  • Structured JSON logs. Every request: method, path, status, duration_ms, trace_id, user_id.
  • Datadog (or Application Insights on Azure) on uncaught errors + perf traces on the critical endpoints.
  • OpenTelemetry traces if you have > 2 services that call each other. Skip until then.

Rate limiting & abuse

  • Token bucket per (auth’d user + endpoint), backed by Redis.
  • Aggressive limits on auth and expensive endpoints. Lenient on reads.
  • Return 429 with Retry-After, not a 500.

Deployment

  • Dockerfile — multi-stage, non-root user, HEALTHCHECK.
  • Health endpoint /health — returns 200 + DB ping + dependency checks.
  • Graceful shutdown — handle SIGTERM, finish in-flight requests, close DB pool.
  • Hosting: Azure Web Apps (MS shop) or Docker + VPS. See 05-deployment-ops.

Testing

  • Integration tests with a real Postgres in CI (testcontainers). This is where most bugs hide.
  • Contract tests: round-trip every Zod/Pydantic schema against fixtures.
  • Load test the critical path once with k6 before prod.

Common mistakes to avoid

  • Splitting a service too early. You can always split a function later. You can’t un-split a service easily.
  • One shared “utils” service that everyone depends on. That’s a distributed singleton — just a library, pulled in per service.
  • GraphQL because “it’s flexible.” Flexibility has a cost; only pay it when there’s a concrete reason.
  • Exposing internal errors to clients. Log the full error; return a generic message + error code.