Playbook — Backend API / Service

When a Next.js route is the wrong shape for the work — split into a service.

When to split from Next.js

You should split when one of these is true. Not when it “feels cleaner.”

Work is CPU or IO heavy (> a few seconds per request) and doesn’t fit Vercel’s function timeout.
Work is long-running / background (video processing, report generation, AI pipelines).
A non-Next consumer exists (mobile app, partner integration, internal Python batch).
A different scaling profile — e.g. an AI endpoint that needs GPU/warm memory while the web app needs spiky scale-to-zero.
Different team ownership — another team owns this logic and deploy coupling is painful.

Not reasons to split: “microservices scale,” “it’s a separate concern,” “I want to try X.”

Service shape

TypeScript — default: Hono

Hono over Express. Fast, tiny, TypeScript-native, runs on Node / Bun / Deno / Cloudflare Workers / Vercel.

Structure:

src/
  routes/        # One file per resource (users, invoices, etc.)
  lib/
    db/          # Drizzle
    auth/        # JWT verification
    errors/      # Typed error classes
  middleware/    # Auth, logging, rate-limit
  schemas/       # Zod schemas, shared with clients
  index.ts       # App bootstrap

Validate with Zod; derive types from the schema. The schema is the contract.

Python — default: FastAPI

FastAPI with Pydantic v2 models.
uv for deps. ruff + black for style. pytest for tests.

Structure:

src/app/
  routers/
  models/        # Pydantic
  db/            # SQLAlchemy or raw SQL via psycopg
  deps.py        # FastAPI dependencies
  main.py

Async by default (async def). Use asyncpg or SQLAlchemy 2.0 async.

Choose TS or Python by:

Who consumes it? JS / web clients → TS (shared Zod types are a huge win).
What does it do? ML / data / pandas-heavy → Python.
What does the team know? When in doubt, TS — most of the team is already there.

API contract

OpenAPI schema auto-generated from Zod (hono/zod-openapi) or FastAPI (built-in). Never hand-written.
Clients import types from the schema, not from the service’s internal code.
Versioning: /v1/... in the path. Introduce /v2/ when you break compatibility; don’t break /v1 clients silently.
Errors: consistent shape — { error: { code, message, details? } }. Document the codes.

Auth between services

Short-lived JWT (≤ 15 min) signed by an internal IdP, verified with JWKS. No shared secret passed around.
mTLS for service-to-service calls behind a private network (Azure VNet, Tailscale on VPS).
API keys only for external integrations. Rotate them. Scope them.
Never trust the client — validate the JWT on every request, re-check permissions per endpoint.

Database

Every service owns its tables. No cross-service DB sharing.
Migrations live in the service that owns the tables.
Cross-service reads happen via API, not a SQL JOIN.
Acceptable exception: a read-only replica / warehouse for analytics. Not for live product code.

Background jobs

BullMQ on Redis (TS) or Celery / RQ (Python) for async work.
Separate worker process, not tacked onto the API server. Deploy them independently.
Idempotent by design — jobs can be retried. Use a job id and skip already-processed ones.

Observability

Structured JSON logs. Every request: method, path, status, duration_ms, trace_id, user_id.
Datadog (or Application Insights on Azure) on uncaught errors + perf traces on the critical endpoints.
OpenTelemetry traces if you have > 2 services that call each other. Skip until then.

Rate limiting & abuse

Token bucket per (auth’d user + endpoint), backed by Redis.
Aggressive limits on auth and expensive endpoints. Lenient on reads.
Return 429 with Retry-After, not a 500.

Deployment

Dockerfile — multi-stage, non-root user, HEALTHCHECK.
Health endpoint /health — returns 200 + DB ping + dependency checks.
Graceful shutdown — handle SIGTERM, finish in-flight requests, close DB pool.
Hosting: Azure Web Apps (MS shop) or Docker + VPS. See 05-deployment-ops.

Testing

Integration tests with a real Postgres in CI (testcontainers). This is where most bugs hide.
Contract tests: round-trip every Zod/Pydantic schema against fixtures.
Load test the critical path once with k6 before prod.

Common mistakes to avoid

Splitting a service too early. You can always split a function later. You can’t un-split a service easily.
One shared “utils” service that everyone depends on. That’s a distributed singleton — just a library, pulled in per service.
GraphQL because “it’s flexible.” Flexibility has a cost; only pay it when there’s a concrete reason.
Exposing internal errors to clients. Log the full error; return a generic message + error code.