Playbook — Backend API / Service
When a Next.js route is the wrong shape for the work — split into a service.
When to split from Next.js
You should split when one of these is true. Not when it “feels cleaner.”
- Work is CPU or IO heavy (> a few seconds per request) and doesn’t fit Vercel’s function timeout.
- Work is long-running / background (video processing, report generation, AI pipelines).
- A non-Next consumer exists (mobile app, partner integration, internal Python batch).
- A different scaling profile — e.g. an AI endpoint that needs GPU/warm memory while the web app needs spiky scale-to-zero.
- Different team ownership — another team owns this logic and deploy coupling is painful.
Not reasons to split: “microservices scale,” “it’s a separate concern,” “I want to try X.”
Service shape
TypeScript — default: Hono
- Hono over Express. Fast, tiny, TypeScript-native, runs on Node / Bun / Deno / Cloudflare Workers / Vercel.
- Structure:
src/ routes/ # One file per resource (users, invoices, etc.) lib/ db/ # Drizzle auth/ # JWT verification errors/ # Typed error classes middleware/ # Auth, logging, rate-limit schemas/ # Zod schemas, shared with clients index.ts # App bootstrap - Validate with Zod; derive types from the schema. The schema is the contract.
Python — default: FastAPI
- FastAPI with Pydantic v2 models.
uvfor deps.ruff+blackfor style.pytestfor tests.- Structure:
src/app/ routers/ models/ # Pydantic db/ # SQLAlchemy or raw SQL via psycopg deps.py # FastAPI dependencies main.py - Async by default (
async def). UseasyncpgorSQLAlchemy 2.0 async.
Choose TS or Python by:
- Who consumes it? JS / web clients → TS (shared Zod types are a huge win).
- What does it do? ML / data / pandas-heavy → Python.
- What does the team know? When in doubt, TS — most of the team is already there.
API contract
- OpenAPI schema auto-generated from Zod (
hono/zod-openapi) or FastAPI (built-in). Never hand-written. - Clients import types from the schema, not from the service’s internal code.
- Versioning:
/v1/...in the path. Introduce/v2/when you break compatibility; don’t break/v1clients silently. - Errors: consistent shape —
{ error: { code, message, details? } }. Document the codes.
Auth between services
- Short-lived JWT (≤ 15 min) signed by an internal IdP, verified with JWKS. No shared secret passed around.
- mTLS for service-to-service calls behind a private network (Azure VNet, Tailscale on VPS).
- API keys only for external integrations. Rotate them. Scope them.
- Never trust the client — validate the JWT on every request, re-check permissions per endpoint.
Database
- Every service owns its tables. No cross-service DB sharing.
- Migrations live in the service that owns the tables.
- Cross-service reads happen via API, not a SQL JOIN.
- Acceptable exception: a read-only replica / warehouse for analytics. Not for live product code.
Background jobs
- BullMQ on Redis (TS) or Celery / RQ (Python) for async work.
- Separate worker process, not tacked onto the API server. Deploy them independently.
- Idempotent by design — jobs can be retried. Use a job id and skip already-processed ones.
Observability
- Structured JSON logs. Every request:
method,path,status,duration_ms,trace_id,user_id. - Datadog (or Application Insights on Azure) on uncaught errors + perf traces on the critical endpoints.
- OpenTelemetry traces if you have > 2 services that call each other. Skip until then.
Rate limiting & abuse
- Token bucket per (auth’d user + endpoint), backed by Redis.
- Aggressive limits on auth and expensive endpoints. Lenient on reads.
- Return
429withRetry-After, not a 500.
Deployment
- Dockerfile — multi-stage, non-root user,
HEALTHCHECK. - Health endpoint
/health— returns 200 + DB ping + dependency checks. - Graceful shutdown — handle SIGTERM, finish in-flight requests, close DB pool.
- Hosting: Azure Web Apps (MS shop) or Docker + VPS. See 05-deployment-ops.
Testing
- Integration tests with a real Postgres in CI (testcontainers). This is where most bugs hide.
- Contract tests: round-trip every Zod/Pydantic schema against fixtures.
- Load test the critical path once with
k6before prod.
Common mistakes to avoid
- Splitting a service too early. You can always split a function later. You can’t un-split a service easily.
- One shared “utils” service that everyone depends on. That’s a distributed singleton — just a library, pulled in per service.
- GraphQL because “it’s flexible.” Flexibility has a cost; only pay it when there’s a concrete reason.
- Exposing internal errors to clients. Log the full error; return a generic message + error code.