02 — Architecture

Rule of thumb: boring wins. Monolith until it hurts. Split only when you can point at the pain.

Monolith vs semi-microservices

Default: monolith. One Next.js app, one Postgres, one deploy. This scales further than juniors expect.

Split into semi-microservices only when one of these is true:

One part of the app has a different scaling profile (e.g. an AI pipeline that needs a long-running worker, or a CPU-heavy report generator).
A team outside the web team owns part of the logic and deploy coupling is slowing them down.
You have long-running jobs (> ~60s) that don’t fit Vercel’s request budget.
A non-JS consumer needs the API (mobile native, partner, internal batch).

“Semi-microservices” here means Next.js app + 1 to 3 purpose-built services. Not 20 services. Not a service per model. If you find yourself naming a fifth service, stop and ask a lead.

Anti-examples — do NOT split for these reasons:

“It feels cleaner.”
“The blog said microservices scale better.”
“I want to try Go.”

Database

Postgres everywhere. No exceptions. No MongoDB, no DynamoDB, no Firestore, no Firebase. If a client insists on one of these, push back first — they’re almost always wrong, and we’ve been burned by schema drift and vendor lock-in before.

Where the Postgres lives depends on the env:

Env	Where Postgres runs	Notes
Local / dev	Docker Postgres on your machine (or a plain local install).	Run via `docker compose up`. Tests run against this — not a shared dev DB.
CI	Docker Postgres via testcontainers / GitHub Actions service.	Ephemeral, real DB, fast.
Preview / staging	Neon or Supabase — free tier, branchable.	Acceptable for prototypes and preview deploys. Not for prod.
Production	Azure Database for PostgreSQL (Flexible Server) — always.	Provisioned via Azure CLI or portal, sized per app (B-series for light apps, GP series for heavier). Integrates with Key Vault + Entra.

Other rules:

One Postgres per app. Shared DB across services is a distributed monolith with extra latency.
Use Postgres schemas for logical separation (tenants, bounded contexts) before you reach for separate DBs.
Multi-tenant → row-level tenant_id column + indexed, plus RLS if security requires it. Schema-per-tenant only for genuine isolation needs.
Migrations: one tool per project (Drizzle Kit or Prisma Migrate). Never hand-edited SQL in prod.
Backups: automatic daily snapshots (Azure Postgres does this by default — verify it’s actually on). Verify restore at least once — untested backups are not backups.
Connection pooling on serverless: PgBouncer in front of Azure Postgres, or use the provider’s pooled endpoint on Neon/Supabase. Never open a raw connection per request on Vercel.

Provisioning Azure Postgres (quick reference)

az postgres flexible-server create \
  --resource-group <rg> \
  --name <app>-prod-db \
  --location <region> \
  --tier Burstable --sku-name Standard_B1ms \
  --storage-size 32 --version 16 \
  --admin-user <user> --admin-password <vault-ref>

Bump the tier (GeneralPurpose / MemoryOptimized) for heavier apps. Put the admin password in Key Vault from day one — never in a shell history or a Teams message.

ORM

Pick	When
Drizzle (default)	Typed SQL, minimal magic, excellent TS inference, edge-runtime friendly.
Prisma	When team familiarity strongly favors it, or you need its migrate/studio tooling.
Raw SQL (with a thin helper like `postgres`/`pg`)	Scripts, perf-critical paths, one-off analytics.

Don’t mix two ORMs in one project.

Cache

Don’t add Redis until you’ve measured a hot path. “We might need caching” is not a reason.
Serverless (Vercel): Upstash Redis or Vercel KV. HTTP-based, serverless-friendly.
Self-hosted: Redis in Docker on the same VPS.
Cache keys: include a version prefix (v1:user:123) so you can invalidate by bumping the prefix.

Queues & background jobs

Light recurring work (hourly syncs, cleanup) → Vercel Cron.
Heavier async work (emails, report generation, AI pipelines) → BullMQ on Redis (TS) or Azure Service Bus (if already on Azure).
Avoid building queues until you have async work. Don’t pre-wire it.

Auth

Pick	When
Microsoft Entra (Azure AD) — restricted to `@egca.io` tenant	Internal tools and admin dashboards only. Staff sign in with their company account; offboarding happens automatically via Entra. Single-tenant app registration — reject any account outside `@egca.io`.
Clerk	Client-facing prod apps. Ship in a day. Pay for their team doing what you shouldn’t.
Auth.js (NextAuth)	Self-hosted flexibility for client apps, or as the Entra wrapper on internal tools. Multiple providers, low cost.
Custom	Only with a lead’s sign-off. Usually a mistake.

Never use Entra for public / client-facing apps. Clients don’t have @egca.io accounts and shouldn’t be forced into the company tenant. If an app has both a client UI and an admin panel, split: client on Clerk, admin behind a separate Entra-gated subdomain (admin.app.com) or a separate Next.js app.

Entra tenant restriction — the config that matters:

// Auth.js Microsoft Entra provider
MicrosoftEntraID({
  clientId: env.AUTH_ENTRA_CLIENT_ID,
  clientSecret: env.AUTH_ENTRA_CLIENT_SECRET,
  tenantId: env.AUTH_ENTRA_TENANT_ID,        // the egca.io tenant id
  authorization: { params: { prompt: 'select_account' } },
})

And in the sign-in callback, double-check the email domain as a belt-and-braces guard:

async signIn({ profile }) {
  return profile?.email?.endsWith('@egca.io') ?? false
}

The tenantId alone blocks other tenants. The domain check catches weird edge cases (guest accounts, personal MSAs slipping through).

API shape inside the app

Next.js Server Components + Server Actions for the same-app UI → API calls. No REST layer needed.
Public API / mobile / non-Next consumers → expose via Hono (TS) or FastAPI (Python), not Next route handlers. Keeps concerns separate.

Cross-service communication

HTTP + JSON with Zod / Pydantic validation at the boundary. Typed.
Short-lived JWT between services behind a gateway. No shared DB. No RPC frameworks.
Retry + timeout on every outbound call. 5s default timeout, 3 retries with jitter.

Observability baked in from day one

Structured logs (JSON) — no console.log in committed code beyond local dev.
Every outbound call: trace id + duration + status logged.
Error tracking wired on every service — Datadog by default, Application Insights when the service is on Azure. See 05-deployment-ops.