02 — Architecture
Rule of thumb: boring wins. Monolith until it hurts. Split only when you can point at the pain.
Monolith vs semi-microservices
Default: monolith. One Next.js app, one Postgres, one deploy. This scales further than juniors expect.
Split into semi-microservices only when one of these is true:
- One part of the app has a different scaling profile (e.g. an AI pipeline that needs a long-running worker, or a CPU-heavy report generator).
- A team outside the web team owns part of the logic and deploy coupling is slowing them down.
- You have long-running jobs (> ~60s) that don’t fit Vercel’s request budget.
- A non-JS consumer needs the API (mobile native, partner, internal batch).
“Semi-microservices” here means Next.js app + 1 to 3 purpose-built services. Not 20 services. Not a service per model. If you find yourself naming a fifth service, stop and ask a lead.
Anti-examples — do NOT split for these reasons:
- “It feels cleaner.”
- “The blog said microservices scale better.”
- “I want to try Go.”
Database
Postgres everywhere. No exceptions. No MongoDB, no DynamoDB, no Firestore, no Firebase. If a client insists on one of these, push back first — they’re almost always wrong, and we’ve been burned by schema drift and vendor lock-in before.
Where the Postgres lives depends on the env:
| Env | Where Postgres runs | Notes |
|---|---|---|
| Local / dev | Docker Postgres on your machine (or a plain local install). | Run via docker compose up. Tests run against this — not a shared dev DB. |
| CI | Docker Postgres via testcontainers / GitHub Actions service. | Ephemeral, real DB, fast. |
| Preview / staging | Neon or Supabase — free tier, branchable. | Acceptable for prototypes and preview deploys. Not for prod. |
| Production | Azure Database for PostgreSQL (Flexible Server) — always. | Provisioned via Azure CLI or portal, sized per app (B-series for light apps, GP series for heavier). Integrates with Key Vault + Entra. |
Other rules:
- One Postgres per app. Shared DB across services is a distributed monolith with extra latency.
- Use Postgres schemas for logical separation (tenants, bounded contexts) before you reach for separate DBs.
- Multi-tenant → row-level
tenant_idcolumn + indexed, plus RLS if security requires it. Schema-per-tenant only for genuine isolation needs. - Migrations: one tool per project (Drizzle Kit or Prisma Migrate). Never hand-edited SQL in prod.
- Backups: automatic daily snapshots (Azure Postgres does this by default — verify it’s actually on). Verify restore at least once — untested backups are not backups.
- Connection pooling on serverless: PgBouncer in front of Azure Postgres, or use the provider’s pooled endpoint on Neon/Supabase. Never open a raw connection per request on Vercel.
Provisioning Azure Postgres (quick reference)
az postgres flexible-server create \
--resource-group <rg> \
--name <app>-prod-db \
--location <region> \
--tier Burstable --sku-name Standard_B1ms \
--storage-size 32 --version 16 \
--admin-user <user> --admin-password <vault-ref>
Bump the tier (GeneralPurpose / MemoryOptimized) for heavier apps. Put the admin password in Key Vault from day one — never in a shell history or a Teams message.
ORM
| Pick | When |
|---|---|
| Drizzle (default) | Typed SQL, minimal magic, excellent TS inference, edge-runtime friendly. |
| Prisma | When team familiarity strongly favors it, or you need its migrate/studio tooling. |
Raw SQL (with a thin helper like postgres/pg) | Scripts, perf-critical paths, one-off analytics. |
Don’t mix two ORMs in one project.
Cache
- Don’t add Redis until you’ve measured a hot path. “We might need caching” is not a reason.
- Serverless (Vercel): Upstash Redis or Vercel KV. HTTP-based, serverless-friendly.
- Self-hosted: Redis in Docker on the same VPS.
- Cache keys: include a version prefix (
v1:user:123) so you can invalidate by bumping the prefix.
Queues & background jobs
- Light recurring work (hourly syncs, cleanup) → Vercel Cron.
- Heavier async work (emails, report generation, AI pipelines) → BullMQ on Redis (TS) or Azure Service Bus (if already on Azure).
- Avoid building queues until you have async work. Don’t pre-wire it.
Auth
| Pick | When |
|---|---|
Microsoft Entra (Azure AD) — restricted to @egca.io tenant | Internal tools and admin dashboards only. Staff sign in with their company account; offboarding happens automatically via Entra. Single-tenant app registration — reject any account outside @egca.io. |
| Clerk | Client-facing prod apps. Ship in a day. Pay for their team doing what you shouldn’t. |
| Auth.js (NextAuth) | Self-hosted flexibility for client apps, or as the Entra wrapper on internal tools. Multiple providers, low cost. |
| Custom | Only with a lead’s sign-off. Usually a mistake. |
Never use Entra for public / client-facing apps. Clients don’t have @egca.io accounts and shouldn’t be forced into the company tenant. If an app has both a client UI and an admin panel, split: client on Clerk, admin behind a separate Entra-gated subdomain (admin.app.com) or a separate Next.js app.
Entra tenant restriction — the config that matters:
// Auth.js Microsoft Entra provider
MicrosoftEntraID({
clientId: env.AUTH_ENTRA_CLIENT_ID,
clientSecret: env.AUTH_ENTRA_CLIENT_SECRET,
tenantId: env.AUTH_ENTRA_TENANT_ID, // the egca.io tenant id
authorization: { params: { prompt: 'select_account' } },
})
And in the sign-in callback, double-check the email domain as a belt-and-braces guard:
async signIn({ profile }) {
return profile?.email?.endsWith('@egca.io') ?? false
}
The tenantId alone blocks other tenants. The domain check catches weird edge cases (guest accounts, personal MSAs slipping through).
API shape inside the app
- Next.js Server Components + Server Actions for the same-app UI → API calls. No REST layer needed.
- Public API / mobile / non-Next consumers → expose via Hono (TS) or FastAPI (Python), not Next route handlers. Keeps concerns separate.
Cross-service communication
- HTTP + JSON with Zod / Pydantic validation at the boundary. Typed.
- Short-lived JWT between services behind a gateway. No shared DB. No RPC frameworks.
- Retry + timeout on every outbound call. 5s default timeout, 3 retries with jitter.
Observability baked in from day one
- Structured logs (JSON) — no
console.login committed code beyond local dev. - Every outbound call: trace id + duration + status logged.
- Error tracking wired on every service — Datadog by default, Application Insights when the service is on Azure. See 05-deployment-ops.