EGCA EGCA Engineering Handbook
Internal reference · v1
Home Onboarding Git Review Prototype ↔ Prod
Home 2. Architecture

02 — Architecture

Rule of thumb: boring wins. Monolith until it hurts. Split only when you can point at the pain.

Monolith vs semi-microservices

Default: monolith. One Next.js app, one Postgres, one deploy. This scales further than juniors expect.

Split into semi-microservices only when one of these is true:

  • One part of the app has a different scaling profile (e.g. an AI pipeline that needs a long-running worker, or a CPU-heavy report generator).
  • A team outside the web team owns part of the logic and deploy coupling is slowing them down.
  • You have long-running jobs (> ~60s) that don’t fit Vercel’s request budget.
  • A non-JS consumer needs the API (mobile native, partner, internal batch).

“Semi-microservices” here means Next.js app + 1 to 3 purpose-built services. Not 20 services. Not a service per model. If you find yourself naming a fifth service, stop and ask a lead.

Anti-examples — do NOT split for these reasons:

  • “It feels cleaner.”
  • “The blog said microservices scale better.”
  • “I want to try Go.”

Database

Postgres everywhere. No exceptions. No MongoDB, no DynamoDB, no Firestore, no Firebase. If a client insists on one of these, push back first — they’re almost always wrong, and we’ve been burned by schema drift and vendor lock-in before.

Where the Postgres lives depends on the env:

EnvWhere Postgres runsNotes
Local / devDocker Postgres on your machine (or a plain local install).Run via docker compose up. Tests run against this — not a shared dev DB.
CIDocker Postgres via testcontainers / GitHub Actions service.Ephemeral, real DB, fast.
Preview / stagingNeon or Supabase — free tier, branchable.Acceptable for prototypes and preview deploys. Not for prod.
ProductionAzure Database for PostgreSQL (Flexible Server) — always.Provisioned via Azure CLI or portal, sized per app (B-series for light apps, GP series for heavier). Integrates with Key Vault + Entra.

Other rules:

  • One Postgres per app. Shared DB across services is a distributed monolith with extra latency.
  • Use Postgres schemas for logical separation (tenants, bounded contexts) before you reach for separate DBs.
  • Multi-tenant → row-level tenant_id column + indexed, plus RLS if security requires it. Schema-per-tenant only for genuine isolation needs.
  • Migrations: one tool per project (Drizzle Kit or Prisma Migrate). Never hand-edited SQL in prod.
  • Backups: automatic daily snapshots (Azure Postgres does this by default — verify it’s actually on). Verify restore at least once — untested backups are not backups.
  • Connection pooling on serverless: PgBouncer in front of Azure Postgres, or use the provider’s pooled endpoint on Neon/Supabase. Never open a raw connection per request on Vercel.

Provisioning Azure Postgres (quick reference)

az postgres flexible-server create \
  --resource-group <rg> \
  --name <app>-prod-db \
  --location <region> \
  --tier Burstable --sku-name Standard_B1ms \
  --storage-size 32 --version 16 \
  --admin-user <user> --admin-password <vault-ref>

Bump the tier (GeneralPurpose / MemoryOptimized) for heavier apps. Put the admin password in Key Vault from day one — never in a shell history or a Teams message.

ORM

PickWhen
Drizzle (default)Typed SQL, minimal magic, excellent TS inference, edge-runtime friendly.
PrismaWhen team familiarity strongly favors it, or you need its migrate/studio tooling.
Raw SQL (with a thin helper like postgres/pg)Scripts, perf-critical paths, one-off analytics.

Don’t mix two ORMs in one project.

Cache

  • Don’t add Redis until you’ve measured a hot path. “We might need caching” is not a reason.
  • Serverless (Vercel): Upstash Redis or Vercel KV. HTTP-based, serverless-friendly.
  • Self-hosted: Redis in Docker on the same VPS.
  • Cache keys: include a version prefix (v1:user:123) so you can invalidate by bumping the prefix.

Queues & background jobs

  • Light recurring work (hourly syncs, cleanup) → Vercel Cron.
  • Heavier async work (emails, report generation, AI pipelines) → BullMQ on Redis (TS) or Azure Service Bus (if already on Azure).
  • Avoid building queues until you have async work. Don’t pre-wire it.

Auth

PickWhen
Microsoft Entra (Azure AD) — restricted to @egca.io tenantInternal tools and admin dashboards only. Staff sign in with their company account; offboarding happens automatically via Entra. Single-tenant app registration — reject any account outside @egca.io.
ClerkClient-facing prod apps. Ship in a day. Pay for their team doing what you shouldn’t.
Auth.js (NextAuth)Self-hosted flexibility for client apps, or as the Entra wrapper on internal tools. Multiple providers, low cost.
CustomOnly with a lead’s sign-off. Usually a mistake.

Never use Entra for public / client-facing apps. Clients don’t have @egca.io accounts and shouldn’t be forced into the company tenant. If an app has both a client UI and an admin panel, split: client on Clerk, admin behind a separate Entra-gated subdomain (admin.app.com) or a separate Next.js app.

Entra tenant restriction — the config that matters:

// Auth.js Microsoft Entra provider
MicrosoftEntraID({
  clientId: env.AUTH_ENTRA_CLIENT_ID,
  clientSecret: env.AUTH_ENTRA_CLIENT_SECRET,
  tenantId: env.AUTH_ENTRA_TENANT_ID,        // the egca.io tenant id
  authorization: { params: { prompt: 'select_account' } },
})

And in the sign-in callback, double-check the email domain as a belt-and-braces guard:

async signIn({ profile }) {
  return profile?.email?.endsWith('@egca.io') ?? false
}

The tenantId alone blocks other tenants. The domain check catches weird edge cases (guest accounts, personal MSAs slipping through).

API shape inside the app

  • Next.js Server Components + Server Actions for the same-app UI → API calls. No REST layer needed.
  • Public API / mobile / non-Next consumers → expose via Hono (TS) or FastAPI (Python), not Next route handlers. Keeps concerns separate.

Cross-service communication

  • HTTP + JSON with Zod / Pydantic validation at the boundary. Typed.
  • Short-lived JWT between services behind a gateway. No shared DB. No RPC frameworks.
  • Retry + timeout on every outbound call. 5s default timeout, 3 retries with jitter.

Observability baked in from day one

  • Structured logs (JSON) — no console.log in committed code beyond local dev.
  • Every outbound call: trace id + duration + status logged.
  • Error tracking wired on every service — Datadog by default, Application Insights when the service is on Azure. See 05-deployment-ops.