05 — Deployment & Ops

Ship small, ship often. You cannot debug what you cannot see.

Hosting playbooks

Vercel (default for Next.js)

Preview deploy per PR — enabled out of the box.
main branch → production. Nothing else.
Env vars in Vercel dashboard — split by environment (Dev / Preview / Production). Pull locally with vercel env pull.
Postgres: Neon or Supabase for preview/staging only. Production DB is always Azure Postgres — even when the app itself lives on Vercel. Point DATABASE_URL at Azure from the Vercel production env. See 02-architecture § Database.
Logs: Vercel’s log drain → Datadog (if the project warrants it) or just the dashboard.

Azure Web Apps (default for Microsoft-shop apps)

Deploy via GitHub Actions using azure/webapps-deploy@v3. Service principal for auth, stored in repo secrets.
Deployment slots: staging slot for verification, slot swap into production for zero-downtime.
Secrets in Azure Key Vault, referenced from the Web App’s app settings — never inline.
Custom domain + managed TLS cert — Azure handles cert rotation, don’t DIY.
Logs: Application Insights on every Azure-hosted service.

Docker + VPS (self-hosted)

For long-running workers, stateful services, or cost-sensitive bulk compute.

Dockerfile per service. Multi-stage build. Non-root user. HEALTHCHECK directive.
docker-compose.yml to orchestrate the service + Postgres + Redis.
Caddy or Traefik as reverse proxy for automatic TLS (Let’s Encrypt).
Deploy: GitHub Actions → ssh into VPS → docker compose pull && docker compose up -d. Or use a thin tool like dokku / coolify if ceremony becomes a problem.
Host: Hetzner first choice (price), DigitalOcean second (polish). One host until you measurably outgrow it.
Backups: automated pg_dump to S3 / R2, rotated, restore-tested quarterly.

When NOT Vercel

Background workers that run > 60s per task → split to a backend service on Azure / VPS.
Stateful anything (a persistent websocket fleet, a long-running bot) → VPS.
Per-request cost is higher than fixed-compute cost for your traffic → VPS once measured.

Secrets

Vercel env / Azure Key Vault are the only acceptable stores. Vercel for anything Vercel-hosted; Key Vault for everything else.
Rotate on: employee off-boarding, suspected leak, provider compromise, 12-month timer.
Never in git (including .env), Teams, Notion, tickets, screenshots.
If someone commits a secret: rotate it first, then force-push the history clean. Treat anything that touched git as compromised forever.

Monitoring (defaults across every app)

Tool	Use
Datadog	Our default for errors, performance (APM), logs, and infra — across every app regardless of host. Source maps uploaded in CI; release tagged per deploy.
Azure Application Insights	Free, first-party observability for Azure Web Apps. Turn it on with one checkbox when deploying — use it in addition to Datadog for Azure-hosted services, or instead of Datadog if the app is internal and small.
Vercel Logs + Speed Insights	Built in to Vercel. Enough for small-team apps on their own. Pipe logs to Datadog when you need to correlate with a backend service.

Wire these up on day one, not after the first incident. You cannot retroactively know what your p95 latency was last Tuesday.

We do not use Sentry or PostHog. If a team needs product analytics or session replay for a specific client app, propose it in a PR with a clear reason — we’ll evaluate per project.

Email

Scenario	Use
External / transactional (client-facing)	Resend — SDK is TS-native, templates in React Email.
Internal tools (must come from company domain)	Azure Entra app + Microsoft Graph API — sends as an actual company mailbox, no SPF/DKIM pain.

Don’t use a personal SMTP or gmail API for anything production-adjacent.

Incident response

Short runbook — live this from day one, not after the first outage.

Declare: post in the Incidents channel (Engineering team on Teams) with one line — “Site is down / Checkout failing / etc.” Include severity (SEV1 = customer impact, SEV2 = degraded, SEV3 = internal).
Owner: one named person drives. Others support. Don’t run an incident by committee.
Comms: owner posts updates every 15 min, even if “still investigating.” Silence is worse than no progress.
Mitigate, then fix. Rollback is a valid mitigation. A revert + calm investigation beats a 3am hotfix.
Postmortem within 48 hours for SEV1/SEV2. Blameless. Template: what happened, timeline, root cause, what went well, what didn’t, action items with owners + dates.

On-call

With a team this small, rotation is informal — one primary for the week, one backup.
Primary response time: 15 min during business hours, 1 hour nights/weekends for SEV1/SEV2. SEV3 waits for business hours.
Pager-worthy alerts only. If PagerDuty is firing for non-actionable reasons, the alert is broken — fix or delete it.

What “production-ready” means here

Before flipping a new app to prod traffic, all of these are true:

Error tracking wired (Datadog, or Application Insights for Azure-hosted), receiving events from the production env.
Health check endpoint returning 200 + a real DB ping.
CI green; test suite includes the critical path.
Secrets in the right vault, not env files on someone’s laptop.
Backup configured + one successful restore tested.
Rollback path documented (revert commit + redeploy, or slot swap back).
On-call knows the app exists and where the runbook lives.
One page of docs: what it does, how to run locally, how to deploy, who owns it.