05 — Deployment & Ops
Ship small, ship often. You cannot debug what you cannot see.
Hosting playbooks
Vercel (default for Next.js)
- Preview deploy per PR — enabled out of the box.
mainbranch → production. Nothing else.- Env vars in Vercel dashboard — split by environment (Dev / Preview / Production). Pull locally with
vercel env pull. - Postgres: Neon or Supabase for preview/staging only. Production DB is always Azure Postgres — even when the app itself lives on Vercel. Point
DATABASE_URLat Azure from the Vercel production env. See 02-architecture § Database. - Logs: Vercel’s log drain → Datadog (if the project warrants it) or just the dashboard.
Azure Web Apps (default for Microsoft-shop apps)
- Deploy via GitHub Actions using
azure/webapps-deploy@v3. Service principal for auth, stored in repo secrets. - Deployment slots:
stagingslot for verification, slot swap into production for zero-downtime. - Secrets in Azure Key Vault, referenced from the Web App’s app settings — never inline.
- Custom domain + managed TLS cert — Azure handles cert rotation, don’t DIY.
- Logs: Application Insights on every Azure-hosted service.
Docker + VPS (self-hosted)
For long-running workers, stateful services, or cost-sensitive bulk compute.
- Dockerfile per service. Multi-stage build. Non-root user.
HEALTHCHECKdirective. - docker-compose.yml to orchestrate the service + Postgres + Redis.
- Caddy or Traefik as reverse proxy for automatic TLS (Let’s Encrypt).
- Deploy: GitHub Actions →
sshinto VPS →docker compose pull && docker compose up -d. Or use a thin tool likedokku/coolifyif ceremony becomes a problem. - Host: Hetzner first choice (price), DigitalOcean second (polish). One host until you measurably outgrow it.
- Backups: automated
pg_dumpto S3 / R2, rotated, restore-tested quarterly.
When NOT Vercel
- Background workers that run > 60s per task → split to a backend service on Azure / VPS.
- Stateful anything (a persistent websocket fleet, a long-running bot) → VPS.
- Per-request cost is higher than fixed-compute cost for your traffic → VPS once measured.
Secrets
- Vercel env / Azure Key Vault are the only acceptable stores. Vercel for anything Vercel-hosted; Key Vault for everything else.
- Rotate on: employee off-boarding, suspected leak, provider compromise, 12-month timer.
- Never in git (including
.env), Teams, Notion, tickets, screenshots. - If someone commits a secret: rotate it first, then force-push the history clean. Treat anything that touched git as compromised forever.
Monitoring (defaults across every app)
| Tool | Use |
|---|---|
| Datadog | Our default for errors, performance (APM), logs, and infra — across every app regardless of host. Source maps uploaded in CI; release tagged per deploy. |
| Azure Application Insights | Free, first-party observability for Azure Web Apps. Turn it on with one checkbox when deploying — use it in addition to Datadog for Azure-hosted services, or instead of Datadog if the app is internal and small. |
| Vercel Logs + Speed Insights | Built in to Vercel. Enough for small-team apps on their own. Pipe logs to Datadog when you need to correlate with a backend service. |
Wire these up on day one, not after the first incident. You cannot retroactively know what your p95 latency was last Tuesday.
We do not use Sentry or PostHog. If a team needs product analytics or session replay for a specific client app, propose it in a PR with a clear reason — we’ll evaluate per project.
| Scenario | Use |
|---|---|
| External / transactional (client-facing) | Resend — SDK is TS-native, templates in React Email. |
| Internal tools (must come from company domain) | Azure Entra app + Microsoft Graph API — sends as an actual company mailbox, no SPF/DKIM pain. |
Don’t use a personal SMTP or gmail API for anything production-adjacent.
Incident response
Short runbook — live this from day one, not after the first outage.
- Declare: post in the Incidents channel (Engineering team on Teams) with one line — “Site is down / Checkout failing / etc.” Include severity (SEV1 = customer impact, SEV2 = degraded, SEV3 = internal).
- Owner: one named person drives. Others support. Don’t run an incident by committee.
- Comms: owner posts updates every 15 min, even if “still investigating.” Silence is worse than no progress.
- Mitigate, then fix. Rollback is a valid mitigation. A revert + calm investigation beats a 3am hotfix.
- Postmortem within 48 hours for SEV1/SEV2. Blameless. Template: what happened, timeline, root cause, what went well, what didn’t, action items with owners + dates.
On-call
- With a team this small, rotation is informal — one primary for the week, one backup.
- Primary response time: 15 min during business hours, 1 hour nights/weekends for SEV1/SEV2. SEV3 waits for business hours.
- Pager-worthy alerts only. If PagerDuty is firing for non-actionable reasons, the alert is broken — fix or delete it.
What “production-ready” means here
Before flipping a new app to prod traffic, all of these are true:
- Error tracking wired (Datadog, or Application Insights for Azure-hosted), receiving events from the production env.
- Health check endpoint returning 200 + a real DB ping.
- CI green; test suite includes the critical path.
- Secrets in the right vault, not env files on someone’s laptop.
- Backup configured + one successful restore tested.
- Rollback path documented (revert commit + redeploy, or slot swap back).
- On-call knows the app exists and where the runbook lives.
- One page of docs: what it does, how to run locally, how to deploy, who owns it.