Operations
Running the system outside dev: deployment topology, logging, observability, backups, runbooks. The deployment target has not been chosen yet — this doc captures the requirements and the shape; pick a hosting provider and fill in the concrete steps when you do.
1. Deployment topology
At minimum, production needs:
- API process — the NestJS app. Stateless; horizontal-scalable. Needs outbound to MongoDB and inbound from the internet (via TLS-terminating load balancer).
- MongoDB — managed (Atlas, DigitalOcean) or self-hosted replica set. Transactions require a replica set — a standalone Mongo does not suffice once tag rename lands.
- Object storage — for thumbnail mips once the pipeline is built (S3, Cloudflare R2, Spaces). Not needed on day one.
- CDN — in front of thumbnails and (optionally)
/catalog//metaresponses. Optional; the app's ownCache-Control+ ETag already gives most of the win.
Not required v1: message queue, Redis, background worker fleet. Events are in-process. When event volume demands durability, graduate to BullMQ + Redis.
Pod / container sizing (guess, tune on data)
- API: 256 MB memory, 0.25 vCPU baseline. Scales on concurrency.
- MongoDB: 2 GB memory minimum; more as working set grows.
2. Secrets
| Secret | Where stored | Rotation cadence |
|---|---|---|
MONGO_URI (includes creds) | Platform secret manager | On breach / yearly |
JWT_ACCESS_SECRET | Platform secret manager | Yearly or on suspected compromise |
JWT_REFRESH_SECRET | Platform secret manager | Yearly or on suspected compromise |
ME_PASSWORD (Mongo Express) | Not deployed to prod | N/A — dev only |
| Mongo root password | Platform secret manager | On provisioning; rotate yearly |
Never commit .env files. .env.example is the only checked-in file and has only placeholder values.
Rotation runbook — JWT secrets
- Decide which secret to rotate (access or refresh).
- For access secret rotation: users will experience a forced refresh on their next request after deployment. No further action.
- For refresh secret rotation: all existing refresh tokens become invalid — users are forced to re-login. This is visible; schedule during low-traffic.
- Two-secret verify strategy (optional, v2): support
JWT_ACCESS_SECRET_CURRENTandJWT_ACCESS_SECRET_PREVIOUS; accept tokens signed by either; sign new with current. Allows rotation without forcing refresh storm. Same for refresh. Revisit when traffic justifies the complexity.
3. Logging and request-id
Every log line includes a request ID so concurrent traffic stays traceable.
Format
Structured JSON to stdout:
{"ts":"2026-04-21T12:00:00.123Z","level":"info","requestId":"req_01H…","route":"GET /content","userId":"64fb…","durationMs":14,"statusCode":200,"msg":"request.completed"}
Nest's default logger is replaced at bootstrap with a Pino logger wrapping the request-id context. main.ts installs RequestIdMiddleware first thing; the middleware reads X-Request-Id if present, generates req_<ulid> if not, stores it on the request, and echoes it back in the response header + meta.requestId.
Levels
error— unhandled exceptions, 5xx responses, security events (refresh-token reuse detected).warn— 4xx that smells adversarial (rate-limited IPs, repeated auth failures).info— successful requests (one line per request), lifecycle events (startup, shutdown).debug— Mongoose queries (dev only), cache hits/misses.
Never log:
- Passwords or hashes.
- Access tokens or refresh tokens (not even prefixes in prod).
- Full request bodies of auth endpoints.
4. Observability placeholders
Not wired in v1. When wiring, these are the minimums:
- Health endpoint
GET /healthreturning 200 once Mongo ping succeeds. Wire to the platform's liveness probe. - Readiness endpoint
GET /readyreturning 200 when Mongo is connected AND the catalog cache has been primed. Wire to readiness probe. - Metrics (v2) —
/metricsPrometheus endpoint via@willsoto/nestjs-prometheusor similar. Track: request count by route and status, request duration, Mongo query duration, cache hit ratio on/catalog. - Error tracking (v2) — Sentry or equivalent. Capture 5xx and security events with request ID tags.
5. Backups
Managed Mongo (Atlas, etc.) handles backups natively. For self-hosted:
- Daily
mongodumpto object storage, retained 30 days. - Weekly full to cold storage, retained 1 year.
- Restore drill quarterly.
Nothing the API stores is irreplaceable except user accounts and curated content. Thumbnail mips can be regenerated from source URLs.
6. Deployment checklist
Before every production deploy:
-
.env.examplecovers every variable the code reads (grepConfigService.get). - Database migration scripts (if any this deploy) are tested against a recent prod backup.
- CORS allowed origins includes the extension's Web Store ID.
-
AUTH_ALLOW_SELF_REGISTRATIONis set correctly for this environment. - Secrets in the platform secret manager match the secrets in the build's expected env.
- Health endpoint returns 200 against the new build in staging.
- Rollback plan: previous image tag recorded; revert is a redeploy.
7. Runbooks
Skeletons — fill in platform-specific steps once a target is chosen.
API is returning 5xx
- Check health endpoint. Is Mongo reachable?
- Check recent error logs filtered by
level=errorand a short time window. - Get a request ID of a failing request; pull its log line.
- If Mongo is the cause: check Mongo metrics / health on the provider dashboard.
- If code is the cause: roll back to the previous image.
Mongo unreachable
- Is the Mongo service up in the provider's console?
- Is there a connection cap? Increase, or bounce API instances to release stale connections.
- Is DNS resolving?
- If the outage is provider-side, put up a maintenance page (TBD mechanism) and wait.
Refresh token compromise suspected
Symptom: auth.refresh.reused log events spike, or an individual user reports unexpected logouts.
- Identify the user. In Mongo:
db.refresh_tokens.find({ userId }).sort({ createdAt: -1 }). - Revoke all their refresh tokens:
db.refresh_tokens.updateMany({ userId }, { $set: { revokedAt: new Date() } }). - Force password reset for that user (TBD flow; for now, operator resets via admin tool).
- If many users affected simultaneously, consider rotating
JWT_REFRESH_SECRET— nukes all refresh tokens globally. - Review logs for common attributes across affected accounts (shared IP, shared UA, same referrer) to identify the attack vector.
Rate limit thundering herd
Symptom: 429s spike for a single IP or range.
- Inspect
error.code: ratelimit.exceededlogs. Group by client IP. - If it's a single actor, optionally add a temporary IP block at the load balancer.
- If it's a legitimate integration, invite them to authenticate (authenticated limits are per-user, usually higher).
Storage for thumbnails full
Once thumbnails are live:
- Expected growth ~N MB per approved content item.
- Alert at 75% capacity; purchase more or run retention job.
- Retention job: delete mips for content with
isActive: falseolder than 30 days.
8. SLOs (to set once there's traffic)
Starting targets, revisit after 30 days of data:
- 99th percentile response time under 300 ms for
/content,/catalog,/meta. - 99.5% availability monthly.
- Zero 5xx on
/health//ready.
Measure via the metrics endpoint once wired.