Operations

Running the system outside dev: deployment topology, logging, observability, backups, runbooks. The deployment target has not been chosen yet — this doc captures the requirements and the shape; pick a hosting provider and fill in the concrete steps when you do.

1. Deployment topology

At minimum, production needs:

API process — the NestJS app. Stateless; horizontal-scalable. Needs outbound to MongoDB and inbound from the internet (via TLS-terminating load balancer).
MongoDB — managed (Atlas, DigitalOcean) or self-hosted replica set. Transactions require a replica set — a standalone Mongo does not suffice once tag rename lands.
Object storage — for thumbnail mips once the pipeline is built (S3, Cloudflare R2, Spaces). Not needed on day one.
CDN — in front of thumbnails and (optionally) /catalog / /meta responses. Optional; the app's own Cache-Control + ETag already gives most of the win.

Not required v1: message queue, Redis, background worker fleet. Events are in-process. When event volume demands durability, graduate to BullMQ + Redis.

Pod / container sizing (guess, tune on data)

API: 256 MB memory, 0.25 vCPU baseline. Scales on concurrency.
MongoDB: 2 GB memory minimum; more as working set grows.

2. Secrets

Secret	Where stored	Rotation cadence
`MONGO_URI` (includes creds)	Platform secret manager	On breach / yearly
`JWT_ACCESS_SECRET`	Platform secret manager	Yearly or on suspected compromise
`JWT_REFRESH_SECRET`	Platform secret manager	Yearly or on suspected compromise
`ME_PASSWORD` (Mongo Express)	Not deployed to prod	N/A — dev only
Mongo root password	Platform secret manager	On provisioning; rotate yearly

Never commit .env files. .env.example is the only checked-in file and has only placeholder values.

Rotation runbook — JWT secrets

Decide which secret to rotate (access or refresh).
For access secret rotation: users will experience a forced refresh on their next request after deployment. No further action.
For refresh secret rotation: all existing refresh tokens become invalid — users are forced to re-login. This is visible; schedule during low-traffic.
Two-secret verify strategy (optional, v2): support JWT_ACCESS_SECRET_CURRENT and JWT_ACCESS_SECRET_PREVIOUS; accept tokens signed by either; sign new with current. Allows rotation without forcing refresh storm. Same for refresh. Revisit when traffic justifies the complexity.

3. Logging and request-id

Every log line includes a request ID so concurrent traffic stays traceable.

Format

Structured JSON to stdout:

{"ts":"2026-04-21T12:00:00.123Z","level":"info","requestId":"req_01H…","route":"GET /content","userId":"64fb…","durationMs":14,"statusCode":200,"msg":"request.completed"}

Nest's default logger is replaced at bootstrap with a Pino logger wrapping the request-id context. main.ts installs RequestIdMiddleware first thing; the middleware reads X-Request-Id if present, generates req_<ulid> if not, stores it on the request, and echoes it back in the response header + meta.requestId.

Levels

error — unhandled exceptions, 5xx responses, security events (refresh-token reuse detected).
warn — 4xx that smells adversarial (rate-limited IPs, repeated auth failures).
info — successful requests (one line per request), lifecycle events (startup, shutdown).
debug — Mongoose queries (dev only), cache hits/misses.

Never log:

Passwords or hashes.
Access tokens or refresh tokens (not even prefixes in prod).
Full request bodies of auth endpoints.

4. Observability placeholders

Not wired in v1. When wiring, these are the minimums:

Health endpoint GET /health returning 200 once Mongo ping succeeds. Wire to the platform's liveness probe.
Readiness endpoint GET /ready returning 200 when Mongo is connected AND the catalog cache has been primed. Wire to readiness probe.
Metrics (v2) — /metrics Prometheus endpoint via @willsoto/nestjs-prometheus or similar. Track: request count by route and status, request duration, Mongo query duration, cache hit ratio on /catalog.
Error tracking (v2) — Sentry or equivalent. Capture 5xx and security events with request ID tags.

5. Backups

Managed Mongo (Atlas, etc.) handles backups natively. For self-hosted:

Daily mongodump to object storage, retained 30 days.
Weekly full to cold storage, retained 1 year.
Restore drill quarterly.

Nothing the API stores is irreplaceable except user accounts and curated content. Thumbnail mips can be regenerated from source URLs.

6. Deployment checklist

Before every production deploy:

.env.example covers every variable the code reads (grep ConfigService.get).
Database migration scripts (if any this deploy) are tested against a recent prod backup.
CORS allowed origins includes the extension's Web Store ID.
AUTH_ALLOW_SELF_REGISTRATION is set correctly for this environment.
Secrets in the platform secret manager match the secrets in the build's expected env.
Health endpoint returns 200 against the new build in staging.
Rollback plan: previous image tag recorded; revert is a redeploy.

7. Runbooks

Skeletons — fill in platform-specific steps once a target is chosen.

API is returning 5xx

Check health endpoint. Is Mongo reachable?
Check recent error logs filtered by level=error and a short time window.
Get a request ID of a failing request; pull its log line.
If Mongo is the cause: check Mongo metrics / health on the provider dashboard.
If code is the cause: roll back to the previous image.

Mongo unreachable

Is the Mongo service up in the provider's console?
Is there a connection cap? Increase, or bounce API instances to release stale connections.
Is DNS resolving?
If the outage is provider-side, put up a maintenance page (TBD mechanism) and wait.

Refresh token compromise suspected

Symptom: auth.refresh.reused log events spike, or an individual user reports unexpected logouts.

Identify the user. In Mongo: db.refresh_tokens.find({ userId }).sort({ createdAt: -1 }).
Revoke all their refresh tokens: db.refresh_tokens.updateMany({ userId }, { $set: { revokedAt: new Date() } }).
Force password reset for that user (TBD flow; for now, operator resets via admin tool).
If many users affected simultaneously, consider rotating JWT_REFRESH_SECRET — nukes all refresh tokens globally.
Review logs for common attributes across affected accounts (shared IP, shared UA, same referrer) to identify the attack vector.

Rate limit thundering herd

Symptom: 429s spike for a single IP or range.

Inspect error.code: ratelimit.exceeded logs. Group by client IP.
If it's a single actor, optionally add a temporary IP block at the load balancer.
If it's a legitimate integration, invite them to authenticate (authenticated limits are per-user, usually higher).

Storage for thumbnails full

Once thumbnails are live:

Expected growth ~N MB per approved content item.
Alert at 75% capacity; purchase more or run retention job.
Retention job: delete mips for content with isActive: false older than 30 days.

8. SLOs (to set once there's traffic)

Starting targets, revisit after 30 days of data:

99th percentile response time under 300 ms for /content, /catalog, /meta.
99.5% availability monthly.
Zero 5xx on /health / /ready.

Measure via the metrics endpoint once wired.

1. Deployment topology​

Pod / container sizing (guess, tune on data)​

2. Secrets​

Rotation runbook — JWT secrets​

3. Logging and request-id​

Format​

Levels​

4. Observability placeholders​

5. Backups​

6. Deployment checklist​

7. Runbooks​

API is returning 5xx​

Mongo unreachable​

Refresh token compromise suspected​

Rate limit thundering herd​

Storage for thumbnails full​

8. SLOs (to set once there's traffic)​