Skip to main content

Operations

Running the system outside dev: deployment topology, logging, observability, backups, runbooks. The deployment target has not been chosen yet — this doc captures the requirements and the shape; pick a hosting provider and fill in the concrete steps when you do.


1. Deployment topology

At minimum, production needs:

  • API process — the NestJS app. Stateless; horizontal-scalable. Needs outbound to MongoDB and inbound from the internet (via TLS-terminating load balancer).
  • MongoDB — managed (Atlas, DigitalOcean) or self-hosted replica set. Transactions require a replica set — a standalone Mongo does not suffice once tag rename lands.
  • Object storage — for thumbnail mips once the pipeline is built (S3, Cloudflare R2, Spaces). Not needed on day one.
  • CDN — in front of thumbnails and (optionally) /catalog / /meta responses. Optional; the app's own Cache-Control + ETag already gives most of the win.

Not required v1: message queue, Redis, background worker fleet. Events are in-process. When event volume demands durability, graduate to BullMQ + Redis.

Pod / container sizing (guess, tune on data)

  • API: 256 MB memory, 0.25 vCPU baseline. Scales on concurrency.
  • MongoDB: 2 GB memory minimum; more as working set grows.

2. Secrets

SecretWhere storedRotation cadence
MONGO_URI (includes creds)Platform secret managerOn breach / yearly
JWT_ACCESS_SECRETPlatform secret managerYearly or on suspected compromise
JWT_REFRESH_SECRETPlatform secret managerYearly or on suspected compromise
ME_PASSWORD (Mongo Express)Not deployed to prodN/A — dev only
Mongo root passwordPlatform secret managerOn provisioning; rotate yearly

Never commit .env files. .env.example is the only checked-in file and has only placeholder values.

Rotation runbook — JWT secrets

  1. Decide which secret to rotate (access or refresh).
  2. For access secret rotation: users will experience a forced refresh on their next request after deployment. No further action.
  3. For refresh secret rotation: all existing refresh tokens become invalid — users are forced to re-login. This is visible; schedule during low-traffic.
  4. Two-secret verify strategy (optional, v2): support JWT_ACCESS_SECRET_CURRENT and JWT_ACCESS_SECRET_PREVIOUS; accept tokens signed by either; sign new with current. Allows rotation without forcing refresh storm. Same for refresh. Revisit when traffic justifies the complexity.

3. Logging and request-id

Every log line includes a request ID so concurrent traffic stays traceable.

Format

Structured JSON to stdout:

{"ts":"2026-04-21T12:00:00.123Z","level":"info","requestId":"req_01H…","route":"GET /content","userId":"64fb…","durationMs":14,"statusCode":200,"msg":"request.completed"}

Nest's default logger is replaced at bootstrap with a Pino logger wrapping the request-id context. main.ts installs RequestIdMiddleware first thing; the middleware reads X-Request-Id if present, generates req_<ulid> if not, stores it on the request, and echoes it back in the response header + meta.requestId.

Levels

  • error — unhandled exceptions, 5xx responses, security events (refresh-token reuse detected).
  • warn — 4xx that smells adversarial (rate-limited IPs, repeated auth failures).
  • info — successful requests (one line per request), lifecycle events (startup, shutdown).
  • debug — Mongoose queries (dev only), cache hits/misses.

Never log:

  • Passwords or hashes.
  • Access tokens or refresh tokens (not even prefixes in prod).
  • Full request bodies of auth endpoints.

4. Observability placeholders

Not wired in v1. When wiring, these are the minimums:

  • Health endpoint GET /health returning 200 once Mongo ping succeeds. Wire to the platform's liveness probe.
  • Readiness endpoint GET /ready returning 200 when Mongo is connected AND the catalog cache has been primed. Wire to readiness probe.
  • Metrics (v2) — /metrics Prometheus endpoint via @willsoto/nestjs-prometheus or similar. Track: request count by route and status, request duration, Mongo query duration, cache hit ratio on /catalog.
  • Error tracking (v2) — Sentry or equivalent. Capture 5xx and security events with request ID tags.

5. Backups

Managed Mongo (Atlas, etc.) handles backups natively. For self-hosted:

  • Daily mongodump to object storage, retained 30 days.
  • Weekly full to cold storage, retained 1 year.
  • Restore drill quarterly.

Nothing the API stores is irreplaceable except user accounts and curated content. Thumbnail mips can be regenerated from source URLs.


6. Deployment checklist

Before every production deploy:

  • .env.example covers every variable the code reads (grep ConfigService.get).
  • Database migration scripts (if any this deploy) are tested against a recent prod backup.
  • CORS allowed origins includes the extension's Web Store ID.
  • AUTH_ALLOW_SELF_REGISTRATION is set correctly for this environment.
  • Secrets in the platform secret manager match the secrets in the build's expected env.
  • Health endpoint returns 200 against the new build in staging.
  • Rollback plan: previous image tag recorded; revert is a redeploy.

7. Runbooks

Skeletons — fill in platform-specific steps once a target is chosen.

API is returning 5xx

  1. Check health endpoint. Is Mongo reachable?
  2. Check recent error logs filtered by level=error and a short time window.
  3. Get a request ID of a failing request; pull its log line.
  4. If Mongo is the cause: check Mongo metrics / health on the provider dashboard.
  5. If code is the cause: roll back to the previous image.

Mongo unreachable

  1. Is the Mongo service up in the provider's console?
  2. Is there a connection cap? Increase, or bounce API instances to release stale connections.
  3. Is DNS resolving?
  4. If the outage is provider-side, put up a maintenance page (TBD mechanism) and wait.

Refresh token compromise suspected

Symptom: auth.refresh.reused log events spike, or an individual user reports unexpected logouts.

  1. Identify the user. In Mongo: db.refresh_tokens.find({ userId }).sort({ createdAt: -1 }).
  2. Revoke all their refresh tokens: db.refresh_tokens.updateMany({ userId }, { $set: { revokedAt: new Date() } }).
  3. Force password reset for that user (TBD flow; for now, operator resets via admin tool).
  4. If many users affected simultaneously, consider rotating JWT_REFRESH_SECRET — nukes all refresh tokens globally.
  5. Review logs for common attributes across affected accounts (shared IP, shared UA, same referrer) to identify the attack vector.

Rate limit thundering herd

Symptom: 429s spike for a single IP or range.

  1. Inspect error.code: ratelimit.exceeded logs. Group by client IP.
  2. If it's a single actor, optionally add a temporary IP block at the load balancer.
  3. If it's a legitimate integration, invite them to authenticate (authenticated limits are per-user, usually higher).

Storage for thumbnails full

Once thumbnails are live:

  1. Expected growth ~N MB per approved content item.
  2. Alert at 75% capacity; purchase more or run retention job.
  3. Retention job: delete mips for content with isActive: false older than 30 days.

8. SLOs (to set once there's traffic)

Starting targets, revisit after 30 days of data:

  • 99th percentile response time under 300 ms for /content, /catalog, /meta.
  • 99.5% availability monthly.
  • Zero 5xx on /health / /ready.

Measure via the metrics endpoint once wired.