System design

How this site is built

A walk through the stack — from a request leaving a browser to the rendered page coming back. The shape of the system is opinionated around operational simplicity, global latency, and being honest about which work happens at the edge, on the server, and in the database.

BrowserReact 19 · hydrated UI

CloudflareDNS · WAF · cache · Workers (routing)

Vercel · Next.js 16SSR · ISR · static · edge fns · image opt

Applicationauth (JWT) · API routes · AI agents

ObservabilitySentry · Vercel analytics · logs

Rediscache · rate limit · queues

MongoDBprimary store today · Postgres under consideration

Object storageS3 (existing assets) · Vercel for new uploads

The request, narrated

A user in Los Angeles types positivetech.org/calendar.
DNS resolves to Cloudflare's anycast network. The nearest CF POP terminates TLS and decides: is this in cache?
For a static asset (/_next/static/*, fonts, images), Cloudflare returns it locally — the origin is never hit.
For an HTML page, CF checks its page cache. ISR-eligible pages can be served straight from the edge.
Cache miss → CF proxies to Vercel. A Worker may rewrite the request first (A/B tests, legacy redirects).
Vercel routes to the closest serverless region. Next.js renders the page server-side, fetching from MongoDB and Redis as needed.
HTML streams back through CF, gets cached at the POP, lands in the browser. React hydrates and the page becomes interactive.

Edge layer — Cloudflare

Cloudflare sits in front of Vercel as a second CDN with security and routing on top. Vercel already has its own edge network, so CF is additive — not a replacement.

DNS + TLS

Authoritative DNS on Cloudflare (joel.ns.cloudflare.com, tessa.ns.cloudflare.com). TLS terminated at the POP. SSL mode: Full (strict) so the CF→Vercel hop is also verified.

WAF + DDoS

Bot detection, rate-limiting rules, and L3/L4 DDoS absorption included on the free plan. WAF rules can block patterns before they ever reach Vercel.

Page cache

HTML pages aren't cached by default. Cache rules promote static and ISR-eligible routes (e.g. /decks/*, /architecture) to be cached at the POP with a short TTL, with stale-while-revalidate so users never wait.

Workers (routing)

Small JS at the edge to split traffic by URL — useful when migrating from a legacy backend, doing geographic routing, or running an A/B split. Currently unused; reserved for the next legacy carve-out.

Compute — Vercel + Next.js 16

The app is deployed to Vercel. Next.js gives us three rendering strategies; the right one depends on how dynamic the content is and who's allowed to see it.

Strategy	When to use	Example route here
Static (SSG)	Content rarely changes; same HTML for everyone. Built once at deploy, served from the edge.	`/architecture`, `/decks/what-if`
ISR	Content changes occasionally; first request after TTL expiry regenerates, others get the cached copy. Good for content with "eventually fresh" tolerance.	Marketing pages, prompt-curated content lists.
SSR	Personalized or auth-gated. Every request renders fresh on the server.	`/calendar` (reads user progress), `/10-day-review`
Client-only	Highly interactive widgets where SEO doesn't matter. Hydrated on first paint.	Gratitude chat launcher, calendar grid checkboxes.

Edge functions + middleware

Lightweight checks that run before SSR — auth cookies, redirects, geo headers. Cheaper and faster than spinning up a full Node fn.

Image optimization

next/imageserves resized, modern-format images (AVIF/WebP) from Vercel's optimizer, cached at the edge. We avoid shipping original-resolution images to mobile devices.

Code splitting + dynamic imports

Per-route JS bundles by default. Heavy widgets (charts, the chat dialog) load via dynamic()so they don't block first paint.

Streaming + Suspense

SSR streams HTML in chunks so the shell paints before slow data (e.g. AI responses) finishes. Skeletons fill in via Suspense.

Application services

Auth (JWT + bcryptjs)

Password hashes with bcryptjs; session as a signed JWT via jose in an httpOnly cookie. No third-party auth provider — keeps the dependency surface small and the user data local. The signing key lives in Vercel env vars.

API routes

REST-style endpoints under /api/* handle calendar progress, gratitude history, 10-day review submissions, and the calendar .ics export. Thin handlers — business logic lives in src/lib/*.

AI agents

The Gratitude Chat uses @ai-sdk/anthropic (Claude) with streaming responses; the 10-Day Review uses an @openai/agents pipeline for structured strength-tagging. Both run as Vercel functions — long timeouts where needed, short ones for everything else.

Rate limiting

Per-IP and per-user limits on AI endpoints to bound cost. Redis sorted sets or Upstash's ratelimit library — the abuse surface is small but real.

Data layer

MongoDB (current)

Primary store today. The document model fits our shapes — calendars of date-keyed progress, free-form gratitude entries, review submissions with nested strength tags — without forcing schema migrations every iteration.

Postgres (considering)

On the table for the future. Stronger constraints, easier relational queries, mature analytics tooling, and what most CMS-style platforms (Drupal etc.) expect. Tradeoff: more migration ceremony. Likely to migrate when query patterns get more relational than document-y.

Redis (cache + rate limit)

Two jobs. Cache: memoize expensive reads (aggregated review insights, AI responses to identical prompts). Coordination: rate-limit counters and the occasional lightweight queue for background work.

Object storage

S3 for existing imagery (the review tile is served from belcastro.s3.us-east-1.amazonaws.com today). New uploads will go to Vercel Blob — colocated with the app, simpler IAM, fewer accounts to keep credentials in sync.

Observability & ops

Error + behavior tracking (Sentry)

Server and client errors with source maps. Tag releases via the Vercel deploy hook so we can attribute regressions to a specific commit. Session replay turned on for authenticated flows only.

Performance (Vercel)

Web Vitals via @vercel/speed-insights, traffic shape via @vercel/analytics. Both ship as small scripts and give the metrics that actually move SEO and conversion.

Logs

Vercel function logs for short-term debugging; ship to a long-term sink (Logflare/Axiom) once volume justifies. Structured logs (JSON) from day one — easier to grep later.

Uptime + alerts

Cloudflare health checks on the apex; Sentry alerts for error spikes; Vercel deploy notifications in Slack (planned). Alerts route by severity — see below.

Alert routing

Not every error deserves a 3am phone call. Sentry alert rules route each issue to one of three tiers based on impact and urgency. The rule of thumb: P0 wakes you up, P1 ruins your morning coffee, P2 shows up in a Friday digest.

Severity	Channels	Example triggers
P0 — page now user-visible outage or money on fire	📱 SMS (Twilio) 📧 Email 💬 Slack #alerts-urgent (planned)	5xx rate > 5% over 2 min Cloudflare health check failing MongoDB connection errors > 10/min Auth: login failures spike > 50/min (possible brute force or outage) AI spend > $5/min (cost runaway) Sentry "unhandled exception" on `/api/auth/*`
P1 — Slack soon real problem, ~15 min response OK	💬 Slack #alerts (planned) 📧 Email (no SMS)	Error rate > 2× baseline within 30 min of a deploy A specific route 500ing > 5×/min p95 latency on `/calendar` or `/10-day-review` > 3s Hydration errors clustering on a new release Background job failures > 3 in an hour Anthropic / OpenAI 5xx rate > 20% (degrade gracefully but flag it)
P2 — daily digest informational, fix when convenient	📧 Email (daily roll-up) 💬 Slack #alerts-digest (planned)	Top 10 new errors by frequency Slowest 5 routes (p95 trend) Rate limit hits on AI endpoints Cloudflare WAF blocks summary Deprecation warnings from Next.js / Mongo driver

Slack integration is not built yet — the channels above (#alerts-urgent, #alerts, #alerts-digest) are placeholders. The plan is to wire Sentry's Slack integration with one webhook per severity tier so the routing stays declarative in Sentry, not scattered in custom code. Until then: SMS via Twilio for P0, email for everything else.

A note on tuning: every alert that fires without action becomes noise, and noise trains you to ignore the next one. Each rule has a quarterly review — if it's never actionable, it gets demoted or deleted.

Testing

The strategy is a pyramid, not a barbell: lots of fast static + unit checks at the base, a meaningful integration layer in the middle, and a small set of E2E tests on the flows that actually matter. The goal is to catch regressions before they reach prod without making every PR a 20-minute CI wait.

Layer	Tool	What it covers	When it runs
Static	TypeScript (`tsc --noEmit`) · ESLint	Type errors, dead code, unsafe patterns. Catches ~70% of bugs before the code ever runs.	Editor on save · pre-commit · PR
Unit	Vitest	Pure functions in `src/lib/*` — date math, JWT signing, progress calculations, prompt builders. Fast, no network.	Pre-commit (changed only) · PR (all)
Component	Vitest + React Testing Library	Critical interactive components: calendar grid, auth form, gratitude chat shell. Tests behavior, not implementation.	PR
Integration	Vitest against a real Mongo in CI	API routes end-to-end — auth flow, calendar progress read/write, review submission. Uses an ephemeral test DB per run.	PR
E2E	Playwright (Chromium · WebKit)	~10 critical-path scenarios: signup, login, complete a calendar day, run a 10-day review, send a gratitude chat message. Headless against a preview deploy.	PR (against preview URL) · post-deploy smoke
Visual regression	Playwright screenshot diffs (planned)	Landing page, architecture page, calendar grid at three viewports. Catches accidental CSS regressions on shared tokens.	PR (changed routes only)
Accessibility	`@axe-core/playwright` + manual review	Automated checks for contrast, missing labels, focus traps; manual screen-reader walk-through on new pages before merge.	PR + before every release
Performance	Lighthouse CI · Vercel Speed Insights	Budgets per route: LCP < 2.5s, CLS < 0.1, INP < 200ms. Build fails if the home or calendar page drops below a perf budget.	PR · production (continuous via Speed Insights)

E2E in practice

E2E tests are the most expensive layer to run and the most painful to keep green, so they earn their slot only on flows where a regression would directly hurt a real user. Everything else is covered by lower, faster layers.

The actual scenarios

auth.signup— new email + password → account created → lands on home with the user's name in the header.
auth.login — existing user signs in, session cookie set, refresh keeps them signed in.
auth.logout — sign out, protected pages bounce to /login.
calendar.check-off— auth'd user opens /calendar, ticks day 5, reloads the page, day 5 is still ticked and the progress counter incremented.
calendar.ics-download— auth'd user clicks the download link, the response is a valid text/calendar file with 30 VEVENT entries.
gratitude.chat — open the chat launcher, send a message, assert the streamed response renders progressively (token by token), no console errors, dialog closes cleanly.
review.submit — paste sample 10-day text into /10-day-review, submit, see the two-paragraph summary and at least three character-strength tags returned.
landing.experiments — anonymous visit to /, every experiment card renders with the correct href; clicking each one navigates without a hard reload.
nav.404 — unknown route returns the custom not-found page with working links back.

How auth is handled

One globalSetuplogs in a fixture user once and saves the session via Playwright's storageState. Every test that needs auth loads that state instead of going through the login UI — fast and removes login flakiness from unrelated tests. The login flow itself still has dedicated tests (auth.signup, auth.login) that exercise the form end-to-end.

Test data + isolation

Each run gets a fresh ephemeral Mongo (Docker in CI) seeded from tests/fixtures/. E2E-created users use a prefixed email pattern (e2e-<run-id>-<n>@positivetech.test) and are wiped in globalTeardown. Tests do not share state across files — a failing earlier test can't cause cascading failures.

AI calls in tests

We do nothit Anthropic or OpenAI from CI — too slow, too flaky, and the cost adds up across PRs. Two strategies depending on what we're testing:

Mock at the network layer with page.route() intercepting the SSE stream. Returns a canned token-by-token response so we can assert streaming UI behavior (placeholder → tokens → done state) without any model calls.
Record-and-replay for a small set of scenarios where the exact shape of the response matters (review strength-tagging). Re-recorded manually when prompts change.

A separate nightly job runs the AI scenarios against real providers and reports drift — that's where prompt regressions surface, decoupled from the PR pipeline.

Flakiness controls

Role-based selectors (getByRole, getByLabel) over CSS — survive refactors, match how users find things.
No page.waitForTimeout ever. Auto-waiting + expect.poll for anything time-dependent.
retries: 2 in CI, 0 locally — locally a flaky test is a bug to fix, in CI a single retry hides infrastructure noise without masking real flakes.
trace: 'on-first-retry' + video: 'retain-on-failure' — every failure uploads a viewable trace + recording to the GitHub Actions run.

Browsers, viewports, parallelism

Browsers: Chromium (Chrome / Edge) and WebKit (Safari). Firefox skipped — small user share, not worth the extra CI minutes.
Viewports: primary suite at 1280×800; a mobile-tagged subset (signup, calendar check-off, chat) also runs at 390×844.
Parallelism: sharded by test file across 4 workers in CI. Full suite typically finishes in under 4 minutes.

Where they run

On every PR, Playwright runs against that PR's Vercel preview URL — same code, same edge, same Mongo connection as production would see, just pointed at staging data. A small smoke subset (auth.login, calendar.check-off, landing.experiments) re-runs against production immediately after deploy promotion. If the smoke fails, the deploy is rolled back via Vercel's instant-rollback.

What we don't test (and why)

Pure presentational components. Snapshot tests on JSX are noisy and rarely catch real regressions. Visual diffs do that job better.
Third-party libs.We trust Mongo's driver, Next's router, Anthropic's SDK. Wrapping them in tests is testing their authors, not us.
AI output content. Non-deterministic by design. We test that the call happens, the response shape, and the streaming behavior — not the words Claude picks.
Cross-browser at scale. Playwright runs Chromium and WebKit; we accept the small risk on Firefox-specific bugs given the user base.

Test data + environments

Integration and E2E tests use an ephemeral Mongo per CI run (Docker in GitHub Actions) and deterministic seed data committed to tests/fixtures/. Preview deploys share a long-lived staging Mongo so manual exploratory testing is repeatable. Prod data is never reachable from CI — different connection string, different Vercel environment, different secret.

Delivery (CI/CD)

GitHub → Vercel. Every PR gets a preview deploy at a unique URL — which makes review and stakeholder feedback async and concrete. Main branch deploys to production automatically. Database migrations run via a one-off script before the deploy promotes; secrets are per-environment in the Vercel dashboard, never committed.

Preview environments share staging data, not prod — so destructive experiments are safe.
Type checking + lint run on every PR; the build fails if either does.
Feature flags (planned) for risky rollouts — let us merge incomplete code behind a flag rather than long-lived branches.

Tradeoffs worth naming

Two CDNs (Cloudflare + Vercel). Adds a hop and a second cache to invalidate. Worth it for WAF, Workers, and controlled cache rules; not worth it if those are unused.
Serverless functions.No cold-start tuning, but connection pooling to Mongo/Postgres needs care — we use the provider's pooled connection string and reuse clients across invocations.