System design
How this site is built
A walk through the stack — from a request leaving a browser to the rendered page coming back. The shape of the system is opinionated around operational simplicity, global latency, and being honest about which work happens at the edge, on the server, and in the database.
The request, narrated
- A user in Los Angeles types
positivetech.org/calendar. - DNS resolves to Cloudflare's anycast network. The nearest CF POP terminates TLS and decides: is this in cache?
- For a static asset (
/_next/static/*, fonts, images), Cloudflare returns it locally — the origin is never hit. - For an HTML page, CF checks its page cache. ISR-eligible pages can be served straight from the edge.
- Cache miss → CF proxies to Vercel. A Worker may rewrite the request first (A/B tests, legacy redirects).
- Vercel routes to the closest serverless region. Next.js renders the page server-side, fetching from MongoDB and Redis as needed.
- HTML streams back through CF, gets cached at the POP, lands in the browser. React hydrates and the page becomes interactive.
Edge layer — Cloudflare
Cloudflare sits in front of Vercel as a second CDN with security and routing on top. Vercel already has its own edge network, so CF is additive — not a replacement.
DNS + TLS
Authoritative DNS on Cloudflare (joel.ns.cloudflare.com, tessa.ns.cloudflare.com). TLS terminated at the POP. SSL mode: Full (strict) so the CF→Vercel hop is also verified.
WAF + DDoS
Bot detection, rate-limiting rules, and L3/L4 DDoS absorption included on the free plan. WAF rules can block patterns before they ever reach Vercel.
Page cache
HTML pages aren't cached by default. Cache rules promote static and ISR-eligible routes (e.g. /decks/*, /architecture) to be cached at the POP with a short TTL, with stale-while-revalidate so users never wait.
Workers (routing)
Small JS at the edge to split traffic by URL — useful when migrating from a legacy backend, doing geographic routing, or running an A/B split. Currently unused; reserved for the next legacy carve-out.
Compute — Vercel + Next.js 16
The app is deployed to Vercel. Next.js gives us three rendering strategies; the right one depends on how dynamic the content is and who's allowed to see it.
| Strategy | When to use | Example route here |
|---|---|---|
| Static (SSG) | Content rarely changes; same HTML for everyone. Built once at deploy, served from the edge. | /architecture, /decks/what-if |
| ISR | Content changes occasionally; first request after TTL expiry regenerates, others get the cached copy. Good for content with "eventually fresh" tolerance. | Marketing pages, prompt-curated content lists. |
| SSR | Personalized or auth-gated. Every request renders fresh on the server. | /calendar (reads user progress), /10-day-review |
| Client-only | Highly interactive widgets where SEO doesn't matter. Hydrated on first paint. | Gratitude chat launcher, calendar grid checkboxes. |
Edge functions + middleware
Lightweight checks that run before SSR — auth cookies, redirects, geo headers. Cheaper and faster than spinning up a full Node fn.
Image optimization
next/imageserves resized, modern-format images (AVIF/WebP) from Vercel's optimizer, cached at the edge. We avoid shipping original-resolution images to mobile devices.
Code splitting + dynamic imports
Per-route JS bundles by default. Heavy widgets (charts, the chat dialog) load via dynamic()so they don't block first paint.
Streaming + Suspense
SSR streams HTML in chunks so the shell paints before slow data (e.g. AI responses) finishes. Skeletons fill in via Suspense.
Application services
Auth (JWT + bcryptjs)
Password hashes with bcryptjs; session as a signed JWT via jose in an httpOnly cookie. No third-party auth provider — keeps the dependency surface small and the user data local. The signing key lives in Vercel env vars.
API routes
REST-style endpoints under /api/* handle calendar progress, gratitude history, 10-day review submissions, and the calendar .ics export. Thin handlers — business logic lives in src/lib/*.
AI agents
The Gratitude Chat uses @ai-sdk/anthropic (Claude) with streaming responses; the 10-Day Review uses an @openai/agents pipeline for structured strength-tagging. Both run as Vercel functions — long timeouts where needed, short ones for everything else.
Rate limiting
Per-IP and per-user limits on AI endpoints to bound cost. Redis sorted sets or Upstash's ratelimit library — the abuse surface is small but real.
Data layer
MongoDB (current)
Primary store today. The document model fits our shapes — calendars of date-keyed progress, free-form gratitude entries, review submissions with nested strength tags — without forcing schema migrations every iteration.
Postgres (considering)
On the table for the future. Stronger constraints, easier relational queries, mature analytics tooling, and what most CMS-style platforms (Drupal etc.) expect. Tradeoff: more migration ceremony. Likely to migrate when query patterns get more relational than document-y.
Redis (cache + rate limit)
Two jobs. Cache: memoize expensive reads (aggregated review insights, AI responses to identical prompts). Coordination: rate-limit counters and the occasional lightweight queue for background work.
Object storage
S3 for existing imagery (the review tile is served from belcastro.s3.us-east-1.amazonaws.com today). New uploads will go to Vercel Blob — colocated with the app, simpler IAM, fewer accounts to keep credentials in sync.
Observability & ops
Error + behavior tracking (Sentry)
Server and client errors with source maps. Tag releases via the Vercel deploy hook so we can attribute regressions to a specific commit. Session replay turned on for authenticated flows only.
Performance (Vercel)
Web Vitals via @vercel/speed-insights, traffic shape via @vercel/analytics. Both ship as small scripts and give the metrics that actually move SEO and conversion.
Logs
Vercel function logs for short-term debugging; ship to a long-term sink (Logflare/Axiom) once volume justifies. Structured logs (JSON) from day one — easier to grep later.
Uptime + alerts
Cloudflare health checks on the apex; Sentry alerts for error spikes; Vercel deploy notifications in Slack (planned). Alerts route by severity — see below.
Alert routing
Not every error deserves a 3am phone call. Sentry alert rules route each issue to one of three tiers based on impact and urgency. The rule of thumb: P0 wakes you up, P1 ruins your morning coffee, P2 shows up in a Friday digest.
| Severity | Channels | Example triggers |
|---|---|---|
| P0 — page now user-visible outage or money on fire | 📱 SMS (Twilio) 💬 Slack #alerts-urgent (planned) |
|
| P1 — Slack soon real problem, ~15 min response OK | 💬 Slack #alerts (planned) 📧 Email (no SMS) |
|
| P2 — daily digest informational, fix when convenient | 📧 Email (daily roll-up) 💬 Slack #alerts-digest (planned) |
|
Slack integration is not built yet — the channels above (#alerts-urgent, #alerts, #alerts-digest) are placeholders. The plan is to wire Sentry's Slack integration with one webhook per severity tier so the routing stays declarative in Sentry, not scattered in custom code. Until then: SMS via Twilio for P0, email for everything else.
A note on tuning: every alert that fires without action becomes noise, and noise trains you to ignore the next one. Each rule has a quarterly review — if it's never actionable, it gets demoted or deleted.
Testing
The strategy is a pyramid, not a barbell: lots of fast static + unit checks at the base, a meaningful integration layer in the middle, and a small set of E2E tests on the flows that actually matter. The goal is to catch regressions before they reach prod without making every PR a 20-minute CI wait.
| Layer | Tool | What it covers | When it runs |
|---|---|---|---|
| Static | TypeScript (tsc --noEmit) · ESLint | Type errors, dead code, unsafe patterns. Catches ~70% of bugs before the code ever runs. | Editor on save · pre-commit · PR |
| Unit | Vitest | Pure functions in src/lib/* — date math, JWT signing, progress calculations, prompt builders. Fast, no network. | Pre-commit (changed only) · PR (all) |
| Component | Vitest + React Testing Library | Critical interactive components: calendar grid, auth form, gratitude chat shell. Tests behavior, not implementation. | PR |
| Integration | Vitest against a real Mongo in CI | API routes end-to-end — auth flow, calendar progress read/write, review submission. Uses an ephemeral test DB per run. | PR |
| E2E | Playwright (Chromium · WebKit) | ~10 critical-path scenarios: signup, login, complete a calendar day, run a 10-day review, send a gratitude chat message. Headless against a preview deploy. | PR (against preview URL) · post-deploy smoke |
| Visual regression | Playwright screenshot diffs (planned) | Landing page, architecture page, calendar grid at three viewports. Catches accidental CSS regressions on shared tokens. | PR (changed routes only) |
| Accessibility | @axe-core/playwright + manual review | Automated checks for contrast, missing labels, focus traps; manual screen-reader walk-through on new pages before merge. | PR + before every release |
| Performance | Lighthouse CI · Vercel Speed Insights | Budgets per route: LCP < 2.5s, CLS < 0.1, INP < 200ms. Build fails if the home or calendar page drops below a perf budget. | PR · production (continuous via Speed Insights) |
E2E in practice
E2E tests are the most expensive layer to run and the most painful to keep green, so they earn their slot only on flows where a regression would directly hurt a real user. Everything else is covered by lower, faster layers.
The actual scenarios
auth.signup— new email + password → account created → lands on home with the user's name in the header.auth.login— existing user signs in, session cookie set, refresh keeps them signed in.auth.logout— sign out, protected pages bounce to/login.calendar.check-off— auth'd user opens/calendar, ticks day 5, reloads the page, day 5 is still ticked and the progress counter incremented.calendar.ics-download— auth'd user clicks the download link, the response is a validtext/calendarfile with 30VEVENTentries.gratitude.chat— open the chat launcher, send a message, assert the streamed response renders progressively (token by token), no console errors, dialog closes cleanly.review.submit— paste sample 10-day text into/10-day-review, submit, see the two-paragraph summary and at least three character-strength tags returned.landing.experiments— anonymous visit to/, every experiment card renders with the correct href; clicking each one navigates without a hard reload.nav.404— unknown route returns the custom not-found page with working links back.
How auth is handled
One globalSetuplogs in a fixture user once and saves the session via Playwright's storageState. Every test that needs auth loads that state instead of going through the login UI — fast and removes login flakiness from unrelated tests. The login flow itself still has dedicated tests (auth.signup, auth.login) that exercise the form end-to-end.
Test data + isolation
Each run gets a fresh ephemeral Mongo (Docker in CI) seeded from tests/fixtures/. E2E-created users use a prefixed email pattern (e2e-<run-id>-<n>@positivetech.test) and are wiped in globalTeardown. Tests do not share state across files — a failing earlier test can't cause cascading failures.
AI calls in tests
We do nothit Anthropic or OpenAI from CI — too slow, too flaky, and the cost adds up across PRs. Two strategies depending on what we're testing:
- Mock at the network layer with
page.route()intercepting the SSE stream. Returns a canned token-by-token response so we can assert streaming UI behavior (placeholder → tokens → done state) without any model calls. - Record-and-replay for a small set of scenarios where the exact shape of the response matters (review strength-tagging). Re-recorded manually when prompts change.
A separate nightly job runs the AI scenarios against real providers and reports drift — that's where prompt regressions surface, decoupled from the PR pipeline.
Flakiness controls
- Role-based selectors (
getByRole,getByLabel) over CSS — survive refactors, match how users find things. - No
page.waitForTimeoutever. Auto-waiting +expect.pollfor anything time-dependent. retries: 2in CI,0locally — locally a flaky test is a bug to fix, in CI a single retry hides infrastructure noise without masking real flakes.trace: 'on-first-retry'+video: 'retain-on-failure'— every failure uploads a viewable trace + recording to the GitHub Actions run.
Browsers, viewports, parallelism
- Browsers: Chromium (Chrome / Edge) and WebKit (Safari). Firefox skipped — small user share, not worth the extra CI minutes.
- Viewports: primary suite at 1280×800; a mobile-tagged subset (signup, calendar check-off, chat) also runs at 390×844.
- Parallelism: sharded by test file across 4 workers in CI. Full suite typically finishes in under 4 minutes.
Where they run
On every PR, Playwright runs against that PR's Vercel preview URL — same code, same edge, same Mongo connection as production would see, just pointed at staging data. A small smoke subset (auth.login, calendar.check-off, landing.experiments) re-runs against production immediately after deploy promotion. If the smoke fails, the deploy is rolled back via Vercel's instant-rollback.
What we don't test (and why)
- Pure presentational components. Snapshot tests on JSX are noisy and rarely catch real regressions. Visual diffs do that job better.
- Third-party libs.We trust Mongo's driver, Next's router, Anthropic's SDK. Wrapping them in tests is testing their authors, not us.
- AI output content. Non-deterministic by design. We test that the call happens, the response shape, and the streaming behavior — not the words Claude picks.
- Cross-browser at scale. Playwright runs Chromium and WebKit; we accept the small risk on Firefox-specific bugs given the user base.
Test data + environments
Integration and E2E tests use an ephemeral Mongo per CI run (Docker in GitHub Actions) and deterministic seed data committed to tests/fixtures/. Preview deploys share a long-lived staging Mongo so manual exploratory testing is repeatable. Prod data is never reachable from CI — different connection string, different Vercel environment, different secret.
Delivery (CI/CD)
GitHub → Vercel. Every PR gets a preview deploy at a unique URL — which makes review and stakeholder feedback async and concrete. Main branch deploys to production automatically. Database migrations run via a one-off script before the deploy promotes; secrets are per-environment in the Vercel dashboard, never committed.
- Preview environments share staging data, not prod — so destructive experiments are safe.
- Type checking + lint run on every PR; the build fails if either does.
- Feature flags (planned) for risky rollouts — let us merge incomplete code behind a flag rather than long-lived branches.
Tradeoffs worth naming
- Two CDNs (Cloudflare + Vercel). Adds a hop and a second cache to invalidate. Worth it for WAF, Workers, and controlled cache rules; not worth it if those are unused.
- Serverless functions.No cold-start tuning, but connection pooling to Mongo/Postgres needs care — we use the provider's pooled connection string and reuse clients across invocations.