Technical Architecture
Groundwork — Shared Reality Platform for Home Renovation Transparency
0 Architecture Principles
These principles govern every technical decision in this document. When trade-offs arise, they serve as the tiebreaker. Any deviation from these principles requires an explicit ADR.
Single Responsibility
Every module, service, and function does one thing and does it well. The API layer does not contain business logic. Business logic does not know about the database schema.
Defense in Depth
Security controls exist at every layer: network, application, data, and audit. No single failure exposes sensitive data. Homeowner and contractor data are isolated at the database level, not just the application level.
Principle of Least Privilege
Every process, user role, and service account has only the permissions it needs and nothing more. Database roles are scoped per operation type. API tokens expire. Admin actions are logged.
Fail Fast, Recover Gracefully
Invalid input is rejected at the boundary with clear error messages. Integration failures (permit APIs, carrier tracking) fall back to cached state, never silently corrupt data or crash the application.
12-Factor App
Configuration lives in environment variables. Logs are streams to stdout. The app is stateless — any instance can handle any request. Dev/prod parity is maintained. Backing services are attached resources.
Explicit Over Implicit
No magic. No clever abstractions that hide what the system is doing. A junior developer reading any module should be able to trace a request from HTTP handler through business logic to database query in under 10 minutes.
Observability First
Structured logs, metrics, and error tracking are built in from day one. The system must be debuggable in production without attaching a debugger. Every background job emits completion metrics.
Evolutionary Architecture
Year 1 infrastructure must not over-engineer for Year 3 load. But Year 1 code must not make Year 3 scaling unnecessarily painful. Schema migrations, API versioning, and service boundaries are designed to evolve.
Layered Separation of Concerns
Every feature is implemented across clearly separated layers. No layer reaches across its boundary into another layer's responsibility.
1 System Overview
Groundwork is a two-sided platform connecting homeowners and contractors within a shared project workspace. The architecture is a monolithic API (intentionally — see ADR-07) deployed on Fly.io, a SvelteKit frontend on Netlify, PostgreSQL on Neon for all relational data, Cloudflare R2 for document storage, and Fly.io workers for background integration polling.
Key Architectural Decisions Visible in This Diagram
2 Frontend Architecture
Rendering Strategy
SvelteKit supports multiple rendering modes per route. Groundwork uses a deliberate split: SSR for public-facing pages (SEO and first-contentful-paint), and client-side SPA rendering for the authenticated application. This gives the landing page sub-1-second LCP without sacrificing the interactive richness of the project dashboard.
Public Routes — SSR (Server-Side Rendered)
/— Landing page (SEO critical)/pricing,/about— Marketing pages/login,/signup— Auth entry points/invite/[token]— Contractor invite landing
Rendered at request time on Netlify Edge Functions. HTML is indexable by search engines. Hydrated client-side after load.
App Routes — CSR (Client-Side SPA)
/dashboard— Project list/projects/[id]— Project workspace/projects/[id]/milestones/projects/[id]/documents/projects/[id]/payments/projects/[id]/change-orders/admin— Admin panel
JavaScript bundle loaded once. Route transitions are instant. Auth guard at layout level — unauthenticated users redirected to /login.
Component Architecture
Components follow the Container / Presentational pattern. No business logic in components. Data fetching happens in +page.ts load functions or via stores — never inline in component markup.
// Directory structure — feature-first, not type-first
src/
lib/
components/
ui/ // Button, Badge, Modal, Toast — design system atoms
project/ // ProjectCard, ProjectHeader, MilestoneTimeline
change-order/ // ChangeOrderForm, ChangeOrderStatus
payment/ // PaymentSchedule, PaymentRow
document/ // DocumentUploader, DocumentList
permit/ // PermitStatus, PermitTimeline
stores/
project.ts // Writable<Project> — current project context
notifications.ts // real-time notification queue
session.ts // User session, role, permissions
api/
client.ts // Typed fetch wrapper — all API calls go through here
projects.ts // Project CRUD functions
milestones.ts
documents.ts
utils/
format.ts // Currency, date formatting — pure functions
validate.ts // Client-side validation schemas (Zod)
routes/
(public)/ // SSR group
(app)/ // Auth-guarded CSR group
State Management
Svelte's built-in stores are sufficient for Groundwork's data model. No Zustand, Redux, or Pinia. The session store holds the current user. The project store holds the current project (populated by the route's load function). Notifications arrive via Server-Sent Events and push to the notifications store.
GET /api/stream, an SSE endpoint. When a background worker detects a permit status change or a contractor marks a milestone complete, PostgreSQL LISTEN/NOTIFY triggers the API to push a typed event to all connected clients on that project. The notification store updates, Svelte's reactivity re-renders the relevant component. No WebSocket complexity, no third-party realtime service.
Performance Budget
Achieving the performance budget:
- SvelteKit compiles away the framework — no virtual DOM, minimal runtime. Landing page JS is typically 20–40kb gzipped.
- Images served through Cloudflare's image resizing — correct size and format (WebP/AVIF) per viewport.
- Fonts loaded via
font-display: swapwith preconnect hints. Only 2 font families, 3 weights each. - Route-level code splitting is automatic in SvelteKit — the project dashboard doesn't load the admin panel's code.
- SvelteKit's
preloadstrategy: link hover triggers prefetch of the next route's data.
Security Considerations — Frontend
- CSRF protection: Session cookies use
SameSite=Strict. All state-mutating requests include a CSRF token from the session, validated server-side. - Content Security Policy: Strict CSP header from Netlify — no inline scripts, no eval, allowed hosts enumerated explicitly.
- Sensitive data: PII (addresses, contractor license numbers) is never stored in
localStorageorsessionStorage— only in memory. Refresh clears it; the API re-fetches as needed. - Input sanitization: All text content rendered via Svelte's
{}binding (not{@html}) — auto-escaped, XSS-safe by default.
3 Backend Architecture
Framework: Hono on Node.js (Fly.io)
The API is built with Hono, a lightweight TypeScript-first HTTP framework with excellent performance characteristics and middleware composability. It runs on Node.js on a Fly.io VM (not serverless — see ADR-07). The codebase is organized as a modular monolith: separate router files per domain, shared middleware, dependency injection via closure rather than a DI container.
Why REST over GraphQL — see ADR-05. Short version: Groundwork's data model is relational, access patterns are predictable CRUD, and GraphQL's N+1 risks and schema complexity add cost without benefit at this scale.
Key API Endpoints
| Method | Path | Description | Auth | Rate Limit |
|---|---|---|---|---|
| POST | /auth/login | Create session with email + password | Public | 5/min per IP |
| POST | /auth/logout | Invalidate session, clear cookie | Session | — |
| GET | /auth/me | Current user + role + permissions | Session | 60/min |
| POST | /projects | Create new project (homeowner initiates) | Homeowner | 10/hour |
| GET | /projects | List user's projects (both roles) | Session | 60/min |
| GET | /projects/:id | Project detail + health score | Member | 120/min |
| PATCH | /projects/:id | Update project metadata | Contractor | 30/min |
| GET | /projects/:id/milestones | List milestones with status | Member | 120/min |
| PATCH | /projects/:id/milestones/:mid | Mark milestone complete (triggers notification) | Contractor | 30/min |
| POST | /projects/:id/change-orders | Submit change order for approval | Contractor | 20/hour |
| PATCH | /projects/:id/change-orders/:cid | Approve or reject change order | Homeowner | 30/min |
| GET | /projects/:id/documents | List documents with signed download URLs | Member | 60/min |
| POST | /projects/:id/documents/upload-url | Generate signed R2 upload URL | Member | 20/min |
| POST | /projects/:id/documents/confirm | Confirm upload complete, persist metadata | Member | 20/min |
| GET | /projects/:id/payments | Payment schedule + status | Member | 60/min |
| GET | /projects/:id/permits | Permit status (cached, refreshed by poller) | Member | 60/min |
| GET | /projects/:id/health | Contract health score + breakdown | Member | 60/min |
| GET | /api/stream | SSE stream for real-time project events | Session | 5 connections/user |
| POST | /invites | Send contractor invite link | Homeowner | 10/day |
| GET | /invites/:token | Validate invite token, return project info | Public | 20/min per IP |
Authentication: Session Cookies (not JWT)
See ADR-03 for the full rationale. Sessions are stored server-side in Postgres (with a Redis cache for fast lookup). The client receives only an opaque session ID in an HttpOnly; Secure; SameSite=Strict cookie. Sessions expire after 24 hours of inactivity and are invalidated immediately on logout.
// Session table — stored in Postgres, cached in Redis
CREATE TABLE sessions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_seen TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL,
ip_address INET,
user_agent TEXT
);
-- Redis key: "session:{id}" → JSON(user_id, role, expires_at)
-- TTL: 24 hours, refreshed on each request
RBAC: Role-Based Access Control
Three roles: homeowner, contractor, admin. Role is stored on the project_members join table (not globally on the user) — a user can be a homeowner on one project and a contractor on another. The middleware resolves the caller's role within the requested project before routing.
GET /projects/abc/milestones resolves: is this user a member of project abc? What is their role within that project? This prevents the class of bug where a contractor on Project B can see Project A's data by guessing a UUID.
| Permission | Homeowner | Contractor | Admin |
|---|---|---|---|
| View project data | Yes | Yes | Yes |
| Create project | Yes | No | Yes |
| Invite contractor | Yes | No | Yes |
| Mark milestone complete | No | Yes | Yes |
| Submit change order | No | Yes | Yes |
| Approve change order | Yes | No | Yes |
| Upload documents | Yes | Yes | Yes |
| Delete documents | Own only | Own only | Yes |
| View admin panel | No | No | Yes |
Rate Limiting Strategy
Rate limits are enforced at the API middleware layer using a sliding window counter stored in Redis (Upstash). Three tiers:
- Global per-IP: 300 req/min for all authenticated requests. Blocks volumetric abuse before hitting business logic.
- Endpoint-specific: Tighter limits on destructive or expensive endpoints (see table above). Authentication endpoints are the most restrictive — 5 attempts/min/IP with exponential backoff.
- Per-user: 60 req/min across all authenticated requests for a given user ID, regardless of IP. Prevents credential-stuffed accounts from being used as scrapers.
4 Data Architecture
Primary Schema: PostgreSQL (Neon)
All relational data lives in a single PostgreSQL database hosted on Neon. The schema below covers the 7 primary entities plus supporting tables. Row Level Security is enabled on all tables containing project data.
Caching Strategy: When to Use Redis vs. Postgres
Not everything needs to be cached. Cache when: the data is read far more often than it changes, computing it is expensive, or latency for stale data is acceptable. Do not cache when: correctness is critical (payment amounts, audit events) or the dataset fits in a single Postgres query.
| Data | Cache? | TTL | Reason |
|---|---|---|---|
| Contract health score | Redis | 1 hour | Expensive to compute; acceptable if slightly stale |
| Permit status | Redis | 30 min | Polled externally; DB write + cache invalidate on update |
| User session | Redis | 24 hours | Hit on every request — must be sub-millisecond |
| Rate limit counters | Redis | 1 min | Requires atomic increment; Redis INCR is atomic |
| Project list | Postgres | — | Small result set; query is <10ms with index |
| Milestone list | Postgres | — | Must always reflect latest contractor updates |
| Payment amounts | Never | — | Financial data — must be authoritative from DB |
| Audit log | Never | — | Legal record — no caching ever |
| Signed R2 URLs | Redis | 55 min | Signed URLs expire at 60 min; cache saves regeneration |
Real-Time: PostgreSQL LISTEN/NOTIFY
PostgreSQL's built-in pub/sub mechanism handles real-time project updates without a separate WebSocket server or message broker. A Postgres trigger fires on writes to milestones, change_orders, and permits — emitting a notification on a channel named project:{project_id}. The API server maintains a single persistent Postgres connection per project that has active SSE subscribers, forwarding notifications to those clients.
-- Trigger on milestone update
CREATE OR REPLACE FUNCTION
notify_project_change()
RETURNS TRIGGER AS $$
BEGIN
PERFORM pg_notify(
'project:' || NEW.project_id,
json_build_object(
'entity', TG_TABLE_NAME,
'id', NEW.id,
'event', TG_OP
)::text
);
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER milestone_notify
AFTER INSERT OR UPDATE ON milestones
FOR EACH ROW EXECUTE FUNCTION
notify_project_change();
Document Storage: Cloudflare R2
Documents (contracts, permits, invoices, photos) are stored in Cloudflare R2. The flow is a two-step upload: the API generates a presigned PUT URL (valid 15 minutes), the client uploads directly to R2 (bypassing the API entirely), then the client POSTs a confirmation to the API which persists the metadata. This keeps large binaries off the application server.
5 Integration Architecture
City Permit APIs
This is the most technically challenging integration. There is no standard API — every municipality has its own portal, data model, and access method. Groundwork's Year 1 target (one city) allows for a focused adapter. Year 2 requires a pluggable adapter pattern.
Adapter Pattern
Each city permit source implements a common PermitAdapter interface:
interface PermitAdapter {
fetchByAddress(
address: string,
city: string,
state: string
): Promise<PermitRecord[]>;
fetchByPermitNumber(
id: string
): Promise<PermitRecord>;
normalizeStatus(
raw: string
): PermitStatus;
}
Failure Modes & Fallbacks
- API timeout (>10s): Return cached status from last successful poll. Surface staleness timestamp to user ("Last updated 3 hours ago").
- HTTP 4xx/5xx: Increment error counter. After 3 consecutive failures, mark permit as "Status Unavailable" and alert admin via Sentry.
- Schema change: Normalization layer logs unknown status values, maps to "Unknown" — does not throw. Alert is raised for manual review.
- City changes access method: Adapter is swapped without changing the caller. Raw responses stored in
permits.raw_responseJSONB for re-parsing.
Delivery Tracking
Carrier APIs (UPS, FedEx, USPS) provide REST APIs with reasonable documentation. Groundwork tracks deliveries linked to project milestones (e.g., "lumber delivery" before "framing" milestone).
| Carrier | Method | Auth | Rate Limit | Failure Mode |
|---|---|---|---|---|
| UPS | REST API | OAuth 2.0 | 40 req/day (free) | Cache last known status; retry 1h later |
| FedEx | REST API | OAuth 2.0 | 5000 req/day | Cache last known status; retry 1h later |
| USPS | REST API (v3) | OAuth 2.0 | 500 req/day | Cache last known status; retry 4h later |
License Verification
Contractor license verification is the most fragile integration. Most state databases offer no API — they require either screen scraping of a public search form or access to a third-party aggregator.
Tiered Verification Approach
- Tier 1 — Official API: A small number of states (CA, TX, FL, NY) have machine-readable license lookup APIs. Use these directly.
- Tier 2 — Structured scrape: States with consistent HTML structures get a Playwright-based scraper running in a Fly.io worker. Results cached 7 days.
- Tier 3 — Third-party aggregator: For states where scraping is unreliable, use a service like CSLB API or Contractor Check as a paid fallback.
- Tier 4 — Manual review queue: If all automated methods fail, flag the contractor for manual admin verification. Platform remains usable; verification shown as "Pending".
Security considerations
License numbers are PII — they are stored encrypted in users.license_number (column-level encryption via pgcrypto). Only admin role can view plaintext. The verification result (verified/not verified/pending) is stored as a separate non-encrypted field so it can be queried without decryption.
Verification is re-checked at 90-day intervals. An expired or revoked license triggers an admin alert and changes the contractor's trust indicator on shared projects.
Email & SMS (Notifications)
Outbound notifications use Resend for transactional email (excellent deliverability, TypeScript SDK, generous free tier) and Twilio for SMS on critical events (milestone payment due, change order requiring approval within 48h).
6 Background Processing
Background jobs run as separate Fly.io worker machines. They consume jobs from the Postgres job_queue table using the SKIP LOCKED pattern — multiple workers can run concurrently without stepping on each other, and jobs are never lost (they live in the database, not in memory). See ADR-04 for the trade-off analysis.
-- Worker claim query — safe for concurrent workers
UPDATE job_queue
SET
status = 'processing',
locked_at = NOW(),
attempts = attempts + 1
WHERE id = (
SELECT id FROM job_queue
WHERE status = 'pending'
AND run_at <= NOW()
AND attempts < max_attempts
ORDER BY run_at ASC
LIMIT 1
FOR UPDATE SKIP LOCKED
)
RETURNING *;
Job Types
| Job Type | Trigger | Frequency | SLA | Failure Action |
|---|---|---|---|---|
permit.poll |
Scheduled (cron) | Hourly per active project | Complete within 55 min | Retry 3x, then mark stale + alert |
health.compute |
On milestone/change-order write | On-demand | Complete within 30s | Retry 3x, serve stale score with staleness label |
notification.send |
On project events | On-demand | Delivered within 5 min | Retry with backoff, dead-letter after 3 failures |
delivery.track |
On tracking number added | Every 2 hours until delivered | Complete within 90 min | Retry 3x with 1h backoff |
digest.generate |
Scheduled (cron) | Daily at 7 AM project timezone | Complete before 8 AM | Skip day, alert admin if 3 consecutive failures |
license.verify |
On contractor join + every 90 days | On-demand + scheduled | Complete within 1 hour | Retry, then queue for manual review |
session.cleanup |
Scheduled (cron) | Daily at 2 AM UTC | Complete before 3 AM | Log failure, retry next day |
Contract Health Score Algorithm
The ContractHealthScore is a 0–100 composite score computed by the health.compute worker. It is the most business-critical computation in the system. The algorithm is deterministic and testable as a pure function.
- Schedule adherence (30 pts): Ratio of on-time milestone completions vs. late completions. A milestone is late if completed more than 3 days past its due date, or not yet complete and past due date.
- Change order frequency (25 pts): Projects with more than 2 change orders per 10 milestones are penalized. High change order rates correlate with scope creep or poor planning.
- Documentation completeness (20 pts): Key documents present: signed contract, active permit(s), current insurance certificate, inspection records.
- Payment timeliness (15 pts): Payments released within 72h of milestone approval. Delayed payments reduce the score.
- Communication currency (10 pts): Time since last activity in project (milestone update, document upload, message). Projects inactive >14 days lose points.
7 Security Architecture
Authentication Flow
Data Encryption
At Rest
- Database: Neon encrypts all data at rest using AES-256. No additional action required from the application.
- Column-level encryption: PII fields (address, license_number, phone) are additionally encrypted using
pgcryptowith a key stored in environment variables, not in the database. An attacker with a Postgres dump cannot read PII without the application key. - R2: Cloudflare R2 encrypts all objects at rest. Encryption keys are managed by Cloudflare.
In Transit
- All external traffic: TLS 1.3 enforced. TLS 1.0 and 1.1 are disabled at the Cloudflare WAF level. HSTS with
max-age=31536000; includeSubDomains. - API → Neon: TLS-encrypted connection. Neon requires TLS by default; plaintext connections are rejected.
- API → Redis: Upstash enforces TLS. The connection string includes
tls://. - API → R2: HTTPS only. Presigned URLs include a content-type restriction to prevent MIME sniffing attacks.
Multi-Tenant Isolation: Row Level Security
Row Level Security (RLS) is the last line of defense for multi-tenant data isolation. Even if the application layer has a bug that constructs an incorrect query, the database itself will not return data from another tenant's project.
-- Enable RLS on all project-scoped tables
ALTER TABLE milestones ENABLE ROW LEVEL SECURITY;
ALTER TABLE change_orders ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
ALTER TABLE permits ENABLE ROW LEVEL SECURITY;
-- Policy: user can only see milestones from their projects
CREATE POLICY milestones_member_policy
ON milestones FOR ALL TO app_user
USING (
project_id IN (
SELECT project_id
FROM project_members
WHERE user_id = current_setting('app.current_user_id')::uuid
)
);
-- API sets the user context before each query
-- SELECT set_config('app.current_user_id', $1, true)
-- Separate DB role for workers — cannot write to audit_events
CREATE ROLE app_worker;
GRANT SELECT, UPDATE ON job_queue TO app_worker;
GRANT SELECT, INSERT ON permits TO app_worker;
GRANT SELECT, UPDATE ON projects TO app_worker;
PII Inventory
| PII Field | Table | Encrypted? | Who Can Access | Retention |
|---|---|---|---|---|
| Email address | users | At rest (Neon) | User (own), Admin | Until account deletion |
| Phone number | users | Column-level (pgcrypto) | User (own), Admin | Until account deletion |
| Home address | projects | Column-level (pgcrypto) | Project members, Admin | Project lifetime + 7 years |
| License number | users | Column-level (pgcrypto) | Admin only | Until account deletion |
| Financial documents | R2 (key in documents) | R2 at-rest + TLS | Project members, Admin | Project lifetime + 7 years |
| IP addresses | sessions, audit_events | At rest (Neon) | Admin only | 90 days |
| Payment amounts | milestones, change_orders | At rest (Neon) | Project members, Admin | Project lifetime + 7 years |
Security Checklist by Threat Vector
Injection & Input Attacks
- All SQL via parameterized queries (Drizzle ORM — never string interpolation)
- All input validated with Zod schemas at the API boundary before reaching business logic
- File uploads: MIME type validated server-side (not client-supplied
Content-Type); max size enforced at R2 presigned URL generation; filename sanitized before storage - HTML rendering in SvelteKit uses
{}bindings — auto-escaped.{@html}is banned by ESLint rule.
Broken Access Control
- RBAC checked in middleware before every handler
- RLS as second enforcement layer in database
- Object-level authorization checked before every signed URL generation (is this user a member of this project?)
- Admin endpoints behind separate admin role; no privilege escalation path for homeowner/contractor roles
Auth & Session
- Passwords hashed with bcrypt (cost=12) — never stored plaintext
- Session IDs are UUIDs (128-bit entropy) — not sequential
- Sessions expire after 24h inactivity; absolute max 30 days
- Logout immediately invalidates session in Redis and Postgres
- Password reset tokens are single-use, 15-minute expiry, stored as bcrypt hashes
- CSRF tokens required for all state-mutating requests
Infrastructure
- Cloudflare WAF in front of Netlify and Fly.io
- Fly.io private networking for API → Neon (not public internet)
- Secrets in Fly.io secrets (env vars), never in code or logs
- Dependency scanning via
npm auditin CI - Security headers: CSP, X-Frame-Options, X-Content-Type-Options, HSTS
8 Scaling Strategy
Groundwork is designed to start simple and scale predictably. The architectural choices in Year 1 deliberately avoid premature optimization while maintaining clear paths to Year 3 capacity.
500 Projects
Single city — low volume, validate the product
- 1x Fly.io API machine (shared-cpu-2x, 512MB)
- 1x Fly.io worker machine (shared-cpu-1x)
- Neon free tier (0.5 vCPU, 1 GB)
- Upstash Redis free tier (10K commands/day)
- Estimated cost: ~$30/month
5,000 Projects
Multi-city expansion — 10x growth
- 2–3x Fly.io API machines (auto-scale on CPU)
- 2x Fly.io worker machines (more parallel pollers)
- Neon Launch ($19/month — 10 GB)
- Upstash Redis Pay-as-you-go
- PgBouncer connection pooling (Neon built-in)
- Estimated cost: ~$200/month
50,000 Projects
National scale — 100x from Year 1
- Auto-scaling Fly.io pool (4–10 machines)
- Neon Scale tier ($69/month + read replicas)
- Read replica for analytics queries
- Cloudflare R2 bandwidth savings at scale
- Cache hit rate target: >90% for health scores
- Estimated cost: ~$800–1,200/month
Database Scaling Path
The most important scaling risk is the database. The strategy is to exhaust Postgres vertical scaling before introducing horizontal complexity.
| Stage | Action | When to trigger |
|---|---|---|
| 1 | Add missing indexes; review slow query log | p95 query time > 200ms |
| 2 | Enable PgBouncer connection pooling (Neon built-in) | >50 concurrent connections |
| 3 | Add Redis caching for expensive derived queries | Same query hit rate > 10x/minute |
| 4 | Vertical scale Neon tier (more vCPU, RAM) | CPU > 70% sustained |
| 5 | Add read replica; route analytics to replica | Read/write ratio > 90%:10% |
| 6 | Partition audit_events by month (range partitioning) | Table > 50M rows |
| 7 | Consider sharding by city/region (unlikely before 500K projects) | >100M total rows, write contention |
API Scaling: Stateless by Design
The API server is stateless — all session state is in Redis, all data is in Postgres. Adding a second or tenth API machine requires zero coordination. Fly.io's built-in load balancer distributes traffic across instances. The SSE stream connections are sticky (Fly.io session affinity) to avoid the complexity of cross-instance event broadcasting.
Cache Strategy: TTLs and Invalidation
Cache invalidation is the root of many production bugs. The strategy here is explicit: cache entries are invalidated on write, not relying on TTL expiry for correctness. TTLs are a safety net, not the primary invalidation mechanism.
| Cache Key | TTL | Invalidated by |
|---|---|---|
session:{id} | 24h | Logout, password change, admin revocation |
health:{project_id} | 2h | Any milestone/change-order write to that project |
permit:{project_id} | 30m | Permit poller completing successfully |
r2_url:{document_id} | 55m | Never — URLs naturally expire at 60m |
rate:{ip}:{endpoint} | 60s sliding | TTL only (intended to expire) |
9 Observability
A system that cannot be debugged in production is incomplete. Observability is built in from day one. Every request, every background job, and every integration failure is traceable.
Structured Logging
All application logs are written to stdout as newline-delimited JSON (NDJSON) using Pino. Log levels: error, warn, info, debug. Production defaults to info. Fly.io captures stdout and streams to a log drain.
// Every log entry includes standard fields
{
"level": "info",
"time": 1744000000000,
"pid": 1234,
"requestId": "req_01J...", // Ulid — unique per request
"userId": "usr_01J...", // null if unauthenticated
"projectId": "proj_01J...", // null if not project-scoped
"method": "PATCH",
"path": "/projects/abc/milestones/xyz",
"status": 200,
"latencyMs": 42,
"msg": "milestone updated"
}
Key Metrics
API Metrics (emitted to Fly.io)
| Metric | Type | Alert Threshold |
|---|---|---|
| http_request_duration_ms | Histogram | p95 > 500ms |
| http_error_rate | Counter | > 1% 5xx in 5 min |
| active_sse_connections | Gauge | > 200 per instance |
| auth_failures_per_minute | Counter | > 50 failures/min |
Business Metrics (Postgres views)
| Metric | Query |
|---|---|
| Active projects | COUNT WHERE status='active' |
| Pending change orders | COUNT WHERE status='pending' |
| Job queue depth | COUNT WHERE status='pending' |
| Failed jobs (24h) | COUNT WHERE status='failed' |
Error Tracking: Sentry
Sentry captures unhandled exceptions and explicit error calls in both the frontend and backend. Configuration:
- Source maps uploaded during CI deploy — Sentry shows original TypeScript, not compiled JS
- PII scrubbing enabled in Sentry SDK (
send_default_pii: false) - Alerts: any new error issue triggers a Slack notification to
#eng-alerts - Performance tracing: 10% of transactions sampled (1% in Year 1 to stay within free tier)
- Integration failures (permit API, carrier API) logged as Sentry issues with breadcrumbs
Uptime Monitoring
UptimeRobot (free tier): HTTP monitors on GET /health (API) and the Netlify frontend. 5-minute check interval. Alerts via email + Slack.
The GET /health endpoint performs a lightweight check:
// Health check — fast, meaningful
GET /health → {
"status": "ok",
"db": "ok", // SELECT 1 against Neon
"redis": "ok", // PING against Upstash
"uptime": 86400 // seconds since last restart
}
On-Call Runbooks
Each Sentry alert links to a runbook in /docs/runbooks/. Every recurring alert must have a runbook before the system launches. Runbooks include: alert description, probable causes, diagnostic queries, resolution steps, and escalation path. This is non-negotiable — alerts without runbooks create panic, not resolution.
10 Architecture Decision Records
Each decision below documents context, the decision taken, and the trade-offs accepted. These are living records — if a decision is revisited, the ADR is updated with the date and reason.
SvelteKit over Next.js
Accepted April 2026Groundwork is a two-sided application — a marketing site that needs SEO, and a complex multi-view app with real-time updates. SvelteKit handles both in a single framework via per-route rendering strategies. Svelte compiles to vanilla JS (no virtual DOM), producing smaller bundles and faster runtime performance. The team prefers Svelte's developer experience and signal-based reactivity.
Next.js (React): Larger ecosystem, more hiring options, more examples online. Chosen against because the React bundle overhead (~40KB min+gzip) matters for the LCP target, and the team's Svelte familiarity removes the ecosystem advantage for this project size.
Nuxt (Vue): Similar bundle size advantages to Svelte. Rejected because team preference is Svelte and Vue's composition API is less ergonomic for complex stores than Svelte's writable stores.
PostgreSQL over MongoDB
Accepted April 2026Groundwork's core entities are strongly relational: a Project has many Milestones; a Milestone belongs to a Project and optionally to a ChangeOrder; a Payment is tied to a Milestone; an AuditEvent references a Project, an Actor, and an Entity. These relationships benefit from foreign key constraints, JOIN performance, and transactional guarantees that PostgreSQL provides natively. The audit log and financial records legally require ACID compliance.
MongoDB: Flexible schema would help with the fragmented permit.raw_response field. This is handled by storing raw permit responses in a JSONB column — PostgreSQL's JSONB gives the flexibility of document storage where needed, without sacrificing relational integrity on core entities.
PlanetScale (MySQL): Rejected because Postgres's LISTEN/NOTIFY and SKIP LOCKED patterns are core to the architecture. MySQL has no equivalent without a separate broker.
Session Cookies over JWT
Accepted April 2026Sessions can be immediately invalidated on logout, on suspicious activity detection, or by admin action — this is essential for a platform handling home renovation finances. JWTs cannot be invalidated before their expiry without building a token blacklist (which is, ironically, a server-side session store). The HttpOnly cookie attribute prevents XSS from stealing the session token — a crucial control given that contractor-submitted content might include uploaded documents with embedded scripts.
JWT: Stateless, no DB lookup on every request. Rejected because: (1) instant revocation is not optional for this use case; (2) JWTs stored in localStorage are XSS-vulnerable; (3) the Redis lookup we use for session validation is <1ms — the "no DB lookup" argument is moot at this scale.
Accepted cost: The SSE endpoint requires server affinity (sticky sessions) in Year 3 multi-instance deployment. This is managed via Fly.io's session affinity header — not an architectural constraint.
Postgres SKIP LOCKED over Redis for Job Queue
Accepted April 2026SKIP LOCKED pattern for the background job queue. No Redis queue (BullMQ, Sidekiq, etc.).Groundwork's job throughput is low — a few dozen jobs per hour at Year 1, a few hundred per hour at Year 3. A Postgres-backed queue is sufficient for this volume and eliminates an entire infrastructure dependency. Jobs survive database restarts (unlike in-memory Redis queues). The SKIP LOCKED pattern is well-documented, safe, and widely used in production systems at similar scales (Sidekiq used this before Redis was common).
BullMQ (Redis): Excellent DX, rich feature set, handles high throughput. Rejected because it requires Redis — which we already use for sessions and caching but at minimal volume. Adding a second Redis-dependent workload increases infrastructure coupling. If we already needed high-throughput Redis queuing, BullMQ would win.
Scale threshold: If job volume exceeds 10,000/hour consistently, or if job priority queuing becomes complex, migrate to BullMQ. The job schema is designed so this migration can be done without changing the worker interface.
REST over GraphQL
Accepted April 2026Groundwork's access patterns are well-defined and mostly CRUD: get project, get milestones for project, update milestone status. There are no deeply nested cross-entity queries from the frontend that would benefit from GraphQL's selectivity. The frontend is built alongside the API — there is no third-party consumer or mobile app requiring custom field selection. REST is simpler to cache (standard HTTP caching semantics), simpler to rate limit (per-endpoint), and simpler to audit-log.
GraphQL: Strong type system with schema-first development, excellent DX for complex nested queries. For Groundwork, the N+1 problem (a known GraphQL pitfall) would require DataLoader, adding complexity. The over-fetching problem that GraphQL solves is minimal here — we control both ends of the wire. If Groundwork ever builds a public API for third-party integrators, GraphQL becomes more compelling.
tRPC: Type-safe end-to-end without code generation is appealing for a TypeScript monorepo. Rejected because the SvelteKit frontend and Hono backend are in separate repositories (Netlify vs. Fly.io deploy). tRPC's main advantage — shared types without code generation — requires a monorepo to be ergonomic.
Neon over Supabase
Accepted April 2026Neon is a pure Postgres provider — it does not bundle opinions about auth, storage, or realtime. Groundwork has intentional choices in each of these areas (session cookies for auth, R2 for storage, LISTEN/NOTIFY for realtime) that would conflict with Supabase's bundled solutions. Neon's serverless architecture with auto-suspend is cost-effective for Year 1's low traffic. Neon supports branching — a useful feature for staging environments that share production schema.
Supabase: Would provide auth, storage (S3-compatible), and realtime out of the box — reducing integration work. Rejected because bundling these services into one platform creates vendor lock-in across multiple concerns. Groundwork's choices for auth and storage are intentional (see ADR-03, and R2 rationale in ADR-08) and Supabase's alternatives would require either accepting Supabase's approach or maintaining parallel implementations.
RDS (AWS): Battle-tested, no serverless cold starts, fine-grained IAM. Rejected as operationally heavyweight for Year 1. Migrate to RDS if Neon's connection limits or performance become a constraint above 50K projects.
Fly.io over Serverless (Lambda/Vercel Functions)
Accepted April 2026The SSE real-time endpoint and the background polling workers require persistent connections — a cold-started Lambda function cannot maintain an open SSE stream or hold a LISTEN/NOTIFY connection. Fly.io VMs start in ~300ms and remain running, allowing long-lived connections. Workers need to run on a schedule and maintain state across invocations (e.g., tracking which projects were last polled). Serverless would require external state coordination (DynamoDB, etc.) for this, eliminating the simplicity argument.
Serverless (Vercel/Lambda): Zero-ops scaling, pay-per-request pricing. Rejected because: (1) SSE requires persistent connections; (2) background polling workers need to run continuously, not on-demand; (3) cold start latency (100–500ms) hurts the sub-80ms p50 API target; (4) per-execution pricing becomes expensive at sustained load vs. always-on Fly.io VMs.
Accepted cost: We pay for VM uptime even during off-peak hours. At Year 1 traffic, this is ~$15–20/month — an acceptable cost for the architectural simplicity gained.
Cloudflare R2 over AWS S3
Accepted April 2026Groundwork's documents (contracts, permit applications, inspection photos) will be downloaded regularly by both homeowners and contractors. AWS S3 charges $0.09/GB for egress — on a platform where a single project might have 50–100 MB of documents viewed many times, egress costs compound quickly. Cloudflare R2 has zero egress fees. R2 is S3-compatible (same SDK, same presigned URL patterns), so the migration risk to S3 in the future is low. Documents are served via Cloudflare's global CDN, which also improves download speeds globally.
AWS S3: More mature, more third-party integrations, AWS Lambda triggers for post-upload processing. Rejected primarily on egress cost grounds. At Year 3 (50K projects × 75 MB average documents × frequent downloads), egress costs on S3 could be $500–2,000/month. R2 makes that $0.
Tigris on Fly.io: S3-compatible object storage that runs on Fly.io infrastructure — attractive because it collocates storage with the API. Considered seriously. R2 chosen over Tigris because Cloudflare's WAF and CDN are already in the stack; R2 integrates more tightly with Cloudflare's edge network for global download performance. Revisit if Tigris matures significantly.
Decision Summary
| # | Decision | Chosen | Primary Reason |
|---|---|---|---|
| 01 | Frontend framework | SvelteKit | Bundle size, team preference, SSR+SPA hybrid |
| 02 | Primary database | PostgreSQL (Neon) | Relational model, ACID, LISTEN/NOTIFY, SKIP LOCKED |
| 03 | Authentication | Session cookies | Instant revocation, XSS-safe HttpOnly cookie |
| 04 | Job queue | Postgres SKIP LOCKED | No extra dependency, durable, sufficient throughput |
| 05 | API style | REST | Simple CRUD patterns, no N+1 risk, cacheable |
| 06 | Postgres host | Neon | Pure Postgres, no bundled opinion, branching |
| 07 | Compute | Fly.io VMs | Persistent connections required (SSE, workers) |
| 08 | Object storage | Cloudflare R2 | Zero egress fees, CDN integration |