Technical Architecture

Groundwork — Shared Reality Platform for Home Renovation Transparency

Version1.0 — Draft
DateApril 2026
ScopeMVP through Year 3 scale
StatusEngineering Ready
Requirements72 functional requirements / 10 feature areas
Purpose This document proves Groundwork is buildable. It provides enough specificity that an engineering team can begin implementation immediately. Every architectural decision is justified with trade-offs documented. Security and performance are treated as first-class concerns evaluated in every section.

0 Architecture Principles

These principles govern every technical decision in this document. When trade-offs arise, they serve as the tiebreaker. Any deviation from these principles requires an explicit ADR.

Single Responsibility

Every module, service, and function does one thing and does it well. The API layer does not contain business logic. Business logic does not know about the database schema.

Defense in Depth

Security controls exist at every layer: network, application, data, and audit. No single failure exposes sensitive data. Homeowner and contractor data are isolated at the database level, not just the application level.

Principle of Least Privilege

Every process, user role, and service account has only the permissions it needs and nothing more. Database roles are scoped per operation type. API tokens expire. Admin actions are logged.

Fail Fast, Recover Gracefully

Invalid input is rejected at the boundary with clear error messages. Integration failures (permit APIs, carrier tracking) fall back to cached state, never silently corrupt data or crash the application.

12-Factor App

Configuration lives in environment variables. Logs are streams to stdout. The app is stateless — any instance can handle any request. Dev/prod parity is maintained. Backing services are attached resources.

Explicit Over Implicit

No magic. No clever abstractions that hide what the system is doing. A junior developer reading any module should be able to trace a request from HTTP handler through business logic to database query in under 10 minutes.

Observability First

Structured logs, metrics, and error tracking are built in from day one. The system must be debuggable in production without attaching a debugger. Every background job emits completion metrics.

Evolutionary Architecture

Year 1 infrastructure must not over-engineer for Year 3 load. But Year 1 code must not make Year 3 scaling unnecessarily painful. Schema migrations, API versioning, and service boundaries are designed to evolve.

Layered Separation of Concerns

Every feature is implemented across clearly separated layers. No layer reaches across its boundary into another layer's responsibility.

Presentation Layer SvelteKit components, stores, route handlers — rendering only, no business rules
API Contract Layer HTTP handlers, request validation (Zod/Joi), response serialization, auth middleware
Business Logic Layer Domain models, rules (contract health scoring, change order approval), pure functions
Data Access Layer Repository pattern — SQL queries, cache reads/writes, zero business logic
Infrastructure Layer PostgreSQL, R2 object storage, Upstash Redis, external API clients — all replaceable

1 System Overview

Groundwork is a two-sided platform connecting homeowners and contractors within a shared project workspace. The architecture is a monolithic API (intentionally — see ADR-07) deployed on Fly.io, a SvelteKit frontend on Netlify, PostgreSQL on Neon for all relational data, Cloudflare R2 for document storage, and Fly.io workers for background integration polling.

CLIENT NETLIFY CDN FLY.IO — API FLY.IO — WORKERS DATA LAYER EXTERNAL INTEGRATIONS CLOUDFLARE Homeowner Browser / PWA Contractor Browser / PWA SvelteKit SSR + SPA Auth Middleware Session validation · RBAC REST API (Hono) Route handlers · Validation Business Logic Domain rules · Health scoring Data Access (Drizzle) Repositories · Cache layer Permit Poller Hourly Delivery Tracker On-demand Notif. Worker Event-driven Health Scorer On-change PostgreSQL Neon (managed) RLS · LISTEN/NOTIFY Redis Cache Upstash (serverless) Sessions · Scores · Permits Job Queue Postgres + SKIP LOCKED Durable · No Redis needed City Permit APIs (fragmented) Carrier APIs UPS · FedEx · USPS License DBs State registries Email / SMS Resend · Twilio R2 Storage Cloudflare R2 Documents · Photos CDN / WAF Cloudflare DDoS · Rate limit · TLS HTTPS REST poll signed URL SECURITY PERIMETER TLS 1.3 everywhere Postgres RLS per tenant Signed URLs for R2 WAF at CDN edge PERFORMANCE CDN caches static assets Redis: scores + permits LISTEN/NOTIFY: real-time SKIP LOCKED: no queue lag

Key Architectural Decisions Visible in This Diagram

Monolithic API, not microservices At 500–50,000 projects, a well-structured monolith is operationally simpler, easier to debug, and avoids distributed-transaction complexity. Service boundaries are enforced by module structure, not network calls.
Workers are separate processes Background polling workers run as separate Fly.io machines — isolated from the API so a slow permit scrape cannot degrade API response times. They communicate via the Postgres job queue, not in-process.
Multi-Tenant Isolation Every database query is constrained by Row Level Security policies in PostgreSQL. The application layer cannot accidentally return data from one project to another — the database itself enforces the boundary regardless of application bugs.

2 Frontend Architecture

Rendering Strategy

SvelteKit supports multiple rendering modes per route. Groundwork uses a deliberate split: SSR for public-facing pages (SEO and first-contentful-paint), and client-side SPA rendering for the authenticated application. This gives the landing page sub-1-second LCP without sacrificing the interactive richness of the project dashboard.

Public Routes — SSR (Server-Side Rendered)

  • / — Landing page (SEO critical)
  • /pricing, /about — Marketing pages
  • /login, /signup — Auth entry points
  • /invite/[token] — Contractor invite landing

Rendered at request time on Netlify Edge Functions. HTML is indexable by search engines. Hydrated client-side after load.

App Routes — CSR (Client-Side SPA)

  • /dashboard — Project list
  • /projects/[id] — Project workspace
  • /projects/[id]/milestones
  • /projects/[id]/documents
  • /projects/[id]/payments
  • /projects/[id]/change-orders
  • /admin — Admin panel

JavaScript bundle loaded once. Route transitions are instant. Auth guard at layout level — unauthenticated users redirected to /login.

Component Architecture

Components follow the Container / Presentational pattern. No business logic in components. Data fetching happens in +page.ts load functions or via stores — never inline in component markup.

// Directory structure — feature-first, not type-first
src/
  lib/
    components/
      ui/               // Button, Badge, Modal, Toast — design system atoms
      project/          // ProjectCard, ProjectHeader, MilestoneTimeline
      change-order/     // ChangeOrderForm, ChangeOrderStatus
      payment/          // PaymentSchedule, PaymentRow
      document/         // DocumentUploader, DocumentList
      permit/           // PermitStatus, PermitTimeline
    stores/
      project.ts        // Writable<Project> — current project context
      notifications.ts  // real-time notification queue
      session.ts        // User session, role, permissions
    api/
      client.ts         // Typed fetch wrapper — all API calls go through here
      projects.ts       // Project CRUD functions
      milestones.ts
      documents.ts
    utils/
      format.ts         // Currency, date formatting — pure functions
      validate.ts       // Client-side validation schemas (Zod)
  routes/
    (public)/           // SSR group
    (app)/              // Auth-guarded CSR group

State Management

Svelte's built-in stores are sufficient for Groundwork's data model. No Zustand, Redux, or Pinia. The session store holds the current user. The project store holds the current project (populated by the route's load function). Notifications arrive via Server-Sent Events and push to the notifications store.

Real-Time Updates via Server-Sent Events (SSE) The authenticated app subscribes to GET /api/stream, an SSE endpoint. When a background worker detects a permit status change or a contractor marks a milestone complete, PostgreSQL LISTEN/NOTIFY triggers the API to push a typed event to all connected clients on that project. The notification store updates, Svelte's reactivity re-renders the relevant component. No WebSocket complexity, no third-party realtime service.

Performance Budget

< 1.5s LCP (target)
< 100ms FID / INP
< 0.1 CLS
< 120kb JS Bundle (gzip)

Achieving the performance budget:

Security Considerations — Frontend

3 Backend Architecture

Framework: Hono on Node.js (Fly.io)

The API is built with Hono, a lightweight TypeScript-first HTTP framework with excellent performance characteristics and middleware composability. It runs on Node.js on a Fly.io VM (not serverless — see ADR-07). The codebase is organized as a modular monolith: separate router files per domain, shared middleware, dependency injection via closure rather than a DI container.

Why REST over GraphQL — see ADR-05. Short version: Groundwork's data model is relational, access patterns are predictable CRUD, and GraphQL's N+1 risks and schema complexity add cost without benefit at this scale.

Key API Endpoints

Method Path Description Auth Rate Limit
POST /auth/login Create session with email + password Public 5/min per IP
POST /auth/logout Invalidate session, clear cookie Session
GET /auth/me Current user + role + permissions Session 60/min
POST /projects Create new project (homeowner initiates) Homeowner 10/hour
GET /projects List user's projects (both roles) Session 60/min
GET /projects/:id Project detail + health score Member 120/min
PATCH /projects/:id Update project metadata Contractor 30/min
GET /projects/:id/milestones List milestones with status Member 120/min
PATCH /projects/:id/milestones/:mid Mark milestone complete (triggers notification) Contractor 30/min
POST /projects/:id/change-orders Submit change order for approval Contractor 20/hour
PATCH /projects/:id/change-orders/:cid Approve or reject change order Homeowner 30/min
GET /projects/:id/documents List documents with signed download URLs Member 60/min
POST /projects/:id/documents/upload-url Generate signed R2 upload URL Member 20/min
POST /projects/:id/documents/confirm Confirm upload complete, persist metadata Member 20/min
GET /projects/:id/payments Payment schedule + status Member 60/min
GET /projects/:id/permits Permit status (cached, refreshed by poller) Member 60/min
GET /projects/:id/health Contract health score + breakdown Member 60/min
GET /api/stream SSE stream for real-time project events Session 5 connections/user
POST /invites Send contractor invite link Homeowner 10/day
GET /invites/:token Validate invite token, return project info Public 20/min per IP

Authentication: Session Cookies (not JWT)

See ADR-03 for the full rationale. Sessions are stored server-side in Postgres (with a Redis cache for fast lookup). The client receives only an opaque session ID in an HttpOnly; Secure; SameSite=Strict cookie. Sessions expire after 24 hours of inactivity and are invalidated immediately on logout.

// Session table — stored in Postgres, cached in Redis
CREATE TABLE sessions (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_seen   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  expires_at  TIMESTAMPTZ NOT NULL,
  ip_address  INET,
  user_agent  TEXT
);

-- Redis key: "session:{id}" → JSON(user_id, role, expires_at)
-- TTL: 24 hours, refreshed on each request

RBAC: Role-Based Access Control

Three roles: homeowner, contractor, admin. Role is stored on the project_members join table (not globally on the user) — a user can be a homeowner on one project and a contractor on another. The middleware resolves the caller's role within the requested project before routing.

Role resolution is per-project, not per-user GET /projects/abc/milestones resolves: is this user a member of project abc? What is their role within that project? This prevents the class of bug where a contractor on Project B can see Project A's data by guessing a UUID.
PermissionHomeownerContractorAdmin
View project dataYesYesYes
Create projectYesNoYes
Invite contractorYesNoYes
Mark milestone completeNoYesYes
Submit change orderNoYesYes
Approve change orderYesNoYes
Upload documentsYesYesYes
Delete documentsOwn onlyOwn onlyYes
View admin panelNoNoYes

Rate Limiting Strategy

Rate limits are enforced at the API middleware layer using a sliding window counter stored in Redis (Upstash). Three tiers:

Performance: API Response Time Targets p50 < 80ms, p95 < 300ms, p99 < 800ms for all non-streaming endpoints. The SSE stream endpoint has no latency target but must not block the Node.js event loop. CPU-intensive health score computation runs in a worker process, not in the API request path.

4 Data Architecture

Primary Schema: PostgreSQL (Neon)

All relational data lives in a single PostgreSQL database hosted on Neon. The schema below covers the 7 primary entities plus supporting tables. Row Level Security is enabled on all tables containing project data.

-- ────────────────────────────────────────────────── -- USERS -- ────────────────────────────────────────────────── CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT NOT NULL UNIQUE, display_name TEXT NOT NULL, password_hash TEXT NOT NULL, -- bcrypt, cost=12 phone TEXT, -- nullable — not required at signup avatar_key TEXT, -- R2 object key license_number TEXT, -- contractors only license_state CHAR(2), -- two-letter state code license_verified_at TIMESTAMPTZ, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), last_login TIMESTAMPTZ ); CREATE INDEX idx_users_email ON users(email); -- ────────────────────────────────────────────────── -- PROJECTS -- ────────────────────────────────────────────────── CREATE TABLE projects ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL, address TEXT NOT NULL, -- encrypted at rest (pgcrypto) city TEXT NOT NULL, state CHAR(2) NOT NULL, zip TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'active', -- active|paused|completed|disputed contract_total NUMERIC(12,2), start_date DATE, end_date DATE, health_score SMALLINT, -- 0-100, recomputed by worker health_computed_at TIMESTAMPTZ, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_projects_status ON projects(status); -- ────────────────────────────────────────────────── -- PROJECT MEMBERS (multi-tenant pivot) -- ────────────────────────────────────────────────── CREATE TABLE project_members ( project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, role TEXT NOT NULL, -- 'homeowner' | 'contractor' joined_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), PRIMARY KEY (project_id, user_id) ); CREATE INDEX idx_project_members_user ON project_members(user_id); -- ────────────────────────────────────────────────── -- MILESTONES -- ────────────────────────────────────────────────── CREATE TABLE milestones ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, title TEXT NOT NULL, description TEXT, status TEXT NOT NULL DEFAULT 'pending', -- pending|in_progress|complete|disputed due_date DATE, completed_at TIMESTAMPTZ, completed_by UUID REFERENCES users(id), payment_amount NUMERIC(12,2), -- portion of contract due at completion sort_order SMALLINT NOT NULL DEFAULT 0, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_milestones_project ON milestones(project_id, sort_order); -- ────────────────────────────────────────────────── -- CHANGE ORDERS -- ────────────────────────────────────────────────── CREATE TABLE change_orders ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, submitted_by UUID NOT NULL REFERENCES users(id), title TEXT NOT NULL, description TEXT NOT NULL, cost_delta NUMERIC(12,2) NOT NULL, -- positive = increase, negative = credit time_delta INTERVAL, -- schedule impact status TEXT NOT NULL DEFAULT 'pending', -- pending|approved|rejected|withdrawn reviewed_by UUID REFERENCES users(id), reviewed_at TIMESTAMPTZ, review_note TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- ────────────────────────────────────────────────── -- DOCUMENTS -- ────────────────────────────────────────────────── CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, uploaded_by UUID NOT NULL REFERENCES users(id), filename TEXT NOT NULL, mime_type TEXT NOT NULL, size_bytes BIGINT NOT NULL, r2_key TEXT NOT NULL UNIQUE, -- R2 object key — never exposed to client category TEXT, -- contract|permit|invoice|photo|other created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_documents_project ON documents(project_id, created_at DESC); -- ────────────────────────────────────────────────── -- PERMITS -- ────────────────────────────────────────────────── CREATE TABLE permits ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, permit_number TEXT, permit_type TEXT NOT NULL, -- building|electrical|plumbing|mechanical status TEXT NOT NULL, -- applied|issued|inspected|final|expired issued_date DATE, expiry_date DATE, city_api_id TEXT, -- external ID in city's system last_polled TIMESTAMPTZ, raw_response JSONB, -- normalized later, raw kept for debugging created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- ────────────────────────────────────────────────── -- AUDIT LOG — append-only, never updated -- ────────────────────────────────────────────────── CREATE TABLE audit_events ( id BIGSERIAL PRIMARY KEY, project_id UUID REFERENCES projects(id) ON DELETE SET NULL, actor_id UUID REFERENCES users(id) ON DELETE SET NULL, event_type TEXT NOT NULL, entity_type TEXT NOT NULL, entity_id UUID, payload JSONB NOT NULL DEFAULT '{}', ip_address INET, occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_audit_project ON audit_events(project_id, occurred_at DESC); CREATE INDEX idx_audit_actor ON audit_events(actor_id, occurred_at DESC); -- ────────────────────────────────────────────────── -- JOB QUEUE (Postgres SKIP LOCKED pattern) -- ────────────────────────────────────────────────── CREATE TABLE job_queue ( id BIGSERIAL PRIMARY KEY, job_type TEXT NOT NULL, payload JSONB NOT NULL, status TEXT NOT NULL DEFAULT 'pending', attempts SMALLINT NOT NULL DEFAULT 0, max_attempts SMALLINT NOT NULL DEFAULT 3, run_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), locked_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, error TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_job_queue_ready ON job_queue(run_at) WHERE status = 'pending';

Caching Strategy: When to Use Redis vs. Postgres

Not everything needs to be cached. Cache when: the data is read far more often than it changes, computing it is expensive, or latency for stale data is acceptable. Do not cache when: correctness is critical (payment amounts, audit events) or the dataset fits in a single Postgres query.

DataCache?TTLReason
Contract health scoreRedis1 hourExpensive to compute; acceptable if slightly stale
Permit statusRedis30 minPolled externally; DB write + cache invalidate on update
User sessionRedis24 hoursHit on every request — must be sub-millisecond
Rate limit countersRedis1 minRequires atomic increment; Redis INCR is atomic
Project listPostgresSmall result set; query is <10ms with index
Milestone listPostgresMust always reflect latest contractor updates
Payment amountsNeverFinancial data — must be authoritative from DB
Audit logNeverLegal record — no caching ever
Signed R2 URLsRedis55 minSigned URLs expire at 60 min; cache saves regeneration

Real-Time: PostgreSQL LISTEN/NOTIFY

PostgreSQL's built-in pub/sub mechanism handles real-time project updates without a separate WebSocket server or message broker. A Postgres trigger fires on writes to milestones, change_orders, and permits — emitting a notification on a channel named project:{project_id}. The API server maintains a single persistent Postgres connection per project that has active SSE subscribers, forwarding notifications to those clients.

-- Trigger on milestone update
CREATE OR REPLACE FUNCTION
  notify_project_change()
RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify(
    'project:' || NEW.project_id,
    json_build_object(
      'entity', TG_TABLE_NAME,
      'id',     NEW.id,
      'event',  TG_OP
    )::text
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER milestone_notify
AFTER INSERT OR UPDATE ON milestones
FOR EACH ROW EXECUTE FUNCTION
  notify_project_change();
Why not WebSockets? LISTEN/NOTIFY is sufficient for Groundwork's real-time needs. Events are low-frequency (a milestone update, a permit status change), not high-frequency streaming data. Server-Sent Events (one-directional) are simpler to implement, proxy-safe, and auto-reconnecting in browsers. WebSockets add connection management complexity with no benefit here.
Scale limit LISTEN/NOTIFY works well up to ~100 concurrent project subscriptions per API instance. At Year 3 scale, if concurrent active sessions exceed this, the API auto-scales horizontally (each instance holds its own LISTEN connections). No architectural change needed.

Document Storage: Cloudflare R2

Documents (contracts, permits, invoices, photos) are stored in Cloudflare R2. The flow is a two-step upload: the API generates a presigned PUT URL (valid 15 minutes), the client uploads directly to R2 (bypassing the API entirely), then the client POSTs a confirmation to the API which persists the metadata. This keeps large binaries off the application server.

Security: Documents are never publicly accessible R2 bucket is private. Download links are presigned GET URLs generated server-side, valid for 60 minutes, scoped to the requesting user's session. The R2 object key is opaque (UUID-based) — guessing another project's document key is computationally infeasible. RBAC is checked before generating any signed URL.

5 Integration Architecture

Integration Philosophy External integrations are the highest-risk part of the system — they are outside our control, poorly documented, frequently down, and occasionally change without notice. Every integration is wrapped in an adapter with a defined failure mode. The system must degrade gracefully: a permit API being down never prevents a homeowner from viewing their project.

City Permit APIs

This is the most technically challenging integration. There is no standard API — every municipality has its own portal, data model, and access method. Groundwork's Year 1 target (one city) allows for a focused adapter. Year 2 requires a pluggable adapter pattern.

Adapter Pattern

Each city permit source implements a common PermitAdapter interface:

interface PermitAdapter {
  fetchByAddress(
    address: string,
    city: string,
    state: string
  ): Promise<PermitRecord[]>;

  fetchByPermitNumber(
    id: string
  ): Promise<PermitRecord>;

  normalizeStatus(
    raw: string
  ): PermitStatus;
}

Failure Modes & Fallbacks

  • API timeout (>10s): Return cached status from last successful poll. Surface staleness timestamp to user ("Last updated 3 hours ago").
  • HTTP 4xx/5xx: Increment error counter. After 3 consecutive failures, mark permit as "Status Unavailable" and alert admin via Sentry.
  • Schema change: Normalization layer logs unknown status values, maps to "Unknown" — does not throw. Alert is raised for manual review.
  • City changes access method: Adapter is swapped without changing the caller. Raw responses stored in permits.raw_response JSONB for re-parsing.

Delivery Tracking

Carrier APIs (UPS, FedEx, USPS) provide REST APIs with reasonable documentation. Groundwork tracks deliveries linked to project milestones (e.g., "lumber delivery" before "framing" milestone).

CarrierMethodAuthRate LimitFailure Mode
UPSREST APIOAuth 2.040 req/day (free)Cache last known status; retry 1h later
FedExREST APIOAuth 2.05000 req/dayCache last known status; retry 1h later
USPSREST API (v3)OAuth 2.0500 req/dayCache last known status; retry 4h later
Carrier detection Tracking numbers are routed to the correct carrier by pattern matching (UPS starts with "1Z", FedEx is 12 or 15 digits, USPS follows USPS formats). If pattern is ambiguous, the system tries carriers in priority order and returns the first successful result.

License Verification

Contractor license verification is the most fragile integration. Most state databases offer no API — they require either screen scraping of a public search form or access to a third-party aggregator.

Tiered Verification Approach

  1. Tier 1 — Official API: A small number of states (CA, TX, FL, NY) have machine-readable license lookup APIs. Use these directly.
  2. Tier 2 — Structured scrape: States with consistent HTML structures get a Playwright-based scraper running in a Fly.io worker. Results cached 7 days.
  3. Tier 3 — Third-party aggregator: For states where scraping is unreliable, use a service like CSLB API or Contractor Check as a paid fallback.
  4. Tier 4 — Manual review queue: If all automated methods fail, flag the contractor for manual admin verification. Platform remains usable; verification shown as "Pending".

Security considerations

License numbers are PII — they are stored encrypted in users.license_number (column-level encryption via pgcrypto). Only admin role can view plaintext. The verification result (verified/not verified/pending) is stored as a separate non-encrypted field so it can be queried without decryption.

Verification is re-checked at 90-day intervals. An expired or revoked license triggers an admin alert and changes the contractor's trust indicator on shared projects.

Email & SMS (Notifications)

Outbound notifications use Resend for transactional email (excellent deliverability, TypeScript SDK, generous free tier) and Twilio for SMS on critical events (milestone payment due, change order requiring approval within 48h).

Notification failure handling Notification dispatch is always async — via the job queue. If Resend is down, the job is retried with exponential backoff (1 min, 5 min, 30 min). After 3 failures, the job is moved to a dead-letter table and the user's in-app notification is flagged as "Email delivery failed." The user experience is never blocked by notification infrastructure failures.

6 Background Processing

Background jobs run as separate Fly.io worker machines. They consume jobs from the Postgres job_queue table using the SKIP LOCKED pattern — multiple workers can run concurrently without stepping on each other, and jobs are never lost (they live in the database, not in memory). See ADR-04 for the trade-off analysis.

-- Worker claim query — safe for concurrent workers
UPDATE job_queue
SET
  status    = 'processing',
  locked_at = NOW(),
  attempts  = attempts + 1
WHERE id = (
  SELECT id FROM job_queue
  WHERE  status = 'pending'
    AND  run_at <= NOW()
    AND  attempts < max_attempts
  ORDER BY run_at ASC
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING *;

Job Types

Job TypeTriggerFrequencySLAFailure Action
permit.poll Scheduled (cron) Hourly per active project Complete within 55 min Retry 3x, then mark stale + alert
health.compute On milestone/change-order write On-demand Complete within 30s Retry 3x, serve stale score with staleness label
notification.send On project events On-demand Delivered within 5 min Retry with backoff, dead-letter after 3 failures
delivery.track On tracking number added Every 2 hours until delivered Complete within 90 min Retry 3x with 1h backoff
digest.generate Scheduled (cron) Daily at 7 AM project timezone Complete before 8 AM Skip day, alert admin if 3 consecutive failures
license.verify On contractor join + every 90 days On-demand + scheduled Complete within 1 hour Retry, then queue for manual review
session.cleanup Scheduled (cron) Daily at 2 AM UTC Complete before 3 AM Log failure, retry next day

Contract Health Score Algorithm

The ContractHealthScore is a 0–100 composite score computed by the health.compute worker. It is the most business-critical computation in the system. The algorithm is deterministic and testable as a pure function.

Health Score Components
  • Schedule adherence (30 pts): Ratio of on-time milestone completions vs. late completions. A milestone is late if completed more than 3 days past its due date, or not yet complete and past due date.
  • Change order frequency (25 pts): Projects with more than 2 change orders per 10 milestones are penalized. High change order rates correlate with scope creep or poor planning.
  • Documentation completeness (20 pts): Key documents present: signed contract, active permit(s), current insurance certificate, inspection records.
  • Payment timeliness (15 pts): Payments released within 72h of milestone approval. Delayed payments reduce the score.
  • Communication currency (10 pts): Time since last activity in project (milestone update, document upload, message). Projects inactive >14 days lose points.

7 Security Architecture

Security is non-negotiable Groundwork stores home addresses, financial arrangements, contractor license numbers, and signed legal documents. A breach would cause real harm to real people. Security controls are specified precisely here and enforced as code — not documentation aspiration.

Authentication Flow

CLIENT AUTH MIDDLEWARE REDIS CACHE POSTGRES AUDIT LOG POST /auth/login email + password Rate limit check 5/min per IP (Redis) INCR counter TTL sliding window Limit exceeded 429 → client Lookup user by email bcrypt compare hash Wrong password 401 → client Write audit event login_success INSERT session row UUID session ID SET session cache TTL 24h Set-Cookie header HttpOnly; Secure; SameSite 200 OK Session established SUBSEQUENT REQUESTS: Cookie → Middleware → Redis lookup (session:id) → if miss → Postgres sessions table → RBAC → handler Session lookup is Redis-first (sub-millisecond). Falls back to Postgres on cache miss. Session freshness validated on every request.

Data Encryption

At Rest

  • Database: Neon encrypts all data at rest using AES-256. No additional action required from the application.
  • Column-level encryption: PII fields (address, license_number, phone) are additionally encrypted using pgcrypto with a key stored in environment variables, not in the database. An attacker with a Postgres dump cannot read PII without the application key.
  • R2: Cloudflare R2 encrypts all objects at rest. Encryption keys are managed by Cloudflare.

In Transit

  • All external traffic: TLS 1.3 enforced. TLS 1.0 and 1.1 are disabled at the Cloudflare WAF level. HSTS with max-age=31536000; includeSubDomains.
  • API → Neon: TLS-encrypted connection. Neon requires TLS by default; plaintext connections are rejected.
  • API → Redis: Upstash enforces TLS. The connection string includes tls://.
  • API → R2: HTTPS only. Presigned URLs include a content-type restriction to prevent MIME sniffing attacks.

Multi-Tenant Isolation: Row Level Security

Row Level Security (RLS) is the last line of defense for multi-tenant data isolation. Even if the application layer has a bug that constructs an incorrect query, the database itself will not return data from another tenant's project.

-- Enable RLS on all project-scoped tables
ALTER TABLE milestones     ENABLE ROW LEVEL SECURITY;
ALTER TABLE change_orders  ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents      ENABLE ROW LEVEL SECURITY;
ALTER TABLE permits        ENABLE ROW LEVEL SECURITY;

-- Policy: user can only see milestones from their projects
CREATE POLICY milestones_member_policy
ON milestones FOR ALL TO app_user
USING (
  project_id IN (
    SELECT project_id
    FROM   project_members
    WHERE  user_id = current_setting('app.current_user_id')::uuid
  )
);

-- API sets the user context before each query
-- SELECT set_config('app.current_user_id', $1, true)

-- Separate DB role for workers — cannot write to audit_events
CREATE ROLE app_worker;
GRANT SELECT, UPDATE ON job_queue TO app_worker;
GRANT SELECT, INSERT ON permits   TO app_worker;
GRANT SELECT, UPDATE ON projects  TO app_worker;

PII Inventory

PII FieldTableEncrypted?Who Can AccessRetention
Email addressusersAt rest (Neon)User (own), AdminUntil account deletion
Phone numberusersColumn-level (pgcrypto)User (own), AdminUntil account deletion
Home addressprojectsColumn-level (pgcrypto)Project members, AdminProject lifetime + 7 years
License numberusersColumn-level (pgcrypto)Admin onlyUntil account deletion
Financial documentsR2 (key in documents)R2 at-rest + TLSProject members, AdminProject lifetime + 7 years
IP addressessessions, audit_eventsAt rest (Neon)Admin only90 days
Payment amountsmilestones, change_ordersAt rest (Neon)Project members, AdminProject lifetime + 7 years

Security Checklist by Threat Vector

Injection & Input Attacks

  • All SQL via parameterized queries (Drizzle ORM — never string interpolation)
  • All input validated with Zod schemas at the API boundary before reaching business logic
  • File uploads: MIME type validated server-side (not client-supplied Content-Type); max size enforced at R2 presigned URL generation; filename sanitized before storage
  • HTML rendering in SvelteKit uses {} bindings — auto-escaped. {@html} is banned by ESLint rule.

Broken Access Control

  • RBAC checked in middleware before every handler
  • RLS as second enforcement layer in database
  • Object-level authorization checked before every signed URL generation (is this user a member of this project?)
  • Admin endpoints behind separate admin role; no privilege escalation path for homeowner/contractor roles

Auth & Session

  • Passwords hashed with bcrypt (cost=12) — never stored plaintext
  • Session IDs are UUIDs (128-bit entropy) — not sequential
  • Sessions expire after 24h inactivity; absolute max 30 days
  • Logout immediately invalidates session in Redis and Postgres
  • Password reset tokens are single-use, 15-minute expiry, stored as bcrypt hashes
  • CSRF tokens required for all state-mutating requests

Infrastructure

  • Cloudflare WAF in front of Netlify and Fly.io
  • Fly.io private networking for API → Neon (not public internet)
  • Secrets in Fly.io secrets (env vars), never in code or logs
  • Dependency scanning via npm audit in CI
  • Security headers: CSP, X-Frame-Options, X-Content-Type-Options, HSTS

8 Scaling Strategy

Groundwork is designed to start simple and scale predictably. The architectural choices in Year 1 deliberately avoid premature optimization while maintaining clear paths to Year 3 capacity.

Year 1

500 Projects

Single city — low volume, validate the product

  • 1x Fly.io API machine (shared-cpu-2x, 512MB)
  • 1x Fly.io worker machine (shared-cpu-1x)
  • Neon free tier (0.5 vCPU, 1 GB)
  • Upstash Redis free tier (10K commands/day)
  • Estimated cost: ~$30/month
Year 2

5,000 Projects

Multi-city expansion — 10x growth

  • 2–3x Fly.io API machines (auto-scale on CPU)
  • 2x Fly.io worker machines (more parallel pollers)
  • Neon Launch ($19/month — 10 GB)
  • Upstash Redis Pay-as-you-go
  • PgBouncer connection pooling (Neon built-in)
  • Estimated cost: ~$200/month
Year 3

50,000 Projects

National scale — 100x from Year 1

  • Auto-scaling Fly.io pool (4–10 machines)
  • Neon Scale tier ($69/month + read replicas)
  • Read replica for analytics queries
  • Cloudflare R2 bandwidth savings at scale
  • Cache hit rate target: >90% for health scores
  • Estimated cost: ~$800–1,200/month

Database Scaling Path

The most important scaling risk is the database. The strategy is to exhaust Postgres vertical scaling before introducing horizontal complexity.

StageActionWhen to trigger
1Add missing indexes; review slow query logp95 query time > 200ms
2Enable PgBouncer connection pooling (Neon built-in)>50 concurrent connections
3Add Redis caching for expensive derived queriesSame query hit rate > 10x/minute
4Vertical scale Neon tier (more vCPU, RAM)CPU > 70% sustained
5Add read replica; route analytics to replicaRead/write ratio > 90%:10%
6Partition audit_events by month (range partitioning)Table > 50M rows
7Consider sharding by city/region (unlikely before 500K projects)>100M total rows, write contention

API Scaling: Stateless by Design

The API server is stateless — all session state is in Redis, all data is in Postgres. Adding a second or tenth API machine requires zero coordination. Fly.io's built-in load balancer distributes traffic across instances. The SSE stream connections are sticky (Fly.io session affinity) to avoid the complexity of cross-instance event broadcasting.

SSE Scaling Note With session affinity, each API instance maintains its own set of LISTEN/NOTIFY connections. If a user's SSE connection lands on Instance A, all events for their projects are received by Instance A's Postgres LISTEN connection and forwarded. This scales linearly — each new API machine adds capacity for more concurrent SSE connections. At very high concurrency (>1000 projects per instance), SSE connections can be moved to a dedicated machine.

Cache Strategy: TTLs and Invalidation

Cache invalidation is the root of many production bugs. The strategy here is explicit: cache entries are invalidated on write, not relying on TTL expiry for correctness. TTLs are a safety net, not the primary invalidation mechanism.

Cache KeyTTLInvalidated by
session:{id}24hLogout, password change, admin revocation
health:{project_id}2hAny milestone/change-order write to that project
permit:{project_id}30mPermit poller completing successfully
r2_url:{document_id}55mNever — URLs naturally expire at 60m
rate:{ip}:{endpoint}60s slidingTTL only (intended to expire)

9 Observability

A system that cannot be debugged in production is incomplete. Observability is built in from day one. Every request, every background job, and every integration failure is traceable.

Structured Logging

All application logs are written to stdout as newline-delimited JSON (NDJSON) using Pino. Log levels: error, warn, info, debug. Production defaults to info. Fly.io captures stdout and streams to a log drain.

// Every log entry includes standard fields
{
  "level":      "info",
  "time":       1744000000000,
  "pid":        1234,
  "requestId":  "req_01J...",   // Ulid — unique per request
  "userId":     "usr_01J...",   // null if unauthenticated
  "projectId":  "proj_01J...",  // null if not project-scoped
  "method":     "PATCH",
  "path":       "/projects/abc/milestones/xyz",
  "status":     200,
  "latencyMs":  42,
  "msg":        "milestone updated"
}
PII in logs is prohibited Log entries must never contain: email addresses, phone numbers, home addresses, license numbers, payment amounts, document names, or any field from the PII inventory in Section 7. Sensitive fields are redacted at the logging middleware layer before the JSON is serialized.

Key Metrics

API Metrics (emitted to Fly.io)

MetricTypeAlert Threshold
http_request_duration_msHistogramp95 > 500ms
http_error_rateCounter> 1% 5xx in 5 min
active_sse_connectionsGauge> 200 per instance
auth_failures_per_minuteCounter> 50 failures/min

Business Metrics (Postgres views)

MetricQuery
Active projectsCOUNT WHERE status='active'
Pending change ordersCOUNT WHERE status='pending'
Job queue depthCOUNT WHERE status='pending'
Failed jobs (24h)COUNT WHERE status='failed'

Error Tracking: Sentry

Sentry captures unhandled exceptions and explicit error calls in both the frontend and backend. Configuration:

Uptime Monitoring

UptimeRobot (free tier): HTTP monitors on GET /health (API) and the Netlify frontend. 5-minute check interval. Alerts via email + Slack.

The GET /health endpoint performs a lightweight check:

// Health check — fast, meaningful
GET /health  {
  "status": "ok",
  "db":     "ok",   // SELECT 1 against Neon
  "redis":  "ok",   // PING against Upstash
  "uptime": 86400  // seconds since last restart
}

On-Call Runbooks

Each Sentry alert links to a runbook in /docs/runbooks/. Every recurring alert must have a runbook before the system launches. Runbooks include: alert description, probable causes, diagnostic queries, resolution steps, and escalation path. This is non-negotiable — alerts without runbooks create panic, not resolution.

10 Architecture Decision Records

Each decision below documents context, the decision taken, and the trade-offs accepted. These are living records — if a decision is revisited, the ADR is updated with the date and reason.

ADR-01

SvelteKit over Next.js

Accepted   April 2026
Decision: Use SvelteKit as the frontend framework, deployed on Netlify.

Groundwork is a two-sided application — a marketing site that needs SEO, and a complex multi-view app with real-time updates. SvelteKit handles both in a single framework via per-route rendering strategies. Svelte compiles to vanilla JS (no virtual DOM), producing smaller bundles and faster runtime performance. The team prefers Svelte's developer experience and signal-based reactivity.

Next.js (React): Larger ecosystem, more hiring options, more examples online. Chosen against because the React bundle overhead (~40KB min+gzip) matters for the LCP target, and the team's Svelte familiarity removes the ecosystem advantage for this project size.

Nuxt (Vue): Similar bundle size advantages to Svelte. Rejected because team preference is Svelte and Vue's composition API is less ergonomic for complex stores than Svelte's writable stores.

ADR-02

PostgreSQL over MongoDB

Accepted   April 2026
Decision: PostgreSQL (Neon managed) as the sole primary data store.

Groundwork's core entities are strongly relational: a Project has many Milestones; a Milestone belongs to a Project and optionally to a ChangeOrder; a Payment is tied to a Milestone; an AuditEvent references a Project, an Actor, and an Entity. These relationships benefit from foreign key constraints, JOIN performance, and transactional guarantees that PostgreSQL provides natively. The audit log and financial records legally require ACID compliance.

MongoDB: Flexible schema would help with the fragmented permit.raw_response field. This is handled by storing raw permit responses in a JSONB column — PostgreSQL's JSONB gives the flexibility of document storage where needed, without sacrificing relational integrity on core entities.

PlanetScale (MySQL): Rejected because Postgres's LISTEN/NOTIFY and SKIP LOCKED patterns are core to the architecture. MySQL has no equivalent without a separate broker.

ADR-03

Session Cookies over JWT

Accepted   April 2026
Decision: Server-side sessions stored in Postgres + Redis. Session ID transmitted as HttpOnly Secure cookie. No JWT.

Sessions can be immediately invalidated on logout, on suspicious activity detection, or by admin action — this is essential for a platform handling home renovation finances. JWTs cannot be invalidated before their expiry without building a token blacklist (which is, ironically, a server-side session store). The HttpOnly cookie attribute prevents XSS from stealing the session token — a crucial control given that contractor-submitted content might include uploaded documents with embedded scripts.

JWT: Stateless, no DB lookup on every request. Rejected because: (1) instant revocation is not optional for this use case; (2) JWTs stored in localStorage are XSS-vulnerable; (3) the Redis lookup we use for session validation is <1ms — the "no DB lookup" argument is moot at this scale.

Accepted cost: The SSE endpoint requires server affinity (sticky sessions) in Year 3 multi-instance deployment. This is managed via Fly.io's session affinity header — not an architectural constraint.

ADR-04

Postgres SKIP LOCKED over Redis for Job Queue

Accepted   April 2026
Decision: Use the Postgres SKIP LOCKED pattern for the background job queue. No Redis queue (BullMQ, Sidekiq, etc.).

Groundwork's job throughput is low — a few dozen jobs per hour at Year 1, a few hundred per hour at Year 3. A Postgres-backed queue is sufficient for this volume and eliminates an entire infrastructure dependency. Jobs survive database restarts (unlike in-memory Redis queues). The SKIP LOCKED pattern is well-documented, safe, and widely used in production systems at similar scales (Sidekiq used this before Redis was common).

BullMQ (Redis): Excellent DX, rich feature set, handles high throughput. Rejected because it requires Redis — which we already use for sessions and caching but at minimal volume. Adding a second Redis-dependent workload increases infrastructure coupling. If we already needed high-throughput Redis queuing, BullMQ would win.

Scale threshold: If job volume exceeds 10,000/hour consistently, or if job priority queuing becomes complex, migrate to BullMQ. The job schema is designed so this migration can be done without changing the worker interface.

ADR-05

REST over GraphQL

Accepted   April 2026
Decision: REST API with typed JSON responses. No GraphQL layer.

Groundwork's access patterns are well-defined and mostly CRUD: get project, get milestones for project, update milestone status. There are no deeply nested cross-entity queries from the frontend that would benefit from GraphQL's selectivity. The frontend is built alongside the API — there is no third-party consumer or mobile app requiring custom field selection. REST is simpler to cache (standard HTTP caching semantics), simpler to rate limit (per-endpoint), and simpler to audit-log.

GraphQL: Strong type system with schema-first development, excellent DX for complex nested queries. For Groundwork, the N+1 problem (a known GraphQL pitfall) would require DataLoader, adding complexity. The over-fetching problem that GraphQL solves is minimal here — we control both ends of the wire. If Groundwork ever builds a public API for third-party integrators, GraphQL becomes more compelling.

tRPC: Type-safe end-to-end without code generation is appealing for a TypeScript monorepo. Rejected because the SvelteKit frontend and Hono backend are in separate repositories (Netlify vs. Fly.io deploy). tRPC's main advantage — shared types without code generation — requires a monorepo to be ergonomic.

ADR-06

Neon over Supabase

Accepted   April 2026
Decision: Neon as the managed PostgreSQL provider. Supabase not used.

Neon is a pure Postgres provider — it does not bundle opinions about auth, storage, or realtime. Groundwork has intentional choices in each of these areas (session cookies for auth, R2 for storage, LISTEN/NOTIFY for realtime) that would conflict with Supabase's bundled solutions. Neon's serverless architecture with auto-suspend is cost-effective for Year 1's low traffic. Neon supports branching — a useful feature for staging environments that share production schema.

Supabase: Would provide auth, storage (S3-compatible), and realtime out of the box — reducing integration work. Rejected because bundling these services into one platform creates vendor lock-in across multiple concerns. Groundwork's choices for auth and storage are intentional (see ADR-03, and R2 rationale in ADR-08) and Supabase's alternatives would require either accepting Supabase's approach or maintaining parallel implementations.

RDS (AWS): Battle-tested, no serverless cold starts, fine-grained IAM. Rejected as operationally heavyweight for Year 1. Migrate to RDS if Neon's connection limits or performance become a constraint above 50K projects.

ADR-07

Fly.io over Serverless (Lambda/Vercel Functions)

Accepted   April 2026
Decision: Deploy API and workers as persistent VMs on Fly.io. No serverless functions for the API.

The SSE real-time endpoint and the background polling workers require persistent connections — a cold-started Lambda function cannot maintain an open SSE stream or hold a LISTEN/NOTIFY connection. Fly.io VMs start in ~300ms and remain running, allowing long-lived connections. Workers need to run on a schedule and maintain state across invocations (e.g., tracking which projects were last polled). Serverless would require external state coordination (DynamoDB, etc.) for this, eliminating the simplicity argument.

Serverless (Vercel/Lambda): Zero-ops scaling, pay-per-request pricing. Rejected because: (1) SSE requires persistent connections; (2) background polling workers need to run continuously, not on-demand; (3) cold start latency (100–500ms) hurts the sub-80ms p50 API target; (4) per-execution pricing becomes expensive at sustained load vs. always-on Fly.io VMs.

Accepted cost: We pay for VM uptime even during off-peak hours. At Year 1 traffic, this is ~$15–20/month — an acceptable cost for the architectural simplicity gained.

ADR-08

Cloudflare R2 over AWS S3

Accepted   April 2026
Decision: Cloudflare R2 for object storage (documents, photos). S3 not used.

Groundwork's documents (contracts, permit applications, inspection photos) will be downloaded regularly by both homeowners and contractors. AWS S3 charges $0.09/GB for egress — on a platform where a single project might have 50–100 MB of documents viewed many times, egress costs compound quickly. Cloudflare R2 has zero egress fees. R2 is S3-compatible (same SDK, same presigned URL patterns), so the migration risk to S3 in the future is low. Documents are served via Cloudflare's global CDN, which also improves download speeds globally.

AWS S3: More mature, more third-party integrations, AWS Lambda triggers for post-upload processing. Rejected primarily on egress cost grounds. At Year 3 (50K projects × 75 MB average documents × frequent downloads), egress costs on S3 could be $500–2,000/month. R2 makes that $0.

Tigris on Fly.io: S3-compatible object storage that runs on Fly.io infrastructure — attractive because it collocates storage with the API. Considered seriously. R2 chosen over Tigris because Cloudflare's WAF and CDN are already in the stack; R2 integrates more tightly with Cloudflare's edge network for global download performance. Revisit if Tigris matures significantly.

Decision Summary

#DecisionChosenPrimary Reason
01Frontend frameworkSvelteKitBundle size, team preference, SSR+SPA hybrid
02Primary databasePostgreSQL (Neon)Relational model, ACID, LISTEN/NOTIFY, SKIP LOCKED
03AuthenticationSession cookiesInstant revocation, XSS-safe HttpOnly cookie
04Job queuePostgres SKIP LOCKEDNo extra dependency, durable, sufficient throughput
05API styleRESTSimple CRUD patterns, no N+1 risk, cacheable
06Postgres hostNeonPure Postgres, no bundled opinion, branching
07ComputeFly.io VMsPersistent connections required (SSE, workers)
08Object storageCloudflare R2Zero egress fees, CDN integration

Implementation Starting Point An engineering team can begin from this document. Recommended sequence: (1) Bootstrap the Hono API with auth middleware and session handling. (2) Create the database schema with RLS policies. (3) Build the SvelteKit app shell with route structure and SSE subscription. (4) Implement the job queue worker runner. (5) Build feature areas in milestone order per the BRD. Each step has sufficient specification above to begin without additional architecture decisions.