Technical Architecture

Groundwork — Shared Reality Platform for Home Renovation Transparency

Version1.0 — Draft

DateApril 2026

ScopeMVP through Year 3 scale

StatusEngineering Ready

Requirements72 functional requirements / 10 feature areas

Purpose This document proves Groundwork is buildable. It provides enough specificity that an engineering team can begin implementation immediately. Every architectural decision is justified with trade-offs documented. Security and performance are treated as first-class concerns evaluated in every section.

0 Architecture Principles

These principles govern every technical decision in this document. When trade-offs arise, they serve as the tiebreaker. Any deviation from these principles requires an explicit ADR.

Single Responsibility

Every module, service, and function does one thing and does it well. The API layer does not contain business logic. Business logic does not know about the database schema.

Defense in Depth

Security controls exist at every layer: network, application, data, and audit. No single failure exposes sensitive data. Homeowner and contractor data are isolated at the database level, not just the application level.

Principle of Least Privilege

Every process, user role, and service account has only the permissions it needs and nothing more. Database roles are scoped per operation type. API tokens expire. Admin actions are logged.

Fail Fast, Recover Gracefully

Invalid input is rejected at the boundary with clear error messages. Integration failures (permit APIs, carrier tracking) fall back to cached state, never silently corrupt data or crash the application.

12-Factor App

Configuration lives in environment variables. Logs are streams to stdout. The app is stateless — any instance can handle any request. Dev/prod parity is maintained. Backing services are attached resources.

Explicit Over Implicit

No magic. No clever abstractions that hide what the system is doing. A junior developer reading any module should be able to trace a request from HTTP handler through business logic to database query in under 10 minutes.

Observability First

Structured logs, metrics, and error tracking are built in from day one. The system must be debuggable in production without attaching a debugger. Every background job emits completion metrics.

Evolutionary Architecture

Year 1 infrastructure must not over-engineer for Year 3 load. But Year 1 code must not make Year 3 scaling unnecessarily painful. Schema migrations, API versioning, and service boundaries are designed to evolve.

Layered Separation of Concerns

Every feature is implemented across clearly separated layers. No layer reaches across its boundary into another layer's responsibility.

Presentation Layer SvelteKit components, stores, route handlers — rendering only, no business rules

API Contract Layer HTTP handlers, request validation (Zod/Joi), response serialization, auth middleware

Business Logic Layer Domain models, rules (contract health scoring, change order approval), pure functions

Data Access Layer Repository pattern — SQL queries, cache reads/writes, zero business logic

Infrastructure Layer PostgreSQL, R2 object storage, Upstash Redis, external API clients — all replaceable

1 System Overview

Groundwork is a two-sided platform connecting homeowners and contractors within a shared project workspace. The architecture is a monolithic API (intentionally — see ADR-07) deployed on Fly.io, a SvelteKit frontend on Netlify, PostgreSQL on Neon for all relational data, Cloudflare R2 for document storage, and Fly.io workers for background integration polling.

Key Architectural Decisions Visible in This Diagram

Monolithic API, not microservices At 500–50,000 projects, a well-structured monolith is operationally simpler, easier to debug, and avoids distributed-transaction complexity. Service boundaries are enforced by module structure, not network calls.

Workers are separate processes Background polling workers run as separate Fly.io machines — isolated from the API so a slow permit scrape cannot degrade API response times. They communicate via the Postgres job queue, not in-process.

Multi-Tenant Isolation Every database query is constrained by Row Level Security policies in PostgreSQL. The application layer cannot accidentally return data from one project to another — the database itself enforces the boundary regardless of application bugs.

2 Frontend Architecture

Rendering Strategy

SvelteKit supports multiple rendering modes per route. Groundwork uses a deliberate split: SSR for public-facing pages (SEO and first-contentful-paint), and client-side SPA rendering for the authenticated application. This gives the landing page sub-1-second LCP without sacrificing the interactive richness of the project dashboard.

Public Routes — SSR (Server-Side Rendered)

/ — Landing page (SEO critical)
/pricing, /about — Marketing pages
/login, /signup — Auth entry points
/invite/[token] — Contractor invite landing

Rendered at request time on Netlify Edge Functions. HTML is indexable by search engines. Hydrated client-side after load.

App Routes — CSR (Client-Side SPA)

/dashboard — Project list
/projects/[id] — Project workspace
/projects/[id]/milestones
/projects/[id]/documents
/projects/[id]/payments
/projects/[id]/change-orders
/admin — Admin panel

JavaScript bundle loaded once. Route transitions are instant. Auth guard at layout level — unauthenticated users redirected to /login.

Component Architecture

Components follow the Container / Presentational pattern. No business logic in components. Data fetching happens in +page.ts load functions or via stores — never inline in component markup.

// Directory structure — feature-first, not type-first
src/
  lib/
    components/
      ui/               // Button, Badge, Modal, Toast — design system atoms
      project/          // ProjectCard, ProjectHeader, MilestoneTimeline
      change-order/     // ChangeOrderForm, ChangeOrderStatus
      payment/          // PaymentSchedule, PaymentRow
      document/         // DocumentUploader, DocumentList
      permit/           // PermitStatus, PermitTimeline
    stores/
      project.ts        // Writable<Project> — current project context
      notifications.ts  // real-time notification queue
      session.ts        // User session, role, permissions
    api/
      client.ts         // Typed fetch wrapper — all API calls go through here
      projects.ts       // Project CRUD functions
      milestones.ts
      documents.ts
    utils/
      format.ts         // Currency, date formatting — pure functions
      validate.ts       // Client-side validation schemas (Zod)
  routes/
    (public)/           // SSR group
    (app)/              // Auth-guarded CSR group

State Management

Svelte's built-in stores are sufficient for Groundwork's data model. No Zustand, Redux, or Pinia. The session store holds the current user. The project store holds the current project (populated by the route's load function). Notifications arrive via Server-Sent Events and push to the notifications store.

Real-Time Updates via Server-Sent Events (SSE) The authenticated app subscribes to GET /api/stream, an SSE endpoint. When a background worker detects a permit status change or a contractor marks a milestone complete, PostgreSQL LISTEN/NOTIFY triggers the API to push a typed event to all connected clients on that project. The notification store updates, Svelte's reactivity re-renders the relevant component. No WebSocket complexity, no third-party realtime service.

Performance Budget

< 1.5s LCP (target)

< 100ms FID / INP

< 0.1 CLS

< 120kb JS Bundle (gzip)

Achieving the performance budget:

SvelteKit compiles away the framework — no virtual DOM, minimal runtime. Landing page JS is typically 20–40kb gzipped.
Images served through Cloudflare's image resizing — correct size and format (WebP/AVIF) per viewport.
Fonts loaded via font-display: swap with preconnect hints. Only 2 font families, 3 weights each.
Route-level code splitting is automatic in SvelteKit — the project dashboard doesn't load the admin panel's code.
SvelteKit's preload strategy: link hover triggers prefetch of the next route's data.

Security Considerations — Frontend

CSRF protection: Session cookies use SameSite=Strict. All state-mutating requests include a CSRF token from the session, validated server-side.
Content Security Policy: Strict CSP header from Netlify — no inline scripts, no eval, allowed hosts enumerated explicitly.
Sensitive data: PII (addresses, contractor license numbers) is never stored in localStorage or sessionStorage — only in memory. Refresh clears it; the API re-fetches as needed.
Input sanitization: All text content rendered via Svelte's {} binding (not {@html}) — auto-escaped, XSS-safe by default.

3 Backend Architecture

Framework: Hono on Node.js (Fly.io)

The API is built with Hono, a lightweight TypeScript-first HTTP framework with excellent performance characteristics and middleware composability. It runs on Node.js on a Fly.io VM (not serverless — see ADR-07). The codebase is organized as a modular monolith: separate router files per domain, shared middleware, dependency injection via closure rather than a DI container.

Why REST over GraphQL — see ADR-05. Short version: Groundwork's data model is relational, access patterns are predictable CRUD, and GraphQL's N+1 risks and schema complexity add cost without benefit at this scale.

Key API Endpoints

Method	Path	Description	Auth	Rate Limit
POST	/auth/login	Create session with email + password	Public	5/min per IP
POST	/auth/logout	Invalidate session, clear cookie	Session	—
GET	/auth/me	Current user + role + permissions	Session	60/min
POST	/projects	Create new project (homeowner initiates)	Homeowner	10/hour
GET	/projects	List user's projects (both roles)	Session	60/min
GET	/projects/:id	Project detail + health score	Member	120/min
PATCH	/projects/:id	Update project metadata	Contractor	30/min
GET	/projects/:id/milestones	List milestones with status	Member	120/min
PATCH	/projects/:id/milestones/:mid	Mark milestone complete (triggers notification)	Contractor	30/min
POST	/projects/:id/change-orders	Submit change order for approval	Contractor	20/hour
PATCH	/projects/:id/change-orders/:cid	Approve or reject change order	Homeowner	30/min
GET	/projects/:id/documents	List documents with signed download URLs	Member	60/min
POST	/projects/:id/documents/upload-url	Generate signed R2 upload URL	Member	20/min
POST	/projects/:id/documents/confirm	Confirm upload complete, persist metadata	Member	20/min
GET	/projects/:id/payments	Payment schedule + status	Member	60/min
GET	/projects/:id/permits	Permit status (cached, refreshed by poller)	Member	60/min
GET	/projects/:id/health	Contract health score + breakdown	Member	60/min
GET	/api/stream	SSE stream for real-time project events	Session	5 connections/user
POST	/invites	Send contractor invite link	Homeowner	10/day
GET	/invites/:token	Validate invite token, return project info	Public	20/min per IP

Authentication: Session Cookies (not JWT)

See ADR-03 for the full rationale. Sessions are stored server-side in Postgres (with a Redis cache for fast lookup). The client receives only an opaque session ID in an HttpOnly; Secure; SameSite=Strict cookie. Sessions expire after 24 hours of inactivity and are invalidated immediately on logout.

// Session table — stored in Postgres, cached in Redis
CREATE TABLE sessions (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  user_id     UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_seen   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  expires_at  TIMESTAMPTZ NOT NULL,
  ip_address  INET,
  user_agent  TEXT
);

-- Redis key: "session:{id}" → JSON(user_id, role, expires_at)
-- TTL: 24 hours, refreshed on each request

RBAC: Role-Based Access Control

Three roles: homeowner, contractor, admin. Role is stored on the project_members join table (not globally on the user) — a user can be a homeowner on one project and a contractor on another. The middleware resolves the caller's role within the requested project before routing.

Role resolution is per-project, not per-user GET /projects/abc/milestones resolves: is this user a member of project abc? What is their role within that project? This prevents the class of bug where a contractor on Project B can see Project A's data by guessing a UUID.

Permission	Homeowner	Contractor	Admin
View project data	Yes	Yes	Yes
Create project	Yes	No	Yes
Invite contractor	Yes	No	Yes
Mark milestone complete	No	Yes	Yes
Submit change order	No	Yes	Yes
Approve change order	Yes	No	Yes
Upload documents	Yes	Yes	Yes
Delete documents	Own only	Own only	Yes
View admin panel	No	No	Yes

Rate Limiting Strategy

Rate limits are enforced at the API middleware layer using a sliding window counter stored in Redis (Upstash). Three tiers:

Global per-IP: 300 req/min for all authenticated requests. Blocks volumetric abuse before hitting business logic.
Endpoint-specific: Tighter limits on destructive or expensive endpoints (see table above). Authentication endpoints are the most restrictive — 5 attempts/min/IP with exponential backoff.
Per-user: 60 req/min across all authenticated requests for a given user ID, regardless of IP. Prevents credential-stuffed accounts from being used as scrapers.

Performance: API Response Time Targets p50 < 80ms, p95 < 300ms, p99 < 800ms for all non-streaming endpoints. The SSE stream endpoint has no latency target but must not block the Node.js event loop. CPU-intensive health score computation runs in a worker process, not in the API request path.

4 Data Architecture

Primary Schema: PostgreSQL (Neon)

All relational data lives in a single PostgreSQL database hosted on Neon. The schema below covers the 7 primary entities plus supporting tables. Row Level Security is enabled on all tables containing project data.

-- ────────────────────────────────────────────────── -- USERS -- ────────────────────────────────────────────────── CREATE TABLE users ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT NOT NULL UNIQUE, display_name TEXT NOT NULL, password_hash TEXT NOT NULL, -- bcrypt, cost=12 phone TEXT, -- nullable — not required at signup avatar_key TEXT, -- R2 object key license_number TEXT, -- contractors only license_state CHAR(2), -- two-letter state code license_verified_at TIMESTAMPTZ, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), last_login TIMESTAMPTZ ); CREATE INDEX idx_users_email ON users(email); -- ────────────────────────────────────────────────── -- PROJECTS -- ────────────────────────────────────────────────── CREATE TABLE projects ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), name TEXT NOT NULL, address TEXT NOT NULL, -- encrypted at rest (pgcrypto) city TEXT NOT NULL, state CHAR(2) NOT NULL, zip TEXT NOT NULL, status TEXT NOT NULL DEFAULT 'active', -- active|paused|completed|disputed contract_total NUMERIC(12,2), start_date DATE, end_date DATE, health_score SMALLINT, -- 0-100, recomputed by worker health_computed_at TIMESTAMPTZ, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_projects_status ON projects(status); -- ────────────────────────────────────────────────── -- PROJECT MEMBERS (multi-tenant pivot) -- ────────────────────────────────────────────────── CREATE TABLE project_members ( project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, role TEXT NOT NULL, -- 'homeowner' | 'contractor' joined_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), PRIMARY KEY (project_id, user_id) ); CREATE INDEX idx_project_members_user ON project_members(user_id); -- ────────────────────────────────────────────────── -- MILESTONES -- ────────────────────────────────────────────────── CREATE TABLE milestones ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, title TEXT NOT NULL, description TEXT, status TEXT NOT NULL DEFAULT 'pending', -- pending|in_progress|complete|disputed due_date DATE, completed_at TIMESTAMPTZ, completed_by UUID REFERENCES users(id), payment_amount NUMERIC(12,2), -- portion of contract due at completion sort_order SMALLINT NOT NULL DEFAULT 0, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_milestones_project ON milestones(project_id, sort_order); -- ────────────────────────────────────────────────── -- CHANGE ORDERS -- ────────────────────────────────────────────────── CREATE TABLE change_orders ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, submitted_by UUID NOT NULL REFERENCES users(id), title TEXT NOT NULL, description TEXT NOT NULL, cost_delta NUMERIC(12,2) NOT NULL, -- positive = increase, negative = credit time_delta INTERVAL, -- schedule impact status TEXT NOT NULL DEFAULT 'pending', -- pending|approved|rejected|withdrawn reviewed_by UUID REFERENCES users(id), reviewed_at TIMESTAMPTZ, review_note TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- ────────────────────────────────────────────────── -- DOCUMENTS -- ────────────────────────────────────────────────── CREATE TABLE documents ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, uploaded_by UUID NOT NULL REFERENCES users(id), filename TEXT NOT NULL, mime_type TEXT NOT NULL, size_bytes BIGINT NOT NULL, r2_key TEXT NOT NULL UNIQUE, -- R2 object key — never exposed to client category TEXT, -- contract|permit|invoice|photo|other created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_documents_project ON documents(project_id, created_at DESC); -- ────────────────────────────────────────────────── -- PERMITS -- ────────────────────────────────────────────────── CREATE TABLE permits ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), project_id UUID NOT NULL REFERENCES projects(id) ON DELETE CASCADE, permit_number TEXT, permit_type TEXT NOT NULL, -- building|electrical|plumbing|mechanical status TEXT NOT NULL, -- applied|issued|inspected|final|expired issued_date DATE, expiry_date DATE, city_api_id TEXT, -- external ID in city's system last_polled TIMESTAMPTZ, raw_response JSONB, -- normalized later, raw kept for debugging created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- ────────────────────────────────────────────────── -- AUDIT LOG — append-only, never updated -- ────────────────────────────────────────────────── CREATE TABLE audit_events ( id BIGSERIAL PRIMARY KEY, project_id UUID REFERENCES projects(id) ON DELETE SET NULL, actor_id UUID REFERENCES users(id) ON DELETE SET NULL, event_type TEXT NOT NULL, entity_type TEXT NOT NULL, entity_id UUID, payload JSONB NOT NULL DEFAULT '{}', ip_address INET, occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_audit_project ON audit_events(project_id, occurred_at DESC); CREATE INDEX idx_audit_actor ON audit_events(actor_id, occurred_at DESC); -- ────────────────────────────────────────────────── -- JOB QUEUE (Postgres SKIP LOCKED pattern) -- ────────────────────────────────────────────────── CREATE TABLE job_queue ( id BIGSERIAL PRIMARY KEY, job_type TEXT NOT NULL, payload JSONB NOT NULL, status TEXT NOT NULL DEFAULT 'pending', attempts SMALLINT NOT NULL DEFAULT 0, max_attempts SMALLINT NOT NULL DEFAULT 3, run_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), locked_at TIMESTAMPTZ, completed_at TIMESTAMPTZ, error TEXT, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); CREATE INDEX idx_job_queue_ready ON job_queue(run_at) WHERE status = 'pending';

Caching Strategy: When to Use Redis vs. Postgres

Not everything needs to be cached. Cache when: the data is read far more often than it changes, computing it is expensive, or latency for stale data is acceptable. Do not cache when: correctness is critical (payment amounts, audit events) or the dataset fits in a single Postgres query.

Data	Cache?	TTL	Reason
Contract health score	Redis	1 hour	Expensive to compute; acceptable if slightly stale
Permit status	Redis	30 min	Polled externally; DB write + cache invalidate on update
User session	Redis	24 hours	Hit on every request — must be sub-millisecond
Rate limit counters	Redis	1 min	Requires atomic increment; Redis INCR is atomic
Project list	Postgres	—	Small result set; query is <10ms with index
Milestone list	Postgres	—	Must always reflect latest contractor updates
Payment amounts	Never	—	Financial data — must be authoritative from DB
Audit log	Never	—	Legal record — no caching ever
Signed R2 URLs	Redis	55 min	Signed URLs expire at 60 min; cache saves regeneration

Real-Time: PostgreSQL LISTEN/NOTIFY

PostgreSQL's built-in pub/sub mechanism handles real-time project updates without a separate WebSocket server or message broker. A Postgres trigger fires on writes to milestones, change_orders, and permits — emitting a notification on a channel named project:{project_id}. The API server maintains a single persistent Postgres connection per project that has active SSE subscribers, forwarding notifications to those clients.

-- Trigger on milestone update
CREATE OR REPLACE FUNCTION
  notify_project_change()
RETURNS TRIGGER AS $$
BEGIN
  PERFORM pg_notify(
    'project:' || NEW.project_id,
    json_build_object(
      'entity', TG_TABLE_NAME,
      'id',     NEW.id,
      'event',  TG_OP
    )::text
  );
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER milestone_notify
AFTER INSERT OR UPDATE ON milestones
FOR EACH ROW EXECUTE FUNCTION
  notify_project_change();

Why not WebSockets? LISTEN/NOTIFY is sufficient for Groundwork's real-time needs. Events are low-frequency (a milestone update, a permit status change), not high-frequency streaming data. Server-Sent Events (one-directional) are simpler to implement, proxy-safe, and auto-reconnecting in browsers. WebSockets add connection management complexity with no benefit here.

Scale limit LISTEN/NOTIFY works well up to ~100 concurrent project subscriptions per API instance. At Year 3 scale, if concurrent active sessions exceed this, the API auto-scales horizontally (each instance holds its own LISTEN connections). No architectural change needed.

Document Storage: Cloudflare R2

Documents (contracts, permits, invoices, photos) are stored in Cloudflare R2. The flow is a two-step upload: the API generates a presigned PUT URL (valid 15 minutes), the client uploads directly to R2 (bypassing the API entirely), then the client POSTs a confirmation to the API which persists the metadata. This keeps large binaries off the application server.

Security: Documents are never publicly accessible R2 bucket is private. Download links are presigned GET URLs generated server-side, valid for 60 minutes, scoped to the requesting user's session. The R2 object key is opaque (UUID-based) — guessing another project's document key is computationally infeasible. RBAC is checked before generating any signed URL.

5 Integration Architecture

Integration Philosophy External integrations are the highest-risk part of the system — they are outside our control, poorly documented, frequently down, and occasionally change without notice. Every integration is wrapped in an adapter with a defined failure mode. The system must degrade gracefully: a permit API being down never prevents a homeowner from viewing their project.

City Permit APIs

This is the most technically challenging integration. There is no standard API — every municipality has its own portal, data model, and access method. Groundwork's Year 1 target (one city) allows for a focused adapter. Year 2 requires a pluggable adapter pattern.

Adapter Pattern

Each city permit source implements a common PermitAdapter interface:

interface PermitAdapter {
  fetchByAddress(
    address: string,
    city: string,
    state: string
  ): Promise<PermitRecord[]>;

  fetchByPermitNumber(
    id: string
  ): Promise<PermitRecord>;

  normalizeStatus(
    raw: string
  ): PermitStatus;
}

Failure Modes & Fallbacks

API timeout (>10s): Return cached status from last successful poll. Surface staleness timestamp to user ("Last updated 3 hours ago").
HTTP 4xx/5xx: Increment error counter. After 3 consecutive failures, mark permit as "Status Unavailable" and alert admin via Sentry.
Schema change: Normalization layer logs unknown status values, maps to "Unknown" — does not throw. Alert is raised for manual review.
City changes access method: Adapter is swapped without changing the caller. Raw responses stored in permits.raw_response JSONB for re-parsing.

Delivery Tracking

Carrier APIs (UPS, FedEx, USPS) provide REST APIs with reasonable documentation. Groundwork tracks deliveries linked to project milestones (e.g., "lumber delivery" before "framing" milestone).

Carrier	Method	Auth	Rate Limit	Failure Mode
UPS	REST API	OAuth 2.0	40 req/day (free)	Cache last known status; retry 1h later
FedEx	REST API	OAuth 2.0	5000 req/day	Cache last known status; retry 1h later
USPS	REST API (v3)	OAuth 2.0	500 req/day	Cache last known status; retry 4h later

Carrier detection Tracking numbers are routed to the correct carrier by pattern matching (UPS starts with "1Z", FedEx is 12 or 15 digits, USPS follows USPS formats). If pattern is ambiguous, the system tries carriers in priority order and returns the first successful result.

License Verification

Contractor license verification is the most fragile integration. Most state databases offer no API — they require either screen scraping of a public search form or access to a third-party aggregator.

Tiered Verification Approach

Tier 1 — Official API: A small number of states (CA, TX, FL, NY) have machine-readable license lookup APIs. Use these directly.
Tier 2 — Structured scrape: States with consistent HTML structures get a Playwright-based scraper running in a Fly.io worker. Results cached 7 days.
Tier 3 — Third-party aggregator: For states where scraping is unreliable, use a service like CSLB API or Contractor Check as a paid fallback.
Tier 4 — Manual review queue: If all automated methods fail, flag the contractor for manual admin verification. Platform remains usable; verification shown as "Pending".

Security considerations

License numbers are PII — they are stored encrypted in users.license_number (column-level encryption via pgcrypto). Only admin role can view plaintext. The verification result (verified/not verified/pending) is stored as a separate non-encrypted field so it can be queried without decryption.

Verification is re-checked at 90-day intervals. An expired or revoked license triggers an admin alert and changes the contractor's trust indicator on shared projects.

Email & SMS (Notifications)

Outbound notifications use Resend for transactional email (excellent deliverability, TypeScript SDK, generous free tier) and Twilio for SMS on critical events (milestone payment due, change order requiring approval within 48h).

Notification failure handling Notification dispatch is always async — via the job queue. If Resend is down, the job is retried with exponential backoff (1 min, 5 min, 30 min). After 3 failures, the job is moved to a dead-letter table and the user's in-app notification is flagged as "Email delivery failed." The user experience is never blocked by notification infrastructure failures.

6 Background Processing

Background jobs run as separate Fly.io worker machines. They consume jobs from the Postgres job_queue table using the SKIP LOCKED pattern — multiple workers can run concurrently without stepping on each other, and jobs are never lost (they live in the database, not in memory). See ADR-04 for the trade-off analysis.

-- Worker claim query — safe for concurrent workers
UPDATE job_queue
SET
  status    = 'processing',
  locked_at = NOW(),
  attempts  = attempts + 1
WHERE id = (
  SELECT id FROM job_queue
  WHERE  status = 'pending'
    AND  run_at <= NOW()
    AND  attempts < max_attempts
  ORDER BY run_at ASC
  LIMIT 1
  FOR UPDATE SKIP LOCKED
)
RETURNING *;

Job Types

Job Type	Trigger	Frequency	SLA	Failure Action
`permit.poll`	Scheduled (cron)	Hourly per active project	Complete within 55 min	Retry 3x, then mark stale + alert
`health.compute`	On milestone/change-order write	On-demand	Complete within 30s	Retry 3x, serve stale score with staleness label
`notification.send`	On project events	On-demand	Delivered within 5 min	Retry with backoff, dead-letter after 3 failures
`delivery.track`	On tracking number added	Every 2 hours until delivered	Complete within 90 min	Retry 3x with 1h backoff
`digest.generate`	Scheduled (cron)	Daily at 7 AM project timezone	Complete before 8 AM	Skip day, alert admin if 3 consecutive failures
`license.verify`	On contractor join + every 90 days	On-demand + scheduled	Complete within 1 hour	Retry, then queue for manual review
`session.cleanup`	Scheduled (cron)	Daily at 2 AM UTC	Complete before 3 AM	Log failure, retry next day

Contract Health Score Algorithm

The ContractHealthScore is a 0–100 composite score computed by the health.compute worker. It is the most business-critical computation in the system. The algorithm is deterministic and testable as a pure function.

Health Score Components

Schedule adherence (30 pts): Ratio of on-time milestone completions vs. late completions. A milestone is late if completed more than 3 days past its due date, or not yet complete and past due date.
Change order frequency (25 pts): Projects with more than 2 change orders per 10 milestones are penalized. High change order rates correlate with scope creep or poor planning.
Documentation completeness (20 pts): Key documents present: signed contract, active permit(s), current insurance certificate, inspection records.
Payment timeliness (15 pts): Payments released within 72h of milestone approval. Delayed payments reduce the score.
Communication currency (10 pts): Time since last activity in project (milestone update, document upload, message). Projects inactive >14 days lose points.

7 Security Architecture

Security is non-negotiable Groundwork stores home addresses, financial arrangements, contractor license numbers, and signed legal documents. A breach would cause real harm to real people. Security controls are specified precisely here and enforced as code — not documentation aspiration.

Authentication Flow

Data Encryption

At Rest

Database: Neon encrypts all data at rest using AES-256. No additional action required from the application.
Column-level encryption: PII fields (address, license_number, phone) are additionally encrypted using pgcrypto with a key stored in environment variables, not in the database. An attacker with a Postgres dump cannot read PII without the application key.
R2: Cloudflare R2 encrypts all objects at rest. Encryption keys are managed by Cloudflare.

In Transit

All external traffic: TLS 1.3 enforced. TLS 1.0 and 1.1 are disabled at the Cloudflare WAF level. HSTS with max-age=31536000; includeSubDomains.
API → Neon: TLS-encrypted connection. Neon requires TLS by default; plaintext connections are rejected.
API → Redis: Upstash enforces TLS. The connection string includes tls://.
API → R2: HTTPS only. Presigned URLs include a content-type restriction to prevent MIME sniffing attacks.

Multi-Tenant Isolation: Row Level Security

Row Level Security (RLS) is the last line of defense for multi-tenant data isolation. Even if the application layer has a bug that constructs an incorrect query, the database itself will not return data from another tenant's project.

-- Enable RLS on all project-scoped tables
ALTER TABLE milestones     ENABLE ROW LEVEL SECURITY;
ALTER TABLE change_orders  ENABLE ROW LEVEL SECURITY;
ALTER TABLE documents      ENABLE ROW LEVEL SECURITY;
ALTER TABLE permits        ENABLE ROW LEVEL SECURITY;

-- Policy: user can only see milestones from their projects
CREATE POLICY milestones_member_policy
ON milestones FOR ALL TO app_user
USING (
  project_id IN (
    SELECT project_id
    FROM   project_members
    WHERE  user_id = current_setting('app.current_user_id')::uuid
  )
);

-- API sets the user context before each query
-- SELECT set_config('app.current_user_id', $1, true)

-- Separate DB role for workers — cannot write to audit_events
CREATE ROLE app_worker;
GRANT SELECT, UPDATE ON job_queue TO app_worker;
GRANT SELECT, INSERT ON permits   TO app_worker;
GRANT SELECT, UPDATE ON projects  TO app_worker;

PII Inventory

PII Field	Table	Encrypted?	Who Can Access	Retention
Email address	`users`	At rest (Neon)	User (own), Admin	Until account deletion
Phone number	`users`	Column-level (pgcrypto)	User (own), Admin	Until account deletion
Home address	`projects`	Column-level (pgcrypto)	Project members, Admin	Project lifetime + 7 years
License number	`users`	Column-level (pgcrypto)	Admin only	Until account deletion
Financial documents	R2 (key in `documents`)	R2 at-rest + TLS	Project members, Admin	Project lifetime + 7 years
IP addresses	`sessions`, `audit_events`	At rest (Neon)	Admin only	90 days
Payment amounts	`milestones`, `change_orders`	At rest (Neon)	Project members, Admin	Project lifetime + 7 years

Security Checklist by Threat Vector

Injection & Input Attacks

All SQL via parameterized queries (Drizzle ORM — never string interpolation)
All input validated with Zod schemas at the API boundary before reaching business logic
File uploads: MIME type validated server-side (not client-supplied Content-Type); max size enforced at R2 presigned URL generation; filename sanitized before storage
HTML rendering in SvelteKit uses {} bindings — auto-escaped. {@html} is banned by ESLint rule.

Broken Access Control

RBAC checked in middleware before every handler
RLS as second enforcement layer in database
Object-level authorization checked before every signed URL generation (is this user a member of this project?)
Admin endpoints behind separate admin role; no privilege escalation path for homeowner/contractor roles

Auth & Session

Passwords hashed with bcrypt (cost=12) — never stored plaintext
Session IDs are UUIDs (128-bit entropy) — not sequential
Sessions expire after 24h inactivity; absolute max 30 days
Logout immediately invalidates session in Redis and Postgres
Password reset tokens are single-use, 15-minute expiry, stored as bcrypt hashes
CSRF tokens required for all state-mutating requests

Infrastructure

Cloudflare WAF in front of Netlify and Fly.io
Fly.io private networking for API → Neon (not public internet)
Secrets in Fly.io secrets (env vars), never in code or logs
Dependency scanning via npm audit in CI
Security headers: CSP, X-Frame-Options, X-Content-Type-Options, HSTS

8 Scaling Strategy

Groundwork is designed to start simple and scale predictably. The architectural choices in Year 1 deliberately avoid premature optimization while maintaining clear paths to Year 3 capacity.

Year 1

500 Projects

Single city — low volume, validate the product

1x Fly.io API machine (shared-cpu-2x, 512MB)
1x Fly.io worker machine (shared-cpu-1x)
Neon free tier (0.5 vCPU, 1 GB)
Upstash Redis free tier (10K commands/day)
Estimated cost: ~$30/month

Year 2

5,000 Projects

Multi-city expansion — 10x growth

2–3x Fly.io API machines (auto-scale on CPU)
2x Fly.io worker machines (more parallel pollers)
Neon Launch ($19/month — 10 GB)
Upstash Redis Pay-as-you-go
PgBouncer connection pooling (Neon built-in)
Estimated cost: ~$200/month

Year 3

50,000 Projects

National scale — 100x from Year 1

Auto-scaling Fly.io pool (4–10 machines)
Neon Scale tier ($69/month + read replicas)
Read replica for analytics queries
Cloudflare R2 bandwidth savings at scale
Cache hit rate target: >90% for health scores
Estimated cost: ~$800–1,200/month

Database Scaling Path

The most important scaling risk is the database. The strategy is to exhaust Postgres vertical scaling before introducing horizontal complexity.

Stage	Action	When to trigger
1	Add missing indexes; review slow query log	p95 query time > 200ms
2	Enable PgBouncer connection pooling (Neon built-in)	>50 concurrent connections
3	Add Redis caching for expensive derived queries	Same query hit rate > 10x/minute
4	Vertical scale Neon tier (more vCPU, RAM)	CPU > 70% sustained
5	Add read replica; route analytics to replica	Read/write ratio > 90%:10%
6	Partition `audit_events` by month (range partitioning)	Table > 50M rows
7	Consider sharding by city/region (unlikely before 500K projects)	>100M total rows, write contention

API Scaling: Stateless by Design

The API server is stateless — all session state is in Redis, all data is in Postgres. Adding a second or tenth API machine requires zero coordination. Fly.io's built-in load balancer distributes traffic across instances. The SSE stream connections are sticky (Fly.io session affinity) to avoid the complexity of cross-instance event broadcasting.

SSE Scaling Note With session affinity, each API instance maintains its own set of LISTEN/NOTIFY connections. If a user's SSE connection lands on Instance A, all events for their projects are received by Instance A's Postgres LISTEN connection and forwarded. This scales linearly — each new API machine adds capacity for more concurrent SSE connections. At very high concurrency (>1000 projects per instance), SSE connections can be moved to a dedicated machine.

Cache Strategy: TTLs and Invalidation

Cache invalidation is the root of many production bugs. The strategy here is explicit: cache entries are invalidated on write, not relying on TTL expiry for correctness. TTLs are a safety net, not the primary invalidation mechanism.

Cache Key	TTL	Invalidated by
`session:{id}`	24h	Logout, password change, admin revocation
`health:{project_id}`	2h	Any milestone/change-order write to that project
`permit:{project_id}`	30m	Permit poller completing successfully
`r2_url:{document_id}`	55m	Never — URLs naturally expire at 60m
`rate:{ip}:{endpoint}`	60s sliding	TTL only (intended to expire)

9 Observability

A system that cannot be debugged in production is incomplete. Observability is built in from day one. Every request, every background job, and every integration failure is traceable.

Structured Logging

All application logs are written to stdout as newline-delimited JSON (NDJSON) using Pino. Log levels: error, warn, info, debug. Production defaults to info. Fly.io captures stdout and streams to a log drain.

// Every log entry includes standard fields
{
  "level":      "info",
  "time":       1744000000000,
  "pid":        1234,
  "requestId":  "req_01J...",   // Ulid — unique per request
  "userId":     "usr_01J...",   // null if unauthenticated
  "projectId":  "proj_01J...",  // null if not project-scoped
  "method":     "PATCH",
  "path":       "/projects/abc/milestones/xyz",
  "status":     200,
  "latencyMs":  42,
  "msg":        "milestone updated"
}

PII in logs is prohibited Log entries must never contain: email addresses, phone numbers, home addresses, license numbers, payment amounts, document names, or any field from the PII inventory in Section 7. Sensitive fields are redacted at the logging middleware layer before the JSON is serialized.

Key Metrics

API Metrics (emitted to Fly.io)

Metric	Type	Alert Threshold
http_request_duration_ms	Histogram	p95 > 500ms
http_error_rate	Counter	> 1% 5xx in 5 min
active_sse_connections	Gauge	> 200 per instance
auth_failures_per_minute	Counter	> 50 failures/min

Business Metrics (Postgres views)

Metric	Query
Active projects	COUNT WHERE status='active'
Pending change orders	COUNT WHERE status='pending'
Job queue depth	COUNT WHERE status='pending'
Failed jobs (24h)	COUNT WHERE status='failed'

Error Tracking: Sentry

Sentry captures unhandled exceptions and explicit error calls in both the frontend and backend. Configuration:

Source maps uploaded during CI deploy — Sentry shows original TypeScript, not compiled JS
PII scrubbing enabled in Sentry SDK (send_default_pii: false)
Alerts: any new error issue triggers a Slack notification to #eng-alerts
Performance tracing: 10% of transactions sampled (1% in Year 1 to stay within free tier)
Integration failures (permit API, carrier API) logged as Sentry issues with breadcrumbs

Uptime Monitoring

UptimeRobot (free tier): HTTP monitors on GET /health (API) and the Netlify frontend. 5-minute check interval. Alerts via email + Slack.

The GET /health endpoint performs a lightweight check:

// Health check — fast, meaningful
GET /health → {
  "status": "ok",
  "db":     "ok",   // SELECT 1 against Neon
  "redis":  "ok",   // PING against Upstash
  "uptime": 86400  // seconds since last restart
}

On-Call Runbooks

Trade-offs & Alternatives

AWS S3: More mature, more third-party integrations, AWS Lambda triggers for post-upload processing. Rejected primarily on egress cost grounds. At Year 3 (50K projects × 75 MB average documents × frequent downloads), egress costs on S3 could be $500–2,000/month. R2 makes that $0.

Tigris on Fly.io: S3-compatible object storage that runs on Fly.io infrastructure — attractive because it collocates storage with the API. Considered seriously. R2 chosen over Tigris because Cloudflare's WAF and CDN are already in the stack; R2 integrates more tightly with Cloudflare's edge network for global download performance. Revisit if Tigris matures significantly.

Decision Summary

#	Decision	Chosen	Primary Reason
01	Frontend framework	SvelteKit	Bundle size, team preference, SSR+SPA hybrid
02	Primary database	PostgreSQL (Neon)	Relational model, ACID, LISTEN/NOTIFY, SKIP LOCKED
03	Authentication	Session cookies	Instant revocation, XSS-safe HttpOnly cookie
04	Job queue	Postgres SKIP LOCKED	No extra dependency, durable, sufficient throughput
05	API style	REST	Simple CRUD patterns, no N+1 risk, cacheable
06	Postgres host	Neon	Pure Postgres, no bundled opinion, branching
07	Compute	Fly.io VMs	Persistent connections required (SSE, workers)
08	Object storage	Cloudflare R2	Zero egress fees, CDN integration

Implementation Starting Point An engineering team can begin from this document. Recommended sequence: (1) Bootstrap the Hono API with auth middleware and session handling. (2) Create the database schema with RLS policies. (3) Build the SvelteKit app shell with route structure and SSE subscription. (4) Implement the job queue worker runner. (5) Build feature areas in milestone order per the BRD. Each step has sufficient specification above to begin without additional architecture decisions.