Infrastructure Plan

Groundwork — Shared Reality Platform for Home Renovation

Version1.0

StatusDraft

Last UpdatedApril 2026

AudienceEngineering / DevOps

ⓘ

This document is the authoritative reference for how Groundwork is deployed and operated. It is specific enough that a DevOps engineer with access to the listed accounts can stand up the full stack in a single day. All CLI commands assume macOS/Linux with flyctl, netlify-cli, and gh installed.

1. Infrastructure Overview

Groundwork runs on four independent control planes: Netlify for the frontend, Fly.io for the API and background workers, Neon for the database, and Cloudflare R2 for file storage. This decomposition keeps each layer independently scalable and avoids any single-vendor lock-in on the critical path.

Frontend

Netlify

Pro plan — $19/mo

SvelteKit SSR via adapter-netlify
Global CDN for static assets
Preview deploys per pull request
Auto-SSL, custom headers, redirects

API

Fly.io API

shared-cpu-1x · 512MB — $5–50/mo

Node.js (Fastify) REST + WebSocket
Primary region: iad (US East)
Auto-scale 1–4 machines
Private 6PN networking to workers

Workers

Fly.io Workers

Separate app · scales to zero

Background job processing queue
Email renders + dispatch via Resend
PDF generation, notification fan-out
Cold-start acceptable (async work)

Database

Neon PostgreSQL

Free → Pro — $0–69/mo

Serverless autoscale (0–4 CU)
Built-in connection pooler (pgbouncer)
PITR 7-day retention (Pro)
Branch per PR for schema testing

Storage

Cloudflare R2

10GB free · $0.015/GB thereafter

Zero egress fees
S3-compatible API (AWS SDK works)
Project photos, PDFs, attachments
R2 public bucket for presigned URLs

Cache

Upstash Redis

Free tier (10K req/day) · $10/mo at scale

Session cache, rate-limit counters
Deferred until Growth phase
Serverless — pay per request
REST API (no persistent connection)

2. Hosting Architecture

2.1 Netlify — Frontend

The SvelteKit application is deployed to Netlify using @sveltejs/adapter-netlify. Static assets (JS bundles, CSS, fonts, images) are served from Netlify's global CDN. Server-side rendering runs inside Netlify Functions (AWS Lambda under the hood, 1 vCPU, 1024MB, 10s timeout).

      netlify.toml
      TOML
    

[build]
  command   = "npm run build"
  publish   = ".svelte-kit/netlify/static"

[build.environment]
  NODE_VERSION = "20"

# SSR Function handler
[[functions]]
  directory = ".svelte-kit/netlify/functions"

# Cache static assets aggressively
[[headers]]
  for = "/_app/immutable/*"
  [headers.values]
    Cache-Control = "public, max-age=31536000, immutable"

# Security headers on all routes
[[headers]]
  for = "/*"
  [headers.values]
    X-Frame-Options        = "DENY"
    X-Content-Type-Options = "nosniff"
    Referrer-Policy        = "strict-origin-when-cross-origin"
    Permissions-Policy     = "camera=(), microphone=(), geolocation=()"
    Strict-Transport-Security = "max-age=63072000; includeSubDomains; preload"

# SPA fallback for client-side navigation
[[redirects]]
  from   = "/api/*"
  to     = "https://api.groundwork.app/:splat"
  status = 200
  force  = true

2.2 Fly.io — API Service

The API is a Node.js (Fastify) application running on Fly.io. Node.js is recommended over Go here because the team can share code and types between the SvelteKit frontend and the API (e.g., Zod schemas, shared utility functions), reducing duplication and accelerating the early build. Go would be appropriate if the API needs to handle sustained CPU-bound work at high concurrency — that's not the Launch or Growth profile.

      fly.toml — groundwork-api
      TOML
    

app      = "groundwork-api"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile"

[env]
  PORT        = "8080"
  NODE_ENV    = "production"
  LOG_LEVEL   = "info"

[[services]]
  protocol   = "tcp"
  internal_port = 8080

  [services.concurrency]
    type       = "requests"
    hard_limit = 200
    soft_limit = 150

  [[services.ports]]
    port     = 443
    handlers = ["tls", "http"]

  [services.http_checks]
    interval      = "15s"
    timeout       = "5s"
    grace_period  = "10s"
    method        = "GET"
    path          = "/health"
    protocol      = "http"

# Auto-scale: min 1, max 4 machines
[http_service]
  auto_stop_machines  = false  # keep ≥1 warm
  auto_start_machines = true
  min_machines_running= 1

[[vm]]
  cpu_kind = "shared"
  cpus     = 1
  memory_mb= 512

2.3 Fly.io — Background Workers

Workers are a separate Fly.io app (groundwork-workers) so they can scale independently of the API and be deployed or restarted without impacting live traffic. They pull from an in-process queue backed by Neon (using the pg-boss library) and scale to zero between bursts.

      fly.toml — groundwork-workers
      TOML
    

app            = "groundwork-workers"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile.worker"

# No public service — workers are internal only
# They communicate outbound only (Resend, Neon, R2)

[[vm]]
  cpu_kind  = "shared"
  cpus      = 1
  memory_mb = 512

# Scale to zero; woken by pg-boss polling
[http_service]
  auto_stop_machines  = true
  auto_start_machines = true
  min_machines_running= 0

⚠

pg-boss for job queue: Rather than adding Redis in the early stages, use pg-boss which implements a reliable job queue directly on top of PostgreSQL. This removes one external dependency at Launch. Migrate to a dedicated queue (BullMQ + Upstash Redis) only when you hit Growth phase and see queue contention.

2.4 Region Strategy

Phase	Fly.io Regions	Neon Region	Rationale
Launch (Mo 1–3)	iad	us-east-1	US-focused early users; minimize latency to Neon
Growth (Mo 4–12)	iad + ord	us-east-1	Add Chicago for US resilience; primary DB stays east
Scale (Year 2)	iad + ord + lax	us-east-1 (+ read replica)	West coast coverage; Neon read replica cuts latency

3. Database Setup

3.1 Neon Project Configuration

One Neon project holds all environments as separate branches. The project lives in us-east-1 (AWS), co-located with Fly.io iad to minimize network round-trips.

      Initial setup
      SHELL
    

# Install Neon CLI
npm install -g neonctl
neonctl auth

# Create project (do this once)
neonctl projects create \
  --name groundwork \
  --region-id aws-us-east-1 \
  --pg-version 16

# List branch connection strings
neonctl connection-string --branch main
neonctl connection-string --branch staging

# Create staging branch from main
neonctl branches create --name staging --parent main

# Create a per-PR branch (run in CI)
neonctl branches create \
  --name "preview/pr-$PR_NUMBER" \
  --parent staging

3.2 Connection Pooling

Every application connects through Neon's built-in pgbouncer endpoint (port 5432, transaction pooling mode). Direct connections (port 5432 on the non-pooler hostname) are used only for migrations, which require session mode.

Connection Type	Hostname Pattern	Port	Use For
Pooled (pgbouncer)	ep-xxx-pooler.us-east-1.aws.neon.tech	5432	API, Workers (all runtime queries)
Direct	ep-xxx.us-east-1.aws.neon.tech	5432	Migrations only (requires session mode)

      Connection string management — .env structure
      ENV
    

# Runtime connection (pooled — use this in app code)
DATABASE_URL=postgres://user:pass@ep-xxx-pooler.us-east-1.aws.neon.tech/neondb?sslmode=require

# Migration connection (direct — use only in migration scripts)
DATABASE_URL_DIRECT=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/neondb?sslmode=require

# SSL is enforced; never disable sslmode

3.3 Backup & Point-in-Time Recovery

Neon's Pro plan includes continuous WAL archiving with a 7-day PITR window. No additional backup tooling is required for the Launch or Growth phases.

      Branch restore from PITR
      SHELL
    

# Create a restore branch from a specific timestamp
neonctl branches create \
  --name restore-2026-04-01 \
  --parent main \
  --timestamp "2026-04-01T12:00:00Z"

# Verify data in restore branch, then promote to main if correct
neonctl branches set-as-default restore-2026-04-01

3.4 Monitoring

What to Watch	Where	Alert Threshold
Active connections	Neon dashboard → Monitoring	>80 pooler connections
Slow queries	Neon → Query stats (pg_stat_statements)	p99 > 500ms
Storage usage	Neon dashboard → Usage	>80% of plan limit
Compute uptime	Neon dashboard → Compute	Unexpected auto-suspend during peak

4. CI/CD Pipeline

All code flows through GitHub. Netlify auto-deploys are the primary mechanism for the frontend. Fly.io deployments are driven by GitHub Actions to ensure migrations run before the new application code starts serving traffic.

Stage 1

Lint + Test

ESLint + Prettier
TypeScript check
Vitest unit tests
Playwright smoke

Stage 2

Preview Deploy

Netlify preview URL
Neon PR branch
Run migrations
Post URL to PR

Stage 3

Staging

Fly.io staging app
Neon staging branch
Integration tests
Manual approval gate

Stage 4

Production

Run DB migrations
Fly.io deploy
Netlify auto-deploy
Smoke test suite

      .github/workflows/deploy.yml
      YAML
    

name: Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
  NEON_API_KEY:  ${{ secrets.NEON_API_KEY }}
  NEON_PROJECT_ID: ${{ secrets.NEON_PROJECT_ID }}

jobs:
  # ── 1. Lint & Test ─────────────────────────────────────
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - run: npm run lint
      - run: npm run check     # svelte-check + tsc
      - run: npm run test:unit

  # ── 2. PR Preview (branch deploys only) ────────────────
  preview:
    if: github.event_name == 'pull_request'
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Create a Neon branch for this PR
      - name: Create Neon preview branch
        uses: neondatabase/create-branch-action@v5
        id: neon-branch
        with:
          project_id: ${{ env.NEON_PROJECT_ID }}
          api_key:    ${{ env.NEON_API_KEY }}
          branch_name: preview/pr-${{ github.event.number }}
          parent:     staging

      # Run migrations against the preview branch
      - name: Run migrations
        run: npm run migrate
        env:
          DATABASE_URL_DIRECT: ${{ steps.neon-branch.outputs.db_url }}

      # Netlify handles the actual preview deploy automatically
      # We just need to inject the Neon branch URL as an env var
      - name: Comment preview URL on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `Preview DB branch: \`preview/pr-${context.issue.number}\``
            })

  # ── 3. Production Deploy (main branch only) ────────────
  deploy:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    needs: [test]
    runs-on: ubuntu-latest
    environment: production    # GitHub environment with approval gate
    steps:
      - uses: actions/checkout@v4

      # Run migrations BEFORE deploying new code
      - name: Run production migrations
        run: npm run migrate
        env:
          DATABASE_URL_DIRECT: ${{ secrets.DATABASE_URL_DIRECT_PROD }}

      # Deploy API to Fly.io
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - name: Deploy API
        run: flyctl deploy --app groundwork-api --strategy rolling

      # Deploy workers
      - name: Deploy Workers
        run: flyctl deploy --app groundwork-workers --strategy immediate

      # Netlify deploys automatically on push to main via Git integration
      # No step needed here unless you want to block on Netlify completion

4.1 Rollback Procedures

Layer	Rollback Command	Time
Fly.io API	flyctl releases list --app groundwork-api flyctl deploy --image <previous-image> --app groundwork-api	~2 min
Netlify	netlify deploys list netlify deploy --prod --dir=<old-build>	~1 min
Database	neonctl branches create --name rollback --parent main --timestamp <ts>	~5 min

⚠

Migration safety rule: All migrations must be backward-compatible with the previous version of the application code. Use a two-phase approach: first deploy a migration that is compatible with both old and new code, then deploy the code change. Never drop columns or rename them in the same deploy as the code that removes their usage.

5. Environment Management

Local

Frontend: vite dev (port 5173)
API: node --watch (port 8080)
Database: Neon dev branch
Storage: R2 dev bucket
Jobs: in-process (no worker app)

Preview

Frontend: Netlify preview URL
API: groundwork-api (prod)
Database: Neon pr-<N> branch
Storage: R2 dev bucket
Jobs: groundwork-workers

Staging

Frontend: staging.groundwork.app
API: groundwork-api-staging
Database: Neon staging branch
Storage: R2 staging bucket
Jobs: groundwork-workers-staging

Production

Frontend: groundwork.app
API: groundwork-api
Database: Neon main branch
Storage: R2 production bucket
Jobs: groundwork-workers

5.1 Secrets Management

Secret	Stored In	Injected Into
DATABASE_URL	Fly.io secrets	API, Workers runtime
DATABASE_URL_DIRECT	GitHub Actions secrets	Migration step in CI only
RESEND_API_KEY	Fly.io secrets	Workers runtime
SENTRY_DSN	Netlify env vars + Fly.io secrets	Frontend build + API runtime
R2_ACCESS_KEY_ID	Fly.io secrets	API runtime
R2_SECRET_ACCESS_KEY	Fly.io secrets	API runtime
SESSION_SECRET	Fly.io secrets	API runtime
UPSTASH_REDIS_URL	Fly.io secrets	API runtime (Growth+)
FLY_API_TOKEN	GitHub Actions secrets	CI deploy step
NEON_API_KEY	GitHub Actions secrets	CI branch creation step

      Setting secrets on Fly.io
      SHELL
    

# Set secrets (never committed to git)
flyctl secrets set \
  DATABASE_URL="postgres://..." \
  RESEND_API_KEY="re_..." \
  R2_ACCESS_KEY_ID="..." \
  R2_SECRET_ACCESS_KEY="..." \
  SESSION_SECRET="$(openssl rand -hex 32)" \
  --app groundwork-api

# Verify (values are redacted in output)
flyctl secrets list --app groundwork-api

6. Monitoring & Observability

6.1 Metrics

Fly.io provides built-in machine metrics (CPU, memory, network) visible in the dashboard at fly.io/apps/groundwork-api/metrics. No additional agent is required for infrastructure metrics at the Launch phase.

Signal	Source	Dashboard
CPU / Memory	Fly.io built-in	fly.io/apps/groundwork-api/metrics
HTTP request rate + latency	Fly.io built-in	Same dashboard, request metrics tab
Machine restarts	Fly.io events	fly.io/apps/groundwork-api/events
DB query stats	Neon console	console.neon.tech → Monitoring
Uptime / availability	UptimeRobot	uptimerobot.com dashboard

6.2 Error Tracking — Sentry

Sentry is installed in both the SvelteKit frontend (@sentry/sveltekit) and the Fastify API (@sentry/node). They share one Sentry project, differentiated by environment tags.

      Sentry initialization — SvelteKit hooks.server.ts
      TypeScript
    

// src/hooks.server.ts
import * as Sentry from '@sentry/sveltekit';

Sentry.init({
  dsn: import.meta.env.VITE_SENTRY_DSN,
  environment: import.meta.env.MODE,
  tracesSampleRate: 0.1,      // 10% trace sampling
  profilesSampleRate: 0.05,    // 5% profiling
  integrations: [
    Sentry.replayIntegration({
      maskAllText: true,         // PII protection
      blockAllMedia: false,
    }),
  ],
  replaysSessionSampleRate:  0.01,
  replaysOnErrorSampleRate: 1.0,
});

6.3 UptimeRobot — Synthetic Checks

Five monitors cover the critical user journeys. Alert contacts: email + Slack webhook. Check interval: 5 minutes on the free plan.

#	Monitor Name	URL	Type	Expected
1	Homepage	https://groundwork.app/	HTTP(S)	200, <3s
2	API Health	https://api.groundwork.app/health	HTTP(S)	200, JSON {status:"ok"}
3	Login Page	https://groundwork.app/login	HTTP(S)	200, <3s
4	API DB Check	https://api.groundwork.app/health/db	HTTP(S)	200, confirms Neon connectivity
5	File Upload CDN	https://files.groundwork.app/health.txt	HTTP(S)	200, confirms R2 public access

6.4 Structured Logging

The API and Workers emit structured JSON logs (via pino) to stdout. Fly.io captures these and forwards them to a log drain. At Growth phase, configure Fly.io's Grafana Cloud log drain for retention and search.

      Configure Fly.io → Grafana Cloud log drain
      SHELL
    

# Add Grafana Cloud log drain (Growth phase)
flyctl logs drain create \
  --app groundwork-api \
  --type http \
  --url "https://logs-prod-us-central1.grafana.net/loki/api/v1/push" \
  --header "Authorization: Basic <grafana-token>"

# Sample structured log output (pino)
{"level":"info","time":1712073600000,"reqId":"abc-123",
 "method":"POST","url":"/api/projects","statusCode":201,
 "responseTime":42,"userId":"usr_xyz","projectId":"prj_456"}

6.5 Alert Rules

Condition	Threshold	Action	Owner
API health check fails	2 consecutive failures	PAGE immediately	On-call engineer
Neon DB connectivity fails	1 failure	PAGE immediately	On-call engineer
Fly.io machine crash loop	3 restarts in 5 min	PAGE immediately	On-call engineer
API p99 latency	> 2000ms for 5 min	Slack alert — investigate	Engineering channel
Sentry error rate spike	> 10 errors/min (new issue)	Slack alert — investigate	Engineering channel
CPU > 85%	Sustained 10 min	Slack alert — scale up	Engineering channel
Neon storage > 80% quota	Daily check	Slack alert — plan upgrade	Engineering channel
UptimeRobot homepage down	2 consecutive failures	PAGE immediately	On-call engineer

7. Cost Projections

All prices are based on publicly listed rates as of Q1 2026. "Launch" assumes 50 projects and ~100 users. "Growth" assumes 500 projects and ~1,500 users. "Scale" assumes 5,000 projects and ~15,000 users.

Service	Plan / Tier	Launch (Mo 1–3)	Growth (Mo 4–12)	Scale (Year 2)
Netlify	Pro ($19/mo flat)	$19/mo	$19/mo	$19/mo
Fly.io API	shared-cpu-1x, 512MB	$5/mo	$15/mo	$50/mo
Fly.io Workers	Same machine type, idle→scale	$0–5/mo	$10/mo	$30/mo
Neon PostgreSQL	Free → Pro ($19/mo) → Pro+	$0/mo	$19/mo	$69/mo
Cloudflare R2	10GB free, $0.015/GB after	$0/mo	$0/mo	$5/mo
Upstash Redis	Free (10K req/day) → Pay-per-use	$0/mo	$0/mo	$10/mo
Sentry	Free (5K errors/mo) → Team $26	$0/mo	$0/mo	$26/mo
Resend	Free (3K/mo) → Pro $20	$0/mo	$0/mo	$20/mo
UptimeRobot	Free (50 monitors, 5-min checks)	$0/mo	$0/mo	$0/mo
Domain + DNS	Cloudflare Registrar	~$1.25/mo	~$1.25/mo	~$1.25/mo
Total		~$25/mo	~$65/mo	~$230/mo

✓

FinOps notes: The free-tier strategy on Neon, Sentry, Resend, and Upstash saves approximately $55/mo during the Launch phase. Upgrade triggers should be set proactively: move Neon to Pro when the project count exceeds 30 (to ensure PITR coverage before data becomes critical), and Sentry to Team when error volume approaches 4,500/month (80% of the free limit).

7.1 Fly.io Machine Cost Breakdown

Machine Type	$/mo (1 machine)	Launch Count	Growth Count	Scale Count
shared-cpu-1x 512MB	~$3.19/mo	1 API + 0–1 worker	2 API + 1 worker	4 API + 2 workers
shared-cpu-1x 1GB	~$5.70/mo	—	—	Consider at 200 req/s

8. Security Infrastructure

8.1 TLS / Transport Security

Layer	Certificate Provider	Minimum TLS	Notes
Netlify (frontend)	Let's Encrypt (auto-renew)	TLS 1.2	HSTS preloaded via header
Fly.io (API)	Let's Encrypt (auto-renew)	TLS 1.2	Auto-configured per app
Neon (database)	AWS ACM	TLS 1.2	Enforced; `sslmode=require` mandatory
Cloudflare R2	Cloudflare managed	TLS 1.2	Presigned URLs expire in 15 min

8.2 Network Security

Fly.io's 6PN (private networking) is used for API-to-Worker communication. Workers never expose a public port. Neon's IP allowlist restricts database access to Fly.io's NAT gateway IPs.

      Neon IP allowlist — add Fly.io outbound NAT IPs
      SHELL
    

# Get Fly.io outbound IPs for iad region
flyctl ips list --app groundwork-api

# Add to Neon via API or console
# Console: console.neon.tech → Project Settings → IP Allow
# Add each Fly.io IPv4 in CIDR notation: 1.2.3.4/32

# Verify connectivity from a running machine
flyctl ssh console --app groundwork-api
# Inside machine:
psql $DATABASE_URL -c "SELECT version();"

8.3 Content Security Policy

      CSP header — netlify.toml addition
      TOML
    

[[headers]]
  for = "/*"
  [headers.values]
    Content-Security-Policy = """
      default-src 'self';
      script-src 'self' 'unsafe-inline' https://browser.sentry-cdn.com;
      style-src 'self' 'unsafe-inline' https://fonts.googleapis.com;
      font-src 'self' https://fonts.gstatic.com;
      img-src 'self' data: https://files.groundwork.app;
      connect-src 'self' https://api.groundwork.app https://*.sentry.io
                  https://o4504.ingest.sentry.io;
      frame-ancestors 'none';
      base-uri 'self';
      form-action 'self';
    """

8.4 Rate Limiting

Application-level rate limiting is implemented in the Fastify API using @fastify/rate-limit, backed by in-memory storage at Launch and Upstash Redis at Growth phase.

Endpoint Group	Limit	Window	Strategy
POST /auth/*	5 requests	1 minute	Per IP — prevents brute force
POST /api/projects	20 requests	1 minute	Per authenticated user
GET /api/*	200 requests	1 minute	Per authenticated user
Global fallback	500 requests	1 minute	Per IP — prevents DDoS

8.5 IAM Principles

Fly.io deploy tokens are scoped to individual apps — the GitHub Actions token for groundwork-api cannot deploy to groundwork-workers and vice versa.
Neon roles: one app_user role with SELECT/INSERT/UPDATE/DELETE on app tables; a separate migrator role with DDL rights used only in CI migrations.
R2 API tokens: one token scoped to the production bucket with write access; a separate read-only token for presigned URL generation if needed.
Resend API keys: one per environment (production, staging). Rotate quarterly.

9. Disaster Recovery

Metric	Target	Mechanism
RTO (Recovery Time Objective)	30 minutes	Fly.io machine restart (auto <2 min) + Neon branch restore (manual, up to 28 min)
RPO (Recovery Point Objective)	5 minutes	Neon continuous WAL archiving — data loss window is the WAL shipping interval

Fly.io API machine crash / unresponsive

Severity: High Auto-recovery: Yes

Fly.io auto-restarts the crashed machine within 30–60 seconds. If min_machines_running = 1, a new machine is started immediately.
If auto-restart fails repeatedly (crash loop), SSH into the machine: flyctl ssh console --app groundwork-api and inspect logs: flyctl logs --app groundwork-api.
If the latest deploy is the cause, roll back immediately: flyctl deploy --image <previous-image-id> --app groundwork-api.
If the issue is a dependency (Neon down, Resend down), check respective status pages and implement a 503 maintenance response in the health check.
Once stable, write a postmortem and add a test case that would have caught the regression.

Neon PostgreSQL regional outage

Severity: Critical Manual: ~25 min

Confirm outage is on Neon's side: check status.neon.tech. If Neon is healthy, the issue is the connection string or IP allowlist.
Enable maintenance mode on the API (return 503 with Retry-After header) to prevent partial failures from reaching users.
If Neon declares an outage lasting >15 min, create a restore branch from the latest WAL snapshot: neonctl branches create --name dr-restore --parent main --timestamp <last-known-good-ts>.
Update DATABASE_URL in Fly.io secrets to point to the restore branch: flyctl secrets set DATABASE_URL="<new-url>" --app groundwork-api.
Restart machines to pick up new secret: flyctl machines restart --app groundwork-api. Disable maintenance mode. Monitor for errors.

Accidental data deletion (user or application bug)

Severity: High Manual: ~15 min

Immediately identify the timestamp of the bad operation from Fly.io logs: flyctl logs --app groundwork-api | grep "<affected-entity-id>".
Create a Neon restore branch to the moment before the deletion: neonctl branches create --name data-restore --parent main --timestamp "<ISO-timestamp>".
Connect to the restore branch and export the affected rows to a SQL file: pg_dump --table=affected_table --data-only -f restore.sql <restore-branch-url>.
Re-import the rows into the production branch: psql $DATABASE_URL_DIRECT < restore.sql.
Verify row counts and spot-check data integrity. Delete the restore branch: neonctl branches delete data-restore.

Bad production deploy breaks the application

Severity: High Recovery: ~3 min

Identify the bad deploy via Sentry error spike or UptimeRobot alert.
Find the last good image ID: flyctl releases list --app groundwork-api.
Roll back the API immediately: flyctl deploy --image <previous-image-id> --app groundwork-api --strategy immediate.
For Netlify, roll back in the Netlify dashboard under Deploys → select previous deploy → Publish deploy. Or via CLI: netlify deploy --prod --dir=<previous-publish-dir>.
If a migration was run as part of the bad deploy and needs to be reversed, apply a compensating migration (never use destructive rollbacks on production data).

Cloudflare R2 objects corrupted or bucket deleted

Severity: Medium Manual: ~30 min

R2 does not offer built-in versioning. At Growth phase, enable R2 Object Versioning on the production bucket in the Cloudflare dashboard.
For the Launch phase, the database stores file metadata (key, size, content-type). If objects are deleted from R2 but records exist in Neon, the data loss is limited to the binary files only.
At Scale phase, implement a nightly sync job that copies all R2 objects to an archival Cloudflare R2 bucket in a different account as a cold backup.
For immediate recovery of a corrupted file, check if the user has a local copy or if the file was uploaded recently (Sentry + API logs will show the upload request).
Communicate timeline to affected users via in-app notification (dispatched via the Workers job queue) and by direct email.

10. Launch Checklist

Complete all items before marking the infrastructure as production-ready. Commands assume you are authenticated with Fly.io (flyctl auth login), Netlify (netlify login), and Neon (neonctl auth).

01

Create Neon project in us-east-1 neonctl projects create --name groundwork --region-id aws-us-east-1 --pg-version 16
02

Create Neon branches: main, staging, dev neonctl branches create --name staging --parent main and neonctl branches create --name dev --parent staging
03

Create Fly.io app for API flyctl apps create groundwork-api --org personal, then copy fly.toml from Section 2.2
04

Create Fly.io app for Workers flyctl apps create groundwork-workers --org personal, then copy fly.toml from Section 2.3
05

Set all Fly.io secrets for both apps Run flyctl secrets set DATABASE_URL="..." RESEND_API_KEY="..." R2_ACCESS_KEY_ID="..." R2_SECRET_ACCESS_KEY="..." SESSION_SECRET="$(openssl rand -hex 32)" --app groundwork-api
06

Configure Neon IP allowlist with Fly.io NAT IPs flyctl ips list --app groundwork-api, then add each IP in Neon Console → Project Settings → IP Allow
07

Run all database migrations against production branch DATABASE_URL_DIRECT="<neon-direct-url>" npm run migrate — verify all migrations succeed with exit code 0
08

Create Cloudflare R2 production bucket In Cloudflare dashboard → R2 → Create bucket named groundwork-production. Create an API token with Object Read & Write scope restricted to this bucket only.
09

Connect Netlify site to GitHub repo netlify init in the project root, or connect via Netlify dashboard. Confirm build command is npm run build and publish directory matches netlify.toml.
10

Set all Netlify environment variables In Netlify dashboard → Site settings → Environment variables: VITE_SENTRY_DSN, VITE_API_BASE_URL=https://api.groundwork.app, VITE_R2_PUBLIC_URL
11

Configure custom domain on Netlify Netlify dashboard → Domain management → Add custom domain groundwork.app. Point DNS to Netlify's nameservers or add the CNAME/A record as directed.
12

Configure custom domain on Fly.io API flyctl certs create api.groundwork.app --app groundwork-api. Add the CNAME record to DNS as instructed by the output. Verify with flyctl certs check api.groundwork.app --app groundwork-api.
13

Add GitHub Actions secrets In GitHub → repo → Settings → Secrets: add FLY_API_TOKEN (from flyctl auth token), NEON_API_KEY, NEON_PROJECT_ID, DATABASE_URL_DIRECT_PROD
14

Create GitHub Actions environment named "production" with approval gate GitHub → repo → Settings → Environments → New environment → production. Add required reviewers. This gates all production deploys.
15

First production deploy via GitHub Actions Push to main branch. Verify the workflow completes without errors. Check flyctl status --app groundwork-api shows all machines as started.
16

Configure Sentry project and verify error reporting Create Sentry project for groundwork. Trigger a test error via the Sentry debug endpoint. Confirm it appears in the Sentry dashboard within 30 seconds.
17

Set up UptimeRobot monitors for all 5 endpoints Create monitors as listed in Section 6.3. Set alert contacts to on-call email and Slack webhook. Run a test alert to verify delivery.
18

Verify Neon PITR is active on Pro plan Confirm the project is on the Neon Pro plan (required for PITR). Check Console → Backups to see WAL archiving is enabled and shows recent timestamps.
19

Run end-to-end smoke test against production Create a test user account, create a project, upload a photo, trigger an email notification, verify it arrives. Check Sentry for zero new errors. Check Fly.io logs for clean request logs.
20

Document all service account credentials in team password manager Store Neon project ID + admin credentials, Fly.io org slug, Cloudflare Account ID + R2 bucket name, Resend API key, Sentry DSN, and UptimeRobot API key in 1Password or Bitwarden under the "Groundwork Infrastructure" vault.

✓

All 20 items complete? The infrastructure is production-ready. Estimated total setup time for an experienced DevOps engineer: 4–6 hours. Share this document link in the team Slack channel and schedule a 30-minute runbook walkthrough with the full engineering team before the first user-facing launch.