Infrastructure Plan

Groundwork — Shared Reality Platform for Home Renovation
Version1.0
StatusDraft
Last UpdatedApril 2026
AudienceEngineering / DevOps

This document is the authoritative reference for how Groundwork is deployed and operated. It is specific enough that a DevOps engineer with access to the listed accounts can stand up the full stack in a single day. All CLI commands assume macOS/Linux with flyctl, netlify-cli, and gh installed.

1. Infrastructure Overview

Groundwork runs on four independent control planes: Netlify for the frontend, Fly.io for the API and background workers, Neon for the database, and Cloudflare R2 for file storage. This decomposition keeps each layer independently scalable and avoids any single-vendor lock-in on the critical path.

CLIENT LAYER FRONTEND / EDGE LAYER API / DATA LAYER EXTERNAL SERVICES Browser / PWA groundwork.app HTTPS · TLS 1.3 Netlify CDN — Static assets (JS/CSS/img) — CDN edge (100+ PoPs) — Netlify Functions (SSR) — Auto-SSL (Let's Encrypt) — SvelteKit adapter-netlify SvelteKit SSR — Netlify Function runtime — Server load functions — Session / auth cookies — API proxy to Fly.io — Form actions Fly.io API (Node.js) — shared-cpu-1x · 512MB — iad + ord regions — REST + WebSocket — Autoscale 1→4 machines — Internal 6PN networking Fly.io Workers — Separate Fly app — Scale to zero when idle — Email queue processing — PDF report generation — Notification dispatch Neon PostgreSQL — Serverless autoscale — Pooler (port 5432) — PITR 7-day retention Cloudflare R2 — Zero egress · 10GB free — Project photos / PDFs Resend (Email) Sentry (Errors) Upstash Redis (Cache) UptimeRobot (Checks) Grafana Cloud (Logs) HTTPS req SSR page API calls SQL / pgbouncer S3-compat API 6PN Request flow Async / internal
Frontend
Netlify
Pro plan — $19/mo
  • SvelteKit SSR via adapter-netlify
  • Global CDN for static assets
  • Preview deploys per pull request
  • Auto-SSL, custom headers, redirects
API
Fly.io API
shared-cpu-1x · 512MB — $5–50/mo
  • Node.js (Fastify) REST + WebSocket
  • Primary region: iad (US East)
  • Auto-scale 1–4 machines
  • Private 6PN networking to workers
Workers
Fly.io Workers
Separate app · scales to zero
  • Background job processing queue
  • Email renders + dispatch via Resend
  • PDF generation, notification fan-out
  • Cold-start acceptable (async work)
Database
Neon PostgreSQL
Free → Pro — $0–69/mo
  • Serverless autoscale (0–4 CU)
  • Built-in connection pooler (pgbouncer)
  • PITR 7-day retention (Pro)
  • Branch per PR for schema testing
Storage
Cloudflare R2
10GB free · $0.015/GB thereafter
  • Zero egress fees
  • S3-compatible API (AWS SDK works)
  • Project photos, PDFs, attachments
  • R2 public bucket for presigned URLs
Cache
Upstash Redis
Free tier (10K req/day) · $10/mo at scale
  • Session cache, rate-limit counters
  • Deferred until Growth phase
  • Serverless — pay per request
  • REST API (no persistent connection)

2. Hosting Architecture

2.1 Netlify — Frontend

The SvelteKit application is deployed to Netlify using @sveltejs/adapter-netlify. Static assets (JS bundles, CSS, fonts, images) are served from Netlify's global CDN. Server-side rendering runs inside Netlify Functions (AWS Lambda under the hood, 1 vCPU, 1024MB, 10s timeout).

netlify.toml TOML
[build]
  command   = "npm run build"
  publish   = ".svelte-kit/netlify/static"

[build.environment]
  NODE_VERSION = "20"

# SSR Function handler
[[functions]]
  directory = ".svelte-kit/netlify/functions"

# Cache static assets aggressively
[[headers]]
  for = "/_app/immutable/*"
  [headers.values]
    Cache-Control = "public, max-age=31536000, immutable"

# Security headers on all routes
[[headers]]
  for = "/*"
  [headers.values]
    X-Frame-Options        = "DENY"
    X-Content-Type-Options = "nosniff"
    Referrer-Policy        = "strict-origin-when-cross-origin"
    Permissions-Policy     = "camera=(), microphone=(), geolocation=()"
    Strict-Transport-Security = "max-age=63072000; includeSubDomains; preload"

# SPA fallback for client-side navigation
[[redirects]]
  from   = "/api/*"
  to     = "https://api.groundwork.app/:splat"
  status = 200
  force  = true

2.2 Fly.io — API Service

The API is a Node.js (Fastify) application running on Fly.io. Node.js is recommended over Go here because the team can share code and types between the SvelteKit frontend and the API (e.g., Zod schemas, shared utility functions), reducing duplication and accelerating the early build. Go would be appropriate if the API needs to handle sustained CPU-bound work at high concurrency — that's not the Launch or Growth profile.

fly.toml — groundwork-api TOML
app      = "groundwork-api"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile"

[env]
  PORT        = "8080"
  NODE_ENV    = "production"
  LOG_LEVEL   = "info"

[[services]]
  protocol   = "tcp"
  internal_port = 8080

  [services.concurrency]
    type       = "requests"
    hard_limit = 200
    soft_limit = 150

  [[services.ports]]
    port     = 443
    handlers = ["tls", "http"]

  [services.http_checks]
    interval      = "15s"
    timeout       = "5s"
    grace_period  = "10s"
    method        = "GET"
    path          = "/health"
    protocol      = "http"

# Auto-scale: min 1, max 4 machines
[http_service]
  auto_stop_machines  = false  # keep ≥1 warm
  auto_start_machines = true
  min_machines_running= 1

[[vm]]
  cpu_kind = "shared"
  cpus     = 1
  memory_mb= 512

2.3 Fly.io — Background Workers

Workers are a separate Fly.io app (groundwork-workers) so they can scale independently of the API and be deployed or restarted without impacting live traffic. They pull from an in-process queue backed by Neon (using the pg-boss library) and scale to zero between bursts.

fly.toml — groundwork-workers TOML
app            = "groundwork-workers"
primary_region = "iad"

[build]
  dockerfile = "Dockerfile.worker"

# No public service — workers are internal only
# They communicate outbound only (Resend, Neon, R2)

[[vm]]
  cpu_kind  = "shared"
  cpus      = 1
  memory_mb = 512

# Scale to zero; woken by pg-boss polling
[http_service]
  auto_stop_machines  = true
  auto_start_machines = true
  min_machines_running= 0

pg-boss for job queue: Rather than adding Redis in the early stages, use pg-boss which implements a reliable job queue directly on top of PostgreSQL. This removes one external dependency at Launch. Migrate to a dedicated queue (BullMQ + Upstash Redis) only when you hit Growth phase and see queue contention.

2.4 Region Strategy

Phase Fly.io Regions Neon Region Rationale
Launch (Mo 1–3) iad us-east-1 US-focused early users; minimize latency to Neon
Growth (Mo 4–12) iad + ord us-east-1 Add Chicago for US resilience; primary DB stays east
Scale (Year 2) iad + ord + lax us-east-1 (+ read replica) West coast coverage; Neon read replica cuts latency

3. Database Setup

3.1 Neon Project Configuration

One Neon project holds all environments as separate branches. The project lives in us-east-1 (AWS), co-located with Fly.io iad to minimize network round-trips.

Initial setup SHELL
# Install Neon CLI
npm install -g neonctl
neonctl auth

# Create project (do this once)
neonctl projects create \
  --name groundwork \
  --region-id aws-us-east-1 \
  --pg-version 16

# List branch connection strings
neonctl connection-string --branch main
neonctl connection-string --branch staging

# Create staging branch from main
neonctl branches create --name staging --parent main

# Create a per-PR branch (run in CI)
neonctl branches create \
  --name "preview/pr-$PR_NUMBER" \
  --parent staging

3.2 Connection Pooling

Every application connects through Neon's built-in pgbouncer endpoint (port 5432, transaction pooling mode). Direct connections (port 5432 on the non-pooler hostname) are used only for migrations, which require session mode.

Connection Type Hostname Pattern Port Use For
Pooled (pgbouncer) ep-xxx-pooler.us-east-1.aws.neon.tech 5432 API, Workers (all runtime queries)
Direct ep-xxx.us-east-1.aws.neon.tech 5432 Migrations only (requires session mode)
Connection string management — .env structure ENV
# Runtime connection (pooled — use this in app code)
DATABASE_URL=postgres://user:pass@ep-xxx-pooler.us-east-1.aws.neon.tech/neondb?sslmode=require

# Migration connection (direct — use only in migration scripts)
DATABASE_URL_DIRECT=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/neondb?sslmode=require

# SSL is enforced; never disable sslmode

3.3 Backup & Point-in-Time Recovery

Neon's Pro plan includes continuous WAL archiving with a 7-day PITR window. No additional backup tooling is required for the Launch or Growth phases.

Branch restore from PITR SHELL
# Create a restore branch from a specific timestamp
neonctl branches create \
  --name restore-2026-04-01 \
  --parent main \
  --timestamp "2026-04-01T12:00:00Z"

# Verify data in restore branch, then promote to main if correct
neonctl branches set-as-default restore-2026-04-01

3.4 Monitoring

What to Watch Where Alert Threshold
Active connections Neon dashboard → Monitoring >80 pooler connections
Slow queries Neon → Query stats (pg_stat_statements) p99 > 500ms
Storage usage Neon dashboard → Usage >80% of plan limit
Compute uptime Neon dashboard → Compute Unexpected auto-suspend during peak

4. CI/CD Pipeline

All code flows through GitHub. Netlify auto-deploys are the primary mechanism for the frontend. Fly.io deployments are driven by GitHub Actions to ensure migrations run before the new application code starts serving traffic.

Stage 1
Lint + Test
  • ESLint + Prettier
  • TypeScript check
  • Vitest unit tests
  • Playwright smoke
Stage 2
Preview Deploy
  • Netlify preview URL
  • Neon PR branch
  • Run migrations
  • Post URL to PR
Stage 3
Staging
  • Fly.io staging app
  • Neon staging branch
  • Integration tests
  • Manual approval gate
Stage 4
Production
  • Run DB migrations
  • Fly.io deploy
  • Netlify auto-deploy
  • Smoke test suite
.github/workflows/deploy.yml YAML
name: Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}
  NEON_API_KEY:  ${{ secrets.NEON_API_KEY }}
  NEON_PROJECT_ID: ${{ secrets.NEON_PROJECT_ID }}

jobs:
  # ── 1. Lint & Test ─────────────────────────────────────
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20', cache: 'npm' }
      - run: npm ci
      - run: npm run lint
      - run: npm run check     # svelte-check + tsc
      - run: npm run test:unit

  # ── 2. PR Preview (branch deploys only) ────────────────
  preview:
    if: github.event_name == 'pull_request'
    needs: [test]
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Create a Neon branch for this PR
      - name: Create Neon preview branch
        uses: neondatabase/create-branch-action@v5
        id: neon-branch
        with:
          project_id: ${{ env.NEON_PROJECT_ID }}
          api_key:    ${{ env.NEON_API_KEY }}
          branch_name: preview/pr-${{ github.event.number }}
          parent:     staging

      # Run migrations against the preview branch
      - name: Run migrations
        run: npm run migrate
        env:
          DATABASE_URL_DIRECT: ${{ steps.neon-branch.outputs.db_url }}

      # Netlify handles the actual preview deploy automatically
      # We just need to inject the Neon branch URL as an env var
      - name: Comment preview URL on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `Preview DB branch: \`preview/pr-${context.issue.number}\``
            })

  # ── 3. Production Deploy (main branch only) ────────────
  deploy:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    needs: [test]
    runs-on: ubuntu-latest
    environment: production    # GitHub environment with approval gate
    steps:
      - uses: actions/checkout@v4

      # Run migrations BEFORE deploying new code
      - name: Run production migrations
        run: npm run migrate
        env:
          DATABASE_URL_DIRECT: ${{ secrets.DATABASE_URL_DIRECT_PROD }}

      # Deploy API to Fly.io
      - uses: superfly/flyctl-actions/setup-flyctl@master
      - name: Deploy API
        run: flyctl deploy --app groundwork-api --strategy rolling

      # Deploy workers
      - name: Deploy Workers
        run: flyctl deploy --app groundwork-workers --strategy immediate

      # Netlify deploys automatically on push to main via Git integration
      # No step needed here unless you want to block on Netlify completion

4.1 Rollback Procedures

Layer Rollback Command Time
Fly.io API flyctl releases list --app groundwork-api
flyctl deploy --image <previous-image> --app groundwork-api
~2 min
Netlify netlify deploys list
netlify deploy --prod --dir=<old-build>
~1 min
Database neonctl branches create --name rollback --parent main --timestamp <ts> ~5 min

Migration safety rule: All migrations must be backward-compatible with the previous version of the application code. Use a two-phase approach: first deploy a migration that is compatible with both old and new code, then deploy the code change. Never drop columns or rename them in the same deploy as the code that removes their usage.

5. Environment Management

Local
Frontend
vite dev (port 5173)
API
node --watch (port 8080)
Database
Neon dev branch
Storage
R2 dev bucket
Jobs
in-process (no worker app)
Preview
Frontend
Netlify preview URL
API
groundwork-api (prod)
Database
Neon pr-<N> branch
Storage
R2 dev bucket
Jobs
groundwork-workers
Staging
Frontend
staging.groundwork.app
API
groundwork-api-staging
Database
Neon staging branch
Storage
R2 staging bucket
Jobs
groundwork-workers-staging
Production
Frontend
groundwork.app
API
groundwork-api
Database
Neon main branch
Storage
R2 production bucket
Jobs
groundwork-workers

5.1 Secrets Management

Secret Stored In Injected Into
DATABASE_URL Fly.io secrets API, Workers runtime
DATABASE_URL_DIRECT GitHub Actions secrets Migration step in CI only
RESEND_API_KEY Fly.io secrets Workers runtime
SENTRY_DSN Netlify env vars + Fly.io secrets Frontend build + API runtime
R2_ACCESS_KEY_ID Fly.io secrets API runtime
R2_SECRET_ACCESS_KEY Fly.io secrets API runtime
SESSION_SECRET Fly.io secrets API runtime
UPSTASH_REDIS_URL Fly.io secrets API runtime (Growth+)
FLY_API_TOKEN GitHub Actions secrets CI deploy step
NEON_API_KEY GitHub Actions secrets CI branch creation step
Setting secrets on Fly.io SHELL
# Set secrets (never committed to git)
flyctl secrets set \
  DATABASE_URL="postgres://..." \
  RESEND_API_KEY="re_..." \
  R2_ACCESS_KEY_ID="..." \
  R2_SECRET_ACCESS_KEY="..." \
  SESSION_SECRET="$(openssl rand -hex 32)" \
  --app groundwork-api

# Verify (values are redacted in output)
flyctl secrets list --app groundwork-api

6. Monitoring & Observability

6.1 Metrics

Fly.io provides built-in machine metrics (CPU, memory, network) visible in the dashboard at fly.io/apps/groundwork-api/metrics. No additional agent is required for infrastructure metrics at the Launch phase.

Signal Source Dashboard
CPU / Memory Fly.io built-in fly.io/apps/groundwork-api/metrics
HTTP request rate + latency Fly.io built-in Same dashboard, request metrics tab
Machine restarts Fly.io events fly.io/apps/groundwork-api/events
DB query stats Neon console console.neon.tech → Monitoring
Uptime / availability UptimeRobot uptimerobot.com dashboard

6.2 Error Tracking — Sentry

Sentry is installed in both the SvelteKit frontend (@sentry/sveltekit) and the Fastify API (@sentry/node). They share one Sentry project, differentiated by environment tags.

Sentry initialization — SvelteKit hooks.server.ts TypeScript
// src/hooks.server.ts
import * as Sentry from '@sentry/sveltekit';

Sentry.init({
  dsn: import.meta.env.VITE_SENTRY_DSN,
  environment: import.meta.env.MODE,
  tracesSampleRate: 0.1,      // 10% trace sampling
  profilesSampleRate: 0.05,    // 5% profiling
  integrations: [
    Sentry.replayIntegration({
      maskAllText: true,         // PII protection
      blockAllMedia: false,
    }),
  ],
  replaysSessionSampleRate:  0.01,
  replaysOnErrorSampleRate: 1.0,
});

6.3 UptimeRobot — Synthetic Checks

Five monitors cover the critical user journeys. Alert contacts: email + Slack webhook. Check interval: 5 minutes on the free plan.

# Monitor Name URL Type Expected
1 Homepage https://groundwork.app/ HTTP(S) 200, <3s
2 API Health https://api.groundwork.app/health HTTP(S) 200, JSON {status:"ok"}
3 Login Page https://groundwork.app/login HTTP(S) 200, <3s
4 API DB Check https://api.groundwork.app/health/db HTTP(S) 200, confirms Neon connectivity
5 File Upload CDN https://files.groundwork.app/health.txt HTTP(S) 200, confirms R2 public access

6.4 Structured Logging

The API and Workers emit structured JSON logs (via pino) to stdout. Fly.io captures these and forwards them to a log drain. At Growth phase, configure Fly.io's Grafana Cloud log drain for retention and search.

Configure Fly.io → Grafana Cloud log drain SHELL
# Add Grafana Cloud log drain (Growth phase)
flyctl logs drain create \
  --app groundwork-api \
  --type http \
  --url "https://logs-prod-us-central1.grafana.net/loki/api/v1/push" \
  --header "Authorization: Basic <grafana-token>"

# Sample structured log output (pino)
{"level":"info","time":1712073600000,"reqId":"abc-123",
 "method":"POST","url":"/api/projects","statusCode":201,
 "responseTime":42,"userId":"usr_xyz","projectId":"prj_456"}

6.5 Alert Rules

Condition Threshold Action Owner
API health check fails 2 consecutive failures PAGE immediately On-call engineer
Neon DB connectivity fails 1 failure PAGE immediately On-call engineer
Fly.io machine crash loop 3 restarts in 5 min PAGE immediately On-call engineer
API p99 latency > 2000ms for 5 min Slack alert — investigate Engineering channel
Sentry error rate spike > 10 errors/min (new issue) Slack alert — investigate Engineering channel
CPU > 85% Sustained 10 min Slack alert — scale up Engineering channel
Neon storage > 80% quota Daily check Slack alert — plan upgrade Engineering channel
UptimeRobot homepage down 2 consecutive failures PAGE immediately On-call engineer

7. Cost Projections

All prices are based on publicly listed rates as of Q1 2026. "Launch" assumes 50 projects and ~100 users. "Growth" assumes 500 projects and ~1,500 users. "Scale" assumes 5,000 projects and ~15,000 users.

Service Plan / Tier Launch (Mo 1–3) Growth (Mo 4–12) Scale (Year 2)
Netlify Pro ($19/mo flat) $19/mo $19/mo $19/mo
Fly.io API shared-cpu-1x, 512MB $5/mo $15/mo $50/mo
Fly.io Workers Same machine type, idle→scale $0–5/mo $10/mo $30/mo
Neon PostgreSQL Free → Pro ($19/mo) → Pro+ $0/mo $19/mo $69/mo
Cloudflare R2 10GB free, $0.015/GB after $0/mo $0/mo $5/mo
Upstash Redis Free (10K req/day) → Pay-per-use $0/mo $0/mo $10/mo
Sentry Free (5K errors/mo) → Team $26 $0/mo $0/mo $26/mo
Resend Free (3K/mo) → Pro $20 $0/mo $0/mo $20/mo
UptimeRobot Free (50 monitors, 5-min checks) $0/mo $0/mo $0/mo
Domain + DNS Cloudflare Registrar ~$1.25/mo ~$1.25/mo ~$1.25/mo
Total ~$25/mo ~$65/mo ~$230/mo

FinOps notes: The free-tier strategy on Neon, Sentry, Resend, and Upstash saves approximately $55/mo during the Launch phase. Upgrade triggers should be set proactively: move Neon to Pro when the project count exceeds 30 (to ensure PITR coverage before data becomes critical), and Sentry to Team when error volume approaches 4,500/month (80% of the free limit).

7.1 Fly.io Machine Cost Breakdown

Machine Type $/mo (1 machine) Launch Count Growth Count Scale Count
shared-cpu-1x 512MB ~$3.19/mo 1 API + 0–1 worker 2 API + 1 worker 4 API + 2 workers
shared-cpu-1x 1GB ~$5.70/mo Consider at 200 req/s

8. Security Infrastructure

8.1 TLS / Transport Security

Layer Certificate Provider Minimum TLS Notes
Netlify (frontend) Let's Encrypt (auto-renew) TLS 1.2 HSTS preloaded via header
Fly.io (API) Let's Encrypt (auto-renew) TLS 1.2 Auto-configured per app
Neon (database) AWS ACM TLS 1.2 Enforced; sslmode=require mandatory
Cloudflare R2 Cloudflare managed TLS 1.2 Presigned URLs expire in 15 min

8.2 Network Security

Fly.io's 6PN (private networking) is used for API-to-Worker communication. Workers never expose a public port. Neon's IP allowlist restricts database access to Fly.io's NAT gateway IPs.

Neon IP allowlist — add Fly.io outbound NAT IPs SHELL
# Get Fly.io outbound IPs for iad region
flyctl ips list --app groundwork-api

# Add to Neon via API or console
# Console: console.neon.tech → Project Settings → IP Allow
# Add each Fly.io IPv4 in CIDR notation: 1.2.3.4/32

# Verify connectivity from a running machine
flyctl ssh console --app groundwork-api
# Inside machine:
psql $DATABASE_URL -c "SELECT version();"

8.3 Content Security Policy

CSP header — netlify.toml addition TOML
[[headers]]
  for = "/*"
  [headers.values]
    Content-Security-Policy = """
      default-src 'self';
      script-src 'self' 'unsafe-inline' https://browser.sentry-cdn.com;
      style-src 'self' 'unsafe-inline' https://fonts.googleapis.com;
      font-src 'self' https://fonts.gstatic.com;
      img-src 'self' data: https://files.groundwork.app;
      connect-src 'self' https://api.groundwork.app https://*.sentry.io
                  https://o4504.ingest.sentry.io;
      frame-ancestors 'none';
      base-uri 'self';
      form-action 'self';
    """

8.4 Rate Limiting

Application-level rate limiting is implemented in the Fastify API using @fastify/rate-limit, backed by in-memory storage at Launch and Upstash Redis at Growth phase.

Endpoint Group Limit Window Strategy
POST /auth/* 5 requests 1 minute Per IP — prevents brute force
POST /api/projects 20 requests 1 minute Per authenticated user
GET /api/* 200 requests 1 minute Per authenticated user
Global fallback 500 requests 1 minute Per IP — prevents DDoS

8.5 IAM Principles

9. Disaster Recovery

Metric Target Mechanism
RTO (Recovery Time Objective) 30 minutes Fly.io machine restart (auto <2 min) + Neon branch restore (manual, up to 28 min)
RPO (Recovery Point Objective) 5 minutes Neon continuous WAL archiving — data loss window is the WAL shipping interval
1
Fly.io API machine crash / unresponsive
Severity: High Auto-recovery: Yes
  1. Fly.io auto-restarts the crashed machine within 30–60 seconds. If min_machines_running = 1, a new machine is started immediately.
  2. If auto-restart fails repeatedly (crash loop), SSH into the machine: flyctl ssh console --app groundwork-api and inspect logs: flyctl logs --app groundwork-api.
  3. If the latest deploy is the cause, roll back immediately: flyctl deploy --image <previous-image-id> --app groundwork-api.
  4. If the issue is a dependency (Neon down, Resend down), check respective status pages and implement a 503 maintenance response in the health check.
  5. Once stable, write a postmortem and add a test case that would have caught the regression.
2
Neon PostgreSQL regional outage
Severity: Critical Manual: ~25 min
  1. Confirm outage is on Neon's side: check status.neon.tech. If Neon is healthy, the issue is the connection string or IP allowlist.
  2. Enable maintenance mode on the API (return 503 with Retry-After header) to prevent partial failures from reaching users.
  3. If Neon declares an outage lasting >15 min, create a restore branch from the latest WAL snapshot: neonctl branches create --name dr-restore --parent main --timestamp <last-known-good-ts>.
  4. Update DATABASE_URL in Fly.io secrets to point to the restore branch: flyctl secrets set DATABASE_URL="<new-url>" --app groundwork-api.
  5. Restart machines to pick up new secret: flyctl machines restart --app groundwork-api. Disable maintenance mode. Monitor for errors.
3
Accidental data deletion (user or application bug)
Severity: High Manual: ~15 min
  1. Immediately identify the timestamp of the bad operation from Fly.io logs: flyctl logs --app groundwork-api | grep "<affected-entity-id>".
  2. Create a Neon restore branch to the moment before the deletion: neonctl branches create --name data-restore --parent main --timestamp "<ISO-timestamp>".
  3. Connect to the restore branch and export the affected rows to a SQL file: pg_dump --table=affected_table --data-only -f restore.sql <restore-branch-url>.
  4. Re-import the rows into the production branch: psql $DATABASE_URL_DIRECT < restore.sql.
  5. Verify row counts and spot-check data integrity. Delete the restore branch: neonctl branches delete data-restore.
4
Bad production deploy breaks the application
Severity: High Recovery: ~3 min
  1. Identify the bad deploy via Sentry error spike or UptimeRobot alert.
  2. Find the last good image ID: flyctl releases list --app groundwork-api.
  3. Roll back the API immediately: flyctl deploy --image <previous-image-id> --app groundwork-api --strategy immediate.
  4. For Netlify, roll back in the Netlify dashboard under Deploys → select previous deploy → Publish deploy. Or via CLI: netlify deploy --prod --dir=<previous-publish-dir>.
  5. If a migration was run as part of the bad deploy and needs to be reversed, apply a compensating migration (never use destructive rollbacks on production data).
5
Cloudflare R2 objects corrupted or bucket deleted
Severity: Medium Manual: ~30 min
  1. R2 does not offer built-in versioning. At Growth phase, enable R2 Object Versioning on the production bucket in the Cloudflare dashboard.
  2. For the Launch phase, the database stores file metadata (key, size, content-type). If objects are deleted from R2 but records exist in Neon, the data loss is limited to the binary files only.
  3. At Scale phase, implement a nightly sync job that copies all R2 objects to an archival Cloudflare R2 bucket in a different account as a cold backup.
  4. For immediate recovery of a corrupted file, check if the user has a local copy or if the file was uploaded recently (Sentry + API logs will show the upload request).
  5. Communicate timeline to affected users via in-app notification (dispatched via the Workers job queue) and by direct email.

10. Launch Checklist

Complete all items before marking the infrastructure as production-ready. Commands assume you are authenticated with Fly.io (flyctl auth login), Netlify (netlify login), and Neon (neonctl auth).

  1. 01
    Create Neon project in us-east-1 neonctl projects create --name groundwork --region-id aws-us-east-1 --pg-version 16
  2. 02
    Create Neon branches: main, staging, dev neonctl branches create --name staging --parent main and neonctl branches create --name dev --parent staging
  3. 03
    Create Fly.io app for API flyctl apps create groundwork-api --org personal, then copy fly.toml from Section 2.2
  4. 04
    Create Fly.io app for Workers flyctl apps create groundwork-workers --org personal, then copy fly.toml from Section 2.3
  5. 05
    Set all Fly.io secrets for both apps Run flyctl secrets set DATABASE_URL="..." RESEND_API_KEY="..." R2_ACCESS_KEY_ID="..." R2_SECRET_ACCESS_KEY="..." SESSION_SECRET="$(openssl rand -hex 32)" --app groundwork-api
  6. 06
    Configure Neon IP allowlist with Fly.io NAT IPs flyctl ips list --app groundwork-api, then add each IP in Neon Console → Project Settings → IP Allow
  7. 07
    Run all database migrations against production branch DATABASE_URL_DIRECT="<neon-direct-url>" npm run migrate — verify all migrations succeed with exit code 0
  8. 08
    Create Cloudflare R2 production bucket In Cloudflare dashboard → R2 → Create bucket named groundwork-production. Create an API token with Object Read & Write scope restricted to this bucket only.
  9. 09
    Connect Netlify site to GitHub repo netlify init in the project root, or connect via Netlify dashboard. Confirm build command is npm run build and publish directory matches netlify.toml.
  10. 10
    Set all Netlify environment variables In Netlify dashboard → Site settings → Environment variables: VITE_SENTRY_DSN, VITE_API_BASE_URL=https://api.groundwork.app, VITE_R2_PUBLIC_URL
  11. 11
    Configure custom domain on Netlify Netlify dashboard → Domain management → Add custom domain groundwork.app. Point DNS to Netlify's nameservers or add the CNAME/A record as directed.
  12. 12
    Configure custom domain on Fly.io API flyctl certs create api.groundwork.app --app groundwork-api. Add the CNAME record to DNS as instructed by the output. Verify with flyctl certs check api.groundwork.app --app groundwork-api.
  13. 13
    Add GitHub Actions secrets In GitHub → repo → Settings → Secrets: add FLY_API_TOKEN (from flyctl auth token), NEON_API_KEY, NEON_PROJECT_ID, DATABASE_URL_DIRECT_PROD
  14. 14
    Create GitHub Actions environment named "production" with approval gate GitHub → repo → Settings → Environments → New environment → production. Add required reviewers. This gates all production deploys.
  15. 15
    First production deploy via GitHub Actions Push to main branch. Verify the workflow completes without errors. Check flyctl status --app groundwork-api shows all machines as started.
  16. 16
    Configure Sentry project and verify error reporting Create Sentry project for groundwork. Trigger a test error via the Sentry debug endpoint. Confirm it appears in the Sentry dashboard within 30 seconds.
  17. 17
    Set up UptimeRobot monitors for all 5 endpoints Create monitors as listed in Section 6.3. Set alert contacts to on-call email and Slack webhook. Run a test alert to verify delivery.
  18. 18
    Verify Neon PITR is active on Pro plan Confirm the project is on the Neon Pro plan (required for PITR). Check Console → Backups to see WAL archiving is enabled and shows recent timestamps.
  19. 19
    Run end-to-end smoke test against production Create a test user account, create a project, upload a photo, trigger an email notification, verify it arrives. Check Sentry for zero new errors. Check Fly.io logs for clean request logs.
  20. 20
    Document all service account credentials in team password manager Store Neon project ID + admin credentials, Fly.io org slug, Cloudflare Account ID + R2 bucket name, Resend API key, Sentry DSN, and UptimeRobot API key in 1Password or Bitwarden under the "Groundwork Infrastructure" vault.

All 20 items complete? The infrastructure is production-ready. Estimated total setup time for an experienced DevOps engineer: 4–6 hours. Share this document link in the team Slack channel and schedule a 30-minute runbook walkthrough with the full engineering team before the first user-facing launch.