Infrastructure Plan
This document is the authoritative reference for how Groundwork is deployed and operated. It is specific enough that a DevOps engineer with access to the listed accounts can stand up the full stack in a single day. All CLI commands assume macOS/Linux with flyctl, netlify-cli, and gh installed.
1. Infrastructure Overview
Groundwork runs on four independent control planes: Netlify for the frontend, Fly.io for the API and background workers, Neon for the database, and Cloudflare R2 for file storage. This decomposition keeps each layer independently scalable and avoids any single-vendor lock-in on the critical path.
- SvelteKit SSR via adapter-netlify
- Global CDN for static assets
- Preview deploys per pull request
- Auto-SSL, custom headers, redirects
- Node.js (Fastify) REST + WebSocket
- Primary region: iad (US East)
- Auto-scale 1–4 machines
- Private 6PN networking to workers
- Background job processing queue
- Email renders + dispatch via Resend
- PDF generation, notification fan-out
- Cold-start acceptable (async work)
- Serverless autoscale (0–4 CU)
- Built-in connection pooler (pgbouncer)
- PITR 7-day retention (Pro)
- Branch per PR for schema testing
- Zero egress fees
- S3-compatible API (AWS SDK works)
- Project photos, PDFs, attachments
- R2 public bucket for presigned URLs
- Session cache, rate-limit counters
- Deferred until Growth phase
- Serverless — pay per request
- REST API (no persistent connection)
2. Hosting Architecture
2.1 Netlify — Frontend
The SvelteKit application is deployed to Netlify using @sveltejs/adapter-netlify. Static assets (JS bundles, CSS, fonts, images) are served from Netlify's global CDN. Server-side rendering runs inside Netlify Functions (AWS Lambda under the hood, 1 vCPU, 1024MB, 10s timeout).
[build] command = "npm run build" publish = ".svelte-kit/netlify/static" [build.environment] NODE_VERSION = "20" # SSR Function handler [[functions]] directory = ".svelte-kit/netlify/functions" # Cache static assets aggressively [[headers]] for = "/_app/immutable/*" [headers.values] Cache-Control = "public, max-age=31536000, immutable" # Security headers on all routes [[headers]] for = "/*" [headers.values] X-Frame-Options = "DENY" X-Content-Type-Options = "nosniff" Referrer-Policy = "strict-origin-when-cross-origin" Permissions-Policy = "camera=(), microphone=(), geolocation=()" Strict-Transport-Security = "max-age=63072000; includeSubDomains; preload" # SPA fallback for client-side navigation [[redirects]] from = "/api/*" to = "https://api.groundwork.app/:splat" status = 200 force = true
2.2 Fly.io — API Service
The API is a Node.js (Fastify) application running on Fly.io. Node.js is recommended over Go here because the team can share code and types between the SvelteKit frontend and the API (e.g., Zod schemas, shared utility functions), reducing duplication and accelerating the early build. Go would be appropriate if the API needs to handle sustained CPU-bound work at high concurrency — that's not the Launch or Growth profile.
app = "groundwork-api" primary_region = "iad" [build] dockerfile = "Dockerfile" [env] PORT = "8080" NODE_ENV = "production" LOG_LEVEL = "info" [[services]] protocol = "tcp" internal_port = 8080 [services.concurrency] type = "requests" hard_limit = 200 soft_limit = 150 [[services.ports]] port = 443 handlers = ["tls", "http"] [services.http_checks] interval = "15s" timeout = "5s" grace_period = "10s" method = "GET" path = "/health" protocol = "http" # Auto-scale: min 1, max 4 machines [http_service] auto_stop_machines = false # keep ≥1 warm auto_start_machines = true min_machines_running= 1 [[vm]] cpu_kind = "shared" cpus = 1 memory_mb= 512
2.3 Fly.io — Background Workers
Workers are a separate Fly.io app (groundwork-workers) so they can scale independently of the API and be deployed or restarted without impacting live traffic. They pull from an in-process queue backed by Neon (using the pg-boss library) and scale to zero between bursts.
app = "groundwork-workers" primary_region = "iad" [build] dockerfile = "Dockerfile.worker" # No public service — workers are internal only # They communicate outbound only (Resend, Neon, R2) [[vm]] cpu_kind = "shared" cpus = 1 memory_mb = 512 # Scale to zero; woken by pg-boss polling [http_service] auto_stop_machines = true auto_start_machines = true min_machines_running= 0
pg-boss for job queue: Rather than adding Redis in the early stages, use pg-boss which implements a reliable job queue directly on top of PostgreSQL. This removes one external dependency at Launch. Migrate to a dedicated queue (BullMQ + Upstash Redis) only when you hit Growth phase and see queue contention.
2.4 Region Strategy
| Phase | Fly.io Regions | Neon Region | Rationale |
|---|---|---|---|
| Launch (Mo 1–3) | iad | us-east-1 | US-focused early users; minimize latency to Neon |
| Growth (Mo 4–12) | iad + ord | us-east-1 | Add Chicago for US resilience; primary DB stays east |
| Scale (Year 2) | iad + ord + lax | us-east-1 (+ read replica) | West coast coverage; Neon read replica cuts latency |
3. Database Setup
3.1 Neon Project Configuration
One Neon project holds all environments as separate branches. The project lives in us-east-1 (AWS), co-located with Fly.io iad to minimize network round-trips.
# Install Neon CLI npm install -g neonctl neonctl auth # Create project (do this once) neonctl projects create \ --name groundwork \ --region-id aws-us-east-1 \ --pg-version 16 # List branch connection strings neonctl connection-string --branch main neonctl connection-string --branch staging # Create staging branch from main neonctl branches create --name staging --parent main # Create a per-PR branch (run in CI) neonctl branches create \ --name "preview/pr-$PR_NUMBER" \ --parent staging
3.2 Connection Pooling
Every application connects through Neon's built-in pgbouncer endpoint (port 5432, transaction pooling mode). Direct connections (port 5432 on the non-pooler hostname) are used only for migrations, which require session mode.
| Connection Type | Hostname Pattern | Port | Use For |
|---|---|---|---|
| Pooled (pgbouncer) | ep-xxx-pooler.us-east-1.aws.neon.tech | 5432 | API, Workers (all runtime queries) |
| Direct | ep-xxx.us-east-1.aws.neon.tech | 5432 | Migrations only (requires session mode) |
# Runtime connection (pooled — use this in app code) DATABASE_URL=postgres://user:pass@ep-xxx-pooler.us-east-1.aws.neon.tech/neondb?sslmode=require # Migration connection (direct — use only in migration scripts) DATABASE_URL_DIRECT=postgres://user:pass@ep-xxx.us-east-1.aws.neon.tech/neondb?sslmode=require # SSL is enforced; never disable sslmode
3.3 Backup & Point-in-Time Recovery
Neon's Pro plan includes continuous WAL archiving with a 7-day PITR window. No additional backup tooling is required for the Launch or Growth phases.
# Create a restore branch from a specific timestamp neonctl branches create \ --name restore-2026-04-01 \ --parent main \ --timestamp "2026-04-01T12:00:00Z" # Verify data in restore branch, then promote to main if correct neonctl branches set-as-default restore-2026-04-01
3.4 Monitoring
| What to Watch | Where | Alert Threshold |
|---|---|---|
| Active connections | Neon dashboard → Monitoring | >80 pooler connections |
| Slow queries | Neon → Query stats (pg_stat_statements) | p99 > 500ms |
| Storage usage | Neon dashboard → Usage | >80% of plan limit |
| Compute uptime | Neon dashboard → Compute | Unexpected auto-suspend during peak |
4. CI/CD Pipeline
All code flows through GitHub. Netlify auto-deploys are the primary mechanism for the frontend. Fly.io deployments are driven by GitHub Actions to ensure migrations run before the new application code starts serving traffic.
- ESLint + Prettier
- TypeScript check
- Vitest unit tests
- Playwright smoke
- Netlify preview URL
- Neon PR branch
- Run migrations
- Post URL to PR
- Fly.io staging app
- Neon staging branch
- Integration tests
- Manual approval gate
- Run DB migrations
- Fly.io deploy
- Netlify auto-deploy
- Smoke test suite
name: Deploy on: push: branches: [main] pull_request: branches: [main] env: FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }} NEON_API_KEY: ${{ secrets.NEON_API_KEY }} NEON_PROJECT_ID: ${{ secrets.NEON_PROJECT_ID }} jobs: # ── 1. Lint & Test ───────────────────────────────────── test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: { node-version: '20', cache: 'npm' } - run: npm ci - run: npm run lint - run: npm run check # svelte-check + tsc - run: npm run test:unit # ── 2. PR Preview (branch deploys only) ──────────────── preview: if: github.event_name == 'pull_request' needs: [test] runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 # Create a Neon branch for this PR - name: Create Neon preview branch uses: neondatabase/create-branch-action@v5 id: neon-branch with: project_id: ${{ env.NEON_PROJECT_ID }} api_key: ${{ env.NEON_API_KEY }} branch_name: preview/pr-${{ github.event.number }} parent: staging # Run migrations against the preview branch - name: Run migrations run: npm run migrate env: DATABASE_URL_DIRECT: ${{ steps.neon-branch.outputs.db_url }} # Netlify handles the actual preview deploy automatically # We just need to inject the Neon branch URL as an env var - name: Comment preview URL on PR uses: actions/github-script@v7 with: script: | github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: `Preview DB branch: \`preview/pr-${context.issue.number}\`` }) # ── 3. Production Deploy (main branch only) ──────────── deploy: if: github.event_name == 'push' && github.ref == 'refs/heads/main' needs: [test] runs-on: ubuntu-latest environment: production # GitHub environment with approval gate steps: - uses: actions/checkout@v4 # Run migrations BEFORE deploying new code - name: Run production migrations run: npm run migrate env: DATABASE_URL_DIRECT: ${{ secrets.DATABASE_URL_DIRECT_PROD }} # Deploy API to Fly.io - uses: superfly/flyctl-actions/setup-flyctl@master - name: Deploy API run: flyctl deploy --app groundwork-api --strategy rolling # Deploy workers - name: Deploy Workers run: flyctl deploy --app groundwork-workers --strategy immediate # Netlify deploys automatically on push to main via Git integration # No step needed here unless you want to block on Netlify completion
4.1 Rollback Procedures
| Layer | Rollback Command | Time |
|---|---|---|
| Fly.io API | flyctl releases list --app groundwork-api flyctl deploy --image <previous-image> --app groundwork-api |
~2 min |
| Netlify | netlify deploys list netlify deploy --prod --dir=<old-build> |
~1 min |
| Database | neonctl branches create --name rollback --parent main --timestamp <ts> | ~5 min |
Migration safety rule: All migrations must be backward-compatible with the previous version of the application code. Use a two-phase approach: first deploy a migration that is compatible with both old and new code, then deploy the code change. Never drop columns or rename them in the same deploy as the code that removes their usage.
5. Environment Management
- Frontend
- vite dev (port 5173)
- API
- node --watch (port 8080)
- Database
- Neon dev branch
- Storage
- R2 dev bucket
- Jobs
- in-process (no worker app)
- Frontend
- Netlify preview URL
- API
- groundwork-api (prod)
- Database
- Neon pr-<N> branch
- Storage
- R2 dev bucket
- Jobs
- groundwork-workers
- Frontend
- staging.groundwork.app
- API
- groundwork-api-staging
- Database
- Neon staging branch
- Storage
- R2 staging bucket
- Jobs
- groundwork-workers-staging
- Frontend
- groundwork.app
- API
- groundwork-api
- Database
- Neon main branch
- Storage
- R2 production bucket
- Jobs
- groundwork-workers
5.1 Secrets Management
| Secret | Stored In | Injected Into |
|---|---|---|
| DATABASE_URL | Fly.io secrets | API, Workers runtime |
| DATABASE_URL_DIRECT | GitHub Actions secrets | Migration step in CI only |
| RESEND_API_KEY | Fly.io secrets | Workers runtime |
| SENTRY_DSN | Netlify env vars + Fly.io secrets | Frontend build + API runtime |
| R2_ACCESS_KEY_ID | Fly.io secrets | API runtime |
| R2_SECRET_ACCESS_KEY | Fly.io secrets | API runtime |
| SESSION_SECRET | Fly.io secrets | API runtime |
| UPSTASH_REDIS_URL | Fly.io secrets | API runtime (Growth+) |
| FLY_API_TOKEN | GitHub Actions secrets | CI deploy step |
| NEON_API_KEY | GitHub Actions secrets | CI branch creation step |
# Set secrets (never committed to git) flyctl secrets set \ DATABASE_URL="postgres://..." \ RESEND_API_KEY="re_..." \ R2_ACCESS_KEY_ID="..." \ R2_SECRET_ACCESS_KEY="..." \ SESSION_SECRET="$(openssl rand -hex 32)" \ --app groundwork-api # Verify (values are redacted in output) flyctl secrets list --app groundwork-api
6. Monitoring & Observability
6.1 Metrics
Fly.io provides built-in machine metrics (CPU, memory, network) visible in the dashboard at fly.io/apps/groundwork-api/metrics. No additional agent is required for infrastructure metrics at the Launch phase.
| Signal | Source | Dashboard |
|---|---|---|
| CPU / Memory | Fly.io built-in | fly.io/apps/groundwork-api/metrics |
| HTTP request rate + latency | Fly.io built-in | Same dashboard, request metrics tab |
| Machine restarts | Fly.io events | fly.io/apps/groundwork-api/events |
| DB query stats | Neon console | console.neon.tech → Monitoring |
| Uptime / availability | UptimeRobot | uptimerobot.com dashboard |
6.2 Error Tracking — Sentry
Sentry is installed in both the SvelteKit frontend (@sentry/sveltekit) and the Fastify API (@sentry/node). They share one Sentry project, differentiated by environment tags.
// src/hooks.server.ts import * as Sentry from '@sentry/sveltekit'; Sentry.init({ dsn: import.meta.env.VITE_SENTRY_DSN, environment: import.meta.env.MODE, tracesSampleRate: 0.1, // 10% trace sampling profilesSampleRate: 0.05, // 5% profiling integrations: [ Sentry.replayIntegration({ maskAllText: true, // PII protection blockAllMedia: false, }), ], replaysSessionSampleRate: 0.01, replaysOnErrorSampleRate: 1.0, });
6.3 UptimeRobot — Synthetic Checks
Five monitors cover the critical user journeys. Alert contacts: email + Slack webhook. Check interval: 5 minutes on the free plan.
| # | Monitor Name | URL | Type | Expected |
|---|---|---|---|---|
| 1 | Homepage | https://groundwork.app/ | HTTP(S) | 200, <3s |
| 2 | API Health | https://api.groundwork.app/health | HTTP(S) | 200, JSON {status:"ok"} |
| 3 | Login Page | https://groundwork.app/login | HTTP(S) | 200, <3s |
| 4 | API DB Check | https://api.groundwork.app/health/db | HTTP(S) | 200, confirms Neon connectivity |
| 5 | File Upload CDN | https://files.groundwork.app/health.txt | HTTP(S) | 200, confirms R2 public access |
6.4 Structured Logging
The API and Workers emit structured JSON logs (via pino) to stdout. Fly.io captures these and forwards them to a log drain. At Growth phase, configure Fly.io's Grafana Cloud log drain for retention and search.
# Add Grafana Cloud log drain (Growth phase) flyctl logs drain create \ --app groundwork-api \ --type http \ --url "https://logs-prod-us-central1.grafana.net/loki/api/v1/push" \ --header "Authorization: Basic <grafana-token>" # Sample structured log output (pino) {"level":"info","time":1712073600000,"reqId":"abc-123", "method":"POST","url":"/api/projects","statusCode":201, "responseTime":42,"userId":"usr_xyz","projectId":"prj_456"}
6.5 Alert Rules
| Condition | Threshold | Action | Owner |
|---|---|---|---|
| API health check fails | 2 consecutive failures | PAGE immediately | On-call engineer |
| Neon DB connectivity fails | 1 failure | PAGE immediately | On-call engineer |
| Fly.io machine crash loop | 3 restarts in 5 min | PAGE immediately | On-call engineer |
| API p99 latency | > 2000ms for 5 min | Slack alert — investigate | Engineering channel |
| Sentry error rate spike | > 10 errors/min (new issue) | Slack alert — investigate | Engineering channel |
| CPU > 85% | Sustained 10 min | Slack alert — scale up | Engineering channel |
| Neon storage > 80% quota | Daily check | Slack alert — plan upgrade | Engineering channel |
| UptimeRobot homepage down | 2 consecutive failures | PAGE immediately | On-call engineer |
7. Cost Projections
All prices are based on publicly listed rates as of Q1 2026. "Launch" assumes 50 projects and ~100 users. "Growth" assumes 500 projects and ~1,500 users. "Scale" assumes 5,000 projects and ~15,000 users.
| Service | Plan / Tier | Launch (Mo 1–3) | Growth (Mo 4–12) | Scale (Year 2) |
|---|---|---|---|---|
| Netlify | Pro ($19/mo flat) | $19/mo | $19/mo | $19/mo |
| Fly.io API | shared-cpu-1x, 512MB | $5/mo | $15/mo | $50/mo |
| Fly.io Workers | Same machine type, idle→scale | $0–5/mo | $10/mo | $30/mo |
| Neon PostgreSQL | Free → Pro ($19/mo) → Pro+ | $0/mo | $19/mo | $69/mo |
| Cloudflare R2 | 10GB free, $0.015/GB after | $0/mo | $0/mo | $5/mo |
| Upstash Redis | Free (10K req/day) → Pay-per-use | $0/mo | $0/mo | $10/mo |
| Sentry | Free (5K errors/mo) → Team $26 | $0/mo | $0/mo | $26/mo |
| Resend | Free (3K/mo) → Pro $20 | $0/mo | $0/mo | $20/mo |
| UptimeRobot | Free (50 monitors, 5-min checks) | $0/mo | $0/mo | $0/mo |
| Domain + DNS | Cloudflare Registrar | ~$1.25/mo | ~$1.25/mo | ~$1.25/mo |
| Total | ~$25/mo | ~$65/mo | ~$230/mo |
FinOps notes: The free-tier strategy on Neon, Sentry, Resend, and Upstash saves approximately $55/mo during the Launch phase. Upgrade triggers should be set proactively: move Neon to Pro when the project count exceeds 30 (to ensure PITR coverage before data becomes critical), and Sentry to Team when error volume approaches 4,500/month (80% of the free limit).
7.1 Fly.io Machine Cost Breakdown
| Machine Type | $/mo (1 machine) | Launch Count | Growth Count | Scale Count |
|---|---|---|---|---|
| shared-cpu-1x 512MB | ~$3.19/mo | 1 API + 0–1 worker | 2 API + 1 worker | 4 API + 2 workers |
| shared-cpu-1x 1GB | ~$5.70/mo | — | — | Consider at 200 req/s |
8. Security Infrastructure
8.1 TLS / Transport Security
| Layer | Certificate Provider | Minimum TLS | Notes |
|---|---|---|---|
| Netlify (frontend) | Let's Encrypt (auto-renew) | TLS 1.2 | HSTS preloaded via header |
| Fly.io (API) | Let's Encrypt (auto-renew) | TLS 1.2 | Auto-configured per app |
| Neon (database) | AWS ACM | TLS 1.2 | Enforced; sslmode=require mandatory |
| Cloudflare R2 | Cloudflare managed | TLS 1.2 | Presigned URLs expire in 15 min |
8.2 Network Security
Fly.io's 6PN (private networking) is used for API-to-Worker communication. Workers never expose a public port. Neon's IP allowlist restricts database access to Fly.io's NAT gateway IPs.
# Get Fly.io outbound IPs for iad region flyctl ips list --app groundwork-api # Add to Neon via API or console # Console: console.neon.tech → Project Settings → IP Allow # Add each Fly.io IPv4 in CIDR notation: 1.2.3.4/32 # Verify connectivity from a running machine flyctl ssh console --app groundwork-api # Inside machine: psql $DATABASE_URL -c "SELECT version();"
8.3 Content Security Policy
[[headers]] for = "/*" [headers.values] Content-Security-Policy = """ default-src 'self'; script-src 'self' 'unsafe-inline' https://browser.sentry-cdn.com; style-src 'self' 'unsafe-inline' https://fonts.googleapis.com; font-src 'self' https://fonts.gstatic.com; img-src 'self' data: https://files.groundwork.app; connect-src 'self' https://api.groundwork.app https://*.sentry.io https://o4504.ingest.sentry.io; frame-ancestors 'none'; base-uri 'self'; form-action 'self'; """
8.4 Rate Limiting
Application-level rate limiting is implemented in the Fastify API using @fastify/rate-limit, backed by in-memory storage at Launch and Upstash Redis at Growth phase.
| Endpoint Group | Limit | Window | Strategy |
|---|---|---|---|
| POST /auth/* | 5 requests | 1 minute | Per IP — prevents brute force |
| POST /api/projects | 20 requests | 1 minute | Per authenticated user |
| GET /api/* | 200 requests | 1 minute | Per authenticated user |
| Global fallback | 500 requests | 1 minute | Per IP — prevents DDoS |
8.5 IAM Principles
- Fly.io deploy tokens are scoped to individual apps — the GitHub Actions token for
groundwork-apicannot deploy togroundwork-workersand vice versa. - Neon roles: one
app_userrole withSELECT/INSERT/UPDATE/DELETEon app tables; a separatemigratorrole withDDLrights used only in CI migrations. - R2 API tokens: one token scoped to the production bucket with write access; a separate read-only token for presigned URL generation if needed.
- Resend API keys: one per environment (production, staging). Rotate quarterly.
9. Disaster Recovery
| Metric | Target | Mechanism |
|---|---|---|
| RTO (Recovery Time Objective) | 30 minutes | Fly.io machine restart (auto <2 min) + Neon branch restore (manual, up to 28 min) |
| RPO (Recovery Point Objective) | 5 minutes | Neon continuous WAL archiving — data loss window is the WAL shipping interval |
- Fly.io auto-restarts the crashed machine within 30–60 seconds. If
min_machines_running = 1, a new machine is started immediately. - If auto-restart fails repeatedly (crash loop), SSH into the machine:
flyctl ssh console --app groundwork-apiand inspect logs:flyctl logs --app groundwork-api. - If the latest deploy is the cause, roll back immediately:
flyctl deploy --image <previous-image-id> --app groundwork-api. - If the issue is a dependency (Neon down, Resend down), check respective status pages and implement a
503maintenance response in the health check. - Once stable, write a postmortem and add a test case that would have caught the regression.
- Confirm outage is on Neon's side: check
status.neon.tech. If Neon is healthy, the issue is the connection string or IP allowlist. - Enable maintenance mode on the API (return
503withRetry-Afterheader) to prevent partial failures from reaching users. - If Neon declares an outage lasting >15 min, create a restore branch from the latest WAL snapshot:
neonctl branches create --name dr-restore --parent main --timestamp <last-known-good-ts>. - Update
DATABASE_URLin Fly.io secrets to point to the restore branch:flyctl secrets set DATABASE_URL="<new-url>" --app groundwork-api. - Restart machines to pick up new secret:
flyctl machines restart --app groundwork-api. Disable maintenance mode. Monitor for errors.
- Immediately identify the timestamp of the bad operation from Fly.io logs:
flyctl logs --app groundwork-api | grep "<affected-entity-id>". - Create a Neon restore branch to the moment before the deletion:
neonctl branches create --name data-restore --parent main --timestamp "<ISO-timestamp>". - Connect to the restore branch and export the affected rows to a SQL file:
pg_dump --table=affected_table --data-only -f restore.sql <restore-branch-url>. - Re-import the rows into the production branch:
psql $DATABASE_URL_DIRECT < restore.sql. - Verify row counts and spot-check data integrity. Delete the restore branch:
neonctl branches delete data-restore.
- Identify the bad deploy via Sentry error spike or UptimeRobot alert.
- Find the last good image ID:
flyctl releases list --app groundwork-api. - Roll back the API immediately:
flyctl deploy --image <previous-image-id> --app groundwork-api --strategy immediate. - For Netlify, roll back in the Netlify dashboard under Deploys → select previous deploy → Publish deploy. Or via CLI:
netlify deploy --prod --dir=<previous-publish-dir>. - If a migration was run as part of the bad deploy and needs to be reversed, apply a compensating migration (never use destructive rollbacks on production data).
- R2 does not offer built-in versioning. At Growth phase, enable R2 Object Versioning on the production bucket in the Cloudflare dashboard.
- For the Launch phase, the database stores file metadata (key, size, content-type). If objects are deleted from R2 but records exist in Neon, the data loss is limited to the binary files only.
- At Scale phase, implement a nightly sync job that copies all R2 objects to an archival Cloudflare R2 bucket in a different account as a cold backup.
- For immediate recovery of a corrupted file, check if the user has a local copy or if the file was uploaded recently (Sentry + API logs will show the upload request).
- Communicate timeline to affected users via in-app notification (dispatched via the Workers job queue) and by direct email.
10. Launch Checklist
Complete all items before marking the infrastructure as production-ready. Commands assume you are authenticated with Fly.io (flyctl auth login), Netlify (netlify login), and Neon (neonctl auth).
-
01Create Neon project in us-east-1
neonctl projects create --name groundwork --region-id aws-us-east-1 --pg-version 16 -
02Create Neon branches: main, staging, dev
neonctl branches create --name staging --parent mainandneonctl branches create --name dev --parent staging -
03Create Fly.io app for API
flyctl apps create groundwork-api --org personal, then copyfly.tomlfrom Section 2.2 -
04Create Fly.io app for Workers
flyctl apps create groundwork-workers --org personal, then copyfly.tomlfrom Section 2.3 -
05Set all Fly.io secrets for both apps Run
flyctl secrets set DATABASE_URL="..." RESEND_API_KEY="..." R2_ACCESS_KEY_ID="..." R2_SECRET_ACCESS_KEY="..." SESSION_SECRET="$(openssl rand -hex 32)" --app groundwork-api -
06Configure Neon IP allowlist with Fly.io NAT IPs
flyctl ips list --app groundwork-api, then add each IP in Neon Console → Project Settings → IP Allow -
07Run all database migrations against production branch
DATABASE_URL_DIRECT="<neon-direct-url>" npm run migrate— verify all migrations succeed with exit code 0 -
08Create Cloudflare R2 production bucket In Cloudflare dashboard → R2 → Create bucket named
groundwork-production. Create an API token with Object Read & Write scope restricted to this bucket only. -
09Connect Netlify site to GitHub repo
netlify initin the project root, or connect via Netlify dashboard. Confirm build command isnpm run buildand publish directory matchesnetlify.toml. -
10Set all Netlify environment variables In Netlify dashboard → Site settings → Environment variables:
VITE_SENTRY_DSN,VITE_API_BASE_URL=https://api.groundwork.app,VITE_R2_PUBLIC_URL -
11Configure custom domain on Netlify Netlify dashboard → Domain management → Add custom domain
groundwork.app. Point DNS to Netlify's nameservers or add the CNAME/A record as directed. -
12Configure custom domain on Fly.io API
flyctl certs create api.groundwork.app --app groundwork-api. Add the CNAME record to DNS as instructed by the output. Verify withflyctl certs check api.groundwork.app --app groundwork-api. -
13Add GitHub Actions secrets In GitHub → repo → Settings → Secrets: add
FLY_API_TOKEN(fromflyctl auth token),NEON_API_KEY,NEON_PROJECT_ID,DATABASE_URL_DIRECT_PROD -
14Create GitHub Actions environment named "production" with approval gate GitHub → repo → Settings → Environments → New environment →
production. Add required reviewers. This gates all production deploys. -
15First production deploy via GitHub Actions Push to
mainbranch. Verify the workflow completes without errors. Checkflyctl status --app groundwork-apishows all machines asstarted. -
16Configure Sentry project and verify error reporting Create Sentry project for
groundwork. Trigger a test error via the Sentry debug endpoint. Confirm it appears in the Sentry dashboard within 30 seconds. -
17Set up UptimeRobot monitors for all 5 endpoints Create monitors as listed in Section 6.3. Set alert contacts to on-call email and Slack webhook. Run a test alert to verify delivery.
-
18Verify Neon PITR is active on Pro plan Confirm the project is on the Neon Pro plan (required for PITR). Check Console → Backups to see WAL archiving is enabled and shows recent timestamps.
-
19Run end-to-end smoke test against production Create a test user account, create a project, upload a photo, trigger an email notification, verify it arrives. Check Sentry for zero new errors. Check Fly.io logs for clean request logs.
-
20Document all service account credentials in team password manager Store Neon project ID + admin credentials, Fly.io org slug, Cloudflare Account ID + R2 bucket name, Resend API key, Sentry DSN, and UptimeRobot API key in 1Password or Bitwarden under the "Groundwork Infrastructure" vault.
All 20 items complete? The infrastructure is production-ready. Estimated total setup time for an experienced DevOps engineer: 4–6 hours. Share this document link in the team Slack channel and schedule a 30-minute runbook walkthrough with the full engineering team before the first user-facing launch.