Ops & Deployability
Below are all 15 findings recorded in this dimension during the audit, sorted by severity then launch priority. Each card carries the full anatomy: severity / priority / effort, code location, evidence, technical issue, business impact, plain-language explanation, fix steps, related dimensions, and references.
Findings — Ops & Deployability
<repo-root>/.github/, .gitlab-ci.yml, vercel.json, netlify.toml:files do not existls of .github directory returns absent. No .gitlab-ci.yml, .circleci, bitbucket-pipelines.yml, vercel.json, netlify.toml, or Dockerfile at repo root. package.json defines only dev/build/build:dev/preview/lint/format scripts -- no test, deploy, or ci script. Stack profile Section 3 confirms No CI/CD pipeline files detected.There is no automated gate between a developer working tree and production. No PR check runs the linter, the single test, a type-check, or builds the worker. Deploys flow through an external Lovable platform with zero artifact in the repo describing branch protections, required checks, environment promotion, or rollback hooks.
A regression introduced at 02:00 ships straight to production users with no automated safety net. Combined with the absence of error tracking (OPS-005), structured logging (OPS-006), runbook (OPS-008), and feature flags (OPS-013), the team cannot tell whether a deploy succeeded, broke something, or quietly degraded a coach flow. For a B2C health-adjacent app launching in EU markets, this is a launch blocker.
The project has no automated checks that run before code goes live. Anyone can push a change and it reaches real users without the linter, type-checker, or tests running first, and there is no record showing how a deploy is supposed to happen or who is supposed to approve it.
Add .github/workflows/ci.yml that runs on every PR: install deps, lint, tsc --noEmit, vitest run, build. Add a deploy workflow gated on push-to-main running wrangler deploy with secrets from GitHub Actions environments. Enable branch protection on main requiring CI to pass.
M — 1–3 days
_clients/SONI-remix-new/package.json:dependencies + devDependencies (full inventory)grep across repo for sentry, honeybadger, rollbar, bugsnag returned 0 source matches; one false-positive in src/i18n/locales/es.json (locale text). package.json has no @sentry/*, @honeybadger-io/*, rollbar, or @bugsnag/* in dependencies. src/router.tsx provides a DefaultErrorComponent that renders an error UI but does NOT forward the error anywhere.There is no client-side error reporter and no server-side error reporter. When a user hits a stack trace in coach chat at 02:00 in Madrid, no telemetry leaves the device, no Worker log line is correlated across the SSE stream, and the operator has no way to learn the error happened until the user emails support.
Mean-time-to-detect for any production bug is effectively infinity. Mean-time-to-resolution depends on a user noticing, finding a contact method (legal LEG-010 imprint absence), and writing in. For an AI-coach health app, silent failures will accumulate user trust damage over weeks before anyone notices.
When the app crashes on a user phone, nothing is sent back to the team to say it crashed. The team can only find out a bug exists if a user writes in and complains -- which is rare enough that most bugs will live in production for weeks or months without ever being seen.
Install @sentry/react (client) and @sentry/cloudflare (Worker). Initialize in src/router.tsx and the server entry. Upload source maps in the build step. Use environment-scoped DSNs. Wrap the default error component to call Sentry.captureException(error). Attach userId (anonymized) and a request correlation id to every event.
M — 1–3 days
<repo-root>/.env.example:file does not existStack profile Sections 3 and 9 confirm .env.example not present. Required env vars discovered: SUPABASE_URL, SUPABASE_PUBLISHABLE_KEY, SUPABASE_SERVICE_ROLE_KEY, LOVABLE_API_KEY, VAPID_SUBJECT, VAPID_PUBLIC_KEY, VAPID_PRIVATE_KEY, plus VITE_SUPABASE_URL / VITE_SUPABASE_PUBLISHABLE_KEY / VITE_SUPABASE_PROJECT_ID -- 10 distinct keys across 38+ server files plus client. No README at repo root.There is no canonical list of which environment variables the application requires, what they mean, where to obtain them, or which are secret vs public. Onboarding a new engineer requires grepping the codebase for process.env. and import.meta.env. and reverse-engineering the surface from src/integrations/supabase/client*.ts, src/server/push-admin.server.ts, and ~38 AI-using server files.
Bringing a second engineer is a one-day archaeology exercise before they can run vite dev. In a 02:00 incident, a fresh on-call cannot stand up a local repro because they do not know what to put in their .env. Disaster recovery (rebuild on a new Cloudflare account) is similarly blocked.
There is no list of the secret keys and URLs the app needs to run. A new developer would have to read large parts of the source code just to figure out what to put in their configuration file before they can start working.
Create .env.example listing every variable name with a one-line comment per variable: where it is used (server/client/cron), where to obtain it (Supabase dashboard, Lovable, npx web-push generate-vapid-keys), required vs optional. Add a top-level README.md with Local Setup and Deploying sections. Add a CI check that fails if a new process.env.X reference appears without a corresponding line in .env.example.
S — under ½ day
_clients/SONI-remix-new/wrangler.jsonc:1-7 (entire file)wrangler.jsonc full content has only name, compatibility_date, compatibility_flags, main. No env.staging, env.production, vars, triggers, observability blocks. Supabase project id oyajjhkigkffvudjgybp hard-coded in supabase/config.toml -- single project for all environments.There is no way in the current wrangler config to deploy to a separate staging Worker against a separate Supabase project. Development, integration testing and production share the same database, the same service-role key, the same Lovable API key, and the same Worker URL. Combined with absence of feature flags (OPS-013), partial rollout and pre-prod verification are impossible without affecting real users.
Every test against a non-trivial backend interaction (auth, coach chat, body-progress photo upload) lands in production data. A migration that breaks coach_messages cannot be detected on staging before it ships. AI cost experiments run against the same Lovable API cap as production traffic.
There is only one version of the app. There is no separate copy where the team can try out new features safely before real users see them -- every change is tested directly on the live system with real user data.
Add env.staging and env.production blocks to wrangler.jsonc with distinct Worker names. Create a second Supabase project for staging. Update scripts: deploy:staging -> wrangler deploy --env staging; deploy:production -> wrangler deploy --env production. Gate deploy:production behind manual approval.
M — 1–3 days
_clients/SONI-remix-new/wrangler.jsonc, _clients/SONI-remix-new/.env:wrangler.jsonc: no vars block, no secret refs; .env is git-trackedgit ls-files .env returns .env -- file is committed. wrangler.jsonc has no vars block and no docs that secrets flow via wrangler secret put. .gitignore does not contain .env or .env.* patterns. Per Charter Rule 7 the .env file contents were NOT opened by this agent.Whether or not the values in .env are real production credentials, the deployment pipeline has no recorded secret-management discipline. There is no wrangler.jsonc evidence that secrets flow via wrangler secret put, no GitHub Actions secret references, no documented rotation procedure. SEC-001 covers the leak risk; this finding covers the deploy-side gap: no infrastructure-as-code description of how production gets its secrets.
Credential rotation requires editing .env in the repo and pushing a commit. There is no audit trail of when LOVABLE_API_KEY or SUPABASE_SERVICE_ROLE_KEY were last rotated. A leaked secret cannot be revoked at the deploy-platform layer because the deploy platform is not the system of record.
The platform that runs the app (Cloudflare Workers) is not where the secret keys are stored. The keys live in a file in the project itself, which means rotating a leaked key would require editing the project and pushing a change rather than just clicking rotate in a dashboard.
Move every secret out of .env into the Cloudflare Workers secret store via wrangler secret put (per environment). Delete .env from working tree, add .env, .env.*, *.env to .gitignore, git-rm the historical file (cross-ref SEC-001). Replace with .env.example (cross-ref OPS-002). Document secret list and rotation cadence in docs/secrets.md.
M — 1–3 days
<repo-wide> 81 files under src/:see grep summarygrep for console.(log|error|warn|info|debug) under src/ returned 177 occurrences across 81 files (top hits: src/components/CoachPage.tsx:12, src/server/onboarding/daily-block.functions.ts:8, src/server/meal-analysis.ts:7, src/routes/api.coach-chat.ts:9). grep for pino, winston, bunyan returned no source matches. The Cloudflare Worker wrangler.jsonc has no observability section. No log-aggregation library (Logflare, Datadog, Better Stack) in package.json.Production debugging relies on wrangler tail (or whatever the Lovable platform exposes) reading raw console output with no correlation between a single coach-chat request, the AI gateway call inside it, the fact-extraction follow-up AI call, and the Supabase queries that fired alongside. There is no structured logger emitting JSON with requestId, userId, route, level, model, latencyMs fields. There is no log level -- dev-time console.log statements emit at the same priority as console.error.
When an on-call needs to trace a single user-reported coach failure, they must scroll a stream of un-correlated console lines and guess which ones belong to that request. Multi-step failures cannot be reconstructed from logs alone. Combined with no error tracker (OPS-005), debugging time-to-resolution is bounded by guesswork.
The app prints messages to the console like a developer notebook, but nothing in those messages says which user, which request, or which step they belong to. When something breaks, the on-call engineer cannot connect the dots between the user report and the lines in the log.
Add a thin structured logger module at src/lib/log.ts that wraps console.log/error and emits JSON with fields ts, level, msg, requestId, userId?, route?, durationMs?, ...rest. Replace console.log in src/server/* and src/routes/api*.ts with the structured logger. Generate a requestId in auth-middleware.ts and propagate via request context. Enable Worker observability in wrangler.jsonc. Pipe Worker logs to a long-term sink (Cloudflare Logs Engine, Logflare, or Better Stack).
M — 1–3 days
<repo-wide> -- no monitoring config detected:absentgrep for datadog, newrelic, honeycomb, logflare, betterstack, uptimerobot, pagerduty returned no source matches. wrangler.jsonc has no observability block. No alerts configured for Worker error rate, Supabase Postgres connection count, or Lovable AI Gateway spend. No README mentions an on-call rotation, alert email, or Slack channel.There is no uptime probe configured, no APM emitting latency p50/p95/p99, no error-rate alert, no alert on Lovable API token spend or Supabase connection exhaustion, and no defined alert sink (PagerDuty, Slack, email). For an AI-coach product whose dominant cost driver is a third-party token budget (cross-ref SCA-007 and AI-003), the absence of a spend alert is by itself a financial-risk finding.
An outage at 02:00 will not page anyone. A Lovable API key drained by a runaway user (AI-003) will only be noticed when the next user gets a 429 and complains. Worker latency p95 silently doubling after a deploy will not surface until users report sluggish coach replies.
Nothing is watching the app to see if it is up, fast, or running up a bill. If a service goes down in the middle of the night or someone abuses the AI features and triggers a large invoice, nobody on the team will get a notification -- they will only find out the next time they happen to log in.
- using UptimeRobot, Better Stack, or Cloudflare Health Checks. Route alerts to a shared Slack channel and an on-call email. Enable Worker observability in wrangler.jsonc and surface error-rate and CPU-time dashboards. Configure a Lovable AI Gateway monthly spend alert at 50/75/90 percent of the cap. Configure a Supabase project usage alert.
M — 1–3 days
<repo-root>/docs/, RELEASE.md, RUNBOOK.md:files do not existStack profile Section 9: README file not present. No docs/ directory at repo root. find for README, RUNBOOK, RELEASE, DEPLOY returned no matches. No on-call rotation documented in any of the existing project files.There is no document describing how to deploy, how to roll back a bad deploy, what to verify after deploy, who to escalate to during an incident, or how to handle a partial outage (Worker up but Supabase down, or Worker up but Lovable Gateway throttling). The Lovable platform may provide a redeploy-previous button, but there is no in-repo evidence that anyone has practiced it or documented the smoke-test set that proves a rollback was successful.
During a production incident, the responder must invent procedure on the spot. A bad migration that broke coach_messages cannot be rolled back without ad-hoc SQL invented under pressure (cross-ref DAT-002, DAT-008). The single-engineer bus factor is brutal: a second engineer cannot take over without a verbal handover.
There is no written guide for how to safely release the app, how to undo a release that broke something, or who to call when things go wrong. Every incident becomes a fresh exercise in figuring it out from scratch, which is slow and dangerous.
- pre-deploy checklist (lint, tests, type-check pass; staging deploy verified); (.
- deploy command and approval gate; (.
- smoke-test checklist (sign-in, send a coach message, view bio-twin, log a meal, view weekly report); (.
- rollback procedure (wrangler rollback or Lovable platform equivalent, plus Supabase point-in-time-restore steps); (.
- on-call escalation tree with names, phone numbers, and second-on-call. Schedule a quarterly rollback drill.
S — under ½ day
_clients/SONI-remix-new/supabase/migrations/ (10 files referencing pg_cron), _clients/SONI-remix-new/src/routes/hooks/weekly-reports.ts, _clients/SONI-remix-new/src/routes/api/public/hooks/:cron call sites + endpointsgrep for pg_cron, schedule under supabase/ returned 10 migration files referencing pg_cron. src/routes/hooks/weekly-reports.ts and src/routes/api/public/hooks/bio-twin-snapshots.ts and body-plateau-detect.ts are hook endpoints called by cron. Cross-ref SCA-006 confirms cron endpoints process all users in a tight sequential for-loop. SEC-002 confirms unauthenticated cron endpoint uses service-role key and SEC-003 cron endpoint uses publishable (anon) key as the bearer secret. None of the migration files reviewed include a cron-failure logging table or an alert hook.Scheduled jobs fire on a pg_cron timer, hit unauthenticated or weakly authenticated endpoints (SEC-002, SEC-003), iterate all users sequentially (SCA-006), and either succeed silently or fail silently. There is no per-run row inserted into a cron_run_log table with job, started_at, finished_at, candidate_count, processed_count, error. There is no alert when N consecutive runs fail or when a scheduled run is missed entirely.
If the weekly-report cron breaks on a Sunday at 03:00, no user gets a report and no operator notices until a user complains on Tuesday. If the bio-twin-snapshot cron fails for two weeks running, snapshots silently miss the gap and the coach loses context. Combined with no APM (OPS-007), missed runs are invisible.
The app runs scheduled background jobs (for example generating weekly reports). If one of these jobs fails, nothing is written down anywhere -- the team has no way to find out it stopped working until a user notices their report did not appear.
Create a cron_run_log table with job text, started_at timestamptz, finished_at timestamptz, candidate_count int, processed_count int, error text. Every cron endpoint should INSERT a row on start and UPDATE it on finish/failure. Add a Better Stack heartbeat per cron job that the endpoint pings on success -- missed heartbeats trigger an alert. Surface a cron_health admin view that lists last-run-at + last-success-at per job.
M — 1–3 days
_clients/SONI-remix-new/src/routes/:no api.health.ts / health.ts / healthz.ts route existsgrep for health, healthz under src/ returned only one match in src/server/blueprint-initial.ts which is a prompt string containing the word health as text, not an endpoint. ls of src/routes/ shows 32 files, none of them health-related. Stack profile Section 8 enumerates routes -- no /health, /healthz, /api/health, or /api/status appears.External uptime monitoring (OPS-007) cannot verify the worker plus its critical downstream dependencies (Supabase, Lovable AI Gateway) are healthy because there is no endpoint that probes them. A 200 OK on / only proves the SPA shell loads -- it does not prove that auth, the database, or the AI gateway are reachable.
Even after adding an uptime probe, the probe will only confirm the SPA is served -- it will continue returning 200 OK during a Supabase outage or a Lovable gateway outage. Real user impact (coach chat fails) goes undetected until users complain.
There is no special address that says I am alive and so are all the things I depend on. Even when monitoring is added, it will only check that the home page loads -- not that the database or the AI service is actually working.
- select 1 from Supabase with a 1500ms timeout; (.
- HEAD or short OPTIONS against the AI gateway; (.
- return 503 if any downstream check fails. Point UptimeRobot/Better Stack at this endpoint with a 60s interval.
S — under ½ day
_clients/SONI-remix-new/supabase/migrations/ (89 SQL files), <repo-root>/docs/:no deploy-side migration runbook existsStack profile Section 4: 89 SQL files under supabase/migrations/, timestamp-prefixed. No GitHub Actions workflow runs supabase db push (no workflows at all). No README documents whether migrations apply via Lovable platform, via supabase db push from a developer laptop, or via Supabase dashboard SQL editor. No down-migration files (no _down.sql siblings). DAT-006 confirms silent migration drift for tables referenced in code after a drop migration.Migration application is undocumented and probably manual. There is no CI step that runs supabase db push or applies migrations from the repo as part of deploy. There is no convention for how to roll back a bad migration -- no down-migration files, no documented point-in-time-restore steps, no canary against staging (cross-ref OPS-003 -- staging does not exist).
A migration that breaks a hot table (e.g. coach_messages) cannot be rolled back via a single command. The only recourse is Supabase point-in-time restore, which (a) is undocumented (DAT-011) and (b) loses any data written after the bad migration applied. For a health-adjacent app under GDPR, the data-loss exposure is regulatory not just operational.
When the team needs to change the database, there is no automated, repeatable way to apply that change to the live system, and no documented way to undo a change that turned out to be wrong. Every database change is a manual procedure invented on the spot.
Add a deploy workflow step that runs supabase db push --linked against the target environment with secrets from the workflow store. Adopt a convention for breaking migrations (add column nullable -> backfill -> set NOT NULL in a later migration). Document the rollback strategy in docs/deploy.md (Supabase PITR steps + which tables to verify after restore). Mirror every destructive migration (DROP COLUMN, DROP TABLE) with an explicit comment block describing the manual undo.
M — 1–3 days
_clients/SONI-remix-new/wrangler.jsonc, _clients/SONI-remix-new/package.json:wrangler.jsonc:3 (name = tanstack-start-app); package.json:2 (name = tanstack_start_ts)wrangler.jsonc has name tanstack-start-app -- the default scaffold name from npm create cloudflare. package.json name is tanstack_start_ts. Last commit (HEAD 7237266a) message is Lovable update -- no project-specific commit messages in recent history. supabase/config.toml pins a single project_id oyajjhkigkffvudjgybp.The Cloudflare Workers dashboard, the npm package name, and the Lovable platform deployment slug all use a generic scaffold name. In an account with multiple Workers (or after a remix into a sibling project), tanstack-start-app is ambiguous. Commit messages of the form Lovable update carry no semantic information for git-bisect during an incident.
During an incident, the responder reading the Cloudflare dashboard cannot tell which Worker belongs to SONI vs other tenants. Searching git log for the change that introduced a regression yields a wall of Lovable update commits. Onboarding a second engineer is harder because every artifact looks like a scaffold.
The app name on the hosting dashboard is still the generic default (tanstack-start-app) and the version-control history entries are all called Lovable update. This makes it hard to tell which project is which when looking at the production dashboard, and hard to figure out which change introduced which bug.
Rename the Worker in wrangler.jsonc to soni-production (and soni-staging when OPS-003 is implemented). Rename the npm package to soni-app in package.json. Adopt a commit-message convention (conventional commits or just human English summaries) for any non-platform commit. Where Lovable controls commit messages, surface a project-side CHANGELOG.md that captures meaningful release notes.
S — under ½ day
<repo-wide> -- no flag library detected:absentpackage.json contains no LaunchDarkly, Unleash, Statsig, GrowthBook, PostHog, or ConfigCat SDK. grep for featureFlag, feature_flag, flags. under src/ returned no application-level flag plumbing. The only import.meta.env.DEV usage is in src/router.tsx:30 for showing error messages in dev -- that is a build-time mode flag, not a runtime feature flag.There is no way to ship a feature behind an off-by-default flag, no way to enable a new coach prompt for 10 percent of users before flipping it on for everyone, and no way to kill-switch a misbehaving feature without a redeploy. For an AI-coach product where prompt changes and AI provider changes can have user-visible regressions (cross-ref AI-009, AI-010), a kill-switch is the minimum prudent fallback.
Every feature ships to 100 percent of users at the moment it merges. A bad AI prompt that doubles cost or generates unsafe coach content (cross-ref AI-001 fabricated citations, AI-002 role-injection) cannot be turned off without a code change and a redeploy. There is no way to A/B-test a coach behavior change safely.
There is no switch the team can flip to turn off a broken feature or to release a new feature to only some users first. Every release goes to every user at once, and the only way to undo a problem is another full release.
Adopt a minimum-viable feature-flag layer: a feature_flags Supabase table keyed by key, env, value, rollout_percent, queried at server-function entry with a 60s in-memory cache. Or adopt PostHog feature flags (also gives free analytics, which the project currently lacks). Wire the riskiest surfaces first: coach prompt version, AI provider/model, voice-coach feature, push notifications.
M — 1–3 days
_clients/SONI-remix-new/package.json:scripts block (lines 6-12)package.json scripts: dev, build, build:dev, preview, lint, format. No test script. Stack profile Section 7: Test files: 1 file total -- src/lib/locale-region.test.ts. vitest.config.ts is present.Even after CI is added (OPS-001), there is no canonical command to invoke tests. A new contributor running npm test will get no test specified. The vitest binary must be invoked directly via npx vitest run. Combined with the fact that there is only one test file in the entire 86,000-LOC codebase, the project has no testing posture.
The act of formalizing how do we run tests is itself a precondition for the test count ever growing. Without a script, contributors will not add tests; without tests, regressions ship.
There is no shortcut command to run the project tests, and the project currently has only one test for a codebase of about 86,000 lines. Even if the team wanted to add more tests, the basic plumbing for running them is not in place.
Add test: vitest run, test:watch: vitest, and typecheck: tsc --noEmit to package.json scripts. Wire npm test and npm run typecheck into the CI workflow created in OPS-001 as required checks on PRs.
XS
_clients/SONI-remix-new/public/sw.js:SW_VERSION constantStack profile Section 9: Service-worker version is a string constant (SW_VERSION = 2026-05-07-skip-to-app) hard-coded into public/sw.js rather than derived from a build hash -- push-notification update behavior is fully manual.A new release that ships a fixed sw.js will not invalidate clients unless the SW_VERSION string was manually bumped. Forgetting to bump it means users keep the old service worker and the old push handler indefinitely.
Push-notification regressions (and any cached-route regressions) can persist on a user device across deploys until the developer remembers to bump the version. For a notifications-driven habit-coach product, this is a non-trivial UX risk.
The version string for the background worker that handles notifications is updated by hand. If a developer forgets to bump it, users get stuck on the old version even after the team releases an update.
Replace the hard-coded SW_VERSION with a Vite-injected build hash via define: __SW_VERSION__: JSON.stringify(commitSha) in vite.config.ts, or generate sw.js from a template at build time. Add a release-checklist item in docs/deploy.md to verify sw.js version bumped.
S — under ½ day