# Application Update — Testing Guide

The Application Update feature (superadmin **Settings → Application Update**)
allows uploading a release ZIP and applying it in-place. This document covers:

1. Cutting a production release (the supported packaging flow)
2. Automated test suite (vitest + supertest)
3. Manual end-to-end checklist for a deployed/staging instance

---

## 0. Cutting a production release

Use the two release commands shipped with the repo. **Do NOT upload
GitHub's "Download ZIP"** — that archive contains a tracked symlink
and Replit/agent files which the upload validator rejects.

```bash
# 1. Bump the version everywhere in lockstep
npm run version:set -- 3.7.3
#    → updates VERSION, package.json, and the root version
#      entries of package-lock.json. Dependency version strings
#      (e.g. jszip "^3.7.1") are left untouched.

# 2. (Optional) commit + tag
git commit -am "chore: bump 3.7.3"
git tag v3.7.3

# 3. Build the customer-ready ZIP
npm run build:prod-zip
#    → produces dist-release/whatsway-3.7.3-prod.zip
#      (no node_modules, no .git, no .replit/replit.nix/replit.md,
#       no .local/.cache/.upm/.agents, no attached_assets/uploads,
#       no symlinks, no .env files except .env.example)
```

Then upload `dist-release/whatsway-<version>-prod.zip` via
**Superadmin → Application Update → Choose File**.

The build script verifies the staged tree contains zero symlinks and
zero `replit`-named entries before zipping; if either check fires it
exits non-zero so a bad archive can never be produced.

### What to exclude when packaging by hand

If you must build a release ZIP without the helper script, exclude:

- `node_modules/` — the server runs `npm ci` after copying.
- `.git/`, `.github/` — VCS metadata, not needed at runtime.
- `.cache/`, `.upm/`, `.config/`, `.local/`, `.agents/` — agent and tool state.
- `.update-backups/`, `.update-temp/` — runtime backup artifacts.
- `dist/`, `build/` — regenerated by `npm run build`.
- `attached_assets/`, `uploads/` — user/session content.
- `.env`, `.env.*` (keep `.env.example`) — local secrets.

The server validator now allows ZIPs that contain internal-only
symlinks (e.g. workspace `node_modules/.bin/*` entries are fine), but
will still reject any symlink whose target escapes the archive root or
is dangling, naming the offending path in the error.

The destructive **execute** step (which deletes `node_modules`, runs
`npm install`, `npm run build`, `npm run db:push --force`, and tries
`pm2 restart all`) **must NOT be run against the dev environment**. The
automated suite mocks every shell command and filesystem mutation, so it
is safe to run anywhere.

---

## 1. Automated tests

```
npm test
```

This runs `vitest run` against `server/__tests__/app-update.test.ts`. The
suite spins up an in-process express app that wires the same
`csrfMiddleware`, `requireAuth`, `requireRole("superadmin")`, multer, and
controller as production. It points the controller at a fresh tmpdir
(via `APP_UPDATE_ROOT`) so nothing touches the real project tree.

### Coverage matrix

| Area | Cases |
|---|---|
| Auth | unauthenticated → 401; non-superadmin → 403; superadmin → 200 |
| CSRF | upload / execute / rollback without `X-CSRF-Token` → 403 |
| Status | reports current `VERSION`, no backup, lock state |
| Upload — rejections | no file, non-`.zip` filename, missing `package.json`, missing `server/`+`client/`, path-traversal entry, garbage bytes, lock held (409) |
| Upload — happy | parses `VERSION`, returns `{success, newVersion, currentVersion}`, leaves `.update-temp/extracted/` ready |
| Execute | missing prior upload → SSE error; happy path emits `backup → replace → dependencies → build → database → restart → complete` (manual restart message when pm2 missing); pm2-success path reports "restarted successfully"; npm install failure triggers rollback flow; concurrent execute → 409; lock released after both success and failure |
| Rollback | no backup → 404; with backup → 200 + `restoredVersion`; while update in progress → 409 |

### Test seams

The controller exposes a small test surface (production code paths
remain unchanged):

- `__deps.{runCommand, backupCurrentApp, syncTree, restoreFromBackup}` —
  spy-replaceable via `vi.spyOn(__deps, "fn")`. Every production call
  site dispatches through `__deps`.
- `__resetUpdateLockForTests()` — resets the in-memory lock between
  tests.
- `APP_UPDATE_ROOT` env var — overrides the project root that the
  controller and the multer destination use. Defaults to
  `process.cwd()`.

### System dependencies

The suite shells out to `unzip` (controller) and `zip` (test harness, to
craft path-traversal fixtures). Both are installed via Nix on the dev
container; on a fresh environment install with:

```
# via the package management skill
installSystemDependencies({ packages: ["unzip", "zip"] })
```

---

## 2. Manual end-to-end checklist (deployed/staging instance)

> Only run this against a **disposable** instance — the execute step
> wipes `node_modules`, rebuilds, force-pushes the schema, and restarts
> the process. Do not run against production-with-customer-data.

Pre-flight:

- [ ] Log in as a superadmin user.
- [ ] Open `/app-update`.
- [ ] Confirm **Current Version** matches `VERSION` (or `package.json`
      `version`) on the server.

Status & guards:

- [ ] In a separate browser session, log in as a tenant `admin` user
      and try to navigate to `/app-update` — you should be blocked /
      redirected.
- [ ] In devtools, send `POST /api/app-update/upload` without an
      `X-CSRF-Token` header — expect **403**.

Upload validation (use small synthetic ZIPs):

- [ ] Upload `bogus.txt` → rejected with "Only .zip files are accepted".
- [ ] Upload a `.zip` missing `package.json` → "Invalid application
      ZIP".
- [ ] Upload a `.zip` containing only `package.json` (no `server/` or
      `client/`) → "Invalid application ZIP".
- [ ] Upload a hand-crafted ZIP with a `../evil` entry → rejected with
      either "traversal" or "Failed to extract".
- [ ] Upload a real release ZIP → success toast shows
      `Version X.Y.Z ready to install`.

Execute (only in disposable env):

- [ ] Click **Apply Update**. Confirm the SSE log shows steps in order:
      backup → replace → dependencies → build → database → restart →
      complete.
- [ ] After completion, refresh the page: **Current Version** now
      reflects the uploaded ZIP's version.
- [ ] **Backup Available** card now shows the previous version and a
      timestamp.

Concurrency & lock:

- [ ] Start an **Apply Update** and, before it finishes, in another tab
      try to upload again — expect **409 / "An update is already in
      progress"**.

Rollback:

- [ ] Click **Rollback to Previous Version** → confirm the prompt →
      success toast shows the restored version. Refresh: **Current
      Version** now matches the backup.

Failure-path observation (optional, only with disposable env):

- [ ] Force a build failure (e.g. upload a ZIP whose `npm run build`
      will fail) and confirm the SSE log emits a `dependencies` /
      `build` error followed by `rollback: running` and
      `rollback: done`.

If pm2 isn't available (e.g. Replit dev), the final "complete" SSE will
say "Please restart the application manually" instead of "restarted
successfully" — this is expected.

---

## 3. Recovering a stuck update

The `database` step runs `npm run db:push --force`, which under the
hood invokes `drizzle-kit push`. On ambiguous schema changes (table
renames, adding a unique constraint to a non-empty table, etc.)
drizzle prints an interactive question and waits on stdin for an
answer. As of this fix, the updater:

1. Always closes the spawned command's stdin (after writing a stream
   of newlines for the `database` step), so drizzle accepts the
   highlighted default for any prompt and exits instead of hanging.
2. Caps the `database` step at a 2-minute watchdog (down from the
   default 5-minute command timeout). If the push genuinely stalls,
   the SSE log surfaces `database:error` within 2 minutes and the
   automatic rollback runs.
3. Looks for drizzle prompt text (`Is X table created or renamed
   from another table?`, `Do you want to truncate ... table?`, etc.)
   in the captured output and includes it verbatim in the
   `database:error` (or `database:done` advisory) message — so the
   superadmin can see *which* question drizzle was asking instead of
   just "command failed".

### Symptoms of a stuck update (older releases or unrelated hang)

- The progress bar in **Settings → Application Update** sits on
  "Updating database schema..." with no log activity for >30s.
- The "Previous update log" panel shows a spinner that never
  resolves after a page refresh, even though the header
  "Current Version" still reads the *old* version.
- `pm2 logs` (or your process manager equivalent) shows
  `drizzle-kit push` running with no further output.

### How to recover

1. **Wait for the watchdog.** With this fix in place, the database
   step will fail itself within 2 minutes and the updater will
   automatically restore the previous version from the backup. No
   manual action needed in the common case.

2. **Force-fail a stuck command.** If the host is on an older build
   (pre-fix) or another command in the pipeline is wedged, SSH into
   the server and run:

   ```
   pkill -f drizzle-kit
   ```

   (or `pkill -f "npm run db:push"` for the wrapper). The updater's
   `runCommand` will see the non-zero exit, emit `database:error`,
   and the automatic rollback will run.

3. **Verify the rollback.** Hard-refresh `/app-update` in the
   browser. The header **Current Version** should match the
   **Backup Status** version (the previous release). If the rest of
   the app loads normally, you are recovered.

4. **If the in-memory lock is still set.** If a subsequent upload
   returns `409 / "An update is already in progress"` after the
   rollback finished, restart the app process (`pm2 restart all` or
   equivalent) to clear the lock.

### Manual upgrade path (clients pulling code without the in-app updater)

For clients who upgrade by SSHing in and running the update steps by
hand (instead of using the in-app updater), the recommended single
command is:

```
npm run db:upgrade
```

This runs `scripts/prod-v37-precheck-and-cleanup.sql` via `psql` and
then `drizzle-kit push --force`. The precheck:

1. Drops duplicate `templates` rows (per `(channel_id,
   whatsapp_template_id)`) so the new `template_channel_wa_id_unique`
   constraint applies without prompting to truncate.
2. Drops duplicate `automation_edges` rows (per `(automation_id,
   source_node_id, target_node_id, source_handle)`) so the
   `automation_edges_unique_handle_idx` applies without prompting.
3. Converts empty-string and orphan values in nullable FK varchar
   columns (`conversations.assigned_to`, `conversations.channel_id`,
   `conversations.contact_id`, `users.created_by`, `users.channel_id`,
   `contacts.created_by`) back to `NULL` so FK constraints stop
   throwing `Key (col)=()` violations during push.

`db:precheck` requires `psql` on the host (already installed on
standard Ubuntu/Debian deployments via `postgresql-client`). If `psql`
is missing the precheck step will fail before any schema change is
attempted, so it is safe to retry after installing it.

If you cannot install `psql`, the same cleanup runs at server boot via
`server/startup-migration.ts` — but that only happens **after** the
push, so a dirty database will still fail `db:push` once. Install
`postgresql-client` and rerun `npm run db:upgrade`.

The in-app updater (Settings → Application Update) does not need this
script: it always runs `db:push --force` with stdin closed and a
2-minute watchdog (Task #148), and the same cleanups run on the next
server boot.

### Note on the v3.6 → v3.7 release

The v3.6 → v3.7 schema change introduced new tables (e.g.
`update_run_events`) that `drizzle-kit push` flagged as
*possibly renamed from* an existing table, which triggered the
hang on hosts running pre-fix updaters. With this fix in place the
prompt is auto-answered with "create new table" (the correct
choice) and the same v3.7 ZIP installs cleanly without re-issuing
the release.

## Why the restart no longer freezes the panel

Older releases finished the Application Update by:

1. emitting `restart:running` on the SSE stream,
2. awaiting `pm2 restart all` inline, then
3. emitting `complete:done` and writing `status: success` to
   `update_runs`.

The problem: `pm2 restart all` sends `SIGINT` to the very Node
process serving the SSE stream. Steps 2/3 never finished, the
socket died mid-transfer, and the `update_runs` row stayed at
`status: running` forever. On hard refresh the panel hydrated
that orphan row and showed a perpetual 95% spinner — even though
the install had actually succeeded and the new version was live.

The lifecycle was reordered as part of Task #100:

1. detect pm2 availability synchronously (`which pm2`),
2. emit `restart:done` (or `restart:warning` when pm2 is missing)
   and `complete:done`,
3. **persist `status: success` and close the SSE response** via
   the controller's `finish()` helper,
4. only then schedule a **detached** `pm2 restart all` on a 500ms
   delay so the OS has time to flush the response body before
   `SIGINT` arrives.

Two consequences:

- The user sees the panel reach 100% with a green completion
  message before pm2 restarts the process. After the restart, a
  hard refresh shows the run as `success` (not a spinner) because
  the row was finalised before the socket closed.
- The detached child pm2 restarter survives the parent's death
  because it is in its own session (`detached: true`), and our
  Node process is not waiting on it, so no `unhandledRejection`
  fires when pm2 kills us.

### Recovering an `interrupted` run

If a previous Node process *did* die mid-update before the
finaliser ran (e.g. an old release without this fix, an OOM
kill, or a crash), the row is recovered on the next server
boot by the **startup reconciler**:

- If the on-disk `VERSION` matches the run's `to_version`, the
  row is marked `success` with a "reconciled at startup" note —
  the install actually landed, only the SSE event was lost.
- Otherwise the row is marked `interrupted` with a clear
  message, the panel stops spinning, and a fresh upload may be
  attempted immediately.

Look for `[app-update] Startup reconciler: settled N stale
'running' update_runs row(s).` in the server logs to confirm
the sweep ran.
