dev_arc_aws/TERMIX_MIGRATION.md
Claude e10acfd4a1
Close verification gaps for Phases 1b, 6, 7 via real infra + browser tests
With iproute2 and Playwright/Chromium now available in the sandbox:
- Re-verified host-metrics network/ports/firewall collectors against a real
  root SSH host (real eth0, ss ports with process names, parsed iptables rules).
- Browser-verified the host-metrics page, the terminal tabs/split-panes/theme
  UI (live remote prompt, 1->2->4 xterm panes, prefs persisted), and the
  host-to-host transfer UI (live progress panel to completion + on-disk check).

Updates documentation only; no code changes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01BbJV5nm8KPVH1oNJYKpnoF
2026-06-19 16:02:40 +00:00

188 lines
38 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Termix → ArchNest Migration Plan
Status doc for porting Termix's full feature set into ArchNest as a single app, single backend, single auth, single database — reskinned to match ArchNest's design. Written so any session (human or AI) can see exactly what's done, what's next, and why decisions were made.
Source: `https://github.com/SamuelSJames/Termix` (user's fork), cloned for reference at the time of writing. Upstream is `Termix-SSH/Termix`, an Electron + Express + Drizzle ORM self-hosted SSH/RDP/VNC management app — **not** a small terminal widget. It ships as its own Docker image with a `guacd` sidecar for RDP/VNC.
## Decision: why merge into ArchNest's backend, not Termix's
ArchNest's backend (Fastify + better-sqlite3 + JWT) is small and already has things worth keeping: the bookmarks system, the integration adapter framework (Proxmox/AWS/NetBird/Cloudflare/Weather/SSH health checks — see `backend/src/integrations/`), the audit log, and a working auth/profile system built this session. Termix's backend is much bigger but its *value* is the SSH/tunnel/file-manager/Docker/RDP feature logic, not its auth system (OIDC/LDAP/2FA) or its Drizzle schema. So: **port Termix's feature modules onto ArchNest's existing Fastify app and auth**, don't adopt Termix's backend wholesale.
## What is explicitly NOT being ported (user-approved tradeoff)
- Electron desktop app + native installers (Chocolatey/Flatpak/AppImage/MSI/Cask) — ArchNest is a web app.
- OIDC/LDAP/2FA/SSO and Termix's own multi-user auth system — replaced by ArchNest's existing JWT auth. User confirmed they don't currently use 2FA/OIDC/LDAP, so this is an accepted downgrade, not an oversight.
- ~30 language translations (i18n) — not a stated goal, not being ported.
- All Termix branding — logos, icons, About/product copy, links to Termix's Discord/docs/GitHub. Every ported UI component gets reskinned to ArchNest's Tailwind theme (gold `#C8A434`, the existing dark palette) as part of porting it, not as a separate pass.
Everything else — SSH terminal, tunnels, file manager, Docker management, RDP/VNC/Telnet, host metrics — is in scope to fully port, feature-equivalent, just rebuilt on ArchNest's stack.
## Phases
Each phase is independently committable and testable. Do not start a later phase before the previous one is working end-to-end and committed — this is a large port and needs to land in reviewable chunks.
### Phase 1 — SSH Terminal (DONE)
The actual `/terminal` page: a real interactive SSH terminal in the browser (xterm.js + WebSocket), reusing the SSH credentials already stored in ArchNest's integrations (no second "add a host" flow — Termix's separate host-manager concept is being merged into ArchNest's existing `integrations` table/SSH adapter, not duplicated).
**Termix source files this phase is based on** (sizes as of the fork snapshot, for scoping):
- `src/backend/ssh/terminal.ts` (2,570 lines) — WebSocket route handling, message protocol (connect/data/resize/disconnect), output buffering.
- `src/backend/ssh/terminal-session-manager.ts` (570 lines) — session lifecycle, reattach-on-reconnect, per-user session caps, idle timeout, optional session logging to disk.
- `src/backend/ssh/ssh-connection-pool.ts` (225 lines) — connection reuse.
- `src/backend/ssh/host-resolver.ts`, `jump-host-chain.ts`, `terminal-jump-hosts.ts` (~900 lines combined) — jump-host / bastion chaining.
- `src/backend/ssh/auth-manager.ts`, `credential-username.ts`, `host-key-verifier.ts`, `terminal-auth-helpers.ts` (~950 lines combined) — credential resolution, host key verification/trust-on-first-use.
- `src/backend/ssh/opkssh-auth.ts`, `opkssh-cert-auth.ts` (~1,350 lines) — OPKSSH (OpenPubkey SSH) certificate auth.
- `src/backend/ssh/tmux-monitor.ts`, `tmux-helper.ts`, `tmux-monitor-helpers.ts` (~1,350 lines) — tmux session detection/monitoring inside the terminal.
- Frontend: `src/ui/features/terminal/*` — xterm.js wrapper, tab system, up-to-4-panel split screen, theme/font customization.
**Scope split for this phase, given the size above:**
- **Phase 1a (doing this now)**: core single-session SSH terminal. WebSocket connect/data/resize/disconnect, using ArchNest's existing SSH integration config/secrets (host/port/username/password/privateKey/passphrase — already in `backend/src/integrations/ssh.ts`) instead of Termix's separate host table. One terminal per tab, no split panes yet, no jump hosts, no OPKSSH, no tmux monitor, no session recording/logging. Ported onto Fastify's WebSocket support, reusing ArchNest's JWT auth for the WS handshake.
- **Phase 1b (follow-up, not blocking 1a)**: jump-host/bastion chaining, host-key verification/trust-on-first-use UI, tab system + up to 4 split panes, terminal theme/font customization settings.
- **Phase 1c (follow-up, lower priority)**: OPKSSH cert auth, tmux session monitor/reattach, session recording/logging to disk.
Rationale for splitting: 1a alone is a real, useful terminal (matches what `/terminal` needs to stop being a placeholder) and is testable end-to-end on its own. Bundling jump-hosts/OPKSSH/tmux into the first pass risks a large unreviewable change with no working checkpoint in between.
**Status:**
-**Phase 1a — done.** `/terminal` is a real interactive SSH terminal: `backend/src/routes/terminal.ts` (WebSocket, connect/input/resize/disconnect over `ssh2`), `backend/src/db/secrets.ts` (shared secret loader), `src/pages/Terminal.tsx` (xterm.js + host picker, reuses ArchNest's existing SSH integrations — no duplicate host table). Verified end-to-end against a real test SSH server. No jump hosts, no tabs/split panes, no OPKSSH, no tmux monitor yet — see 1b/1c below.
-**Phase 1b — done.**
- **Jump-host chaining**: an SSH integration's config can carry `jumpHostIntegrationId` referencing another SSH integration. `backend/src/routes/terminal.ts` connects to the jump host first, opens a `forwardOut()` channel to the real target, and connects the target `Client` over that channel (single-hop; mirrors Termix's core mechanism without its multi-hop/credential-sharing complexity). Verified end-to-end with two real test SSH servers (one as jump, one as target).
- **Host-key verification (TOFU)**: new `ssh_host_keys` table (`backend/src/db/index.ts`) stores a SHA-256 fingerprint per SSH integration on first successful connect; subsequent connects are rejected if the fingerprint changes, via `ssh2`'s `hostVerifier` connect option. No interactive accept/reject-changed-key UI yet — first-use accept-and-store, hard-reject on mismatch. Verified both the accept-on-first-use and reject-on-mismatch paths against a real test server.
- **Settings UI for multiple SSH hosts**: `src/pages/Settings.tsx` previously could only show/edit one integration per type, which silently broke multi-host SSH. Added a dedicated `SshHostsSection` with its own per-host cards (Save/Test/Delete) and an "Add SSH Host" flow, including a `Jump Host` dropdown populated from the other configured SSH hosts.
- **Tabs + up to 4 split panes**: `src/pages/Terminal.tsx` rewritten around a `TerminalPane` component (one xterm + WebSocket connection each, reusable). Each tab holds 1/2/4 panes (single / split-2 / 2x2 grid); each pane connects independently to whichever SSH host is clicked while it's focused.
- **Terminal theme/font customization**: a preferences bar (theme preset, font size, font family) persisted to `localStorage` (`archnest-terminal-prefs`), applied per-pane on connect.
- Verified via a clean production build (`tsc -b && vite build`), and subsequently **browser-verified** (Playwright/Chromium, once available): logged in, opened `/terminal`, connected a pane to a real SSH host (confirmed by the live remote prompt `uitester@vm:~$` and a `Connected — <host>` status), split into 2 and 4 panes (confirmed 1→2→4 live `xterm` instances rendering as a 2×2 grid), opened a new tab, and changed the theme preference — confirmed it persisted to `localStorage` (`archnest-terminal-prefs``{"themeName":"Matrix",...}`). The original build-only caveat is now closed.
-**Phase 1c — done, with one documented verification gap.**
- **OPKSSH / certificate auth**: `ssh2` (the npm library) has no support for OpenSSH certificates — confirmed by inspecting its type definitions and README, no certificate-related auth flow exists. Implemented `connectWithCertificate()` in `backend/src/routes/terminal.ts`: writes the stored private key + certificate to a temp dir (mode `0600`) and shells out to the system `ssh` binary (which natively understands `-o CertificateFile=`) under a real `node-pty` pty. Used automatically when an SSH integration has a `certificate` secret configured (new field added to Settings' SSH host form). Does **not** support jump-host chaining (documented limitation, not silently dropped — Termix's own OPKSSH path doesn't generally chain through jump hosts either). **Verified end-to-end** (gap from the original pass now closed): with `openssh-client`/`openssh-server` available, built a real SSH CA, signed a user key into an OpenSSH certificate (principal `certuser`), configured a real `sshd` with `TrustedUserCAKeys` + `PasswordAuthentication no` (so only cert auth could succeed), created a real `ssh`-type integration carrying the private key + certificate as secrets, and drove ArchNest's actual `/api/terminal` WebSocket route: it reached `connected`, spawned the cert-auth pty, and a real shell echoed back a marker as `certuser` — i.e. authentication genuinely happened via the certificate, not a password or plain key.
- **tmux session monitor/reattach**: new WebSocket message `list_tmux` execs `tmux list-sessions` on the target host and returns session names; `connect` accepts an optional `tmuxSession` (validated against `^[A-Za-z0-9_-]{1,64}$` before being interpolated into a shell command, to prevent injection) which attaches to that tmux session or creates it if missing, via `exec('tmux attach -t <name> || tmux new-session -s <name>', { pty: ... })` instead of a plain `client.shell()`. `src/pages/Terminal.tsx`'s pane header gained a tmux session picker (plain shell / new session / attach to an existing one). **Verified end-to-end** against a real test SSH server running real `bash`/`tmux` processes (via `node-pty`): listed zero sessions, created a `testsess` tmux session through the WS protocol, confirmed a follow-up `list_tmux` call returned `['testsess']`.
- **Session recording/logging to disk**: new SSH integration config field `sessionLogging` (checkbox in Settings' SSH host form). When set, all outbound terminal output (both the `ssh2` path and the cert-auth pty path) is appended to `<ARCHNEST_SESSION_LOG_DIR ?? './data/session-logs'>/<integrationId>_<timestamp>.log`. No log browsing/download UI yet (not built — out of scope for this pass, not silently dropped). **Verified end-to-end**: a real shell session's output was confirmed present in its log file on disk.
- Everything in this phase was tested against live processes (real `sshd`, real `tmux`, real cert-auth via a real SSH CA), not mocked. The Phase 1b UI (tabs/split panes/theme) remains build/type-verified only — no interactive browser click-through was done — but every backend path, including cert auth, is now exercised end-to-end. All cert-auth test artifacts (CA, signed cert, test `sshd`, test OS user, test backend/DB) were cleaned up afterward.
### Phase 2 — SSH Tunnels (DONE)
Source: `src/backend/ssh/tunnel.ts` (2,414 lines) + `tunnel-c2s-relay.ts`, `tunnel-socks5-relay.ts`, `tunnel-ssh-primitives.ts`, `tunnel-utils.ts`, `tunnel-c2s-relay-utils.ts` (~830 lines combined) + frontend `src/ui/features/tunnel/*`.
**Scope decision**: Termix distinguishes "S2S" (server-to-server, backend-managed) and "C2S" (client-to-server, routed through Termix's desktop/Electron app) tunnels. ArchNest has no desktop client (explicitly out of scope per the top of this doc), so only the **S2S model** was ported — a single persistent backend process manages all tunnels, same as Termix's S2S path. C2S's WebSocket data-multiplexing-to-a-desktop-client layer was not ported; it has no equivalent need in a pure web app.
**What was built:**
- `backend/src/ssh/connect.ts` — extracted `loadSshHost`/`baseConnectConfig`/`connectTarget` (jump-host chaining + TOFU host-key verification) out of `terminal.ts` into a shared module, since tunnels need the exact same SSH-connection logic terminal sessions do.
- `backend/src/tunnels/manager.ts` — in-memory tunnel runtime manager (`Map<tunnelId, RuntimeState>`), mirroring Termix's `activeTunnels`/`connectionStatus` maps but scoped down to this app's needs. Three modes:
- **Local forward**: a `net.Server` listens on `sourcePort`; each inbound connection calls `client.forwardOut()` to `endpointHost:endpointPort` and pipes the two sockets together.
- **Remote forward**: `client.forwardIn('0.0.0.0', sourcePort)` asks the SSH server to bind that port; incoming `'tcp connection'` events are piped to a local `net.connect()` against `endpointHost:endpointPort`.
- **Dynamic (SOCKS5)**: a `net.Server` listens on `sourcePort` running a minimal SOCKS5 handshake (`backend/src/tunnels/socks5.ts`, CONNECT-only, no-auth — sufficient for this use case, not a general SOCKS5 server), then `forwardOut()`s to whatever target the client requested per-connection.
- Automatic reconnection: on SSH error/close or listener bind failure, schedules a retry after `retryIntervalMs`, up to `maxRetries`, then settles into an `error` status (mirrors Termix's retry/backoff but simplified to a fixed interval rather than exponential — sufficient for this scale).
- `startAutoStartTunnels()` is called once at server boot to bring up any tunnel with `autoStart` set.
- `backend/src/routes/tunnels.ts` — REST CRUD (`GET/POST /api/tunnels`, `DELETE /api/tunnels/:id`) plus `POST /api/tunnels/:id/connect` / `/disconnect`. Status (`stopped`/`connecting`/`connected`/`retrying`/`error` + retry count + last error) is read directly off the in-memory runtime state on every `GET /api/tunnels` (simple polling from the frontend every 3s — no SSE/EventSource, unlike Termix; not needed at this scale and keeps the implementation smaller).
- `backend/src/db/index.ts` — new `tunnels` table: `id, name, integration_id, mode, source_port, endpoint_host, endpoint_port, auto_start, max_retries, retry_interval_ms, created_at`. Each tunnel references an existing SSH `integrations` row (no separate host table, consistent with the rest of this migration) — no separate "preset" concept needed since a tunnel row already *is* the saved preset.
- `src/pages/Tunnels.tsx` — new page (`/tunnels`, added to the sidebar with a `Waypoints` icon) with a creation form (name, SSH host picker, mode, source port, endpoint host/port, auto-start) and a card grid showing each tunnel's status, mode, route, and Start/Stop/Delete actions, polling every 3 seconds.
**Verified end-to-end** against a real test SSH server (extending the same real-`ssh2`-`Server` + `node-pty` pattern used in Phase 1c) that genuinely handles `tcpip` (forwardOut) and `tcpip-forward`/`cancel-tcpip-forward` (forwardIn) requests, plus a real upstream TCP echo server: created one tunnel of each mode (local/remote/dynamic), connected all three, and confirmed real data flowed through each — local forward and remote forward both delivered the upstream server's banner through the tunnel, and the dynamic tunnel completed a real SOCKS5 CONNECT handshake and relayed data. Also verified disconnect correctly tears down the local listener (`ECONNREFUSED` after stopping). All test artifacts (test SSH server, test backend instance, test DB, tokens) were cleaned up afterward.
### Phase 3 — Remote File Manager (DONE, with documented gaps)
Source: `src/backend/ssh/file-manager*.ts` (six files, ~3,900 lines combined: list/content/action/operation/download routes + session + utils) + frontend `src/ui/features/file-manager/*`.
**Scope decisions:**
- **Ephemeral SFTP connections** instead of Termix's pooled/long-lived sessions: each request opens a fresh SSH+SFTP connection (`backend/src/ssh/sftp.ts`'s `withSftp()`), does one operation, and tears the connection down. Simpler than managing a third long-lived connection lifecycle alongside terminal and tunnel sessions, and acceptable at this app's scale.
- **No sudo/permission-elevation support.** Termix falls back to shell commands piped a stored sudo password when SFTP returns a permission error; not ported in this pass (no privileged remote test target available in this sandbox to verify against safely — same category of gap as the OPKSSH cert-auth gap in Phase 1c). Documented here rather than silently dropped.
- **No server-to-server transfer** — this matches Termix's actual behavior (its own cross-host "transfer" is just sequential `download` then `upload` through the browser; same-host moves use shell `mv`/`cp`, which isn't ported since sudo isn't). Not a regression.
- **Whole-file-in-memory model** for view/edit, same as Termix: `GET/PUT /api/files/:id/content` reads/writes the entire file via `sftp.readFile`/`writeFile`. Files over 50MB (`MAX_EDITABLE_SIZE`) are rejected with a message pointing at download/upload instead. Binary detection (so binary files are shown as a "can't edit" message rather than mangled text) uses the same heuristic as Termix: scan the first 8KB for a null byte or a >1% ratio of other control bytes.
- **Streaming download** (`GET /api/files/:id/download`) for files of any size, via `sftp.createReadStream()` piped straight into the HTTP response rather than buffered in memory.
**What was built:**
- `backend/src/ssh/sftp.ts``withSftp(integrationId, fn)`: opens an ephemeral SSH+SFTP connection (reusing `connect.ts`'s jump-host-chaining + TOFU logic from Phase 1/2), runs `fn`, then tears the connection down.
- `backend/src/routes/files.ts``GET /api/files/:id/list`, `GET/PUT /api/files/:id/content`, `POST /api/files/:id/mkdir`, `POST /api/files/:id/rename`, `POST /api/files/:id/delete`, `POST /api/files/:id/chmod`, `GET /api/files/:id/download`, `POST /api/files/:id/upload` (multipart, via newly-added `@fastify/multipart`, 1GB limit).
- `src/pages/Files.tsx` — new page (`/files`, sidebar entry with a `FolderOpen` icon): SSH host picker, breadcrumb-navigable directory browser, inline text editor for non-binary files, new-folder/rename/delete/chmod-via-octal-display/upload/download actions.
**Verified end-to-end** against a real filesystem-backed SFTP server built specifically for this (using `ssh2`'s server-side low-level SFTP protocol API — genuine `OPEN`/`READ`/`WRITE`/`READDIR`/`RENAME`/`REMOVE`/`MKDIR`/`STAT`/`SETSTAT` handlers backed by real `fs` calls against a real directory on disk, not a mock). Confirmed by inspecting the actual files/permissions on disk after each operation (`cat`, `ls`, `stat -c '%a'`), not just the HTTP response: list, read, write, mkdir, rename, delete, chmod, upload, and download (byte-for-byte `diff` match against the uploaded source file) all round-tripped correctly. One real bug was caught and fixed during this verification: the download route's wrapping `Promise` was resolving immediately after `reply.send(stream)` instead of waiting for the response to actually finish, which raced Fastify into ending the HTTP response (and the route's `cleanup()` into closing the underlying SSH connection) before the SFTP stream had sent any data — produced a 0-byte download with a "stream closed prematurely" log line. Fixed by letting `reply.send(stream)`'s return value resolve the promise instead of resolving synchronously, and moving connection cleanup to the response's own `finish`/`close` events. All test artifacts (test SFTP server, test backend instance, test DB, tokens, temp files) were cleaned up afterward.
### Phase 4 — Docker Container Management (DONE, with documented gaps)
**Architecture decision**: Termix's source (`src/backend/ssh/docker.ts`, `docker-container-routes.ts`, `docker-console.ts`) drives Docker over SSH+CLI. ArchNest's existing `backend/src/integrations/docker.ts` adapter already talks to the **Docker Engine HTTP API directly** via a stored `baseUrl` (the only config field exposed in Settings for a docker integration — no SSH credentials, no TLS client certs). Rather than bolt on a second SSH-based Docker code path, Phase 4 extends the existing Engine-API approach: all new code talks straight to `dockerd`'s HTTP API.
**What was built:**
- `backend/src/docker/client.ts``loadDockerHost(integrationId)`, `dockerFetch`/`dockerJson` thin wrappers over the Engine API, `demuxDockerStream()` (best-effort parser for the 8-byte-frame multiplexed stdout/stderr format used by non-TTY containers' `logs`/`stats` endpoints, falling back to raw text for TTY containers).
- `backend/src/docker/exec.ts``openExecStream()` opens a `docker exec` session and performs the raw HTTP "hijack": after `POST /exec/{id}/start`, the daemon switches the TCP socket to a raw bidirectional byte stream (no further HTTP framing), so the implementation connects via `net`/`tls` directly, writes the HTTP request by hand, and strips the response headers before treating the rest as raw I/O.
- `backend/src/routes/docker.ts``dockerRoutes` (REST: list/stats/logs/start/stop/restart/pause/unpause/remove, behind the standard `app.authenticate` hook) and `dockerExecRoutes` (websocket `/api/docker/exec`, auth via a `token` query param verified on the `connect` message, mirroring `terminal.ts`'s pattern since websocket upgrades can't carry an `Authorization` header).
- `src/pages/Containers.tsx` — new page (`/containers`, sidebar entry with a `Box` icon): Docker host picker, container table (state, image, live CPU/memory from `stats`, ports) with start/stop/restart/pause/unpause/remove actions, a logs modal, and an exec-terminal modal reusing `Terminal.tsx`'s xterm.js + `FitAddon` pattern (base64-encoded I/O over the websocket).
**Verified end-to-end** against a real Docker daemon (`dockerd`) started inside the sandbox on a TCP port, with a real container built from a `docker import` of the host's own rootfs (no network access to a registry was available, so a minimal real image was constructed locally rather than pulled). Confirmed via real container state transitions (`docker inspect`) cross-checked against the API responses: list, stats, logs (including the frame-demuxed multi-line case), start/stop/restart/pause/unpause, and remove all worked correctly through the new REST routes. The exec-terminal websocket path was exercised with a real `ws` client driving an interactive shell inside the real container (sent `echo HELLO_FROM_EXEC`, got the echoed output back through the hijacked socket) and a live resize.
One real bug was caught and fixed during this verification: `openExecStream()` originally called `POST /exec/{id}/resize` immediately after creating the exec instance but before starting it — confirmed via a raw `curl` repro that the Docker daemon blocks that request indefinitely until the exec's process actually exists, which hung every exec session before it ever reached `ready`. Fixed by passing the initial terminal size via `ConsoleSize` in the exec-create payload instead, and only using the explicit resize endpoint for later live resizes (sent after the exec is already running, so it's safe there, and was verified working in that position).
**Documented gap**: no browser is available in this sandbox, so `Containers.tsx` was verified by type-checking and a production `vite build`, and by manually exercising every backend endpoint it calls against the real daemon above — but it has not been clicked through in an actual browser. All test artifacts (test `dockerd` instance, test image/container, test backend instance, test DB, tokens, temp files) were cleaned up afterward.
### Phase 5 — RDP/VNC/Telnet (DONE)
**Architecture decision**: Termix's own approach (`new GuacamoleLite({ server }, ...)`) attaches an unfiltered `'upgrade'` listener to the whole HTTP server, which would have collided with `@fastify/websocket`'s existing routes (`/api/terminal`, `/api/docker/exec`). Instead, `guacamole-lite`'s lower-level `ClientConnection`/`Crypt` classes (imported directly from their CJS lib files, typed via a small ambient `.d.ts`) are driven from inside our own Fastify `{websocket: true}` route, on a socket Fastify has already upgraded — no interaction with the HTTP server's `'upgrade'` event at all. `guacd` itself remains a required sidecar process (a real `guacd` binary, available via `apt`), but is not wired into a `docker-compose.yml` yet — see gap below.
**What was built:**
- `backend/src/integrations/types.ts` / `registry.ts` / `routes/integrations.ts` — new `remote_desktop` integration type (config: `protocol`/`hostname`/`port`/`username`/`domain`, secret: `password`).
- `backend/src/integrations/remoteDesktop.ts``testConnection()` does a raw TCP probe of the configured port (distinct from the real Guacamole-protocol tunnel below).
- `backend/src/routes/guacamole.ts``/api/guacamole` websocket route: authenticates the `token` query param via `app.jwt.verify` (same pattern as `terminal.ts`/`docker.ts`, since websocket upgrades can't carry an `Authorization` header), loads the `remote_desktop` integration's config + decrypted secrets, server-side constructs and encrypts a Guacamole connection token via `Crypt`, then instantiates `ClientConnection` directly on the open socket and calls `.connect({ host, port })` against `guacd` (configurable via `ARCHNEST_GUACD_HOST`/`ARCHNEST_GUACD_PORT`, default `127.0.0.1:4822`). New env var `ARCHNEST_GUAC_CRYPT_KEY` (32-byte AES-256-CBC key) added to `.env.example`.
- `src/pages/RemoteDesktop.tsx` — new page (`/remote-desktop`, sidebar entry with a `MonitorSmartphone` icon): host picker + a `guacamole-common-js` `Guacamole.Client`/`Guacamole.WebSocketTunnel` canvas viewer. Note: `Guacamole.WebSocketTunnel` appends its own `"?" + data` query string inside `connect()`, so the tunnel URL passed to its constructor must be bare, with `token`/`integrationId` passed as the string argument to `client.connect(...)` instead — this was caught and fixed during browser verification (see below).
- `src/pages/Settings.tsx` — generic integration card extended with a `remote_desktop` entry (protocol/hostname/port/username/domain/password fields).
**Verified end-to-end** against real, locally-installed infrastructure (no mocking): a real `guacd` (v1.3.0, installed via `apt`) and a real `Xtightvnc`/`vncserver` desktop. A raw `ws` client test first confirmed the tunnel itself — JWT auth, integration lookup, token encryption, and the guacd handshake — by observing real Guacamole-protocol `size`/`img` instructions come back over the websocket. Then the actual `RemoteDesktop.tsx` page was exercised in a real headless Chromium (Playwright) against a real running Vite dev server + backend: logged in, navigated to `/remote-desktop`, selected the configured VNC host, and confirmed the UI reaches `Connected` state with a live VNC framebuffer (cursor visible) rendered on canvas — not just a build/typecheck pass.
One real bug was caught and fixed during this browser verification: the page initially called `client.connect()` with no arguments while the tunnel URL already had `token=...&integrationId=...` appended, producing a malformed `...&integrationId=1?undefined` URL and an `ECONNREFUSED`-style failure. Root cause (confirmed by reading `Guacamole.WebSocketTunnel`'s source): it always appends its own `"?" + data` itself. Fixed by passing a bare tunnel URL and moving the query data into the `client.connect(data)` call.
**Documented gaps**:
- Telnet was not verified — no real telnet server could be installed in this sandbox (`telnetd`/`inetutils-telnetd` 404'd against the available `apt` mirror snapshot). RDP was not verified either (no real RDP target was available); only the VNC path has a live, browser-confirmed end-to-end test. The route code path is identical across all three protocols (same `ClientConnection`/`guacd` flow, differing only in the `connection.type` and per-protocol settings), so this is a coverage gap rather than a known defect.
- `guacd` is not yet added to a `docker-compose.yml` for actual deployment on `racknerd1` — it currently must be run as a sidecar process/container manually, pointed at via `ARCHNEST_GUACD_HOST`/`ARCHNEST_GUACD_PORT`. Wiring that into the real deployment compose file is follow-up work, not done here.
- All test artifacts (test `guacd`/`vncserver` processes, test backend instance, test DB, tokens, temp files, Playwright scripts) were cleaned up afterward.
### Phase 6 — Host Metrics Widgets (DONE, with documented gaps)
**Architecture decision**: Termix's `host-metrics.ts` route (2,584 lines) is tightly coupled to its own Drizzle schema, multi-user auth, SOCKS5/jump-host chaining, TOTP-gated metrics sessions, and a metrics cache/backoff/request-queue layer — none of that scaffolding was ported. The actual reusable value is the 10 `widgets/*-collector.ts` files: small, near-backend-agnostic functions that take a raw `ssh2.Client`, run a few shell commands, and return null-tolerant typed metrics. Those collectors were reimplemented against ArchNest's own `ssh2` connection objects (reusing `loadSshHost`/`connectTarget` from Phase 1/2, not Termix's pool/cache/session substrate). Delivery is simple on-demand REST + 5s client-side polling — the same low-tech approach Phase 2 used for tunnel status — rather than Termix's own caching/backoff system. This was built as a new standalone page (`/host-metrics`) rather than folded into `Infrastructure.tsx`: the existing Infrastructure page is a fleet-wide overview (one row per resource), while these widgets are a deep per-host live view, closer in spirit to `Terminal.tsx`/`RemoteDesktop.tsx`'s "pick a host, see one rich view" pattern. The existing `backend/src/integrations/ssh.ts` `listResources` probe (disk/mem/load percentages for the Infrastructure overview) is left as-is and unrelated — it answers "is this host healthy at a glance," not "show me everything about this host."
**What was built:**
- `backend/src/ssh/metrics/common.ts` — shared `execCommand()` (exec + timeout + cleanup) and small numeric helpers, ported from Termix's `widgets/common-utils.ts`.
- `backend/src/ssh/metrics/{cpu,memory,disk,uptime,network,system,processes,ports,firewall,login-stats}.ts` — 10 collectors ported from Termix's `widgets/*-collector.ts`, each independently null-safe. `ports.ts` only implements the `ss`-based path (Termix also had a `netstat` fallback parser, dropped as redundant on any modern target).
- `backend/src/ssh/metrics/index.ts``collectHostMetrics()` aggregator.
- `backend/src/routes/metrics.ts``GET /api/integrations/:id/metrics`, authenticated, connects via `connectTarget` (transparent jump-host support inherited for free) and runs the aggregator.
- `src/pages/HostMetrics.tsx` — new page (`/host-metrics`, sidebar entry with a `Gauge` icon): SSH host picker + CPU/memory/disk gauges, uptime/system card, network interfaces, listening ports, top processes table, firewall summary, login activity summary. Polls every 5s while a host is selected.
- `src/lib/api.ts``getHostMetrics()` + `HostMetrics` type.
**Verified end-to-end** against a real, locally-installed `sshd` (not mocked): installed `openssh-server`, created a real test user, ran a real ArchNest backend + SQLite DB, created a real `ssh`-type integration, and hit `GET /api/integrations/:id/metrics` over a real SSH connection. CPU, memory, disk, uptime, system, and processes all returned real, correct data from the live container (verified CPU% against `/proc/stat` math, memory/disk against `free`/`df`, process list against a parallel manual `ps aux`).
One real bug was caught and fixed: the first version ran all 10 collectors via `Promise.all`, which opens 15-20 concurrent SSH exec channels — this silently exceeded OpenSSH's default `MaxSessions 10` and starved whichever collectors lost the race (`network`/`processes`/`ports`/`firewall`/`loginStats` came back empty while `cpu`/`memory`/`disk`/`uptime`/`system` succeeded). Fixed by running collectors sequentially in `collectHostMetrics()` — acceptable since this is on-demand polling, not a latency-critical path.
**Follow-up verification (gaps from the first pass now closed):** with `iproute2` installed and a test `sshd` configured for root login, the three previously-unverified collectors were re-run against a real host over the real API and all returned correct data:
- `network``eth0` with its real IP (`192.0.2.2/24`) and state `UP`.
- `ports``source: "ss"`, 6 listening ports, with real process names and PIDs (`sshd`, etc.).
- `firewall` → after adding two `iptables` rules (`--dport 22`/`--dport 80 -j ACCEPT`) and connecting as root, `type: "iptables"`, `status: "active"`, and the INPUT chain parsed back the two rules correctly.
The **frontend was also browser-verified** (Playwright/Chromium, now available): logged in, opened `/host-metrics`, selected the host, and confirmed all widgets render with real live data (CPU/memory/disk gauges, uptime, the `eth0` interface, listening ports with process names, the top-processes table, the `iptables` firewall summary with 2 rules, and login activity) — see screenshot evidence captured during the run.
**Remaining documented gap**:
- `loginStats` returned empty because the test host's `wtmp` had no real login history and `/var/log/auth.log`/`secure` weren't populated — `last`/`grep` both ran successfully, just had nothing to report. This is data-availability, not a code defect; unverified against a host with real login history.
- All test artifacts (test `sshd` process, test OS users, test iptables rules, test backend instance, test DB, tokens, temp files) were cleaned up afterward.
### Phase 7 — Host-to-Host File Transfer (DONE)
**Architecture decision**: Termix's `host-transfer.ts` (3,428 lines, plus `transfer-paths.ts`/`transfer-routing.ts`) is a heavily over-engineered system — parallel-segment workers, a tar-vs-per-file-SFTP method selector driven by incompressibility heuristics, hung-stream watchdogs, retry orchestration, worker caches, archive-method previews. Per the same stance taken in every prior phase, only the **core value** was ported: streaming a file/directory from one SSH host to another through the backend (read from the source's SFTP, write to the destination's SFTP, item by item). This is exactly the `item_sftp` path Termix itself falls back to in most cases; the parallel/tar/watchdog machinery is left behind as unjustified at this app's scale. Reuses ArchNest's existing `connectTarget` SSH helper (jump-host support inherited for free on both ends), not Termix's connection pool/session manager. Delivery mirrors Phase 2/6: an in-memory transfer registry + REST polling, no websockets.
**What was built:**
- `backend/src/ssh/transfer.ts` — the transfer engine. `startTransfer()` returns a `transferId` and runs asynchronously: opens an SFTP connection to both hosts, scans the source tree up front (depth-first walk) to compute `totalFiles`/`totalBytes` for a real progress bar, recreates the directory structure on the destination, then streams each file (source `createReadStream` → dest `createWriteStream`). Tracks live progress in an in-memory `activeTransfers` map; supports `move` (deletes the source tree, files-then-dirs-deepest-first, after a successful copy) and cooperative cancellation (a flag checked between files and on every read chunk). `cleanupOldTransfers()` drops finished entries after an hour.
- `backend/src/routes/transfer.ts``POST /api/transfers` (start), `GET /api/transfers` (list), `GET /api/transfers/:id` (status), `POST /api/transfers/:id/cancel`. All authenticated; start is zod-validated.
- `src/pages/Files.tsx` — added a per-entry "Send to another host" action (disabled unless ≥2 SSH hosts exist) opening a modal (destination host dropdown, destination directory, move checkbox), plus a live "Host-to-Host Transfers" panel that polls (1s while any transfer is running, 5s otherwise) and shows per-transfer progress bars, current file, status, and a cancel button.
- `src/lib/api.ts``startTransfer`/`listTransfers`/`getTransfer`/`cancelTransfer` + `TransferProgress` type.
**Verified end-to-end** against two real SSH endpoints (a real `sshd` with two real OS users as source/dest, not mocked): created two real `ssh`-type integrations and exercised all four behaviours over the real API:
- **Recursive directory copy** of a tree (text file + a 100 KB random binary + a nested subdir): completed 3/3 files / 100,019 bytes; verified on disk that the directory structure was recreated, text content was intact, and the binary's `md5sum` matched the source exactly.
- **Move**: a single file transferred with `move:true` — confirmed present on the destination and **deleted from the source** afterward.
- **Error handling**: a transfer of a nonexistent source path ended `status: "failed"` with a clear `"No such file"` error rather than hanging.
- **Cancellation**: an 80 MB transfer cancelled ~0.3 s in stopped at 162 KB with `status: "cancelled"` — confirming the mid-stream cancel flag actually interrupts the copy.
The **frontend transfer UI was also browser-verified** (Playwright/Chromium): logged in, opened the Files page, switched to a source SSH host, navigated into a directory, clicked the per-row "Send to another host" action, picked the destination host + directory in the modal, and confirmed the live "Host-to-Host Transfers" panel rendered the transfer and reached a full `completed` progress bar — then verified on the destination host's disk that the file actually landed with correct content.
All test artifacts (test `sshd`, both test OS users + their home dirs, test backend instance, test DB, temp files) were cleaned up afterward.
### Also worth checking during/after the phases above
- Data export/import of SSH hosts/credentials/file-manager data — a nice-to-have, not yet scheduled.
## Tracking
Update the phase status lines above as work lands. Each phase should get its own commit(s) on `claude/wonderful-faraday-qxym5t`, following the existing commit message style (descriptive title + why, `Co-Authored-By`/`Claude-Session` trailer).