dev_arc_aws/docs/mesh-prerequisite-gate.md
Samuel James cdd93f204e
docs: sync HANDOFF/README/design-decisions; add mesh-gate design (#32)
Bring the docs in line with what shipped since the auth phases, and hand
off the next planned feature cleanly for another agent to pick up.

- HANDOFF.md: new TL;DR (auth complete; persistent terminals + Docker
  three-ways shipped); prominent "next task = Mesh Prerequisite Gate"
  callout warning not to code before the open decisions are answered;
  corrected standing rules (kiro/<feature> branches, gh-based workflow,
  npm run build over plain tsc, Co-authored-by trailers); architecture
  sections updated for TerminalSessionContext, dockerSsh/agents routes,
  docker_agent_reports table, ssh/docker.ts, and the new agent env vars;
  new "Docker: three ways" section.
- README.md: Containers/Terminal page rows, route-group list, SSH layer,
  agent/ dir, ARCHNEST_AGENT_TOKEN/ARCHNEST_AGENT_STALE_MS, current-state
  paragraph, and doc reading order.
- design-decisions.md: Terminal (persistence) and Containers (three
  sources + detail tab) page notes; backend Docker-transport note; mesh
  gate flagged under Future Integration Notes.
- docs/mesh-prerequisite-gate.md (new): full design with lockout-safety
  invariants and the open decisions (A-D) needed before implementation.

Docs only; no code changed.

Co-authored-by: Samuel James <ssamjame@amazon.com>
Co-authored-by: Kiro <noreply@kiro.dev>
2026-06-20 16:42:47 -04:00

165 lines
8.1 KiB
Markdown

# Mesh Network Prerequisite Gate — Design
Design doc for requiring a **mesh network (NetBird) to be configured, tested,
and verified before the rest of ArchNest can be configured**. Written before
implementation. The hard problem here is **not locking the admin out**, so this
doc leads with that.
> Status: DESIGN — not yet implemented. Decisions marked **[DECIDE]** need the
> user's input before coding.
## Goal
After account setup, an admin must establish a verified mesh connection before
they can configure integrations, bookmarks, tunnels, etc. The intent: ArchNest
is meant to operate over a private mesh, and other features (e.g. the Docker
agent ingest, SSH to mesh hosts) assume mesh reachability. The gate makes that a
first-class, enforced prerequisite rather than an operational assumption.
## The lockout problem (read first)
A naive gate that blocks *everything* until mesh is verified is dangerous: if
the mesh test fails (wrong token, NetBird down, transient network), the admin
could be unable to reach the very settings needed to fix it. The existing
codebase already takes lockout seriously (the "last active admin" guards in
`auth.ts`). The gate must follow the same principle:
**Invariants (non-negotiable):**
1. The gate **never blocks** `/api/auth/*` (login, logout, sessions, password).
2. The gate **never blocks** the mesh configuration + test endpoints, nor the
integration create/update/test routes needed to configure the mesh.
3. Enforcement is primarily **UI-level** (a gate screen that *is itself* the
mesh-config UI), so the admin always has a way forward — the gate screen
lets them enter/edit/test the mesh right there.
4. There is an explicit, logged **admin override** ("skip / I'll set this up
later") — see **[DECIDE A]**. Without an override, a hard outage of the mesh
provider could brick configuration access.
5. The mesh config row is always editable even when the gate is unsatisfied.
## What counts as "verified"? [DECIDE B]
Options, from loosest to strictest:
- **(B1) Reachable:** a NetBird integration exists and `testConnection`
succeeds (the NetBird API answers `/api/peers` with the token). Proves the
control-plane token works, not that *this host* is on the mesh.
- **(B2) On the mesh:** the backend host itself has an interface/IP in the mesh
range (e.g. `100.64.0.0/10`), checked server-side. Proves the ArchNest host
is actually meshed.
- **(B3) Reachable + peers present:** B1 plus `listResources()` returning ≥1
connected peer.
Recommendation: **B1 as the baseline verification** (it's what the existing
NetBird adapter already supports and is deterministic), with **B2 as an
additional optional check** surfaced as info ("this host's mesh IP: …"). B3 is
nice but a single-peer network is legitimate, so don't require peers.
This needs your call — see [DECIDE B] at the end.
## Where state lives
There is **no server-side key-value config store** today; all config is in the
`integrations` table. Two options:
- **(S1) Derive from the NetBird integration:** "mesh verified" = there exists
a `netbird` integration with `status = 'connected'` (optionally within a
freshness window). No new table. Simplest, but conflates "an integration that
happens to be NetBird" with "the designated mesh".
- **(S2) New `system_config` key-value table:** explicit keys like
`mesh.integrationId`, `mesh.verifiedAt`, `mesh.overrideUntil`. Cleaner, gives
us a real home for future system-level settings (and the override flag), at
the cost of a new table + endpoints.
Recommendation: **S2 — a small `system_config` kv table.** The gate needs to
persist an override flag and a "designated mesh integration" pointer that S1
can't cleanly represent, and ArchNest will want a system-config store for other
things eventually (this is also where a future "mesh required: on/off" toggle
lives). Proposed schema:
```sql
CREATE TABLE IF NOT EXISTS system_config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
```
Keys: `mesh.integrationId` (the designated NetBird integration), `mesh.verifiedAt`
(ISO timestamp of last successful verify), `mesh.overrideUntil` (optional ISO
timestamp for a temporary admin skip), `mesh.required` (`"true"`/`"false"`,
default true — lets the whole gate be turned off).
## Frontend flow
### New auth status
Add `'needs-mesh'` to the `AuthStatus` union in `AuthContext.tsx`. `refresh()`
currently: token → `api.me()``'logged-in'`. New: after a successful
`api.me()`, also call `api.getMeshStatus()`; if mesh is **required and not
verified and not overridden**, set `'needs-mesh'` instead of `'logged-in'`.
### App routing (`App.tsx`)
Insert a branch **after** `logged-out`/`enrolling` and **before** `Dashboard`:
```
if (status === 'needs-mesh') return <MeshGate />
return <Dashboard />
```
### `MeshGate` page
A focused, full-screen page (styled like Enrollment) that:
- Explains the prerequisite.
- Lets the admin **configure the NetBird mesh** (reuse the integration
create/test form — same `createIntegration` + `testIntegration` calls
Enrollment's `ConnectForm` already uses), or pick an existing NetBird
integration as the designated mesh.
- Runs **Detect → Test → Verify**: shows the test result, and (B2) the detected
mesh IP of the ArchNest host.
- On success, marks `mesh.verifiedAt`, then calls `refresh()` → advances to
`Dashboard`.
- Provides the **admin override** control per [DECIDE A].
- **Members (non-admins):** a member who logs in while mesh is unverified can't
fix it (only admins configure integrations). They should see a "waiting on an
admin to finish mesh setup" message, not a config form. [DECIDE C: do we even
allow member login pre-verification, or block all use until verified?]
### Enrollment
Keep Enrollment's account step. The mesh step can either be folded into
Enrollment as a mandatory step before `finishEnrollment()`, or live purely as
the post-login `needs-mesh` gate. Recommendation: **gate only** (don't duplicate
in Enrollment) — one code path, and it also covers existing installs that
predate the gate.
## Backend
- `GET /api/system/mesh-status` (mirrors `setup-status`): returns
`{ required, verified, overridden, meshIntegrationId, hostMeshIp? }`. Behind
`authenticate` (any logged-in user can read).
- `POST /api/system/mesh/verify` (admin): designates a NetBird integration as
the mesh, runs its `testConnection`, (B2) checks host mesh IP, persists
`mesh.integrationId` + `mesh.verifiedAt` on success. Returns the result.
- `POST /api/system/mesh/override` (admin) **[DECIDE A]**: sets
`mesh.overrideUntil` (or a permanent skip). Writes a `logEvent`.
- Optional `PUT /api/system/mesh/required` (admin): toggle `mesh.required`.
- **Lockout safety:** none of the gate enforcement lives in a global request
hook that could block auth/integration/system routes. If we add any
server-side enforcement at all (beyond the UI gate), it must explicitly
exempt `/api/auth/*`, `/api/integrations*`, and `/api/system/*`.
## Decisions needed before coding
- **[DECIDE A] Override:** Should the admin have a "skip for now" escape hatch?
Strongly recommend **yes** (lockout safety). If yes: temporary (e.g. 24h,
re-prompts) or permanent-until-changed? And does skipping still let them into
the Dashboard fully, or into a limited state?
- **[DECIDE B] Verified definition:** B1 (reachable), B2 (host on mesh), or
B1+B2? Recommend B1 baseline + B2 as informational.
- **[DECIDE C] Member behavior pre-verification:** block all non-admin login
until mesh verified, or let members in with a "setup in progress" notice?
- **[DECIDE D] Existing install / this very deployment:** the live instance has
no mesh row yet. Turning the gate on **will immediately gate the running
production app** at next login. Do we (i) default `mesh.required = false` and
let the admin opt in, or (ii) default it on but rely on the override? This is
the riskiest part for the deployed instance.
## Explicitly out of scope
- Auto-installing/joining NetBird from ArchNest (we only verify, not provision).
- Supporting non-NetBird meshes (Tailscale, etc.) — possible later via the
same `mesh.integrationId` indirection.