dev_arc_aws/docs/mesh-prerequisite-gate.md
Samuel James cdd93f204e
docs: sync HANDOFF/README/design-decisions; add mesh-gate design (#32)
Bring the docs in line with what shipped since the auth phases, and hand
off the next planned feature cleanly for another agent to pick up.

- HANDOFF.md: new TL;DR (auth complete; persistent terminals + Docker
  three-ways shipped); prominent "next task = Mesh Prerequisite Gate"
  callout warning not to code before the open decisions are answered;
  corrected standing rules (kiro/<feature> branches, gh-based workflow,
  npm run build over plain tsc, Co-authored-by trailers); architecture
  sections updated for TerminalSessionContext, dockerSsh/agents routes,
  docker_agent_reports table, ssh/docker.ts, and the new agent env vars;
  new "Docker: three ways" section.
- README.md: Containers/Terminal page rows, route-group list, SSH layer,
  agent/ dir, ARCHNEST_AGENT_TOKEN/ARCHNEST_AGENT_STALE_MS, current-state
  paragraph, and doc reading order.
- design-decisions.md: Terminal (persistence) and Containers (three
  sources + detail tab) page notes; backend Docker-transport note; mesh
  gate flagged under Future Integration Notes.
- docs/mesh-prerequisite-gate.md (new): full design with lockout-safety
  invariants and the open decisions (A-D) needed before implementation.

Docs only; no code changed.

Co-authored-by: Samuel James <ssamjame@amazon.com>
Co-authored-by: Kiro <noreply@kiro.dev>
2026-06-20 16:42:47 -04:00

8.1 KiB

Mesh Network Prerequisite Gate — Design

Design doc for requiring a mesh network (NetBird) to be configured, tested, and verified before the rest of ArchNest can be configured. Written before implementation. The hard problem here is not locking the admin out, so this doc leads with that.

Status: DESIGN — not yet implemented. Decisions marked [DECIDE] need the user's input before coding.

Goal

After account setup, an admin must establish a verified mesh connection before they can configure integrations, bookmarks, tunnels, etc. The intent: ArchNest is meant to operate over a private mesh, and other features (e.g. the Docker agent ingest, SSH to mesh hosts) assume mesh reachability. The gate makes that a first-class, enforced prerequisite rather than an operational assumption.

The lockout problem (read first)

A naive gate that blocks everything until mesh is verified is dangerous: if the mesh test fails (wrong token, NetBird down, transient network), the admin could be unable to reach the very settings needed to fix it. The existing codebase already takes lockout seriously (the "last active admin" guards in auth.ts). The gate must follow the same principle:

Invariants (non-negotiable):

  1. The gate never blocks /api/auth/* (login, logout, sessions, password).
  2. The gate never blocks the mesh configuration + test endpoints, nor the integration create/update/test routes needed to configure the mesh.
  3. Enforcement is primarily UI-level (a gate screen that is itself the mesh-config UI), so the admin always has a way forward — the gate screen lets them enter/edit/test the mesh right there.
  4. There is an explicit, logged admin override ("skip / I'll set this up later") — see [DECIDE A]. Without an override, a hard outage of the mesh provider could brick configuration access.
  5. The mesh config row is always editable even when the gate is unsatisfied.

What counts as "verified"? [DECIDE B]

Options, from loosest to strictest:

  • (B1) Reachable: a NetBird integration exists and testConnection succeeds (the NetBird API answers /api/peers with the token). Proves the control-plane token works, not that this host is on the mesh.
  • (B2) On the mesh: the backend host itself has an interface/IP in the mesh range (e.g. 100.64.0.0/10), checked server-side. Proves the ArchNest host is actually meshed.
  • (B3) Reachable + peers present: B1 plus listResources() returning ≥1 connected peer.

Recommendation: B1 as the baseline verification (it's what the existing NetBird adapter already supports and is deterministic), with B2 as an additional optional check surfaced as info ("this host's mesh IP: …"). B3 is nice but a single-peer network is legitimate, so don't require peers.

This needs your call — see [DECIDE B] at the end.

Where state lives

There is no server-side key-value config store today; all config is in the integrations table. Two options:

  • (S1) Derive from the NetBird integration: "mesh verified" = there exists a netbird integration with status = 'connected' (optionally within a freshness window). No new table. Simplest, but conflates "an integration that happens to be NetBird" with "the designated mesh".
  • (S2) New system_config key-value table: explicit keys like mesh.integrationId, mesh.verifiedAt, mesh.overrideUntil. Cleaner, gives us a real home for future system-level settings (and the override flag), at the cost of a new table + endpoints.

Recommendation: S2 — a small system_config kv table. The gate needs to persist an override flag and a "designated mesh integration" pointer that S1 can't cleanly represent, and ArchNest will want a system-config store for other things eventually (this is also where a future "mesh required: on/off" toggle lives). Proposed schema:

CREATE TABLE IF NOT EXISTS system_config (
  key   TEXT PRIMARY KEY,
  value TEXT NOT NULL,
  updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);

Keys: mesh.integrationId (the designated NetBird integration), mesh.verifiedAt (ISO timestamp of last successful verify), mesh.overrideUntil (optional ISO timestamp for a temporary admin skip), mesh.required ("true"/"false", default true — lets the whole gate be turned off).

Frontend flow

New auth status

Add 'needs-mesh' to the AuthStatus union in AuthContext.tsx. refresh() currently: token → api.me()'logged-in'. New: after a successful api.me(), also call api.getMeshStatus(); if mesh is required and not verified and not overridden, set 'needs-mesh' instead of 'logged-in'.

App routing (App.tsx)

Insert a branch after logged-out/enrolling and before Dashboard:

if (status === 'needs-mesh') return <MeshGate />
return <Dashboard />

MeshGate page

A focused, full-screen page (styled like Enrollment) that:

  • Explains the prerequisite.
  • Lets the admin configure the NetBird mesh (reuse the integration create/test form — same createIntegration + testIntegration calls Enrollment's ConnectForm already uses), or pick an existing NetBird integration as the designated mesh.
  • Runs Detect → Test → Verify: shows the test result, and (B2) the detected mesh IP of the ArchNest host.
  • On success, marks mesh.verifiedAt, then calls refresh() → advances to Dashboard.
  • Provides the admin override control per [DECIDE A].
  • Members (non-admins): a member who logs in while mesh is unverified can't fix it (only admins configure integrations). They should see a "waiting on an admin to finish mesh setup" message, not a config form. [DECIDE C: do we even allow member login pre-verification, or block all use until verified?]

Enrollment

Keep Enrollment's account step. The mesh step can either be folded into Enrollment as a mandatory step before finishEnrollment(), or live purely as the post-login needs-mesh gate. Recommendation: gate only (don't duplicate in Enrollment) — one code path, and it also covers existing installs that predate the gate.

Backend

  • GET /api/system/mesh-status (mirrors setup-status): returns { required, verified, overridden, meshIntegrationId, hostMeshIp? }. Behind authenticate (any logged-in user can read).
  • POST /api/system/mesh/verify (admin): designates a NetBird integration as the mesh, runs its testConnection, (B2) checks host mesh IP, persists mesh.integrationId + mesh.verifiedAt on success. Returns the result.
  • POST /api/system/mesh/override (admin) [DECIDE A]: sets mesh.overrideUntil (or a permanent skip). Writes a logEvent.
  • Optional PUT /api/system/mesh/required (admin): toggle mesh.required.
  • Lockout safety: none of the gate enforcement lives in a global request hook that could block auth/integration/system routes. If we add any server-side enforcement at all (beyond the UI gate), it must explicitly exempt /api/auth/*, /api/integrations*, and /api/system/*.

Decisions needed before coding

  • [DECIDE A] Override: Should the admin have a "skip for now" escape hatch? Strongly recommend yes (lockout safety). If yes: temporary (e.g. 24h, re-prompts) or permanent-until-changed? And does skipping still let them into the Dashboard fully, or into a limited state?
  • [DECIDE B] Verified definition: B1 (reachable), B2 (host on mesh), or B1+B2? Recommend B1 baseline + B2 as informational.
  • [DECIDE C] Member behavior pre-verification: block all non-admin login until mesh verified, or let members in with a "setup in progress" notice?
  • [DECIDE D] Existing install / this very deployment: the live instance has no mesh row yet. Turning the gate on will immediately gate the running production app at next login. Do we (i) default mesh.required = false and let the admin opt in, or (ii) default it on but rely on the override? This is the riskiest part for the deployed instance.

Explicitly out of scope

  • Auto-installing/joining NetBird from ArchNest (we only verify, not provision).
  • Supporting non-NetBird meshes (Tailscale, etc.) — possible later via the same mesh.integrationId indirection.