Bring the docs in line with what shipped since the auth phases, and hand off the next planned feature cleanly for another agent to pick up. - HANDOFF.md: new TL;DR (auth complete; persistent terminals + Docker three-ways shipped); prominent "next task = Mesh Prerequisite Gate" callout warning not to code before the open decisions are answered; corrected standing rules (kiro/<feature> branches, gh-based workflow, npm run build over plain tsc, Co-authored-by trailers); architecture sections updated for TerminalSessionContext, dockerSsh/agents routes, docker_agent_reports table, ssh/docker.ts, and the new agent env vars; new "Docker: three ways" section. - README.md: Containers/Terminal page rows, route-group list, SSH layer, agent/ dir, ARCHNEST_AGENT_TOKEN/ARCHNEST_AGENT_STALE_MS, current-state paragraph, and doc reading order. - design-decisions.md: Terminal (persistence) and Containers (three sources + detail tab) page notes; backend Docker-transport note; mesh gate flagged under Future Integration Notes. - docs/mesh-prerequisite-gate.md (new): full design with lockout-safety invariants and the open decisions (A-D) needed before implementation. Docs only; no code changed. Co-authored-by: Samuel James <ssamjame@amazon.com> Co-authored-by: Kiro <noreply@kiro.dev>
8.1 KiB
Mesh Network Prerequisite Gate — Design
Design doc for requiring a mesh network (NetBird) to be configured, tested, and verified before the rest of ArchNest can be configured. Written before implementation. The hard problem here is not locking the admin out, so this doc leads with that.
Status: DESIGN — not yet implemented. Decisions marked [DECIDE] need the user's input before coding.
Goal
After account setup, an admin must establish a verified mesh connection before they can configure integrations, bookmarks, tunnels, etc. The intent: ArchNest is meant to operate over a private mesh, and other features (e.g. the Docker agent ingest, SSH to mesh hosts) assume mesh reachability. The gate makes that a first-class, enforced prerequisite rather than an operational assumption.
The lockout problem (read first)
A naive gate that blocks everything until mesh is verified is dangerous: if
the mesh test fails (wrong token, NetBird down, transient network), the admin
could be unable to reach the very settings needed to fix it. The existing
codebase already takes lockout seriously (the "last active admin" guards in
auth.ts). The gate must follow the same principle:
Invariants (non-negotiable):
- The gate never blocks
/api/auth/*(login, logout, sessions, password). - The gate never blocks the mesh configuration + test endpoints, nor the integration create/update/test routes needed to configure the mesh.
- Enforcement is primarily UI-level (a gate screen that is itself the mesh-config UI), so the admin always has a way forward — the gate screen lets them enter/edit/test the mesh right there.
- There is an explicit, logged admin override ("skip / I'll set this up later") — see [DECIDE A]. Without an override, a hard outage of the mesh provider could brick configuration access.
- The mesh config row is always editable even when the gate is unsatisfied.
What counts as "verified"? [DECIDE B]
Options, from loosest to strictest:
- (B1) Reachable: a NetBird integration exists and
testConnectionsucceeds (the NetBird API answers/api/peerswith the token). Proves the control-plane token works, not that this host is on the mesh. - (B2) On the mesh: the backend host itself has an interface/IP in the mesh
range (e.g.
100.64.0.0/10), checked server-side. Proves the ArchNest host is actually meshed. - (B3) Reachable + peers present: B1 plus
listResources()returning ≥1 connected peer.
Recommendation: B1 as the baseline verification (it's what the existing NetBird adapter already supports and is deterministic), with B2 as an additional optional check surfaced as info ("this host's mesh IP: …"). B3 is nice but a single-peer network is legitimate, so don't require peers.
This needs your call — see [DECIDE B] at the end.
Where state lives
There is no server-side key-value config store today; all config is in the
integrations table. Two options:
- (S1) Derive from the NetBird integration: "mesh verified" = there exists
a
netbirdintegration withstatus = 'connected'(optionally within a freshness window). No new table. Simplest, but conflates "an integration that happens to be NetBird" with "the designated mesh". - (S2) New
system_configkey-value table: explicit keys likemesh.integrationId,mesh.verifiedAt,mesh.overrideUntil. Cleaner, gives us a real home for future system-level settings (and the override flag), at the cost of a new table + endpoints.
Recommendation: S2 — a small system_config kv table. The gate needs to
persist an override flag and a "designated mesh integration" pointer that S1
can't cleanly represent, and ArchNest will want a system-config store for other
things eventually (this is also where a future "mesh required: on/off" toggle
lives). Proposed schema:
CREATE TABLE IF NOT EXISTS system_config (
key TEXT PRIMARY KEY,
value TEXT NOT NULL,
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
);
Keys: mesh.integrationId (the designated NetBird integration), mesh.verifiedAt
(ISO timestamp of last successful verify), mesh.overrideUntil (optional ISO
timestamp for a temporary admin skip), mesh.required ("true"/"false",
default true — lets the whole gate be turned off).
Frontend flow
New auth status
Add 'needs-mesh' to the AuthStatus union in AuthContext.tsx. refresh()
currently: token → api.me() → 'logged-in'. New: after a successful
api.me(), also call api.getMeshStatus(); if mesh is required and not
verified and not overridden, set 'needs-mesh' instead of 'logged-in'.
App routing (App.tsx)
Insert a branch after logged-out/enrolling and before Dashboard:
if (status === 'needs-mesh') return <MeshGate />
return <Dashboard />
MeshGate page
A focused, full-screen page (styled like Enrollment) that:
- Explains the prerequisite.
- Lets the admin configure the NetBird mesh (reuse the integration
create/test form — same
createIntegration+testIntegrationcalls Enrollment'sConnectFormalready uses), or pick an existing NetBird integration as the designated mesh. - Runs Detect → Test → Verify: shows the test result, and (B2) the detected mesh IP of the ArchNest host.
- On success, marks
mesh.verifiedAt, then callsrefresh()→ advances toDashboard. - Provides the admin override control per [DECIDE A].
- Members (non-admins): a member who logs in while mesh is unverified can't fix it (only admins configure integrations). They should see a "waiting on an admin to finish mesh setup" message, not a config form. [DECIDE C: do we even allow member login pre-verification, or block all use until verified?]
Enrollment
Keep Enrollment's account step. The mesh step can either be folded into
Enrollment as a mandatory step before finishEnrollment(), or live purely as
the post-login needs-mesh gate. Recommendation: gate only (don't duplicate
in Enrollment) — one code path, and it also covers existing installs that
predate the gate.
Backend
GET /api/system/mesh-status(mirrorssetup-status): returns{ required, verified, overridden, meshIntegrationId, hostMeshIp? }. Behindauthenticate(any logged-in user can read).POST /api/system/mesh/verify(admin): designates a NetBird integration as the mesh, runs itstestConnection, (B2) checks host mesh IP, persistsmesh.integrationId+mesh.verifiedAton success. Returns the result.POST /api/system/mesh/override(admin) [DECIDE A]: setsmesh.overrideUntil(or a permanent skip). Writes alogEvent.- Optional
PUT /api/system/mesh/required(admin): togglemesh.required. - Lockout safety: none of the gate enforcement lives in a global request
hook that could block auth/integration/system routes. If we add any
server-side enforcement at all (beyond the UI gate), it must explicitly
exempt
/api/auth/*,/api/integrations*, and/api/system/*.
Decisions needed before coding
- [DECIDE A] Override: Should the admin have a "skip for now" escape hatch? Strongly recommend yes (lockout safety). If yes: temporary (e.g. 24h, re-prompts) or permanent-until-changed? And does skipping still let them into the Dashboard fully, or into a limited state?
- [DECIDE B] Verified definition: B1 (reachable), B2 (host on mesh), or B1+B2? Recommend B1 baseline + B2 as informational.
- [DECIDE C] Member behavior pre-verification: block all non-admin login until mesh verified, or let members in with a "setup in progress" notice?
- [DECIDE D] Existing install / this very deployment: the live instance has
no mesh row yet. Turning the gate on will immediately gate the running
production app at next login. Do we (i) default
mesh.required = falseand let the admin opt in, or (ii) default it on but rely on the override? This is the riskiest part for the deployed instance.
Explicitly out of scope
- Auto-installing/joining NetBird from ArchNest (we only verify, not provision).
- Supporting non-NetBird meshes (Tailscale, etc.) — possible later via the
same
mesh.integrationIdindirection.