Bring the docs in line with what shipped since the auth phases, and hand off the next planned feature cleanly for another agent to pick up. - HANDOFF.md: new TL;DR (auth complete; persistent terminals + Docker three-ways shipped); prominent "next task = Mesh Prerequisite Gate" callout warning not to code before the open decisions are answered; corrected standing rules (kiro/<feature> branches, gh-based workflow, npm run build over plain tsc, Co-authored-by trailers); architecture sections updated for TerminalSessionContext, dockerSsh/agents routes, docker_agent_reports table, ssh/docker.ts, and the new agent env vars; new "Docker: three ways" section. - README.md: Containers/Terminal page rows, route-group list, SSH layer, agent/ dir, ARCHNEST_AGENT_TOKEN/ARCHNEST_AGENT_STALE_MS, current-state paragraph, and doc reading order. - design-decisions.md: Terminal (persistence) and Containers (three sources + detail tab) page notes; backend Docker-transport note; mesh gate flagged under Future Integration Notes. - docs/mesh-prerequisite-gate.md (new): full design with lockout-safety invariants and the open decisions (A-D) needed before implementation. Docs only; no code changed. Co-authored-by: Samuel James <ssamjame@amazon.com> Co-authored-by: Kiro <noreply@kiro.dev>
165 lines
8.1 KiB
Markdown
165 lines
8.1 KiB
Markdown
# Mesh Network Prerequisite Gate — Design
|
|
|
|
Design doc for requiring a **mesh network (NetBird) to be configured, tested,
|
|
and verified before the rest of ArchNest can be configured**. Written before
|
|
implementation. The hard problem here is **not locking the admin out**, so this
|
|
doc leads with that.
|
|
|
|
> Status: DESIGN — not yet implemented. Decisions marked **[DECIDE]** need the
|
|
> user's input before coding.
|
|
|
|
## Goal
|
|
|
|
After account setup, an admin must establish a verified mesh connection before
|
|
they can configure integrations, bookmarks, tunnels, etc. The intent: ArchNest
|
|
is meant to operate over a private mesh, and other features (e.g. the Docker
|
|
agent ingest, SSH to mesh hosts) assume mesh reachability. The gate makes that a
|
|
first-class, enforced prerequisite rather than an operational assumption.
|
|
|
|
## The lockout problem (read first)
|
|
|
|
A naive gate that blocks *everything* until mesh is verified is dangerous: if
|
|
the mesh test fails (wrong token, NetBird down, transient network), the admin
|
|
could be unable to reach the very settings needed to fix it. The existing
|
|
codebase already takes lockout seriously (the "last active admin" guards in
|
|
`auth.ts`). The gate must follow the same principle:
|
|
|
|
**Invariants (non-negotiable):**
|
|
1. The gate **never blocks** `/api/auth/*` (login, logout, sessions, password).
|
|
2. The gate **never blocks** the mesh configuration + test endpoints, nor the
|
|
integration create/update/test routes needed to configure the mesh.
|
|
3. Enforcement is primarily **UI-level** (a gate screen that *is itself* the
|
|
mesh-config UI), so the admin always has a way forward — the gate screen
|
|
lets them enter/edit/test the mesh right there.
|
|
4. There is an explicit, logged **admin override** ("skip / I'll set this up
|
|
later") — see **[DECIDE A]**. Without an override, a hard outage of the mesh
|
|
provider could brick configuration access.
|
|
5. The mesh config row is always editable even when the gate is unsatisfied.
|
|
|
|
## What counts as "verified"? [DECIDE B]
|
|
|
|
Options, from loosest to strictest:
|
|
- **(B1) Reachable:** a NetBird integration exists and `testConnection`
|
|
succeeds (the NetBird API answers `/api/peers` with the token). Proves the
|
|
control-plane token works, not that *this host* is on the mesh.
|
|
- **(B2) On the mesh:** the backend host itself has an interface/IP in the mesh
|
|
range (e.g. `100.64.0.0/10`), checked server-side. Proves the ArchNest host
|
|
is actually meshed.
|
|
- **(B3) Reachable + peers present:** B1 plus `listResources()` returning ≥1
|
|
connected peer.
|
|
|
|
Recommendation: **B1 as the baseline verification** (it's what the existing
|
|
NetBird adapter already supports and is deterministic), with **B2 as an
|
|
additional optional check** surfaced as info ("this host's mesh IP: …"). B3 is
|
|
nice but a single-peer network is legitimate, so don't require peers.
|
|
|
|
This needs your call — see [DECIDE B] at the end.
|
|
|
|
## Where state lives
|
|
|
|
There is **no server-side key-value config store** today; all config is in the
|
|
`integrations` table. Two options:
|
|
|
|
- **(S1) Derive from the NetBird integration:** "mesh verified" = there exists
|
|
a `netbird` integration with `status = 'connected'` (optionally within a
|
|
freshness window). No new table. Simplest, but conflates "an integration that
|
|
happens to be NetBird" with "the designated mesh".
|
|
- **(S2) New `system_config` key-value table:** explicit keys like
|
|
`mesh.integrationId`, `mesh.verifiedAt`, `mesh.overrideUntil`. Cleaner, gives
|
|
us a real home for future system-level settings (and the override flag), at
|
|
the cost of a new table + endpoints.
|
|
|
|
Recommendation: **S2 — a small `system_config` kv table.** The gate needs to
|
|
persist an override flag and a "designated mesh integration" pointer that S1
|
|
can't cleanly represent, and ArchNest will want a system-config store for other
|
|
things eventually (this is also where a future "mesh required: on/off" toggle
|
|
lives). Proposed schema:
|
|
|
|
```sql
|
|
CREATE TABLE IF NOT EXISTS system_config (
|
|
key TEXT PRIMARY KEY,
|
|
value TEXT NOT NULL,
|
|
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
|
);
|
|
```
|
|
|
|
Keys: `mesh.integrationId` (the designated NetBird integration), `mesh.verifiedAt`
|
|
(ISO timestamp of last successful verify), `mesh.overrideUntil` (optional ISO
|
|
timestamp for a temporary admin skip), `mesh.required` (`"true"`/`"false"`,
|
|
default true — lets the whole gate be turned off).
|
|
|
|
## Frontend flow
|
|
|
|
### New auth status
|
|
Add `'needs-mesh'` to the `AuthStatus` union in `AuthContext.tsx`. `refresh()`
|
|
currently: token → `api.me()` → `'logged-in'`. New: after a successful
|
|
`api.me()`, also call `api.getMeshStatus()`; if mesh is **required and not
|
|
verified and not overridden**, set `'needs-mesh'` instead of `'logged-in'`.
|
|
|
|
### App routing (`App.tsx`)
|
|
Insert a branch **after** `logged-out`/`enrolling` and **before** `Dashboard`:
|
|
```
|
|
if (status === 'needs-mesh') return <MeshGate />
|
|
return <Dashboard />
|
|
```
|
|
|
|
### `MeshGate` page
|
|
A focused, full-screen page (styled like Enrollment) that:
|
|
- Explains the prerequisite.
|
|
- Lets the admin **configure the NetBird mesh** (reuse the integration
|
|
create/test form — same `createIntegration` + `testIntegration` calls
|
|
Enrollment's `ConnectForm` already uses), or pick an existing NetBird
|
|
integration as the designated mesh.
|
|
- Runs **Detect → Test → Verify**: shows the test result, and (B2) the detected
|
|
mesh IP of the ArchNest host.
|
|
- On success, marks `mesh.verifiedAt`, then calls `refresh()` → advances to
|
|
`Dashboard`.
|
|
- Provides the **admin override** control per [DECIDE A].
|
|
- **Members (non-admins):** a member who logs in while mesh is unverified can't
|
|
fix it (only admins configure integrations). They should see a "waiting on an
|
|
admin to finish mesh setup" message, not a config form. [DECIDE C: do we even
|
|
allow member login pre-verification, or block all use until verified?]
|
|
|
|
### Enrollment
|
|
Keep Enrollment's account step. The mesh step can either be folded into
|
|
Enrollment as a mandatory step before `finishEnrollment()`, or live purely as
|
|
the post-login `needs-mesh` gate. Recommendation: **gate only** (don't duplicate
|
|
in Enrollment) — one code path, and it also covers existing installs that
|
|
predate the gate.
|
|
|
|
## Backend
|
|
|
|
- `GET /api/system/mesh-status` (mirrors `setup-status`): returns
|
|
`{ required, verified, overridden, meshIntegrationId, hostMeshIp? }`. Behind
|
|
`authenticate` (any logged-in user can read).
|
|
- `POST /api/system/mesh/verify` (admin): designates a NetBird integration as
|
|
the mesh, runs its `testConnection`, (B2) checks host mesh IP, persists
|
|
`mesh.integrationId` + `mesh.verifiedAt` on success. Returns the result.
|
|
- `POST /api/system/mesh/override` (admin) **[DECIDE A]**: sets
|
|
`mesh.overrideUntil` (or a permanent skip). Writes a `logEvent`.
|
|
- Optional `PUT /api/system/mesh/required` (admin): toggle `mesh.required`.
|
|
- **Lockout safety:** none of the gate enforcement lives in a global request
|
|
hook that could block auth/integration/system routes. If we add any
|
|
server-side enforcement at all (beyond the UI gate), it must explicitly
|
|
exempt `/api/auth/*`, `/api/integrations*`, and `/api/system/*`.
|
|
|
|
## Decisions needed before coding
|
|
|
|
- **[DECIDE A] Override:** Should the admin have a "skip for now" escape hatch?
|
|
Strongly recommend **yes** (lockout safety). If yes: temporary (e.g. 24h,
|
|
re-prompts) or permanent-until-changed? And does skipping still let them into
|
|
the Dashboard fully, or into a limited state?
|
|
- **[DECIDE B] Verified definition:** B1 (reachable), B2 (host on mesh), or
|
|
B1+B2? Recommend B1 baseline + B2 as informational.
|
|
- **[DECIDE C] Member behavior pre-verification:** block all non-admin login
|
|
until mesh verified, or let members in with a "setup in progress" notice?
|
|
- **[DECIDE D] Existing install / this very deployment:** the live instance has
|
|
no mesh row yet. Turning the gate on **will immediately gate the running
|
|
production app** at next login. Do we (i) default `mesh.required = false` and
|
|
let the admin opt in, or (ii) default it on but rely on the override? This is
|
|
the riskiest part for the deployed instance.
|
|
|
|
## Explicitly out of scope
|
|
- Auto-installing/joining NetBird from ArchNest (we only verify, not provision).
|
|
- Supporting non-NetBird meshes (Tailscale, etc.) — possible later via the
|
|
same `mesh.integrationId` indirection.
|