166 lines
8.1 KiB
Markdown
166 lines
8.1 KiB
Markdown
|
|
# Mesh Network Prerequisite Gate — Design
|
||
|
|
|
||
|
|
Design doc for requiring a **mesh network (NetBird) to be configured, tested,
|
||
|
|
and verified before the rest of ArchNest can be configured**. Written before
|
||
|
|
implementation. The hard problem here is **not locking the admin out**, so this
|
||
|
|
doc leads with that.
|
||
|
|
|
||
|
|
> Status: DESIGN — not yet implemented. Decisions marked **[DECIDE]** need the
|
||
|
|
> user's input before coding.
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
|
||
|
|
After account setup, an admin must establish a verified mesh connection before
|
||
|
|
they can configure integrations, bookmarks, tunnels, etc. The intent: ArchNest
|
||
|
|
is meant to operate over a private mesh, and other features (e.g. the Docker
|
||
|
|
agent ingest, SSH to mesh hosts) assume mesh reachability. The gate makes that a
|
||
|
|
first-class, enforced prerequisite rather than an operational assumption.
|
||
|
|
|
||
|
|
## The lockout problem (read first)
|
||
|
|
|
||
|
|
A naive gate that blocks *everything* until mesh is verified is dangerous: if
|
||
|
|
the mesh test fails (wrong token, NetBird down, transient network), the admin
|
||
|
|
could be unable to reach the very settings needed to fix it. The existing
|
||
|
|
codebase already takes lockout seriously (the "last active admin" guards in
|
||
|
|
`auth.ts`). The gate must follow the same principle:
|
||
|
|
|
||
|
|
**Invariants (non-negotiable):**
|
||
|
|
1. The gate **never blocks** `/api/auth/*` (login, logout, sessions, password).
|
||
|
|
2. The gate **never blocks** the mesh configuration + test endpoints, nor the
|
||
|
|
integration create/update/test routes needed to configure the mesh.
|
||
|
|
3. Enforcement is primarily **UI-level** (a gate screen that *is itself* the
|
||
|
|
mesh-config UI), so the admin always has a way forward — the gate screen
|
||
|
|
lets them enter/edit/test the mesh right there.
|
||
|
|
4. There is an explicit, logged **admin override** ("skip / I'll set this up
|
||
|
|
later") — see **[DECIDE A]**. Without an override, a hard outage of the mesh
|
||
|
|
provider could brick configuration access.
|
||
|
|
5. The mesh config row is always editable even when the gate is unsatisfied.
|
||
|
|
|
||
|
|
## What counts as "verified"? [DECIDE B]
|
||
|
|
|
||
|
|
Options, from loosest to strictest:
|
||
|
|
- **(B1) Reachable:** a NetBird integration exists and `testConnection`
|
||
|
|
succeeds (the NetBird API answers `/api/peers` with the token). Proves the
|
||
|
|
control-plane token works, not that *this host* is on the mesh.
|
||
|
|
- **(B2) On the mesh:** the backend host itself has an interface/IP in the mesh
|
||
|
|
range (e.g. `100.64.0.0/10`), checked server-side. Proves the ArchNest host
|
||
|
|
is actually meshed.
|
||
|
|
- **(B3) Reachable + peers present:** B1 plus `listResources()` returning ≥1
|
||
|
|
connected peer.
|
||
|
|
|
||
|
|
Recommendation: **B1 as the baseline verification** (it's what the existing
|
||
|
|
NetBird adapter already supports and is deterministic), with **B2 as an
|
||
|
|
additional optional check** surfaced as info ("this host's mesh IP: …"). B3 is
|
||
|
|
nice but a single-peer network is legitimate, so don't require peers.
|
||
|
|
|
||
|
|
This needs your call — see [DECIDE B] at the end.
|
||
|
|
|
||
|
|
## Where state lives
|
||
|
|
|
||
|
|
There is **no server-side key-value config store** today; all config is in the
|
||
|
|
`integrations` table. Two options:
|
||
|
|
|
||
|
|
- **(S1) Derive from the NetBird integration:** "mesh verified" = there exists
|
||
|
|
a `netbird` integration with `status = 'connected'` (optionally within a
|
||
|
|
freshness window). No new table. Simplest, but conflates "an integration that
|
||
|
|
happens to be NetBird" with "the designated mesh".
|
||
|
|
- **(S2) New `system_config` key-value table:** explicit keys like
|
||
|
|
`mesh.integrationId`, `mesh.verifiedAt`, `mesh.overrideUntil`. Cleaner, gives
|
||
|
|
us a real home for future system-level settings (and the override flag), at
|
||
|
|
the cost of a new table + endpoints.
|
||
|
|
|
||
|
|
Recommendation: **S2 — a small `system_config` kv table.** The gate needs to
|
||
|
|
persist an override flag and a "designated mesh integration" pointer that S1
|
||
|
|
can't cleanly represent, and ArchNest will want a system-config store for other
|
||
|
|
things eventually (this is also where a future "mesh required: on/off" toggle
|
||
|
|
lives). Proposed schema:
|
||
|
|
|
||
|
|
```sql
|
||
|
|
CREATE TABLE IF NOT EXISTS system_config (
|
||
|
|
key TEXT PRIMARY KEY,
|
||
|
|
value TEXT NOT NULL,
|
||
|
|
updated_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||
|
|
);
|
||
|
|
```
|
||
|
|
|
||
|
|
Keys: `mesh.integrationId` (the designated NetBird integration), `mesh.verifiedAt`
|
||
|
|
(ISO timestamp of last successful verify), `mesh.overrideUntil` (optional ISO
|
||
|
|
timestamp for a temporary admin skip), `mesh.required` (`"true"`/`"false"`,
|
||
|
|
default true — lets the whole gate be turned off).
|
||
|
|
|
||
|
|
## Frontend flow
|
||
|
|
|
||
|
|
### New auth status
|
||
|
|
Add `'needs-mesh'` to the `AuthStatus` union in `AuthContext.tsx`. `refresh()`
|
||
|
|
currently: token → `api.me()` → `'logged-in'`. New: after a successful
|
||
|
|
`api.me()`, also call `api.getMeshStatus()`; if mesh is **required and not
|
||
|
|
verified and not overridden**, set `'needs-mesh'` instead of `'logged-in'`.
|
||
|
|
|
||
|
|
### App routing (`App.tsx`)
|
||
|
|
Insert a branch **after** `logged-out`/`enrolling` and **before** `Dashboard`:
|
||
|
|
```
|
||
|
|
if (status === 'needs-mesh') return <MeshGate />
|
||
|
|
return <Dashboard />
|
||
|
|
```
|
||
|
|
|
||
|
|
### `MeshGate` page
|
||
|
|
A focused, full-screen page (styled like Enrollment) that:
|
||
|
|
- Explains the prerequisite.
|
||
|
|
- Lets the admin **configure the NetBird mesh** (reuse the integration
|
||
|
|
create/test form — same `createIntegration` + `testIntegration` calls
|
||
|
|
Enrollment's `ConnectForm` already uses), or pick an existing NetBird
|
||
|
|
integration as the designated mesh.
|
||
|
|
- Runs **Detect → Test → Verify**: shows the test result, and (B2) the detected
|
||
|
|
mesh IP of the ArchNest host.
|
||
|
|
- On success, marks `mesh.verifiedAt`, then calls `refresh()` → advances to
|
||
|
|
`Dashboard`.
|
||
|
|
- Provides the **admin override** control per [DECIDE A].
|
||
|
|
- **Members (non-admins):** a member who logs in while mesh is unverified can't
|
||
|
|
fix it (only admins configure integrations). They should see a "waiting on an
|
||
|
|
admin to finish mesh setup" message, not a config form. [DECIDE C: do we even
|
||
|
|
allow member login pre-verification, or block all use until verified?]
|
||
|
|
|
||
|
|
### Enrollment
|
||
|
|
Keep Enrollment's account step. The mesh step can either be folded into
|
||
|
|
Enrollment as a mandatory step before `finishEnrollment()`, or live purely as
|
||
|
|
the post-login `needs-mesh` gate. Recommendation: **gate only** (don't duplicate
|
||
|
|
in Enrollment) — one code path, and it also covers existing installs that
|
||
|
|
predate the gate.
|
||
|
|
|
||
|
|
## Backend
|
||
|
|
|
||
|
|
- `GET /api/system/mesh-status` (mirrors `setup-status`): returns
|
||
|
|
`{ required, verified, overridden, meshIntegrationId, hostMeshIp? }`. Behind
|
||
|
|
`authenticate` (any logged-in user can read).
|
||
|
|
- `POST /api/system/mesh/verify` (admin): designates a NetBird integration as
|
||
|
|
the mesh, runs its `testConnection`, (B2) checks host mesh IP, persists
|
||
|
|
`mesh.integrationId` + `mesh.verifiedAt` on success. Returns the result.
|
||
|
|
- `POST /api/system/mesh/override` (admin) **[DECIDE A]**: sets
|
||
|
|
`mesh.overrideUntil` (or a permanent skip). Writes a `logEvent`.
|
||
|
|
- Optional `PUT /api/system/mesh/required` (admin): toggle `mesh.required`.
|
||
|
|
- **Lockout safety:** none of the gate enforcement lives in a global request
|
||
|
|
hook that could block auth/integration/system routes. If we add any
|
||
|
|
server-side enforcement at all (beyond the UI gate), it must explicitly
|
||
|
|
exempt `/api/auth/*`, `/api/integrations*`, and `/api/system/*`.
|
||
|
|
|
||
|
|
## Decisions needed before coding
|
||
|
|
|
||
|
|
- **[DECIDE A] Override:** Should the admin have a "skip for now" escape hatch?
|
||
|
|
Strongly recommend **yes** (lockout safety). If yes: temporary (e.g. 24h,
|
||
|
|
re-prompts) or permanent-until-changed? And does skipping still let them into
|
||
|
|
the Dashboard fully, or into a limited state?
|
||
|
|
- **[DECIDE B] Verified definition:** B1 (reachable), B2 (host on mesh), or
|
||
|
|
B1+B2? Recommend B1 baseline + B2 as informational.
|
||
|
|
- **[DECIDE C] Member behavior pre-verification:** block all non-admin login
|
||
|
|
until mesh verified, or let members in with a "setup in progress" notice?
|
||
|
|
- **[DECIDE D] Existing install / this very deployment:** the live instance has
|
||
|
|
no mesh row yet. Turning the gate on **will immediately gate the running
|
||
|
|
production app** at next login. Do we (i) default `mesh.required = false` and
|
||
|
|
let the admin opt in, or (ii) default it on but rely on the override? This is
|
||
|
|
the riskiest part for the deployed instance.
|
||
|
|
|
||
|
|
## Explicitly out of scope
|
||
|
|
- Auto-installing/joining NetBird from ArchNest (we only verify, not provision).
|
||
|
|
- Supporting non-NetBird meshes (Tailscale, etc.) — possible later via the
|
||
|
|
same `mesh.integrationId` indirection.
|