# Docker Agent Monitoring (self-hosted, push model) Design doc for the self-hosted **Docker push-agent monitoring** feature (Option 1 in `ROADMAP.md` → "Docker monitoring agent"). Written before implementation; this is the contract the code should match. ## Goal Let ArchNest **monitor** Docker containers across multiple VMs without ArchNest reaching into those VMs. A small agent script runs on each Docker host, gathers rich container data, and **pushes** it to ArchNest. ArchNest stores the latest report per host and renders it read-only on the Containers page. This is monitoring only. **Management (start/stop/restart/exec) is unchanged** and continues to use the existing Docker-over-SSH path (`backend/src/ssh/docker.ts`, `backend/src/routes/dockerSsh.ts`) and the Docker Engine TCP integration (`backend/src/docker/`). A one-way push cannot perform actions, by design — so nothing about management is removed. ## Why push (for self-hosted) - VMs need **outbound-only** reachability to ArchNest. No exposed port, no dockerd TCP socket, no inbound SSH required for monitoring. - Decoupled from SSH auth entirely (sidesteps the cert/OPKSSH auth gap that affects the Docker-over-SSH path). - Simplest thing to "drop on any VM": a bash script + cron/systemd timer. The richer **pull agent** (on-demand monitor + manage via a local authenticated HTTP API on each VM) is the **paid** tier — see `ROADMAP.md`, not built here. ## Architecture ``` Docker VM (agent.sh, every N s) ArchNest backend Browser docker ps --format json ─┐ docker inspect ... ├─> JSON report ──POST /api/agents/docker/report──> upsert latest docker stats --no-stream ─┘ (Bearer: ARCHNEST_AGENT_TOKEN) per host_id in SQLite │ GET /api/agents/docker/... <────────┘ (user JWT) │ Containers page (read-only) ``` ## Security - **Ingest is token-gated, not user-gated.** `POST /api/agents/docker/report` is authenticated by a single shared secret `ARCHNEST_AGENT_TOKEN` (env var on the backend, same value in each agent script), compared in **constant time**. If the env var is unset, the ingest endpoint is **disabled** (returns 503) — the server never accepts unauthenticated reports. - **Ingest must be reachable on the mesh / non-public IP only.** The token is the application-layer guard; network-layer the endpoint should not be exposed publicly. (A separate, later initiative — the "mesh prerequisite gate" — will enforce mesh setup app-wide; this doc does not implement that gate. Until it exists, mesh-only reachability is an operational/deployment responsibility.) - **Ingest only stores data — it never executes anything from the agent.** The payload is validated with zod and persisted as-is; there is no command path, so there is no injection surface from agent input. - **Read endpoints are behind the normal user `authenticate` hook**, so any logged-in user can view monitoring data (consistent with the Phase 3 model: members can view everything). They are read-only. - Single shared token now; **per-host revocable tokens** are a noted future improvement, not in this iteration. ## Report schema (rich) The agent posts one report per host. `host_id` is a stable, user-chosen identifier; `hostname` is informational. ```jsonc { "hostId": "proxmox-vm-1", // stable id, [A-Za-z0-9._-], required "hostname": "docker01", // informational "agentVersion": "1", "reportedAt": "2026-06-20T19:30:00Z", // agent clock; server also records its own receivedAt "containers": [ { "id": "", "name": "myapp", "image": "nginx:1.27", "imageId": "sha256:...", "state": "running", // running|exited|paused|created|restarting|dead "status": "Up 3 hours", // human string from docker ps "createdAt": "2026-06-20T16:00:00Z", "startedAt": "2026-06-20T16:00:01Z", "restartCount": 0, "restartPolicy": "unless-stopped", "health": "healthy", // healthy|unhealthy|starting|none "ports": [ // normalized from inspect { "hostIp": "0.0.0.0", "hostPort": 8080, "containerPort": 80, "proto": "tcp" } ], "networks": [ { "name": "bridge", "ip": "172.17.0.2" } ], "mounts": [ { "type": "volume", "source": "myapp_data", "destination": "/data", "rw": true } ], "env": [ // SECRETS MASKED (see below) { "key": "NODE_ENV", "value": "production" }, { "key": "DB_PASSWORD", "value": "********" } ], "command": "nginx -g 'daemon off;'", "labels": { "com.docker.compose.project": "myapp" }, "stats": { // snapshot from docker stats --no-stream "cpuPercent": 1.4, "memUsage": 20971520, "memLimit": 536870912, "netRxBytes": 12345, "netTxBytes": 67890, "blockReadBytes": 0, "blockWriteBytes": 0 } } ] } ``` ### Env masking The agent masks values whose key matches a secret-ish pattern (`/(PASS|SECRET|TOKEN|KEY|PRIVATE|CREDENTIAL)/i`) before sending, replacing the value with `********`. The full value never leaves the VM. (Defense in depth; the backend also will not display unmasked secrets.) ### Source capability note The Containers page already aggregates three sources (Docker TCP API, Docker over SSH, and now agent). Not every field exists for every source — the UI must **degrade gracefully** and show "—" / "not available from this source" rather than erroring. The agent is the richest source (it runs `docker inspect`). ## Backend ### DB New table, latest-report-per-host (idempotent migration in `backend/src/db/index.ts`): ```sql CREATE TABLE IF NOT EXISTS docker_agent_reports ( host_id TEXT PRIMARY KEY, hostname TEXT, report_json TEXT NOT NULL, -- the full containers array as JSON reported_at TEXT, -- agent-supplied timestamp received_at TEXT NOT NULL DEFAULT (datetime('now')) -- server receive time (source of truth for staleness) ); ``` We keep only the latest report per `host_id` (upsert). Historical time-series is out of scope for this iteration. ### Endpoints - `POST /api/agents/docker/report` — **token-gated** (Bearer `ARCHNEST_AGENT_TOKEN`, constant-time). 503 if token unconfigured, 401 on mismatch, 400 on invalid payload. Upserts the row for `hostId`. - `GET /api/agents/docker/hosts` — user-auth. Returns each reported host with `hostId`, `hostname`, `receivedAt`, `containerCount`, and a `stale` flag (`true` if `received_at` older than `STALE_AFTER_MS`, default ~90s / tunable). - `GET /api/agents/docker/hosts/:hostId/containers` — user-auth. Returns the parsed container list for that host (the spreadsheet rows + enough for detail). - `GET /api/agents/docker/hosts/:hostId/containers/:containerId` — user-auth. Returns the single container's full detail object. `api.ts` gets matching functions + TS interfaces (`AgentHost`, `AgentContainer`, etc.). ## Agent script `agent/archnest-docker-agent.sh` — portable bash, dependencies: `docker`, `curl`, and a JSON tool. To avoid forcing `jq`, the script builds the report by combining `docker ps --format '{{json .}}'`, `docker inspect`, and `docker stats --no-stream --format '{{json .}}'`; if `jq` is present it is used to assemble/mask robustly, otherwise a documented `jq`-required note is shown. (Decision: require `jq` — it is the only sane way to assemble + mask nested JSON in bash reliably; `jq` is a one-line install on every distro. The script checks for it and exits with a clear message if missing.) Configuration via env (script header or `/etc/archnest/agent.env`): - `ARCHNEST_URL` — e.g. `http://:4000` (mesh address). - `ARCHNEST_AGENT_TOKEN` — shared token. - `ARCHNEST_HOST_ID` — stable id for this VM. Scheduling: provide both a **cron** line and a **systemd service + timer** example. Recommended interval 30s (must be < backend `STALE_AFTER_MS`). ## Frontend — Containers page The Containers page becomes **tabbed**: - **Tab 1 "Containers"** — the existing spreadsheet view (Name, Image, State, CPU, Memory, Ports, Actions), now also including agent-reported hosts. The host selector lists Docker-API, SSH, and agent hosts. - **Clicking a container Name** opens a **new tab** in the Containers page showing that container's detail (tabs are dynamic; closeable). ### Detail tab contents (graceful per-source degradation) - **Overview:** name, image + tag, image id, short/full id, created, started, uptime, restart count, restart policy. - **State & health:** state, exit code (if stopped), healthcheck status. - **Stats:** CPU %, mem usage/limit, net RX/TX, block I/O (snapshot; agent & Docker-API have it, SSH list does not). - **Ports / Networks / Mounts:** tables. - **Environment & labels:** env vars with secret values masked; labels. - **Command/entrypoint.** - **Logs:** recent tail (reuse existing logs path where the source supports it). Fields unavailable from the active source render as "—" / a small "not reported by this source" note. ## Explicitly deferred (not in this work) - **Mesh prerequisite gate** (require mesh detected/tested/verified in Settings before anything else can be configured) — its own initiative, needs its own design (lockout-safety is the hard part). This doc assumes mesh-only ingest is handled operationally for now. - **Option 2 paid pull-agent** (local authenticated HTTP API per VM, on-demand monitor + manage) — `ROADMAP.md`. - **Per-host tokens**, **historical/time-series metrics**, **live log tailing for agent hosts**.