216 lines
9.8 KiB
Markdown
216 lines
9.8 KiB
Markdown
|
|
# Docker Agent Monitoring (self-hosted, push model)
|
||
|
|
|
||
|
|
Design doc for the self-hosted **Docker push-agent monitoring** feature
|
||
|
|
(Option 1 in `ROADMAP.md` → "Docker monitoring agent"). Written before
|
||
|
|
implementation; this is the contract the code should match.
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
|
||
|
|
Let ArchNest **monitor** Docker containers across multiple VMs without ArchNest
|
||
|
|
reaching into those VMs. A small agent script runs on each Docker host, gathers
|
||
|
|
rich container data, and **pushes** it to ArchNest. ArchNest stores the latest
|
||
|
|
report per host and renders it read-only on the Containers page.
|
||
|
|
|
||
|
|
This is monitoring only. **Management (start/stop/restart/exec) is unchanged**
|
||
|
|
and continues to use the existing Docker-over-SSH path
|
||
|
|
(`backend/src/ssh/docker.ts`, `backend/src/routes/dockerSsh.ts`) and the Docker
|
||
|
|
Engine TCP integration (`backend/src/docker/`). A one-way push cannot perform
|
||
|
|
actions, by design — so nothing about management is removed.
|
||
|
|
|
||
|
|
## Why push (for self-hosted)
|
||
|
|
|
||
|
|
- VMs need **outbound-only** reachability to ArchNest. No exposed port, no
|
||
|
|
dockerd TCP socket, no inbound SSH required for monitoring.
|
||
|
|
- Decoupled from SSH auth entirely (sidesteps the cert/OPKSSH auth gap that
|
||
|
|
affects the Docker-over-SSH path).
|
||
|
|
- Simplest thing to "drop on any VM": a bash script + cron/systemd timer.
|
||
|
|
|
||
|
|
The richer **pull agent** (on-demand monitor + manage via a local authenticated
|
||
|
|
HTTP API on each VM) is the **paid** tier — see `ROADMAP.md`, not built here.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
Docker VM (agent.sh, every N s) ArchNest backend Browser
|
||
|
|
docker ps --format json ─┐
|
||
|
|
docker inspect <id>... ├─> JSON report ──POST /api/agents/docker/report──> upsert latest
|
||
|
|
docker stats --no-stream ─┘ (Bearer: ARCHNEST_AGENT_TOKEN) per host_id in SQLite
|
||
|
|
│
|
||
|
|
GET /api/agents/docker/... <────────┘ (user JWT)
|
||
|
|
│
|
||
|
|
Containers page (read-only)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Security
|
||
|
|
|
||
|
|
- **Ingest is token-gated, not user-gated.** `POST /api/agents/docker/report`
|
||
|
|
is authenticated by a single shared secret `ARCHNEST_AGENT_TOKEN` (env var on
|
||
|
|
the backend, same value in each agent script), compared in **constant time**.
|
||
|
|
If the env var is unset, the ingest endpoint is **disabled** (returns 503) —
|
||
|
|
the server never accepts unauthenticated reports.
|
||
|
|
- **Ingest must be reachable on the mesh / non-public IP only.** The token is
|
||
|
|
the application-layer guard; network-layer the endpoint should not be exposed
|
||
|
|
publicly. (A separate, later initiative — the "mesh prerequisite gate" — will
|
||
|
|
enforce mesh setup app-wide; this doc does not implement that gate. Until it
|
||
|
|
exists, mesh-only reachability is an operational/deployment responsibility.)
|
||
|
|
- **Ingest only stores data — it never executes anything from the agent.** The
|
||
|
|
payload is validated with zod and persisted as-is; there is no command path,
|
||
|
|
so there is no injection surface from agent input.
|
||
|
|
- **Read endpoints are behind the normal user `authenticate` hook**, so any
|
||
|
|
logged-in user can view monitoring data (consistent with the Phase 3 model:
|
||
|
|
members can view everything). They are read-only.
|
||
|
|
- Single shared token now; **per-host revocable tokens** are a noted future
|
||
|
|
improvement, not in this iteration.
|
||
|
|
|
||
|
|
## Report schema (rich)
|
||
|
|
|
||
|
|
The agent posts one report per host. `host_id` is a stable, user-chosen
|
||
|
|
identifier; `hostname` is informational.
|
||
|
|
|
||
|
|
```jsonc
|
||
|
|
{
|
||
|
|
"hostId": "proxmox-vm-1", // stable id, [A-Za-z0-9._-], required
|
||
|
|
"hostname": "docker01", // informational
|
||
|
|
"agentVersion": "1",
|
||
|
|
"reportedAt": "2026-06-20T19:30:00Z", // agent clock; server also records its own receivedAt
|
||
|
|
"containers": [
|
||
|
|
{
|
||
|
|
"id": "<full container id>",
|
||
|
|
"name": "myapp",
|
||
|
|
"image": "nginx:1.27",
|
||
|
|
"imageId": "sha256:...",
|
||
|
|
"state": "running", // running|exited|paused|created|restarting|dead
|
||
|
|
"status": "Up 3 hours", // human string from docker ps
|
||
|
|
"createdAt": "2026-06-20T16:00:00Z",
|
||
|
|
"startedAt": "2026-06-20T16:00:01Z",
|
||
|
|
"restartCount": 0,
|
||
|
|
"restartPolicy": "unless-stopped",
|
||
|
|
"health": "healthy", // healthy|unhealthy|starting|none
|
||
|
|
"ports": [ // normalized from inspect
|
||
|
|
{ "hostIp": "0.0.0.0", "hostPort": 8080, "containerPort": 80, "proto": "tcp" }
|
||
|
|
],
|
||
|
|
"networks": [
|
||
|
|
{ "name": "bridge", "ip": "172.17.0.2" }
|
||
|
|
],
|
||
|
|
"mounts": [
|
||
|
|
{ "type": "volume", "source": "myapp_data", "destination": "/data", "rw": true }
|
||
|
|
],
|
||
|
|
"env": [ // SECRETS MASKED (see below)
|
||
|
|
{ "key": "NODE_ENV", "value": "production" },
|
||
|
|
{ "key": "DB_PASSWORD", "value": "********" }
|
||
|
|
],
|
||
|
|
"command": "nginx -g 'daemon off;'",
|
||
|
|
"labels": { "com.docker.compose.project": "myapp" },
|
||
|
|
"stats": { // snapshot from docker stats --no-stream
|
||
|
|
"cpuPercent": 1.4,
|
||
|
|
"memUsage": 20971520,
|
||
|
|
"memLimit": 536870912,
|
||
|
|
"netRxBytes": 12345,
|
||
|
|
"netTxBytes": 67890,
|
||
|
|
"blockReadBytes": 0,
|
||
|
|
"blockWriteBytes": 0
|
||
|
|
}
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Env masking
|
||
|
|
The agent masks values whose key matches a secret-ish pattern
|
||
|
|
(`/(PASS|SECRET|TOKEN|KEY|PRIVATE|CREDENTIAL)/i`) before sending, replacing the
|
||
|
|
value with `********`. The full value never leaves the VM. (Defense in depth;
|
||
|
|
the backend also will not display unmasked secrets.)
|
||
|
|
|
||
|
|
### Source capability note
|
||
|
|
The Containers page already aggregates three sources (Docker TCP API, Docker
|
||
|
|
over SSH, and now agent). Not every field exists for every source — the UI must
|
||
|
|
**degrade gracefully** and show "—" / "not available from this source" rather
|
||
|
|
than erroring. The agent is the richest source (it runs `docker inspect`).
|
||
|
|
|
||
|
|
## Backend
|
||
|
|
|
||
|
|
### DB
|
||
|
|
New table, latest-report-per-host (idempotent migration in
|
||
|
|
`backend/src/db/index.ts`):
|
||
|
|
|
||
|
|
```sql
|
||
|
|
CREATE TABLE IF NOT EXISTS docker_agent_reports (
|
||
|
|
host_id TEXT PRIMARY KEY,
|
||
|
|
hostname TEXT,
|
||
|
|
report_json TEXT NOT NULL, -- the full containers array as JSON
|
||
|
|
reported_at TEXT, -- agent-supplied timestamp
|
||
|
|
received_at TEXT NOT NULL DEFAULT (datetime('now')) -- server receive time (source of truth for staleness)
|
||
|
|
);
|
||
|
|
```
|
||
|
|
|
||
|
|
We keep only the latest report per `host_id` (upsert). Historical
|
||
|
|
time-series is out of scope for this iteration.
|
||
|
|
|
||
|
|
### Endpoints
|
||
|
|
- `POST /api/agents/docker/report` — **token-gated** (Bearer
|
||
|
|
`ARCHNEST_AGENT_TOKEN`, constant-time). 503 if token unconfigured, 401 on
|
||
|
|
mismatch, 400 on invalid payload. Upserts the row for `hostId`.
|
||
|
|
- `GET /api/agents/docker/hosts` — user-auth. Returns each reported host with
|
||
|
|
`hostId`, `hostname`, `receivedAt`, `containerCount`, and a `stale` flag
|
||
|
|
(`true` if `received_at` older than `STALE_AFTER_MS`, default ~90s / tunable).
|
||
|
|
- `GET /api/agents/docker/hosts/:hostId/containers` — user-auth. Returns the
|
||
|
|
parsed container list for that host (the spreadsheet rows + enough for detail).
|
||
|
|
- `GET /api/agents/docker/hosts/:hostId/containers/:containerId` — user-auth.
|
||
|
|
Returns the single container's full detail object.
|
||
|
|
|
||
|
|
`api.ts` gets matching functions + TS interfaces (`AgentHost`,
|
||
|
|
`AgentContainer`, etc.).
|
||
|
|
|
||
|
|
## Agent script
|
||
|
|
|
||
|
|
`agent/archnest-docker-agent.sh` — portable bash, dependencies: `docker`,
|
||
|
|
`curl`, and a JSON tool. To avoid forcing `jq`, the script builds the report by
|
||
|
|
combining `docker ps --format '{{json .}}'`, `docker inspect`, and
|
||
|
|
`docker stats --no-stream --format '{{json .}}'`; if `jq` is present it is used
|
||
|
|
to assemble/mask robustly, otherwise a documented `jq`-required note is shown.
|
||
|
|
(Decision: require `jq` — it is the only sane way to assemble + mask nested
|
||
|
|
JSON in bash reliably; `jq` is a one-line install on every distro. The script
|
||
|
|
checks for it and exits with a clear message if missing.)
|
||
|
|
|
||
|
|
Configuration via env (script header or `/etc/archnest/agent.env`):
|
||
|
|
- `ARCHNEST_URL` — e.g. `http://<archnest-mesh-ip>:4000` (mesh address).
|
||
|
|
- `ARCHNEST_AGENT_TOKEN` — shared token.
|
||
|
|
- `ARCHNEST_HOST_ID` — stable id for this VM.
|
||
|
|
|
||
|
|
Scheduling: provide both a **cron** line and a **systemd service + timer**
|
||
|
|
example. Recommended interval 30s (must be < backend `STALE_AFTER_MS`).
|
||
|
|
|
||
|
|
## Frontend — Containers page
|
||
|
|
|
||
|
|
The Containers page becomes **tabbed**:
|
||
|
|
- **Tab 1 "Containers"** — the existing spreadsheet view (Name, Image, State,
|
||
|
|
CPU, Memory, Ports, Actions), now also including agent-reported hosts. The
|
||
|
|
host selector lists Docker-API, SSH, and agent hosts.
|
||
|
|
- **Clicking a container Name** opens a **new tab** in the Containers page
|
||
|
|
showing that container's detail (tabs are dynamic; closeable).
|
||
|
|
|
||
|
|
### Detail tab contents (graceful per-source degradation)
|
||
|
|
- **Overview:** name, image + tag, image id, short/full id, created, started,
|
||
|
|
uptime, restart count, restart policy.
|
||
|
|
- **State & health:** state, exit code (if stopped), healthcheck status.
|
||
|
|
- **Stats:** CPU %, mem usage/limit, net RX/TX, block I/O (snapshot; agent &
|
||
|
|
Docker-API have it, SSH list does not).
|
||
|
|
- **Ports / Networks / Mounts:** tables.
|
||
|
|
- **Environment & labels:** env vars with secret values masked; labels.
|
||
|
|
- **Command/entrypoint.**
|
||
|
|
- **Logs:** recent tail (reuse existing logs path where the source supports it).
|
||
|
|
|
||
|
|
Fields unavailable from the active source render as "—" / a small "not
|
||
|
|
reported by this source" note.
|
||
|
|
|
||
|
|
## Explicitly deferred (not in this work)
|
||
|
|
|
||
|
|
- **Mesh prerequisite gate** (require mesh detected/tested/verified in Settings
|
||
|
|
before anything else can be configured) — its own initiative, needs its own
|
||
|
|
design (lockout-safety is the hard part). This doc assumes mesh-only ingest is
|
||
|
|
handled operationally for now.
|
||
|
|
- **Option 2 paid pull-agent** (local authenticated HTTP API per VM, on-demand
|
||
|
|
monitor + manage) — `ROADMAP.md`.
|
||
|
|
- **Per-host tokens**, **historical/time-series metrics**, **live log tailing
|
||
|
|
for agent hosts**.
|