dev_arc_aws/docs/docker-agent-monitoring.md

# Docker Agent Monitoring (self-hosted, push model)

Design doc for the self-hosted **Docker push-agent monitoring** feature
(Option 1 in `ROADMAP.md` → "Docker monitoring agent"). Written before
implementation; this is the contract the code should match.

## Goal

Let ArchNest **monitor** Docker containers across multiple VMs without ArchNest
reaching into those VMs. A small agent script runs on each Docker host, gathers
rich container data, and **pushes** it to ArchNest. ArchNest stores the latest
report per host and renders it read-only on the Containers page.

This is monitoring only. **Management (start/stop/restart/exec) is unchanged**
and continues to use the existing Docker-over-SSH path
(`backend/src/ssh/docker.ts`, `backend/src/routes/dockerSsh.ts`) and the Docker
Engine TCP integration (`backend/src/docker/`). A one-way push cannot perform
actions, by design — so nothing about management is removed.

## Why push (for self-hosted)

- VMs need **outbound-only** reachability to ArchNest. No exposed port, no
  dockerd TCP socket, no inbound SSH required for monitoring.
- Decoupled from SSH auth entirely (sidesteps the cert/OPKSSH auth gap that
  affects the Docker-over-SSH path).
- Simplest thing to "drop on any VM": a bash script + cron/systemd timer.

The richer **pull agent** (on-demand monitor + manage via a local authenticated
HTTP API on each VM) is the **paid** tier — see `ROADMAP.md`, not built here.

## Architecture

```
Docker VM (agent.sh, every N s)              ArchNest backend            Browser
  docker ps --format json   ─┐
  docker inspect <id>...     ├─> JSON report ──POST /api/agents/docker/report──> upsert latest
  docker stats --no-stream  ─┘   (Bearer: ARCHNEST_AGENT_TOKEN)        per host_id in SQLite
                                                                              │
                                          GET /api/agents/docker/... <────────┘ (user JWT)
                                                                              │
                                                              Containers page (read-only)
```

## Security

- **Ingest is token-gated, not user-gated.** `POST /api/agents/docker/report`
  is authenticated by a single shared secret `ARCHNEST_AGENT_TOKEN` (env var on
  the backend, same value in each agent script), compared in **constant time**.
  If the env var is unset, the ingest endpoint is **disabled** (returns 503) —
  the server never accepts unauthenticated reports.
- **Ingest must be reachable on the mesh / non-public IP only.** The token is
  the application-layer guard; network-layer the endpoint should not be exposed
  publicly. (A separate, later initiative — the "mesh prerequisite gate" — will
  enforce mesh setup app-wide; this doc does not implement that gate. Until it
  exists, mesh-only reachability is an operational/deployment responsibility.)
- **Ingest only stores data — it never executes anything from the agent.** The
  payload is validated with zod and persisted as-is; there is no command path,
  so there is no injection surface from agent input.
- **Read endpoints are behind the normal user `authenticate` hook**, so any
  logged-in user can view monitoring data (consistent with the Phase 3 model:
  members can view everything). They are read-only.
- Single shared token now; **per-host revocable tokens** are a noted future
  improvement, not in this iteration.

## Report schema (rich)

The agent posts one report per host. `host_id` is a stable, user-chosen
identifier; `hostname` is informational.

```jsonc
{
  "hostId": "proxmox-vm-1",          // stable id, [A-Za-z0-9._-], required
  "hostname": "docker01",            // informational
  "agentVersion": "1",
  "reportedAt": "2026-06-20T19:30:00Z", // agent clock; server also records its own receivedAt
  "containers": [
    {
      "id": "<full container id>",
      "name": "myapp",
      "image": "nginx:1.27",
      "imageId": "sha256:...",
      "state": "running",            // running|exited|paused|created|restarting|dead
      "status": "Up 3 hours",        // human string from docker ps
      "createdAt": "2026-06-20T16:00:00Z",
      "startedAt": "2026-06-20T16:00:01Z",
      "restartCount": 0,
      "restartPolicy": "unless-stopped",
      "health": "healthy",           // healthy|unhealthy|starting|none
      "ports": [                      // normalized from inspect
        { "hostIp": "0.0.0.0", "hostPort": 8080, "containerPort": 80, "proto": "tcp" }
      ],
      "networks": [
        { "name": "bridge", "ip": "172.17.0.2" }
      ],
      "mounts": [
        { "type": "volume", "source": "myapp_data", "destination": "/data", "rw": true }
      ],
      "env": [                        // SECRETS MASKED (see below)
        { "key": "NODE_ENV", "value": "production" },
        { "key": "DB_PASSWORD", "value": "********" }
      ],
      "command": "nginx -g 'daemon off;'",
      "labels": { "com.docker.compose.project": "myapp" },
      "stats": {                      // snapshot from docker stats --no-stream
        "cpuPercent": 1.4,
        "memUsage": 20971520,
        "memLimit": 536870912,
        "netRxBytes": 12345,
        "netTxBytes": 67890,
        "blockReadBytes": 0,
        "blockWriteBytes": 0
      }
    }
  ]
}
```

### Env masking
The agent masks values whose key matches a secret-ish pattern
(`/(PASS|SECRET|TOKEN|KEY|PRIVATE|CREDENTIAL)/i`) before sending, replacing the
value with `********`. The full value never leaves the VM. (Defense in depth;
the backend also will not display unmasked secrets.)

### Source capability note
The Containers page already aggregates three sources (Docker TCP API, Docker
over SSH, and now agent). Not every field exists for every source — the UI must
**degrade gracefully** and show "—" / "not available from this source" rather
than erroring. The agent is the richest source (it runs `docker inspect`).

## Backend

### DB
New table, latest-report-per-host (idempotent migration in
`backend/src/db/index.ts`):

```sql
CREATE TABLE IF NOT EXISTS docker_agent_reports (
  host_id     TEXT PRIMARY KEY,
  hostname    TEXT,
  report_json TEXT NOT NULL,        -- the full containers array as JSON
  reported_at TEXT,                 -- agent-supplied timestamp
  received_at TEXT NOT NULL DEFAULT (datetime('now'))  -- server receive time (source of truth for staleness)
);
```

We keep only the latest report per `host_id` (upsert). Historical
time-series is out of scope for this iteration.

### Endpoints
- `POST /api/agents/docker/report` — **token-gated** (Bearer
  `ARCHNEST_AGENT_TOKEN`, constant-time). 503 if token unconfigured, 401 on
  mismatch, 400 on invalid payload. Upserts the row for `hostId`.
- `GET /api/agents/docker/hosts` — user-auth. Returns each reported host with
  `hostId`, `hostname`, `receivedAt`, `containerCount`, and a `stale` flag
  (`true` if `received_at` older than `STALE_AFTER_MS`, default ~90s / tunable).
- `GET /api/agents/docker/hosts/:hostId/containers` — user-auth. Returns the
  parsed container list for that host (the spreadsheet rows + enough for detail).
- `GET /api/agents/docker/hosts/:hostId/containers/:containerId` — user-auth.
  Returns the single container's full detail object.

`api.ts` gets matching functions + TS interfaces (`AgentHost`,
`AgentContainer`, etc.).

## Agent script

`agent/archnest-docker-agent.sh` — portable bash, dependencies: `docker`,
`curl`, and a JSON tool. To avoid forcing `jq`, the script builds the report by
combining `docker ps --format '{{json .}}'`, `docker inspect`, and
`docker stats --no-stream --format '{{json .}}'`; if `jq` is present it is used
to assemble/mask robustly, otherwise a documented `jq`-required note is shown.
(Decision: require `jq` — it is the only sane way to assemble + mask nested
JSON in bash reliably; `jq` is a one-line install on every distro. The script
checks for it and exits with a clear message if missing.)

Configuration via env (script header or `/etc/archnest/agent.env`):
- `ARCHNEST_URL` — e.g. `http://<archnest-mesh-ip>:4000` (mesh address).
- `ARCHNEST_AGENT_TOKEN` — shared token.
- `ARCHNEST_HOST_ID` — stable id for this VM.

Scheduling: provide both a **cron** line and a **systemd service + timer**
example. Recommended interval 30s (must be < backend `STALE_AFTER_MS`).

## Frontend — Containers page

The Containers page becomes **tabbed**:
- **Tab 1 "Containers"** — the existing spreadsheet view (Name, Image, State,
  CPU, Memory, Ports, Actions), now also including agent-reported hosts. The
  host selector lists Docker-API, SSH, and agent hosts.
- **Clicking a container Name** opens a **new tab** in the Containers page
  showing that container's detail (tabs are dynamic; closeable).

### Detail tab contents (graceful per-source degradation)
- **Overview:** name, image + tag, image id, short/full id, created, started,
  uptime, restart count, restart policy.
- **State & health:** state, exit code (if stopped), healthcheck status.
- **Stats:** CPU %, mem usage/limit, net RX/TX, block I/O (snapshot; agent &
  Docker-API have it, SSH list does not).
- **Ports / Networks / Mounts:** tables.
- **Environment & labels:** env vars with secret values masked; labels.
- **Command/entrypoint.**
- **Logs:** recent tail (reuse existing logs path where the source supports it).

Fields unavailable from the active source render as "—" / a small "not
reported by this source" note.

## Explicitly deferred (not in this work)

- **Mesh prerequisite gate** (require mesh detected/tested/verified in Settings
  before anything else can be configured) — its own initiative, needs its own
  design (lockout-safety is the hard part). This doc assumes mesh-only ingest is
  handled operationally for now.
- **Option 2 paid pull-agent** (local authenticated HTTP API per VM, on-demand
  monitor + manage) — `ROADMAP.md`.
- **Per-host tokens**, **historical/time-series metrics**, **live log tailing
  for agent hosts**.
Add Docker-over-SSH management and push-agent monitoring (#31) Expands the Containers feature with two new ways to see and manage Docker containers without exposing the Docker Engine TCP socket, plus the docs and roadmap entries that frame them. Docker over SSH (management): - Runs the `docker` CLI on a remote SSH host instead of talking to the Engine TCP API, reusing the existing SSH transport (jump-host chaining, host-key verification, key/password auth) via connectTarget + execCommand. No dockerd socket has to be exposed — the mesh + SSH auth are the gate. - backend/src/ssh/docker.ts: list/logs/start/stop/restart/pause/unpause/remove and an interactive `docker exec` shell builder. Container refs are validated against a strict allowlist and single-quoted to prevent command injection; action verbs are whitelisted. - backend/src/routes/dockerSsh.ts: REST routes mirroring the TCP Docker API shape (mutating actions gated by adminOnly) + a /api/docker-ssh/exec WebSocket modeled on the terminal PTY plumbing. - Note: the SSH path uses the ssh2 key/password auth; it does not implement the OpenSSH-certificate (OPKSSH) fallback that the terminal route has. Docker push-agent monitoring (self-hosted, read-only): - A small bash agent (agent/archnest-docker-agent.sh) runs on each Docker VM, collects a rich snapshot (docker ps + inspect + a stats snapshot), masks secret-looking env values locally, and POSTs it to ArchNest. VMs need outbound-only mesh access — no exposed port, no SSH for monitoring. - backend/src/routes/agents.ts: token-gated ingest (POST /api/agents/docker/report, ARCHNEST_AGENT_TOKEN, constant-time compare; 503 when unset, so it is disabled by default) plus user-auth read endpoints (hosts list with staleness flag, per-host containers, single-container detail). New docker_agent_reports table (latest report per host). - Ingest stores data only; it never executes anything from the agent. Containers page: - Host selector now spans Docker API, SSH, and Agent sources. - Intra-page tabs: a Containers list plus dynamic, closeable per-container detail tabs opened by clicking a container name. Agent detail shows overview/state/stats/ports/networks/mounts/env(masked)/labels; docker/ssh degrade gracefully. Agent rows are read-only; docker/ssh keep management. Docs/roadmap: - docs/docker-agent-monitoring.md (design doc, written before implementation). - ROADMAP.md: LXC management (paid), Docker monitoring agent tiering (push self-hosted now / pull-agent paid), terminal grid tiering. Deferred (documented, not built here): the mesh-prerequisite setup gate, the paid pull-agent (Option 2), per-host tokens, time-series metrics. Requires ARCHNEST_AGENT_TOKEN in the backend env to enable agent ingest. Verified: backend `tsc --noEmit` and frontend `tsc -b && vite build` both pass; agent jq filters, byte conversion, and `bash -n` checked locally. Co-authored-by: Samuel James <ssamjame@amazon.com> Co-authored-by: Kiro <noreply@kiro.dev> 2026-06-20 16:24:57 -04:00			`# Docker Agent Monitoring (self-hosted, push model)`

			`Design doc for the self-hosted Docker push-agent monitoring feature`
			(Option 1 in `ROADMAP.md` → "Docker monitoring agent"). Written before
			`implementation; this is the contract the code should match.`

			`## Goal`

			`Let ArchNest monitor Docker containers across multiple VMs without ArchNest`
			`reaching into those VMs. A small agent script runs on each Docker host, gathers`
			`rich container data, and pushes it to ArchNest. ArchNest stores the latest`
			`report per host and renders it read-only on the Containers page.`

			`This is monitoring only. Management (start/stop/restart/exec) is unchanged`
			`and continues to use the existing Docker-over-SSH path`
			(`backend/src/ssh/docker.ts`, `backend/src/routes/dockerSsh.ts`) and the Docker
			Engine TCP integration (`backend/src/docker/`). A one-way push cannot perform
			`actions, by design — so nothing about management is removed.`

			`## Why push (for self-hosted)`

			`- VMs need outbound-only reachability to ArchNest. No exposed port, no`
			`dockerd TCP socket, no inbound SSH required for monitoring.`
			`- Decoupled from SSH auth entirely (sidesteps the cert/OPKSSH auth gap that`
			`affects the Docker-over-SSH path).`
			`- Simplest thing to "drop on any VM": a bash script + cron/systemd timer.`

			`The richer pull agent (on-demand monitor + manage via a local authenticated`
			HTTP API on each VM) is the paid tier — see `ROADMAP.md`, not built here.

			`## Architecture`

			```
			`Docker VM (agent.sh, every N s) ArchNest backend Browser`
			`docker ps --format json ─┐`
			`docker inspect <id>... ├─> JSON report ──POST /api/agents/docker/report──> upsert latest`
			`docker stats --no-stream ─┘ (Bearer: ARCHNEST_AGENT_TOKEN) per host_id in SQLite`
			`│`
			`GET /api/agents/docker/... <────────┘ (user JWT)`
			`│`
			`Containers page (read-only)`
			```

			`## Security`

			- Ingest is token-gated, not user-gated. `POST /api/agents/docker/report`
			is authenticated by a single shared secret `ARCHNEST_AGENT_TOKEN` (env var on
			`the backend, same value in each agent script), compared in constant time.`
			`If the env var is unset, the ingest endpoint is disabled (returns 503) —`
			`the server never accepts unauthenticated reports.`
			`- Ingest must be reachable on the mesh / non-public IP only. The token is`
			`the application-layer guard; network-layer the endpoint should not be exposed`
			`publicly. (A separate, later initiative — the "mesh prerequisite gate" — will`
			`enforce mesh setup app-wide; this doc does not implement that gate. Until it`
			`exists, mesh-only reachability is an operational/deployment responsibility.)`
			`- Ingest only stores data — it never executes anything from the agent. The`
			`payload is validated with zod and persisted as-is; there is no command path,`
			`so there is no injection surface from agent input.`
			- Read endpoints are behind the normal user `authenticate` hook, so any
			`logged-in user can view monitoring data (consistent with the Phase 3 model:`
			`members can view everything). They are read-only.`
			`- Single shared token now; per-host revocable tokens are a noted future`
			`improvement, not in this iteration.`

			`## Report schema (rich)`

			The agent posts one report per host. `host_id` is a stable, user-chosen
			identifier; `hostname` is informational.

			```jsonc
			`{`
			`"hostId": "proxmox-vm-1", // stable id, [A-Za-z0-9._-], required`
			`"hostname": "docker01", // informational`
			`"agentVersion": "1",`
			`"reportedAt": "2026-06-20T19:30:00Z", // agent clock; server also records its own receivedAt`
			`"containers": [`
			`{`
			`"id": "<full container id>",`
			`"name": "myapp",`
			`"image": "nginx:1.27",`
			`"imageId": "sha256:...",`
			`"state": "running", // running\|exited\|paused\|created\|restarting\|dead`
			`"status": "Up 3 hours", // human string from docker ps`
			`"createdAt": "2026-06-20T16:00:00Z",`
			`"startedAt": "2026-06-20T16:00:01Z",`
			`"restartCount": 0,`
			`"restartPolicy": "unless-stopped",`
			`"health": "healthy", // healthy\|unhealthy\|starting\|none`
			`"ports": [ // normalized from inspect`
			`{ "hostIp": "0.0.0.0", "hostPort": 8080, "containerPort": 80, "proto": "tcp" }`
			`],`
			`"networks": [`
			`{ "name": "bridge", "ip": "172.17.0.2" }`
			`],`
			`"mounts": [`
			`{ "type": "volume", "source": "myapp_data", "destination": "/data", "rw": true }`
			`],`
			`"env": [ // SECRETS MASKED (see below)`
			`{ "key": "NODE_ENV", "value": "production" },`
			`{ "key": "DB_PASSWORD", "value": "********" }`
			`],`
			`"command": "nginx -g 'daemon off;'",`
			`"labels": { "com.docker.compose.project": "myapp" },`
			`"stats": { // snapshot from docker stats --no-stream`
			`"cpuPercent": 1.4,`
			`"memUsage": 20971520,`
			`"memLimit": 536870912,`
			`"netRxBytes": 12345,`
			`"netTxBytes": 67890,`
			`"blockReadBytes": 0,`
			`"blockWriteBytes": 0`
			`}`
			`}`
			`]`
			`}`
			```

			`### Env masking`
			`The agent masks values whose key matches a secret-ish pattern`
			(`/(PASS\|SECRET\|TOKEN\|KEY\|PRIVATE\|CREDENTIAL)/i`) before sending, replacing the
			value with `********`. The full value never leaves the VM. (Defense in depth;
			`the backend also will not display unmasked secrets.)`

			`### Source capability note`
			`The Containers page already aggregates three sources (Docker TCP API, Docker`
			`over SSH, and now agent). Not every field exists for every source — the UI must`
			`degrade gracefully and show "—" / "not available from this source" rather`
			than erroring. The agent is the richest source (it runs `docker inspect`).

			`## Backend`

			`### DB`
			`New table, latest-report-per-host (idempotent migration in`
			`backend/src/db/index.ts`):

			```sql
			`CREATE TABLE IF NOT EXISTS docker_agent_reports (`
			`host_id TEXT PRIMARY KEY,`
			`hostname TEXT,`
			`report_json TEXT NOT NULL, -- the full containers array as JSON`
			`reported_at TEXT, -- agent-supplied timestamp`
			`received_at TEXT NOT NULL DEFAULT (datetime('now')) -- server receive time (source of truth for staleness)`
			`);`
			```

			We keep only the latest report per `host_id` (upsert). Historical
			`time-series is out of scope for this iteration.`

			`### Endpoints`
			- `POST /api/agents/docker/report` — token-gated (Bearer
			`ARCHNEST_AGENT_TOKEN`, constant-time). 503 if token unconfigured, 401 on
			mismatch, 400 on invalid payload. Upserts the row for `hostId`.
			- `GET /api/agents/docker/hosts` — user-auth. Returns each reported host with
			`hostId`, `hostname`, `receivedAt`, `containerCount`, and a `stale` flag
			(`true` if `received_at` older than `STALE_AFTER_MS`, default ~90s / tunable).
			- `GET /api/agents/docker/hosts/:hostId/containers` — user-auth. Returns the
			`parsed container list for that host (the spreadsheet rows + enough for detail).`
			- `GET /api/agents/docker/hosts/:hostId/containers/:containerId` — user-auth.
			`Returns the single container's full detail object.`

			`api.ts` gets matching functions + TS interfaces (`AgentHost`,
			`AgentContainer`, etc.).

			`## Agent script`

			`agent/archnest-docker-agent.sh` — portable bash, dependencies: `docker`,
			`curl`, and a JSON tool. To avoid forcing `jq`, the script builds the report by
			combining `docker ps --format '{{json .}}'`, `docker inspect`, and
			`docker stats --no-stream --format '{{json .}}'`; if `jq` is present it is used
			to assemble/mask robustly, otherwise a documented `jq`-required note is shown.
			(Decision: require `jq` — it is the only sane way to assemble + mask nested
			JSON in bash reliably; `jq` is a one-line install on every distro. The script
			`checks for it and exits with a clear message if missing.)`

			Configuration via env (script header or `/etc/archnest/agent.env`):
			- `ARCHNEST_URL` — e.g. `http://<archnest-mesh-ip>:4000` (mesh address).
			- `ARCHNEST_AGENT_TOKEN` — shared token.
			- `ARCHNEST_HOST_ID` — stable id for this VM.

			`Scheduling: provide both a cron line and a systemd service + timer`
			example. Recommended interval 30s (must be < backend `STALE_AFTER_MS`).

			`## Frontend — Containers page`

			`The Containers page becomes tabbed:`
			`- Tab 1 "Containers" — the existing spreadsheet view (Name, Image, State,`
			`CPU, Memory, Ports, Actions), now also including agent-reported hosts. The`
			`host selector lists Docker-API, SSH, and agent hosts.`
			`- Clicking a container Name opens a new tab in the Containers page`
			`showing that container's detail (tabs are dynamic; closeable).`

			`### Detail tab contents (graceful per-source degradation)`
			`- Overview: name, image + tag, image id, short/full id, created, started,`
			`uptime, restart count, restart policy.`
			`- State & health: state, exit code (if stopped), healthcheck status.`
			`- Stats: CPU %, mem usage/limit, net RX/TX, block I/O (snapshot; agent &`
			`Docker-API have it, SSH list does not).`
			`- Ports / Networks / Mounts: tables.`
			`- Environment & labels: env vars with secret values masked; labels.`
			`- Command/entrypoint.`
			`- Logs: recent tail (reuse existing logs path where the source supports it).`

			`Fields unavailable from the active source render as "—" / a small "not`
			`reported by this source" note.`

			`## Explicitly deferred (not in this work)`

			`- Mesh prerequisite gate (require mesh detected/tested/verified in Settings`
			`before anything else can be configured) — its own initiative, needs its own`
			`design (lockout-safety is the hard part). This doc assumes mesh-only ingest is`
			`handled operationally for now.`
			`- Option 2 paid pull-agent (local authenticated HTTP API per VM, on-demand`
			monitor + manage) — `ROADMAP.md`.
			`- Per-host tokens, historical/time-series metrics, **live log tailing`
			`for agent hosts**.`