dev_arc_aws/docs/docker-agent-monitoring.md

# Docker Agent Monitoring (self-hosted, push model)

Design doc for the self-hosted **Docker push-agent monitoring** feature
(Option 1 in `ROADMAP.md` → "Docker monitoring agent"). Written before
implementation; this is the contract the code should match.

## Goal

Let ArchNest **monitor** Docker containers across multiple VMs without ArchNest
reaching into those VMs. A small agent script runs on each Docker host, gathers
rich container data, and **pushes** it to ArchNest. ArchNest stores the latest
report per host and renders it read-only on the Containers page.

This is monitoring only. **Management (start/stop/restart/exec) is unchanged**
and continues to use the existing Docker-over-SSH path
(`backend/src/ssh/docker.ts`, `backend/src/routes/dockerSsh.ts`) and the Docker
Engine TCP integration (`backend/src/docker/`). A one-way push cannot perform
actions, by design — so nothing about management is removed.

## Why push (for self-hosted)

- VMs need **outbound-only** reachability to ArchNest. No exposed port, no
  dockerd TCP socket, no inbound SSH required for monitoring.
- Decoupled from SSH auth entirely (sidesteps the cert/OPKSSH auth gap that
  affects the Docker-over-SSH path).
- Simplest thing to "drop on any VM": a bash script + cron/systemd timer.

The richer **pull agent** (on-demand monitor + manage via a local authenticated
HTTP API on each VM) is the **paid** tier — see `ROADMAP.md`, not built here.

## Architecture

```
Docker VM (agent.sh, every N s)              ArchNest backend            Browser
  docker ps --format json   ─┐
  docker inspect <id>...     ├─> JSON report ──POST /api/agents/docker/report──> upsert latest
  docker stats --no-stream  ─┘   (Bearer: ARCHNEST_AGENT_TOKEN)        per host_id in SQLite
                                                                              │
                                          GET /api/agents/docker/... <────────┘ (user JWT)
                                                                              │
                                                              Containers page (read-only)
```

## Security

- **Ingest is token-gated, not user-gated.** `POST /api/agents/docker/report`
  is authenticated by a single shared secret `ARCHNEST_AGENT_TOKEN` (env var on
  the backend, same value in each agent script), compared in **constant time**.
  If the env var is unset, the ingest endpoint is **disabled** (returns 503) —
  the server never accepts unauthenticated reports.
- **Ingest must be reachable on the mesh / non-public IP only.** The token is
  the application-layer guard; network-layer the endpoint should not be exposed
  publicly. (A separate, later initiative — the "mesh prerequisite gate" — will
  enforce mesh setup app-wide; this doc does not implement that gate. Until it
  exists, mesh-only reachability is an operational/deployment responsibility.)
- **Ingest only stores data — it never executes anything from the agent.** The
  payload is validated with zod and persisted as-is; there is no command path,
  so there is no injection surface from agent input.
- **Read endpoints are behind the normal user `authenticate` hook**, so any
  logged-in user can view monitoring data (consistent with the Phase 3 model:
  members can view everything). They are read-only.
- Single shared token now; **per-host revocable tokens** are a noted future
  improvement, not in this iteration.

## Report schema (rich)

The agent posts one report per host. `host_id` is a stable, user-chosen
identifier; `hostname` is informational.

```jsonc
{
  "hostId": "proxmox-vm-1",          // stable id, [A-Za-z0-9._-], required
  "hostname": "docker01",            // informational
  "agentVersion": "1",
  "reportedAt": "2026-06-20T19:30:00Z", // agent clock; server also records its own receivedAt
  "containers": [
    {
      "id": "<full container id>",
      "name": "myapp",
      "image": "nginx:1.27",
      "imageId": "sha256:...",
      "state": "running",            // running|exited|paused|created|restarting|dead
      "status": "Up 3 hours",        // human string from docker ps
      "createdAt": "2026-06-20T16:00:00Z",
      "startedAt": "2026-06-20T16:00:01Z",
      "restartCount": 0,
      "restartPolicy": "unless-stopped",
      "health": "healthy",           // healthy|unhealthy|starting|none
      "ports": [                      // normalized from inspect
        { "hostIp": "0.0.0.0", "hostPort": 8080, "containerPort": 80, "proto": "tcp" }
      ],
      "networks": [
        { "name": "bridge", "ip": "172.17.0.2" }
      ],
      "mounts": [
        { "type": "volume", "source": "myapp_data", "destination": "/data", "rw": true }
      ],
      "env": [                        // SECRETS MASKED (see below)
        { "key": "NODE_ENV", "value": "production" },
        { "key": "DB_PASSWORD", "value": "********" }
      ],
      "command": "nginx -g 'daemon off;'",
      "labels": { "com.docker.compose.project": "myapp" },
      "stats": {                      // snapshot from docker stats --no-stream
        "cpuPercent": 1.4,
        "memUsage": 20971520,
        "memLimit": 536870912,
        "netRxBytes": 12345,
        "netTxBytes": 67890,
        "blockReadBytes": 0,
        "blockWriteBytes": 0
      }
    }
  ]
}
```

### Env masking
The agent masks values whose key matches a secret-ish pattern
(`/(PASS|SECRET|TOKEN|KEY|PRIVATE|CREDENTIAL)/i`) before sending, replacing the
value with `********`. The full value never leaves the VM. (Defense in depth;
the backend also will not display unmasked secrets.)

### Source capability note
The Containers page already aggregates three sources (Docker TCP API, Docker
over SSH, and now agent). Not every field exists for every source — the UI must
**degrade gracefully** and show "—" / "not available from this source" rather
than erroring. The agent is the richest source (it runs `docker inspect`).

## Backend

### DB
New table, latest-report-per-host (idempotent migration in
`backend/src/db/index.ts`):

```sql
CREATE TABLE IF NOT EXISTS docker_agent_reports (
  host_id     TEXT PRIMARY KEY,
  hostname    TEXT,
  report_json TEXT NOT NULL,        -- the full containers array as JSON
  reported_at TEXT,                 -- agent-supplied timestamp
  received_at TEXT NOT NULL DEFAULT (datetime('now'))  -- server receive time (source of truth for staleness)
);
```

We keep only the latest report per `host_id` (upsert). Historical
time-series is out of scope for this iteration.

### Endpoints
- `POST /api/agents/docker/report` — **token-gated** (Bearer
  `ARCHNEST_AGENT_TOKEN`, constant-time). 503 if token unconfigured, 401 on
  mismatch, 400 on invalid payload. Upserts the row for `hostId`.
- `GET /api/agents/docker/hosts` — user-auth. Returns each reported host with
  `hostId`, `hostname`, `receivedAt`, `containerCount`, and a `stale` flag
  (`true` if `received_at` older than `STALE_AFTER_MS`, default ~90s / tunable).
- `GET /api/agents/docker/hosts/:hostId/containers` — user-auth. Returns the
  parsed container list for that host (the spreadsheet rows + enough for detail).
- `GET /api/agents/docker/hosts/:hostId/containers/:containerId` — user-auth.
  Returns the single container's full detail object.

`api.ts` gets matching functions + TS interfaces (`AgentHost`,
`AgentContainer`, etc.).

## Agent script

`agent/archnest-docker-agent.sh` — portable bash, dependencies: `docker`,
`curl`, and a JSON tool. To avoid forcing `jq`, the script builds the report by
combining `docker ps --format '{{json .}}'`, `docker inspect`, and
`docker stats --no-stream --format '{{json .}}'`; if `jq` is present it is used
to assemble/mask robustly, otherwise a documented `jq`-required note is shown.
(Decision: require `jq` — it is the only sane way to assemble + mask nested
JSON in bash reliably; `jq` is a one-line install on every distro. The script
checks for it and exits with a clear message if missing.)

Configuration via env (script header or `/etc/archnest/agent.env`):
- `ARCHNEST_URL` — e.g. `http://<archnest-mesh-ip>:4000` (mesh address).
- `ARCHNEST_AGENT_TOKEN` — shared token.
- `ARCHNEST_HOST_ID` — stable id for this VM.

Scheduling: provide both a **cron** line and a **systemd service + timer**
example. Recommended interval 30s (must be < backend `STALE_AFTER_MS`).

## Frontend — Containers page

The Containers page becomes **tabbed**:
- **Tab 1 "Containers"** — the existing spreadsheet view (Name, Image, State,
  CPU, Memory, Ports, Actions), now also including agent-reported hosts. The
  host selector lists Docker-API, SSH, and agent hosts.
- **Clicking a container Name** opens a **new tab** in the Containers page
  showing that container's detail (tabs are dynamic; closeable).

### Detail tab contents (graceful per-source degradation)
- **Overview:** name, image + tag, image id, short/full id, created, started,
  uptime, restart count, restart policy.
- **State & health:** state, exit code (if stopped), healthcheck status.
- **Stats:** CPU %, mem usage/limit, net RX/TX, block I/O (snapshot; agent &
  Docker-API have it, SSH list does not).
- **Ports / Networks / Mounts:** tables.
- **Environment & labels:** env vars with secret values masked; labels.
- **Command/entrypoint.**
- **Logs:** recent tail (reuse existing logs path where the source supports it).

Fields unavailable from the active source render as "—" / a small "not
reported by this source" note.

## Explicitly deferred (not in this work)

- **Mesh prerequisite gate** (require mesh detected/tested/verified in Settings
  before anything else can be configured) — its own initiative, needs its own
  design (lockout-safety is the hard part). This doc assumes mesh-only ingest is
  handled operationally for now.
- **Option 2 paid pull-agent** (local authenticated HTTP API per VM, on-demand
  monitor + manage) — `ROADMAP.md`.
- **Per-host tokens**, **historical/time-series metrics**, **live log tailing
  for agent hosts**.