From 74f8b9f70522f08fc4cbb70cd1b907c982544473 Mon Sep 17 00:00:00 2001 From: Claude Date: Mon, 22 Jun 2026 14:54:47 +0000 Subject: [PATCH] Add RDP debugging handoff doc for next investigator --- docs/rdp-debug-handoff.md | 203 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 203 insertions(+) create mode 100644 docs/rdp-debug-handoff.md diff --git a/docs/rdp-debug-handoff.md b/docs/rdp-debug-handoff.md new file mode 100644 index 0000000..1162115 --- /dev/null +++ b/docs/rdp-debug-handoff.md @@ -0,0 +1,203 @@ +# RDP Connection Debugging — Handoff Doc + +## Goal + +ArchNest is a self-hosted dashboard product. One of its integrations is a "Remote Desktop" +connection type that proxies RDP/VNC/Telnet sessions through `guacd` (Apache Guacamole's +proxy daemon) so users can open a remote desktop session in the browser. This needs to work +reliably for *any* user's RDP server, not just this one — so the immediate goal is to get +this specific connection working, but treat every root cause found as a potential general +fix (config option, docs, code change) since other users will hit the same servers +(gnome-remote-desktop, xrdp, Windows RDP, etc). + +**You have hands-on access to both machines involved.** Use that — actively connect to both, +run diagnostics on both sides simultaneously, and correlate logs/timestamps. Do not guess from +one side alone; multiple times in this debugging session, a theory formed from only one +machine's logs turned out to be wrong once the other machine's logs were checked. + +## The two machines + +1. **`racknerd-712b73a`** — the VPS running the ArchNest stack (this repo) in Docker. + - Container `archnest-backend` — the Node/Fastify backend. Route of interest: + `backend/src/routes/guacamole.ts` — bridges a browser WebSocket to `guacd` using + `guacamole-lite`'s `ClientConnection`/`Crypt` classes. Builds a Guacamole connection + token (protocol, hostname, port, username, password, domain, security, ignore-cert) + and hands it to `guacd`. + - Container `archnest-guacd` — Apache Guacamole's `guacd` (v1.5.5), the proxy daemon that + actually speaks RDP/VNC/Telnet to the target. Listens on port 4822. On the + `archnest_default` Docker network, internal IP `172.18.0.2`, DNS aliases + `archnest-guacd`/`guacd`. Backend env vars: `ARCHNEST_GUACD_HOST=guacd`, + `ARCHNEST_GUACD_PORT=4822`. + - Diagnostic command: `docker logs -f archnest-guacd` — shows each connection attempt, + the security mode negotiated, certificate validation results, and the final + success/refusal message from FreeRDP (the RDP client library `guacd` uses internally). + - Also useful: `docker exec archnest-backend env | grep ARCHNEST_GUACD`, + `docker inspect archnest-guacd` (to confirm network/IP), `nc -zv 192.168.122.55 3389` + (already confirmed reachable from racknerd). + +2. **Fedora VM (`192.168.122.55`)** — appears to be a libvirt VM co-located on the same + physical host as racknerd (it's in libvirt's default NAT range, and is reachable from + racknerd over a private 192.168.x address despite racknerd otherwise looking like a public + VPS). Running Fedora 44, GPU is a `Red Hat, Inc. Virtio 1.0 GPU (rev 01)` (confirmed via + `lspci`). User `sam`, password `happy2026` (test/lab credentials, not a real secret). + - RDP is served by **`gnome-remote-desktop`** (GNOME's built-in RDP/VNC daemon), running + as a **per-user systemd service**: `systemctl --user status gnome-remote-desktop`, + `systemctl --user restart gnome-remote-desktop`. + - Configured via the `grdctl` CLI: `grdctl status --show-credentials`, `grdctl rdp enable`, + `grdctl rdp set-credentials `, `grdctl rdp set-tls-cert/set-tls-key`, + `grdctl rdp disable-view-only`. + - Diagnostic command: `journalctl --user -u gnome-remote-desktop -f` — shows the daemon's + own startup/shutdown/error logs. + - There is a confirmed active, unlocked, real graphical session: `loginctl list-sessions` + showed session `51` (seat0, tty2, class `user`), and + `loginctl show-session 51 -p Type -p State -p Active` returned + `Type=wayland`, `Active=yes`, `State=active`. So gnome-remote-desktop has a real + Wayland session to attach to — this is NOT a "no session" problem. + +## What's already been fixed (confirmed working, do not re-investigate these) + +1. **DNS**: an earlier hostname (`fedora`) didn't resolve from the backend container — + resolved by using the IP `192.168.122.55` directly instead. +2. **Self-signed cert rejection**: FreeRDP/guacd rejected the target's self-signed RDP cert + by default. Fixed in code — `backend/src/routes/guacamole.ts` now sets + `settings['ignore-cert'] = 'true'` whenever `protocol === 'rdp'`. Confirmed deployed via + `docker exec archnest-backend grep -A2 "ignore-cert" /app/dist/routes/guacamole.js`. +3. **No way to override RDP security mode**: added a `security` field to the connection + token (`settings.security = security || 'any'`) and exposed it in the Settings UI + (`src/pages/Settings.tsx`, field key `security`, hint text about NLA). User has tried + `any`, `nla`, `tls`, and `rdp` — all fail identically (see below). +4. **GNOME's own RDP TLS cert was corrupt**: `journalctl` showed + `[ERROR][com.freerdp.crypto] - [x509_utils_from_pem]: BIO_new failed for certificate` / + `RDP server certificate is invalid`. Fixed by regenerating the cert/key on the Fedora VM: + ```bash + openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 -subj "/CN=fedora" \ + -keyout ~/.local/share/gnome-remote-desktop/rdp-tls.key \ + -out ~/.local/share/gnome-remote-desktop/rdp-tls.crt + grdctl rdp set-tls-cert ~/.local/share/gnome-remote-desktop/rdp-tls.crt + grdctl rdp set-tls-key ~/.local/share/gnome-remote-desktop/rdp-tls.key + systemctl --user restart gnome-remote-desktop + ``` + Confirmed fixed — later journal output shows no cert error on startup. +5. **RDP sharing disabled** at the gnome-remote-desktop level (`grdctl status` showed + `Status: disabled` even though the daemon process was running and the port was + listening). Fixed via `grdctl rdp enable` + `systemctl --user restart + gnome-remote-desktop`. +6. **Credentials missing / GNOME Keyring locked**: `grdctl rdp set-credentials sam happy2026` + failed with `Cannot create an item in a locked collection` because the keyring wasn't + unlocked (likely an artifact of an SSH-only login rather than a real unlocked graphical + login). Fixed via: + ```bash + echo -n 'your-login-password' | gnome-keyring-daemon --unlock + grdctl rdp set-credentials sam happy2026 + ``` + `grdctl status --show-credentials` now consistently shows `Unit status: active`, + `RDP: Status: enabled`, `Username: sam`, `Password: happy2026`. + +## The unresolved problem + +Despite all of the above being fixed and verified consistent, connecting through ArchNest +(browser → backend → guacd → Fedora VM) still fails with: + +``` +Error: Server refused connection (wrong security type?) +``` + +This has been tried with `security` set to `any`, `nla`, `tls`, and `rdp` — **identical +failure every time**, regardless of mode. That's suspicious: if it were a genuine security +negotiation mismatch, different modes should fail differently (or some should succeed). +The fact that they all fail identically suggests the real failure might be happening +*after* security negotiation succeeds — e.g. at session-start/framebuffer-creation time — +and FreeRDP's client-side error message is a generic/misleading bucket for "the connection +didn't complete," not literally a security-type mismatch. + +### Open theory (unconfirmed) + +`journalctl --user -u gnome-remote-desktop` shows, on every daemon startup, EGL/Mesa/Zink +rendering errors: +``` +libEGL warning: failed to get driver name for fd -1 +MESA-LOADER: failed to retrieve device information +MESA: error: ZINK: failed to choose pdev +libEGL warning: egl: failed to create dri2 screen +``` +There was also one observed instance of "RDP server started" immediately followed by +"RDP server stopped" with timing consistent with an actual connection attempt. The theory +is that gnome-remote-desktop can't create a renderable framebuffer for screen capture (no +working GPU/software-render path) and crashes/aborts when a client actually tries to start +a session — which a FreeRDP client then reports as "wrong security type" because that's +the generic refusal message FreeRDP shows for several different underlying failure modes. + +**This theory has NOT been confirmed.** It's a leading hypothesis based on log timing +correlation only — no one has yet proven the EGL/Mesa errors are causal vs. just noise from +gnome-remote-desktop probing GPU paths at startup (which may be harmless/expected on a +Virtio-GPU VM that falls back to software rendering anyway). + +### Diagnostic step that was in progress, never completed + +A direct `xfreerdp` test, bypassing guacd entirely, to isolate whether gnome-remote-desktop +rejects ANY RDP client (not just guacd/FreeRDP-via-guacd), or whether this is specific to how +guacd's embedded FreeRDP negotiates. `freerdp`/`xfreerdp` has now been installed on both +machines, but the actual test was never run/reported back. This should be your first move: + +```bash +# From racknerd (mimics guacd's exact network path: container -> VM): +xfreerdp /v:192.168.122.55 /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only +xfreerdp /v:192.168.122.55 /sec:nla /cert-ignore /u:sam /p:happy2026 +auth-only +xfreerdp /v:192.168.122.55 /sec:rdp /cert-ignore /u:sam /p:happy2026 +auth-only + +# From the Fedora VM itself (rules out networking, tests gnome-remote-desktop alone): +xfreerdp /v:localhost /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only +``` + +Run these WHILE simultaneously tailing both: +```bash +# on racknerd: +docker logs -f archnest-guacd +# on the Fedora VM: +journalctl --user -u gnome-remote-desktop -f +``` + +Correlate the exact moment of failure across both logs. This is the single most valuable +piece of evidence currently missing. + +## Instructions + +1. Get hands-on access to both `racknerd-712b73a` and the Fedora VM (`192.168.122.55`). +2. Run the `xfreerdp` direct tests above, with both logs tailing simultaneously, and read + the actual FreeRDP client-side error output (not just "wrong security type" — xfreerdp's + raw stderr/exit code will usually have more detail than what bubbles up through + guacd/Guacamole's client to the ArchNest UI). +3. If `xfreerdp` succeeds where ArchNest's guac connection fails, the bug is in how + `backend/src/routes/guacamole.ts` builds the connection settings/token, or in the + `guacamole-lite`/`guacd` version compatibility — debug from there, comparing exactly what + settings xfreerdp used successfully vs. what ArchNest sends. +4. If `xfreerdp` *also* fails identically, the problem is squarely on the + gnome-remote-desktop / Fedora VM side. Investigate the EGL/Mesa/Zink rendering theory + directly — check whether software rendering (llvmpipe) is available + (`glxinfo -B` from an actual Wayland session, not an SSH shell — note: an earlier attempt + from an SSH shell failed with `Error: unable to open display`, which is expected and not + informative; you need to run it from within session 51 or equivalent), and whether the + VM's libvirt XML has virtio-gpu with working 3D/virgl acceleration configured on the + hypervisor side. +5. If gnome-remote-desktop turns out to be fundamentally unable to serve a real client + (vs. screen-sharing GNOME's own "Remote Login" feature, which is its primary intended use + case), consider recommending **xrdp** as a replacement RDP server on the Fedora VM, and + note this in your report as a general product recommendation (since other ArchNest users + may hit the same gnome-remote-desktop limitation). +6. Keep ArchNest's product goal in mind throughout: any fix that's specific to *this* user's + VM is fine for unblocking them, but if you find a root cause that's likely to recur for + other users (e.g. a guacd config default, a missing Settings field, a code bug in + `backend/src/routes/guacamole.ts`), make the corresponding code/config fix in this repo, + not just a one-off operational fix on this VM. + +## What to report back when done + +Write a concise report (for the engineer/AI who handed this off) covering: +- The root cause, with the specific log lines/evidence that proved it (not just a theory). +- The exact fix applied, including any commands run on either machine and any code changes + made in this repo (with file paths and diffs). +- Whether the fix is specific to this VM or represents a general product issue that other + ArchNest users could hit — and if general, what was changed in the codebase to address it. +- Current working/non-working status of the connection after the fix, with the actual test + performed to confirm it works end-to-end through ArchNest's UI (not just via direct + `xfreerdp`).