Add RDP debugging handoff doc for next investigator
This commit is contained in:
parent
9578820bbd
commit
74f8b9f705
1 changed files with 203 additions and 0 deletions
203
docs/rdp-debug-handoff.md
Normal file
203
docs/rdp-debug-handoff.md
Normal file
|
|
@ -0,0 +1,203 @@
|
||||||
|
# RDP Connection Debugging — Handoff Doc
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
ArchNest is a self-hosted dashboard product. One of its integrations is a "Remote Desktop"
|
||||||
|
connection type that proxies RDP/VNC/Telnet sessions through `guacd` (Apache Guacamole's
|
||||||
|
proxy daemon) so users can open a remote desktop session in the browser. This needs to work
|
||||||
|
reliably for *any* user's RDP server, not just this one — so the immediate goal is to get
|
||||||
|
this specific connection working, but treat every root cause found as a potential general
|
||||||
|
fix (config option, docs, code change) since other users will hit the same servers
|
||||||
|
(gnome-remote-desktop, xrdp, Windows RDP, etc).
|
||||||
|
|
||||||
|
**You have hands-on access to both machines involved.** Use that — actively connect to both,
|
||||||
|
run diagnostics on both sides simultaneously, and correlate logs/timestamps. Do not guess from
|
||||||
|
one side alone; multiple times in this debugging session, a theory formed from only one
|
||||||
|
machine's logs turned out to be wrong once the other machine's logs were checked.
|
||||||
|
|
||||||
|
## The two machines
|
||||||
|
|
||||||
|
1. **`racknerd-712b73a`** — the VPS running the ArchNest stack (this repo) in Docker.
|
||||||
|
- Container `archnest-backend` — the Node/Fastify backend. Route of interest:
|
||||||
|
`backend/src/routes/guacamole.ts` — bridges a browser WebSocket to `guacd` using
|
||||||
|
`guacamole-lite`'s `ClientConnection`/`Crypt` classes. Builds a Guacamole connection
|
||||||
|
token (protocol, hostname, port, username, password, domain, security, ignore-cert)
|
||||||
|
and hands it to `guacd`.
|
||||||
|
- Container `archnest-guacd` — Apache Guacamole's `guacd` (v1.5.5), the proxy daemon that
|
||||||
|
actually speaks RDP/VNC/Telnet to the target. Listens on port 4822. On the
|
||||||
|
`archnest_default` Docker network, internal IP `172.18.0.2`, DNS aliases
|
||||||
|
`archnest-guacd`/`guacd`. Backend env vars: `ARCHNEST_GUACD_HOST=guacd`,
|
||||||
|
`ARCHNEST_GUACD_PORT=4822`.
|
||||||
|
- Diagnostic command: `docker logs -f archnest-guacd` — shows each connection attempt,
|
||||||
|
the security mode negotiated, certificate validation results, and the final
|
||||||
|
success/refusal message from FreeRDP (the RDP client library `guacd` uses internally).
|
||||||
|
- Also useful: `docker exec archnest-backend env | grep ARCHNEST_GUACD`,
|
||||||
|
`docker inspect archnest-guacd` (to confirm network/IP), `nc -zv 192.168.122.55 3389`
|
||||||
|
(already confirmed reachable from racknerd).
|
||||||
|
|
||||||
|
2. **Fedora VM (`192.168.122.55`)** — appears to be a libvirt VM co-located on the same
|
||||||
|
physical host as racknerd (it's in libvirt's default NAT range, and is reachable from
|
||||||
|
racknerd over a private 192.168.x address despite racknerd otherwise looking like a public
|
||||||
|
VPS). Running Fedora 44, GPU is a `Red Hat, Inc. Virtio 1.0 GPU (rev 01)` (confirmed via
|
||||||
|
`lspci`). User `sam`, password `happy2026` (test/lab credentials, not a real secret).
|
||||||
|
- RDP is served by **`gnome-remote-desktop`** (GNOME's built-in RDP/VNC daemon), running
|
||||||
|
as a **per-user systemd service**: `systemctl --user status gnome-remote-desktop`,
|
||||||
|
`systemctl --user restart gnome-remote-desktop`.
|
||||||
|
- Configured via the `grdctl` CLI: `grdctl status --show-credentials`, `grdctl rdp enable`,
|
||||||
|
`grdctl rdp set-credentials <user> <pass>`, `grdctl rdp set-tls-cert/set-tls-key`,
|
||||||
|
`grdctl rdp disable-view-only`.
|
||||||
|
- Diagnostic command: `journalctl --user -u gnome-remote-desktop -f` — shows the daemon's
|
||||||
|
own startup/shutdown/error logs.
|
||||||
|
- There is a confirmed active, unlocked, real graphical session: `loginctl list-sessions`
|
||||||
|
showed session `51` (seat0, tty2, class `user`), and
|
||||||
|
`loginctl show-session 51 -p Type -p State -p Active` returned
|
||||||
|
`Type=wayland`, `Active=yes`, `State=active`. So gnome-remote-desktop has a real
|
||||||
|
Wayland session to attach to — this is NOT a "no session" problem.
|
||||||
|
|
||||||
|
## What's already been fixed (confirmed working, do not re-investigate these)
|
||||||
|
|
||||||
|
1. **DNS**: an earlier hostname (`fedora`) didn't resolve from the backend container —
|
||||||
|
resolved by using the IP `192.168.122.55` directly instead.
|
||||||
|
2. **Self-signed cert rejection**: FreeRDP/guacd rejected the target's self-signed RDP cert
|
||||||
|
by default. Fixed in code — `backend/src/routes/guacamole.ts` now sets
|
||||||
|
`settings['ignore-cert'] = 'true'` whenever `protocol === 'rdp'`. Confirmed deployed via
|
||||||
|
`docker exec archnest-backend grep -A2 "ignore-cert" /app/dist/routes/guacamole.js`.
|
||||||
|
3. **No way to override RDP security mode**: added a `security` field to the connection
|
||||||
|
token (`settings.security = security || 'any'`) and exposed it in the Settings UI
|
||||||
|
(`src/pages/Settings.tsx`, field key `security`, hint text about NLA). User has tried
|
||||||
|
`any`, `nla`, `tls`, and `rdp` — all fail identically (see below).
|
||||||
|
4. **GNOME's own RDP TLS cert was corrupt**: `journalctl` showed
|
||||||
|
`[ERROR][com.freerdp.crypto] - [x509_utils_from_pem]: BIO_new failed for certificate` /
|
||||||
|
`RDP server certificate is invalid`. Fixed by regenerating the cert/key on the Fedora VM:
|
||||||
|
```bash
|
||||||
|
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 -subj "/CN=fedora" \
|
||||||
|
-keyout ~/.local/share/gnome-remote-desktop/rdp-tls.key \
|
||||||
|
-out ~/.local/share/gnome-remote-desktop/rdp-tls.crt
|
||||||
|
grdctl rdp set-tls-cert ~/.local/share/gnome-remote-desktop/rdp-tls.crt
|
||||||
|
grdctl rdp set-tls-key ~/.local/share/gnome-remote-desktop/rdp-tls.key
|
||||||
|
systemctl --user restart gnome-remote-desktop
|
||||||
|
```
|
||||||
|
Confirmed fixed — later journal output shows no cert error on startup.
|
||||||
|
5. **RDP sharing disabled** at the gnome-remote-desktop level (`grdctl status` showed
|
||||||
|
`Status: disabled` even though the daemon process was running and the port was
|
||||||
|
listening). Fixed via `grdctl rdp enable` + `systemctl --user restart
|
||||||
|
gnome-remote-desktop`.
|
||||||
|
6. **Credentials missing / GNOME Keyring locked**: `grdctl rdp set-credentials sam happy2026`
|
||||||
|
failed with `Cannot create an item in a locked collection` because the keyring wasn't
|
||||||
|
unlocked (likely an artifact of an SSH-only login rather than a real unlocked graphical
|
||||||
|
login). Fixed via:
|
||||||
|
```bash
|
||||||
|
echo -n 'your-login-password' | gnome-keyring-daemon --unlock
|
||||||
|
grdctl rdp set-credentials sam happy2026
|
||||||
|
```
|
||||||
|
`grdctl status --show-credentials` now consistently shows `Unit status: active`,
|
||||||
|
`RDP: Status: enabled`, `Username: sam`, `Password: happy2026`.
|
||||||
|
|
||||||
|
## The unresolved problem
|
||||||
|
|
||||||
|
Despite all of the above being fixed and verified consistent, connecting through ArchNest
|
||||||
|
(browser → backend → guacd → Fedora VM) still fails with:
|
||||||
|
|
||||||
|
```
|
||||||
|
Error: Server refused connection (wrong security type?)
|
||||||
|
```
|
||||||
|
|
||||||
|
This has been tried with `security` set to `any`, `nla`, `tls`, and `rdp` — **identical
|
||||||
|
failure every time**, regardless of mode. That's suspicious: if it were a genuine security
|
||||||
|
negotiation mismatch, different modes should fail differently (or some should succeed).
|
||||||
|
The fact that they all fail identically suggests the real failure might be happening
|
||||||
|
*after* security negotiation succeeds — e.g. at session-start/framebuffer-creation time —
|
||||||
|
and FreeRDP's client-side error message is a generic/misleading bucket for "the connection
|
||||||
|
didn't complete," not literally a security-type mismatch.
|
||||||
|
|
||||||
|
### Open theory (unconfirmed)
|
||||||
|
|
||||||
|
`journalctl --user -u gnome-remote-desktop` shows, on every daemon startup, EGL/Mesa/Zink
|
||||||
|
rendering errors:
|
||||||
|
```
|
||||||
|
libEGL warning: failed to get driver name for fd -1
|
||||||
|
MESA-LOADER: failed to retrieve device information
|
||||||
|
MESA: error: ZINK: failed to choose pdev
|
||||||
|
libEGL warning: egl: failed to create dri2 screen
|
||||||
|
```
|
||||||
|
There was also one observed instance of "RDP server started" immediately followed by
|
||||||
|
"RDP server stopped" with timing consistent with an actual connection attempt. The theory
|
||||||
|
is that gnome-remote-desktop can't create a renderable framebuffer for screen capture (no
|
||||||
|
working GPU/software-render path) and crashes/aborts when a client actually tries to start
|
||||||
|
a session — which a FreeRDP client then reports as "wrong security type" because that's
|
||||||
|
the generic refusal message FreeRDP shows for several different underlying failure modes.
|
||||||
|
|
||||||
|
**This theory has NOT been confirmed.** It's a leading hypothesis based on log timing
|
||||||
|
correlation only — no one has yet proven the EGL/Mesa errors are causal vs. just noise from
|
||||||
|
gnome-remote-desktop probing GPU paths at startup (which may be harmless/expected on a
|
||||||
|
Virtio-GPU VM that falls back to software rendering anyway).
|
||||||
|
|
||||||
|
### Diagnostic step that was in progress, never completed
|
||||||
|
|
||||||
|
A direct `xfreerdp` test, bypassing guacd entirely, to isolate whether gnome-remote-desktop
|
||||||
|
rejects ANY RDP client (not just guacd/FreeRDP-via-guacd), or whether this is specific to how
|
||||||
|
guacd's embedded FreeRDP negotiates. `freerdp`/`xfreerdp` has now been installed on both
|
||||||
|
machines, but the actual test was never run/reported back. This should be your first move:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From racknerd (mimics guacd's exact network path: container -> VM):
|
||||||
|
xfreerdp /v:192.168.122.55 /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only
|
||||||
|
xfreerdp /v:192.168.122.55 /sec:nla /cert-ignore /u:sam /p:happy2026 +auth-only
|
||||||
|
xfreerdp /v:192.168.122.55 /sec:rdp /cert-ignore /u:sam /p:happy2026 +auth-only
|
||||||
|
|
||||||
|
# From the Fedora VM itself (rules out networking, tests gnome-remote-desktop alone):
|
||||||
|
xfreerdp /v:localhost /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only
|
||||||
|
```
|
||||||
|
|
||||||
|
Run these WHILE simultaneously tailing both:
|
||||||
|
```bash
|
||||||
|
# on racknerd:
|
||||||
|
docker logs -f archnest-guacd
|
||||||
|
# on the Fedora VM:
|
||||||
|
journalctl --user -u gnome-remote-desktop -f
|
||||||
|
```
|
||||||
|
|
||||||
|
Correlate the exact moment of failure across both logs. This is the single most valuable
|
||||||
|
piece of evidence currently missing.
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
1. Get hands-on access to both `racknerd-712b73a` and the Fedora VM (`192.168.122.55`).
|
||||||
|
2. Run the `xfreerdp` direct tests above, with both logs tailing simultaneously, and read
|
||||||
|
the actual FreeRDP client-side error output (not just "wrong security type" — xfreerdp's
|
||||||
|
raw stderr/exit code will usually have more detail than what bubbles up through
|
||||||
|
guacd/Guacamole's client to the ArchNest UI).
|
||||||
|
3. If `xfreerdp` succeeds where ArchNest's guac connection fails, the bug is in how
|
||||||
|
`backend/src/routes/guacamole.ts` builds the connection settings/token, or in the
|
||||||
|
`guacamole-lite`/`guacd` version compatibility — debug from there, comparing exactly what
|
||||||
|
settings xfreerdp used successfully vs. what ArchNest sends.
|
||||||
|
4. If `xfreerdp` *also* fails identically, the problem is squarely on the
|
||||||
|
gnome-remote-desktop / Fedora VM side. Investigate the EGL/Mesa/Zink rendering theory
|
||||||
|
directly — check whether software rendering (llvmpipe) is available
|
||||||
|
(`glxinfo -B` from an actual Wayland session, not an SSH shell — note: an earlier attempt
|
||||||
|
from an SSH shell failed with `Error: unable to open display`, which is expected and not
|
||||||
|
informative; you need to run it from within session 51 or equivalent), and whether the
|
||||||
|
VM's libvirt XML has virtio-gpu with working 3D/virgl acceleration configured on the
|
||||||
|
hypervisor side.
|
||||||
|
5. If gnome-remote-desktop turns out to be fundamentally unable to serve a real client
|
||||||
|
(vs. screen-sharing GNOME's own "Remote Login" feature, which is its primary intended use
|
||||||
|
case), consider recommending **xrdp** as a replacement RDP server on the Fedora VM, and
|
||||||
|
note this in your report as a general product recommendation (since other ArchNest users
|
||||||
|
may hit the same gnome-remote-desktop limitation).
|
||||||
|
6. Keep ArchNest's product goal in mind throughout: any fix that's specific to *this* user's
|
||||||
|
VM is fine for unblocking them, but if you find a root cause that's likely to recur for
|
||||||
|
other users (e.g. a guacd config default, a missing Settings field, a code bug in
|
||||||
|
`backend/src/routes/guacamole.ts`), make the corresponding code/config fix in this repo,
|
||||||
|
not just a one-off operational fix on this VM.
|
||||||
|
|
||||||
|
## What to report back when done
|
||||||
|
|
||||||
|
Write a concise report (for the engineer/AI who handed this off) covering:
|
||||||
|
- The root cause, with the specific log lines/evidence that proved it (not just a theory).
|
||||||
|
- The exact fix applied, including any commands run on either machine and any code changes
|
||||||
|
made in this repo (with file paths and diffs).
|
||||||
|
- Whether the fix is specific to this VM or represents a general product issue that other
|
||||||
|
ArchNest users could hit — and if general, what was changed in the codebase to address it.
|
||||||
|
- Current working/non-working status of the connection after the fix, with the actual test
|
||||||
|
performed to confirm it works end-to-end through ArchNest's UI (not just via direct
|
||||||
|
`xfreerdp`).
|
||||||
Loading…
Add table
Reference in a new issue