dev_arc_aws/docs/rdp-debug-handoff.md

12 KiB

RDP Connection Debugging — Handoff Doc

Goal

ArchNest is a self-hosted dashboard product. One of its integrations is a "Remote Desktop" connection type that proxies RDP/VNC/Telnet sessions through guacd (Apache Guacamole's proxy daemon) so users can open a remote desktop session in the browser. This needs to work reliably for any user's RDP server, not just this one — so the immediate goal is to get this specific connection working, but treat every root cause found as a potential general fix (config option, docs, code change) since other users will hit the same servers (gnome-remote-desktop, xrdp, Windows RDP, etc).

You have hands-on access to both machines involved. Use that — actively connect to both, run diagnostics on both sides simultaneously, and correlate logs/timestamps. Do not guess from one side alone; multiple times in this debugging session, a theory formed from only one machine's logs turned out to be wrong once the other machine's logs were checked.

The two machines

  1. racknerd-712b73a — the VPS running the ArchNest stack (this repo) in Docker.

    • Container archnest-backend — the Node/Fastify backend. Route of interest: backend/src/routes/guacamole.ts — bridges a browser WebSocket to guacd using guacamole-lite's ClientConnection/Crypt classes. Builds a Guacamole connection token (protocol, hostname, port, username, password, domain, security, ignore-cert) and hands it to guacd.
    • Container archnest-guacd — Apache Guacamole's guacd (v1.5.5), the proxy daemon that actually speaks RDP/VNC/Telnet to the target. Listens on port 4822. On the archnest_default Docker network, internal IP 172.18.0.2, DNS aliases archnest-guacd/guacd. Backend env vars: ARCHNEST_GUACD_HOST=guacd, ARCHNEST_GUACD_PORT=4822.
    • Diagnostic command: docker logs -f archnest-guacd — shows each connection attempt, the security mode negotiated, certificate validation results, and the final success/refusal message from FreeRDP (the RDP client library guacd uses internally).
    • Also useful: docker exec archnest-backend env | grep ARCHNEST_GUACD, docker inspect archnest-guacd (to confirm network/IP), nc -zv 192.168.122.55 3389 (already confirmed reachable from racknerd).
  2. Fedora VM (192.168.122.55) — appears to be a libvirt VM co-located on the same physical host as racknerd (it's in libvirt's default NAT range, and is reachable from racknerd over a private 192.168.x address despite racknerd otherwise looking like a public VPS). Running Fedora 44, GPU is a Red Hat, Inc. Virtio 1.0 GPU (rev 01) (confirmed via lspci). User sam, password happy2026 (test/lab credentials, not a real secret).

    • RDP is served by gnome-remote-desktop (GNOME's built-in RDP/VNC daemon), running as a per-user systemd service: systemctl --user status gnome-remote-desktop, systemctl --user restart gnome-remote-desktop.
    • Configured via the grdctl CLI: grdctl status --show-credentials, grdctl rdp enable, grdctl rdp set-credentials <user> <pass>, grdctl rdp set-tls-cert/set-tls-key, grdctl rdp disable-view-only.
    • Diagnostic command: journalctl --user -u gnome-remote-desktop -f — shows the daemon's own startup/shutdown/error logs.
    • There is a confirmed active, unlocked, real graphical session: loginctl list-sessions showed session 51 (seat0, tty2, class user), and loginctl show-session 51 -p Type -p State -p Active returned Type=wayland, Active=yes, State=active. So gnome-remote-desktop has a real Wayland session to attach to — this is NOT a "no session" problem.

What's already been fixed (confirmed working, do not re-investigate these)

  1. DNS: an earlier hostname (fedora) didn't resolve from the backend container — resolved by using the IP 192.168.122.55 directly instead.
  2. Self-signed cert rejection: FreeRDP/guacd rejected the target's self-signed RDP cert by default. Fixed in code — backend/src/routes/guacamole.ts now sets settings['ignore-cert'] = 'true' whenever protocol === 'rdp'. Confirmed deployed via docker exec archnest-backend grep -A2 "ignore-cert" /app/dist/routes/guacamole.js.
  3. No way to override RDP security mode: added a security field to the connection token (settings.security = security || 'any') and exposed it in the Settings UI (src/pages/Settings.tsx, field key security, hint text about NLA). User has tried any, nla, tls, and rdp — all fail identically (see below).
  4. GNOME's own RDP TLS cert was corrupt: journalctl showed [ERROR][com.freerdp.crypto] - [x509_utils_from_pem]: BIO_new failed for certificate / RDP server certificate is invalid. Fixed by regenerating the cert/key on the Fedora VM:
    openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 -subj "/CN=fedora" \
      -keyout ~/.local/share/gnome-remote-desktop/rdp-tls.key \
      -out ~/.local/share/gnome-remote-desktop/rdp-tls.crt
    grdctl rdp set-tls-cert ~/.local/share/gnome-remote-desktop/rdp-tls.crt
    grdctl rdp set-tls-key ~/.local/share/gnome-remote-desktop/rdp-tls.key
    systemctl --user restart gnome-remote-desktop
    
    Confirmed fixed — later journal output shows no cert error on startup.
  5. RDP sharing disabled at the gnome-remote-desktop level (grdctl status showed Status: disabled even though the daemon process was running and the port was listening). Fixed via grdctl rdp enable + systemctl --user restart gnome-remote-desktop.
  6. Credentials missing / GNOME Keyring locked: grdctl rdp set-credentials sam happy2026 failed with Cannot create an item in a locked collection because the keyring wasn't unlocked (likely an artifact of an SSH-only login rather than a real unlocked graphical login). Fixed via:
    echo -n 'your-login-password' | gnome-keyring-daemon --unlock
    grdctl rdp set-credentials sam happy2026
    
    grdctl status --show-credentials now consistently shows Unit status: active, RDP: Status: enabled, Username: sam, Password: happy2026.

The unresolved problem

Despite all of the above being fixed and verified consistent, connecting through ArchNest (browser → backend → guacd → Fedora VM) still fails with:

Error: Server refused connection (wrong security type?)

This has been tried with security set to any, nla, tls, and rdpidentical failure every time, regardless of mode. That's suspicious: if it were a genuine security negotiation mismatch, different modes should fail differently (or some should succeed). The fact that they all fail identically suggests the real failure might be happening after security negotiation succeeds — e.g. at session-start/framebuffer-creation time — and FreeRDP's client-side error message is a generic/misleading bucket for "the connection didn't complete," not literally a security-type mismatch.

Open theory (unconfirmed)

journalctl --user -u gnome-remote-desktop shows, on every daemon startup, EGL/Mesa/Zink rendering errors:

libEGL warning: failed to get driver name for fd -1
MESA-LOADER: failed to retrieve device information
MESA: error: ZINK: failed to choose pdev
libEGL warning: egl: failed to create dri2 screen

There was also one observed instance of "RDP server started" immediately followed by "RDP server stopped" with timing consistent with an actual connection attempt. The theory is that gnome-remote-desktop can't create a renderable framebuffer for screen capture (no working GPU/software-render path) and crashes/aborts when a client actually tries to start a session — which a FreeRDP client then reports as "wrong security type" because that's the generic refusal message FreeRDP shows for several different underlying failure modes.

This theory has NOT been confirmed. It's a leading hypothesis based on log timing correlation only — no one has yet proven the EGL/Mesa errors are causal vs. just noise from gnome-remote-desktop probing GPU paths at startup (which may be harmless/expected on a Virtio-GPU VM that falls back to software rendering anyway).

Diagnostic step that was in progress, never completed

A direct xfreerdp test, bypassing guacd entirely, to isolate whether gnome-remote-desktop rejects ANY RDP client (not just guacd/FreeRDP-via-guacd), or whether this is specific to how guacd's embedded FreeRDP negotiates. freerdp/xfreerdp has now been installed on both machines, but the actual test was never run/reported back. This should be your first move:

# From racknerd (mimics guacd's exact network path: container -> VM):
xfreerdp /v:192.168.122.55 /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only
xfreerdp /v:192.168.122.55 /sec:nla /cert-ignore /u:sam /p:happy2026 +auth-only
xfreerdp /v:192.168.122.55 /sec:rdp /cert-ignore /u:sam /p:happy2026 +auth-only

# From the Fedora VM itself (rules out networking, tests gnome-remote-desktop alone):
xfreerdp /v:localhost /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only

Run these WHILE simultaneously tailing both:

# on racknerd:
docker logs -f archnest-guacd
# on the Fedora VM:
journalctl --user -u gnome-remote-desktop -f

Correlate the exact moment of failure across both logs. This is the single most valuable piece of evidence currently missing.

Instructions

  1. Get hands-on access to both racknerd-712b73a and the Fedora VM (192.168.122.55).
  2. Run the xfreerdp direct tests above, with both logs tailing simultaneously, and read the actual FreeRDP client-side error output (not just "wrong security type" — xfreerdp's raw stderr/exit code will usually have more detail than what bubbles up through guacd/Guacamole's client to the ArchNest UI).
  3. If xfreerdp succeeds where ArchNest's guac connection fails, the bug is in how backend/src/routes/guacamole.ts builds the connection settings/token, or in the guacamole-lite/guacd version compatibility — debug from there, comparing exactly what settings xfreerdp used successfully vs. what ArchNest sends.
  4. If xfreerdp also fails identically, the problem is squarely on the gnome-remote-desktop / Fedora VM side. Investigate the EGL/Mesa/Zink rendering theory directly — check whether software rendering (llvmpipe) is available (glxinfo -B from an actual Wayland session, not an SSH shell — note: an earlier attempt from an SSH shell failed with Error: unable to open display, which is expected and not informative; you need to run it from within session 51 or equivalent), and whether the VM's libvirt XML has virtio-gpu with working 3D/virgl acceleration configured on the hypervisor side.
  5. If gnome-remote-desktop turns out to be fundamentally unable to serve a real client (vs. screen-sharing GNOME's own "Remote Login" feature, which is its primary intended use case), consider recommending xrdp as a replacement RDP server on the Fedora VM, and note this in your report as a general product recommendation (since other ArchNest users may hit the same gnome-remote-desktop limitation).
  6. Keep ArchNest's product goal in mind throughout: any fix that's specific to this user's VM is fine for unblocking them, but if you find a root cause that's likely to recur for other users (e.g. a guacd config default, a missing Settings field, a code bug in backend/src/routes/guacamole.ts), make the corresponding code/config fix in this repo, not just a one-off operational fix on this VM.

What to report back when done

Write a concise report (for the engineer/AI who handed this off) covering:

  • The root cause, with the specific log lines/evidence that proved it (not just a theory).
  • The exact fix applied, including any commands run on either machine and any code changes made in this repo (with file paths and diffs).
  • Whether the fix is specific to this VM or represents a general product issue that other ArchNest users could hit — and if general, what was changed in the codebase to address it.
  • Current working/non-working status of the connection after the fix, with the actual test performed to confirm it works end-to-end through ArchNest's UI (not just via direct xfreerdp).