dev_arc_aws/docs/rdp-debug-handoff.md
Samuel James d1fefd3a63
Resolve RDP debugging: root cause + xrdp fix for gnome-remote-desktop (#41)
The "Server refused connection (wrong security type?)" failure was root-caused
end-to-end: guacd 1.5.5 ships FreeRDP 2.11.5, whose NLA/CredSSP client cannot
authenticate against gnome-remote-desktop, which mandates NLA (HYBRID_REQUIRED_
BY_SERVER) with no option to disable it. The earlier EGL/Mesa/Zink GPU theory
was a red herring.

Proven at every layer: direct xfreerdp v3 to the VM, the real guacd protocol
path (all security modes fail identically), and guacd's own logs. Also verified
guacd:1.6.0 still ships FreeRDP 2.11.7, so an image bump would NOT fix it.

Fix applied to the test VM: replaced gnome-remote-desktop with xrdp (masked the
GNOME user service so it can't re-grab port 3389), which interoperates with
guacd's FreeRDP 2. Verified a real session streams through guacd with
security=any. No ArchNest code change was needed — the existing security/
ignore-cert handling in guacamole.ts is correct.

Documents this as a general finding since other users will hit GNOME's built-in
RDP the same way.

Co-authored-by: Samuel James <ssamjame@amazon.com>
Co-authored-by: Kiro <noreply@kiro.dev>
2026-06-22 14:18:04 -04:00

253 lines
16 KiB
Markdown

# RDP Connection Debugging — Handoff Doc
## ✅ RESOLVED (2026-06-22) — root cause found, proven end-to-end
**Root cause: guacd 1.5.5 ships FreeRDP 2.11.5, whose NLA/CredSSP client cannot
complete authentication against gnome-remote-desktop, which *mandates* NLA.**
Proven at every layer (not a theory — the EGL/Mesa/Zink hypothesis below was a red herring):
1. **Server mandates NLA.** Direct `xfreerdp` (v3) from the Fedora VM to its own
gnome-remote-desktop returns, for `/sec:tls` and `/sec:rdp`:
`[WARN][com.freerdp.core.nego] Error: HYBRID_REQUIRED_BY_SERVER [0x00000005]`
`Protocol Security Negotiation Failure`. `grdctl rdp set-auth-methods` only offers
`credentials` (NLA) and `kerberos`**there is no non-NLA / plain-RDP mode** to turn off.
2. **guacd's FreeRDP 2 can't do NLA against it.** Driving the *real* guacd path
(guacd 172.18.0.2:4822 → VM) with `security` = `nla`, `tls`, `rdp`, AND `any` all return
the identical Guacamole error `Server refused connection (wrong security type?)` (code 519).
guacd's own log confirms it tried correctly: `Security mode: NLA` … then
`RDP server closed/refused connection: Server refused connection (wrong security type?)`.
The fact that all four modes fail *identically* was the tell — it's not a mode mismatch,
it's that FreeRDP 2's CredSSP handshake is incompatible with gnome-remote-desktop's.
3. **Bumping guacd does NOT fix it.** `guacamole/guacd:1.6.0` still ships FreeRDP **2.11.7**
(verified by inspecting the image). FreeRDP **3.x** is what fixes gnome-remote-desktop NLA
interop, and Apache's guacd image doesn't ship FreeRDP 3 yet. So an image bump is wasted.
### Fix / recommendation (general — other ArchNest users will hit this)
gnome-remote-desktop is **not a reliable RDP target for guacd-based gateways** (this affects
Fedora/Ubuntu 22.04+ desktops using GNOME's built-in "Remote Desktop"). The fix applied here,
plus the alternative considered:
- **Applied & verified (operational, per-VM): replaced gnome-remote-desktop with `xrdp`** on
the test VM. xrdp's RDP-security path interoperates with guacd's FreeRDP 2. Steps run:
`sudo dnf install -y xrdp && sudo systemctl enable --now xrdp`; then disabled + **masked**
gnome-remote-desktop's user service (`systemctl --user mask gnome-remote-desktop.service`)
and killed the lingering daemon that was still holding port 3389 so xrdp could bind it.
Verified end-to-end through the real guacd path: with `security=any`, guacd authenticates and
streams live desktop frames. **`security` MUST be `any` (or blank → defaults to `any`)** for
xrdp's default config — `nla` fails (`Security negotiation failed`) and `rdp` errors out.
Note: xrdp gives a fresh X login session, not a takeover of the existing Wayland session.
- **Alternative (infra, affects everyone): a custom guacd build with FreeRDP 3.** Not worth it
yet — it's a 30+ min from-source build to maintain in `docker-compose.yml`, for one upstream
gap that Apache will eventually close. Revisit if/when `guacamole/guacd` ships FreeRDP 3.
No ArchNest code change was required — the `security` field + `ignore-cert` handling in
`backend/src/routes/guacamole.ts` (added earlier this debugging arc) are correct and remain
useful for other RDP servers. The blocker was purely the guacd↔gnome NLA incompatibility.
The original investigation notes below are kept for history.
---
## Goal
ArchNest is a self-hosted dashboard product. One of its integrations is a "Remote Desktop"
connection type that proxies RDP/VNC/Telnet sessions through `guacd` (Apache Guacamole's
proxy daemon) so users can open a remote desktop session in the browser. This needs to work
reliably for *any* user's RDP server, not just this one — so the immediate goal is to get
this specific connection working, but treat every root cause found as a potential general
fix (config option, docs, code change) since other users will hit the same servers
(gnome-remote-desktop, xrdp, Windows RDP, etc).
**You have hands-on access to both machines involved.** Use that — actively connect to both,
run diagnostics on both sides simultaneously, and correlate logs/timestamps. Do not guess from
one side alone; multiple times in this debugging session, a theory formed from only one
machine's logs turned out to be wrong once the other machine's logs were checked.
## The two machines
1. **`racknerd-712b73a`** — the VPS running the ArchNest stack (this repo) in Docker.
- Container `archnest-backend` — the Node/Fastify backend. Route of interest:
`backend/src/routes/guacamole.ts` — bridges a browser WebSocket to `guacd` using
`guacamole-lite`'s `ClientConnection`/`Crypt` classes. Builds a Guacamole connection
token (protocol, hostname, port, username, password, domain, security, ignore-cert)
and hands it to `guacd`.
- Container `archnest-guacd` — Apache Guacamole's `guacd` (v1.5.5), the proxy daemon that
actually speaks RDP/VNC/Telnet to the target. Listens on port 4822. On the
`archnest_default` Docker network, internal IP `172.18.0.2`, DNS aliases
`archnest-guacd`/`guacd`. Backend env vars: `ARCHNEST_GUACD_HOST=guacd`,
`ARCHNEST_GUACD_PORT=4822`.
- Diagnostic command: `docker logs -f archnest-guacd` — shows each connection attempt,
the security mode negotiated, certificate validation results, and the final
success/refusal message from FreeRDP (the RDP client library `guacd` uses internally).
- Also useful: `docker exec archnest-backend env | grep ARCHNEST_GUACD`,
`docker inspect archnest-guacd` (to confirm network/IP), `nc -zv 192.168.122.55 3389`
(already confirmed reachable from racknerd).
2. **Fedora VM (`192.168.122.55`)** — appears to be a libvirt VM co-located on the same
physical host as racknerd (it's in libvirt's default NAT range, and is reachable from
racknerd over a private 192.168.x address despite racknerd otherwise looking like a public
VPS). Running Fedora 44, GPU is a `Red Hat, Inc. Virtio 1.0 GPU (rev 01)` (confirmed via
`lspci`). User `sam`, password `happy2026` (test/lab credentials, not a real secret).
- RDP is served by **`gnome-remote-desktop`** (GNOME's built-in RDP/VNC daemon), running
as a **per-user systemd service**: `systemctl --user status gnome-remote-desktop`,
`systemctl --user restart gnome-remote-desktop`.
- Configured via the `grdctl` CLI: `grdctl status --show-credentials`, `grdctl rdp enable`,
`grdctl rdp set-credentials <user> <pass>`, `grdctl rdp set-tls-cert/set-tls-key`,
`grdctl rdp disable-view-only`.
- Diagnostic command: `journalctl --user -u gnome-remote-desktop -f` — shows the daemon's
own startup/shutdown/error logs.
- There is a confirmed active, unlocked, real graphical session: `loginctl list-sessions`
showed session `51` (seat0, tty2, class `user`), and
`loginctl show-session 51 -p Type -p State -p Active` returned
`Type=wayland`, `Active=yes`, `State=active`. So gnome-remote-desktop has a real
Wayland session to attach to — this is NOT a "no session" problem.
## What's already been fixed (confirmed working, do not re-investigate these)
1. **DNS**: an earlier hostname (`fedora`) didn't resolve from the backend container —
resolved by using the IP `192.168.122.55` directly instead.
2. **Self-signed cert rejection**: FreeRDP/guacd rejected the target's self-signed RDP cert
by default. Fixed in code — `backend/src/routes/guacamole.ts` now sets
`settings['ignore-cert'] = 'true'` whenever `protocol === 'rdp'`. Confirmed deployed via
`docker exec archnest-backend grep -A2 "ignore-cert" /app/dist/routes/guacamole.js`.
3. **No way to override RDP security mode**: added a `security` field to the connection
token (`settings.security = security || 'any'`) and exposed it in the Settings UI
(`src/pages/Settings.tsx`, field key `security`, hint text about NLA). User has tried
`any`, `nla`, `tls`, and `rdp` — all fail identically (see below).
4. **GNOME's own RDP TLS cert was corrupt**: `journalctl` showed
`[ERROR][com.freerdp.crypto] - [x509_utils_from_pem]: BIO_new failed for certificate` /
`RDP server certificate is invalid`. Fixed by regenerating the cert/key on the Fedora VM:
```bash
openssl req -new -newkey rsa:4096 -days 365 -nodes -x509 -subj "/CN=fedora" \
-keyout ~/.local/share/gnome-remote-desktop/rdp-tls.key \
-out ~/.local/share/gnome-remote-desktop/rdp-tls.crt
grdctl rdp set-tls-cert ~/.local/share/gnome-remote-desktop/rdp-tls.crt
grdctl rdp set-tls-key ~/.local/share/gnome-remote-desktop/rdp-tls.key
systemctl --user restart gnome-remote-desktop
```
Confirmed fixed — later journal output shows no cert error on startup.
5. **RDP sharing disabled** at the gnome-remote-desktop level (`grdctl status` showed
`Status: disabled` even though the daemon process was running and the port was
listening). Fixed via `grdctl rdp enable` + `systemctl --user restart
gnome-remote-desktop`.
6. **Credentials missing / GNOME Keyring locked**: `grdctl rdp set-credentials sam happy2026`
failed with `Cannot create an item in a locked collection` because the keyring wasn't
unlocked (likely an artifact of an SSH-only login rather than a real unlocked graphical
login). Fixed via:
```bash
echo -n 'your-login-password' | gnome-keyring-daemon --unlock
grdctl rdp set-credentials sam happy2026
```
`grdctl status --show-credentials` now consistently shows `Unit status: active`,
`RDP: Status: enabled`, `Username: sam`, `Password: happy2026`.
## The unresolved problem
Despite all of the above being fixed and verified consistent, connecting through ArchNest
(browser → backend → guacd → Fedora VM) still fails with:
```
Error: Server refused connection (wrong security type?)
```
This has been tried with `security` set to `any`, `nla`, `tls`, and `rdp` — **identical
failure every time**, regardless of mode. That's suspicious: if it were a genuine security
negotiation mismatch, different modes should fail differently (or some should succeed).
The fact that they all fail identically suggests the real failure might be happening
*after* security negotiation succeeds — e.g. at session-start/framebuffer-creation time —
and FreeRDP's client-side error message is a generic/misleading bucket for "the connection
didn't complete," not literally a security-type mismatch.
### Open theory (unconfirmed)
`journalctl --user -u gnome-remote-desktop` shows, on every daemon startup, EGL/Mesa/Zink
rendering errors:
```
libEGL warning: failed to get driver name for fd -1
MESA-LOADER: failed to retrieve device information
MESA: error: ZINK: failed to choose pdev
libEGL warning: egl: failed to create dri2 screen
```
There was also one observed instance of "RDP server started" immediately followed by
"RDP server stopped" with timing consistent with an actual connection attempt. The theory
is that gnome-remote-desktop can't create a renderable framebuffer for screen capture (no
working GPU/software-render path) and crashes/aborts when a client actually tries to start
a session — which a FreeRDP client then reports as "wrong security type" because that's
the generic refusal message FreeRDP shows for several different underlying failure modes.
**This theory has NOT been confirmed.** It's a leading hypothesis based on log timing
correlation only — no one has yet proven the EGL/Mesa errors are causal vs. just noise from
gnome-remote-desktop probing GPU paths at startup (which may be harmless/expected on a
Virtio-GPU VM that falls back to software rendering anyway).
### Diagnostic step that was in progress, never completed
A direct `xfreerdp` test, bypassing guacd entirely, to isolate whether gnome-remote-desktop
rejects ANY RDP client (not just guacd/FreeRDP-via-guacd), or whether this is specific to how
guacd's embedded FreeRDP negotiates. `freerdp`/`xfreerdp` has now been installed on both
machines, but the actual test was never run/reported back. This should be your first move:
```bash
# From racknerd (mimics guacd's exact network path: container -> VM):
xfreerdp /v:192.168.122.55 /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only
xfreerdp /v:192.168.122.55 /sec:nla /cert-ignore /u:sam /p:happy2026 +auth-only
xfreerdp /v:192.168.122.55 /sec:rdp /cert-ignore /u:sam /p:happy2026 +auth-only
# From the Fedora VM itself (rules out networking, tests gnome-remote-desktop alone):
xfreerdp /v:localhost /sec:tls /cert-ignore /u:sam /p:happy2026 +auth-only
```
Run these WHILE simultaneously tailing both:
```bash
# on racknerd:
docker logs -f archnest-guacd
# on the Fedora VM:
journalctl --user -u gnome-remote-desktop -f
```
Correlate the exact moment of failure across both logs. This is the single most valuable
piece of evidence currently missing.
## Instructions
1. Get hands-on access to both `racknerd-712b73a` and the Fedora VM (`192.168.122.55`).
2. Run the `xfreerdp` direct tests above, with both logs tailing simultaneously, and read
the actual FreeRDP client-side error output (not just "wrong security type" — xfreerdp's
raw stderr/exit code will usually have more detail than what bubbles up through
guacd/Guacamole's client to the ArchNest UI).
3. If `xfreerdp` succeeds where ArchNest's guac connection fails, the bug is in how
`backend/src/routes/guacamole.ts` builds the connection settings/token, or in the
`guacamole-lite`/`guacd` version compatibility — debug from there, comparing exactly what
settings xfreerdp used successfully vs. what ArchNest sends.
4. If `xfreerdp` *also* fails identically, the problem is squarely on the
gnome-remote-desktop / Fedora VM side. Investigate the EGL/Mesa/Zink rendering theory
directly — check whether software rendering (llvmpipe) is available
(`glxinfo -B` from an actual Wayland session, not an SSH shell — note: an earlier attempt
from an SSH shell failed with `Error: unable to open display`, which is expected and not
informative; you need to run it from within session 51 or equivalent), and whether the
VM's libvirt XML has virtio-gpu with working 3D/virgl acceleration configured on the
hypervisor side.
5. If gnome-remote-desktop turns out to be fundamentally unable to serve a real client
(vs. screen-sharing GNOME's own "Remote Login" feature, which is its primary intended use
case), consider recommending **xrdp** as a replacement RDP server on the Fedora VM, and
note this in your report as a general product recommendation (since other ArchNest users
may hit the same gnome-remote-desktop limitation).
6. Keep ArchNest's product goal in mind throughout: any fix that's specific to *this* user's
VM is fine for unblocking them, but if you find a root cause that's likely to recur for
other users (e.g. a guacd config default, a missing Settings field, a code bug in
`backend/src/routes/guacamole.ts`), make the corresponding code/config fix in this repo,
not just a one-off operational fix on this VM.
## What to report back when done
Write a concise report (for the engineer/AI who handed this off) covering:
- The root cause, with the specific log lines/evidence that proved it (not just a theory).
- The exact fix applied, including any commands run on either machine and any code changes
made in this repo (with file paths and diffs).
- Whether the fix is specific to this VM or represents a general product issue that other
ArchNest users could hit — and if general, what was changed in the codebase to address it.
- Current working/non-working status of the connection after the fix, with the actual test
performed to confirm it works end-to-end through ArchNest's UI (not just via direct
`xfreerdp`).