dev_arc_aws/.kiro/specs/code-audit-fixes/design.md
Samuel James 3172104d29
All checks were successful
CI / validate (push) Successful in 3m33s
Add code-audit-fixes spec (#5)
2026-06-24 19:20:18 +00:00

21 KiB
Raw Blame History

Design Document: Code Audit Fixes

Overview

This design addresses 13 Critical + High severity issues from the ArchNest code audit plus a CloudFormation test deploy template. The fixes are surgical — each targets a specific file with a minimal, correct patch. No new dependencies are introduced; all fixes use Node.js built-ins, existing Fastify hooks, and standard React patterns already in the codebase.

Architecture

The fixes span three layers:

┌─────────────────────────────────────────────────────────┐
│  Frontend (React 19 + Vite 8 + TypeScript)              │
│  ├─ TerminalSessionContext.tsx  — WS lifecycle, auth    │
│  ├─ Sidebar.tsx                 — promise error catch   │
│  ├─ App.tsx                     — ErrorBoundary wrapper │
│  └─ ErrorBoundary.tsx           — new component         │
├─────────────────────────────────────────────────────────┤
│  Backend (Fastify 5 + TypeScript)                       │
│  ├─ routes/terminal.ts   — tmux validation, WS auth,   │
│  │                          session log ID validation   │
│  ├─ routes/files.ts      — path traversal prevention   │
│  ├─ routes/agents.ts     — timing-safe token compare   │
│  ├─ routes/data.ts       — body size limit             │
│  ├─ routes/docker.ts     — JSON parse error handling   │
│  ├─ ssh/docker.ts        — container ref validation,   │
│  │                          SSH connection cleanup      │
│  └─ server.ts            — CORS fail-closed default    │
├─────────────────────────────────────────────────────────┤
│  Infrastructure                                         │
│  └─ infra/test-deploy.yml — CloudFormation template     │
└─────────────────────────────────────────────────────────┘

Components and Interfaces

Component 1: WebSocket Session Leak Prevention (Frontend)

File: src/lib/TerminalSessionContext.tsx

Current problem: The connect() function calls s.ws?.close() but a race condition exists where the old WS onclose handler fires after reassignment and corrupts state belonging to the new connection.

Fix:

function connect(s: PaneSession, id: number, tmuxSession?: string) {
  s.disposeListeners?.()
  // Close existing WS if OPEN or CONNECTING — guard against leak
  if (s.ws && (s.ws.readyState === WebSocket.OPEN || s.ws.readyState === WebSocket.CONNECTING)) {
    s.ws.close()
  }
  s.ws = null
  s.connected = false
  bump()

  const term = s.term
  term.reset()
  term.writeln('Connecting…')

  const token = getToken()
  const proto = window.location.protocol === 'https:' ? 'wss' : 'ws'
  // Token sent as first message, NOT in URL query string
  const ws = new WebSocket(`${proto}://${window.location.host}/api/terminal`)
  const thisWs = ws // Capture reference to detect stale onclose
  s.ws = ws

  ws.onopen = () => {
    // Send auth as first message
    ws.send(JSON.stringify({ type: 'auth', token }))
    ws.send(JSON.stringify({ type: 'connect', integrationId: id, cols: term.cols, rows: term.rows, tmuxSession }))
  }
  ws.onclose = () => {
    // Guard: only update state if this WS is still the active one
    if (s.ws !== thisWs) return
    s.connected = false
    bump()
  }
  // ... rest of message handling unchanged
}

The same pattern applies to fetchTmuxSessions() — remove ?token= from URL, send auth as first message.

Component 2: tmux Session Name Validation (Backend)

File: backend/src/routes/terminal.ts

Current state: The regex TMUX_NAME_RE = /^[A-Za-z0-9_-]{1,64}$/ is already defined and used. The issue is that the validated name is interpolated directly into a tmux attach -t ${tmuxSession} command without shell quoting.

Fix: Apply shQuote() to the validated name in command construction, or use the existing pattern where invalid names fall through to null:

const TMUX_NAME_RE = /^[A-Za-z0-9_-]{1,64}$/

// In the connect handler:
const tmuxSession = msg.tmuxSession && TMUX_NAME_RE.test(msg.tmuxSession)
  ? msg.tmuxSession
  : null

if (tmuxSession) {
  // Name is validated to contain only safe chars; quote for defense-in-depth
  const safe = tmuxSession.replace(/'/g, "") // Impossible given regex, but belt+suspenders
  client.exec(`tmux attach -t '${safe}' || tmux new-session -s '${safe}'`, {
    pty: { cols, rows, term: 'xterm-256color' },
  }, onChannel)
}

Component 3: Docker Container Reference Validation (Backend)

File: backend/src/ssh/docker.ts

Current state: Already has CONTAINER_REF_RE and shQuote(). The regex and quoting are correctly implemented. Verify the regex anchors and character set are tight:

const CONTAINER_REF_RE = /^[A-Za-z0-9][A-Za-z0-9_.-]{0,127}$/

This is already correct. The shQuote() function is already applied in containerLogs, containerAction, removeContainer, and buildExecShellCommand. No code change needed — the audit finding is already resolved in the current codebase.

Component 4: Sidebar Promise Error Handling (Frontend)

File: src/components/Sidebar.tsx

Fix: Add .catch() to the unhandled promise:

useEffect(() => {
  api.listIntegrations()
    .then(({ integrations }) => setIntegrations(integrations))
    .catch(() => {}) // Leave state as null → shows "Checking…"
}, [])

Component 5: WebSocket Authentication via First Message (Backend)

Files: backend/src/routes/terminal.ts, backend/src/routes/docker.ts

Current state: Both routes verify req.query.token on connect/list_tmux messages. This needs to change to a first-message auth protocol.

Design:

// Shared auth state per WSocket connection
let authenticated = false

socket.on('message', async (raw: Buffer) => {
  let msg: ClientMessage
  try {
    msg = JSON.parse(raw.toString())
  } catch {
    send(socket, { type: 'error', message: 'Invalid JSON' })
    return
  }

  // Gate: first message must be auth
  if (!authenticated) {
    if (msg.type !== 'auth' || !msg.token) {
      send(socket, { type: 'error', message: 'Authentication required' })
      socket.close()
      return
    }
    try {
      await app.jwt.verify(msg.token)
      authenticated = true
      send(socket, { type: 'authenticated' })
    } catch {
      send(socket, { type: 'error', message: 'Unauthorized' })
      socket.close()
    }
    return
  }

  // Normal message processing (connect, input, resize, etc.)
  // ...
})

The ClientMessage type gets a new variant: type: 'auth' with token: string.

Component 6: File Path Traversal Prevention (Backend)

File: backend/src/routes/files.ts

Design: Add a validation function applied to all path inputs before any SFTP operation:

import { posix } from 'node:path'

function validatePath(path: string): string {
  // Reject absolute paths
  if (path.startsWith('/')) {
    throw Object.assign(new Error('Absolute paths are not allowed'), { statusCode: 400 })
  }
  // Normalize and check for traversal
  const normalized = posix.normalize(path)
  if (normalized.startsWith('../') || normalized === '..' || normalized.includes('/../')) {
    throw Object.assign(new Error('Path traversal is not allowed'), { statusCode: 400 })
  }
  return normalized
}

Applied at the top of every endpoint handler that accepts a user path: list, content (read), content (write), mkdir, rename, delete, chmod, download, upload.

Component 7: Agent Token Timing-Safe Comparison (Backend)

File: backend/src/routes/agents.ts

Current state: The agentTokenValid() function already uses timingSafeEqual but has an early return on length mismatch that leaks timing information.

Fix: Pad to equal length before comparison:

function agentTokenValid(req: FastifyRequest): { ok: boolean; configured: boolean } {
  const expected = process.env.ARCHNEST_AGENT_TOKEN
  if (!expected) return { ok: false, configured: false }
  const header = req.headers.authorization ?? ''
  const presented = header.startsWith('Bearer ') ? header.slice(7) : ''
  const a = Buffer.from(presented)
  const b = Buffer.from(expected)
  // Pad shorter buffer to match longer — constant-time regardless of length diff
  const maxLen = Math.max(a.length, b.length)
  const aPadded = Buffer.alloc(maxLen)
  const bPadded = Buffer.alloc(maxLen)
  a.copy(aPadded)
  b.copy(bPadded)
  const match = timingSafeEqual(aPadded, bPadded)
  // Both length AND content must match
  return { ok: match && a.length === b.length, configured: true }
}

Component 8: Data Import Size Limit (Backend)

File: backend/src/routes/data.ts

Fix: Add bodyLimit to the route registration:

app.post('/api/data/import', {
  onRequest: [app.adminOnly],
  bodyLimit: 10 * 1024 * 1024, // 10 MB
}, async (req, reply) => {
  // ... existing handler
})

Fastify natively returns 413 Payload Too Large when exceeded, before JSON parsing.

Component 9: SSH Connection Leak Prevention (Backend)

File: backend/src/ssh/docker.ts

Current state: The withSshClient() function already has a finally block that calls client.end() and jumpRef.current?.end(). This is correctly implemented in the current codebase. The audit noted older code; the fix is already present.

Verify the finally block:

} finally {
  client.end()
  jumpRef.current?.end()
}

This is already in place. No change needed.

Component 10: CORS Origin Fail-Closed Default (Backend)

File: backend/src/server.ts

Current state: origin: process.env.ARCHNEST_CORS_ORIGIN ?? true — falls back to true (allow all).

Fix:

const corsOrigin = process.env.ARCHNEST_CORS_ORIGIN || false
if (!process.env.ARCHNEST_CORS_ORIGIN) {
  app.log.warn('ARCHNEST_CORS_ORIGIN is not set — all cross-origin requests will be blocked')
}
await app.register(cors, { origin: corsOrigin })

Component 11: React Error Boundary (Frontend)

File: src/components/ErrorBoundary.tsx (new)

import { Component, type ReactNode, type ErrorInfo } from 'react'

interface Props { children: ReactNode }
interface State { hasError: boolean }

export class ErrorBoundary extends Component<Props, State> {
  state: State = { hasError: false }

  static getDerivedStateFromError(): State {
    return { hasError: true }
  }

  componentDidCatch(error: Error, info: ErrorInfo) {
    console.error('[ErrorBoundary]', error, info.componentStack)
  }

  render() {
    if (this.state.hasError) {
      return (
        <div style={{
          display: 'flex', flexDirection: 'column', alignItems: 'center',
          justifyContent: 'center', height: '100vh', backgroundColor: '#0D0E10',
          color: '#E8E6E0', fontFamily: 'system-ui, sans-serif',
        }}>
          <p style={{ fontSize: '16px', marginBottom: '16px' }}>
            Something went wrong.
          </p>
          <button
            onClick={() => window.location.reload()}
            style={{
              padding: '8px 20px', backgroundColor: '#C8A434',
              color: '#0D0E10', border: 'none', borderRadius: '6px',
              cursor: 'pointer', fontWeight: 600,
            }}
          >
            Reload Page
          </button>
        </div>
      )
    }
    return this.props.children
  }
}

Integration in App.tsx:

import { ErrorBoundary } from './components/ErrorBoundary'

// In App():
return <ErrorBoundary><Dashboard /></ErrorBoundary>

Component 12: WebSocket JSON Parse Error Handling (Backend)

File: backend/src/routes/docker.ts

Current state: The dockerExecRoutes handler already has a try/catch around JSON.parse that sends { type: 'error', message: 'Invalid JSON' } and does NOT close the connection. This is already correctly implemented.

Verify the existing code:

socket.on('message', async (raw: Buffer) => {
  let msg: ExecMessage
  try {
    msg = JSON.parse(raw.toString())
  } catch {
    sendJson(socket, { type: 'error', message: 'Invalid JSON' })
    return  // Does not close — client can retry
  }
  // ...
})

Already correct. No change needed.

Component 13: Session Log Path Traversal Prevention (Backend)

File: backend/src/routes/terminal.ts

Fix: Validate integrationId is a positive integer before constructing the log path:

function sessionLogPath(integrationId: number): string | null {
  // Validate: must be a positive integer (no NaN, no negative, no float)
  if (!Number.isInteger(integrationId) || integrationId <= 0) return null
  mkdirSync(SESSION_LOG_DIR, { recursive: true })
  const stamp = new Date().toISOString().replace(/[:.]/g, '-')
  return join(SESSION_LOG_DIR, `${integrationId}_${stamp}.log`)
}

Callers check for null return and skip logging.

Component 14: CloudFormation Test Deploy Template

File: infra/test-deploy.yml

AWSTemplateFormatVersion: '2010-09-09'
Description: >-
  ArchNest test deployment - t4g.small EC2 with Docker in us-east-1.
  Budget alarm at $30/month. Destroy after testing.

Parameters:
  KeyPairName:
    Type: AWS::EC2::KeyPair::KeyName
    Description: SSH key pair for instance access
  NotificationEmail:
    Type: String
    Description: Email for budget alarm notifications

Resources:
  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: ArchNest test - SSH + HTTP/HTTPS
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 0.0.0.0/0

  Instance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t4g.small
      KeyName: !Ref KeyPairName
      ImageId: !Sub '{{resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64}}'
      SecurityGroupIds:
        - !GetAtt SecurityGroup.GroupId
      UserData:
        Fn::Base64: |
          #!/bin/bash -xe
          dnf update -y
          dnf install -y docker git
          systemctl enable docker && systemctl start docker
          usermod -aG docker ec2-user
          # Install Docker Compose v2 plugin
          mkdir -p /usr/local/lib/docker/cli-plugins
          curl -fsSL "https://github.com/docker/compose/releases/latest/download/docker-compose-linux-$(uname -m)" \
            -o /usr/local/lib/docker/cli-plugins/docker-compose
          chmod +x /usr/local/lib/docker/cli-plugins/docker-compose

  BudgetAlarm:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetName: archnest-test-budget
        BudgetLimit:
          Amount: 30
          Unit: USD
        TimeUnit: MONTHLY
        BudgetType: COST
      NotificationsWithSubscribers:
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 80
          Subscribers:
            - SubscriptionType: EMAIL
              Address: !Ref NotificationEmail

Outputs:
  PublicIP:
    Value: !GetAtt Instance.PublicIp
    Description: Instance public IP for SSH access
  InstanceId:
    Value: !Ref Instance
    Description: EC2 instance ID

Data Models

No new database tables or schema changes are required. All fixes operate on existing data structures.

New TypeScript interface (WebSocket auth message):

interface AuthMessage {
  type: 'auth'
  token: string
}

// ClientMessage union extended:
type ClientMessage =
  | AuthMessage
  | { type: 'connect'; integrationId?: number; cols?: number; rows?: number; tmuxSession?: string }
  | { type: 'input'; data?: string }
  | { type: 'resize'; cols?: number; rows?: number }
  | { type: 'disconnect' }
  | { type: 'list_tmux'; integrationId?: number }

Error Handling

Component Error Case Behavior
Path validation Traversal/absolute path HTTP 400, descriptive message
WS auth gate Missing/invalid token Error frame + close connection
Agent token Wrong token (any form) HTTP 401, identical response
Data import Body > 10MB HTTP 413 (Fastify built-in)
JSON parse Malformed WS message Error frame, keep connection open
Session log Invalid integrationId Skip logging, no error to user
Error Boundary Component crash Fallback UI with reload button
Sidebar API rejection Swallow error, show "Checking…"

Testing Strategy

Unit tests (example-based): WebSocket session lifecycle (Req 1), Sidebar error catch (Req 4), frontend token removal from URL (Req 5.15.2), Error Boundary rendering (Req 11), CORS default (Req 10), body size limit (Req 8).

Property tests (100+ iterations): Input validation functions (tmux names, container refs, path traversal, integration IDs), token comparison correctness, WS auth gate, SSH cleanup guarantee, JSON parse resilience.

Smoke tests: CloudFormation template structure validation (Req 14) — verify resource types, parameters, and outputs exist with correct values.

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: tmux Session Name Validation Prevents Injection

For any string s, if TMUX_NAME_RE.test(s) returns true, then s contains only characters from [A-Za-z0-9_-] and has length 164, and the resulting shell command tmux attach -t '${s}' is safe from injection. For any string s that contains characters outside this set or exceeds 64 chars, the validator SHALL reject it.

Validates: Requirements 2.1, 2.3

Property 2: Container Reference Validation and Safe Escaping

For any string ref, isValidContainerRef(ref) returns true if and only if ref matches ^[A-Za-z0-9][A-Za-z0-9_.-]{0,127}$. For any valid container reference, shQuote(ref) produces a string that, when embedded in a shell command, cannot break out of single quotes.

Validates: Requirements 3.1, 3.3

Property 3: Path Traversal Prevention

For any path string p, if p starts with / or if posix.normalize(p) contains a .. component that would escape the root (starts with ../, equals .., or contains /../), then validatePath(p) SHALL throw an error. For any relative path without traversal components, validatePath(p) SHALL return the normalized path.

Validates: Requirements 6.1, 6.2

Property 4: Agent Token Comparison Correctness

For any pair of token strings (presented, expected), the agentTokenValid function returns ok: true if and only if presented === expected. The function SHALL never throw regardless of input lengths, and SHALL return the same HTTP 401 response shape for all rejection cases (length mismatch or content mismatch).

Validates: Requirements 7.1, 7.2, 7.3

Property 5: WebSocket Auth Gate Rejects Unauthenticated Messages

For any WebSocket message received before a valid auth message has been processed, the backend SHALL respond with an error frame and close the connection. No connect, list_tmux, input, or resize message SHALL be processed in the unauthenticated state.

Validates: Requirements 5.3, 5.4

Property 6: SSH Connection Cleanup Guarantee

For any operation function fn passed to withSshClient, regardless of whether fn resolves or rejects, both the primary SSH client and the jump-host client (if any) SHALL have .end() called before the withSshClient promise settles.

Validates: Requirements 9.1, 9.2

Property 7: WebSocket Invalid JSON Resilience

For any byte sequence sent to the Docker exec WebSocket that is not valid JSON, the handler SHALL send { type: 'error', message: 'Invalid JSON' } and SHALL NOT close the WebSocket connection.

Validates: Requirements 12.1, 12.2

Property 8: Integration ID Numeric Validation for Session Logs

For any value v used as integrationId when session logging is enabled, if v is not a positive integer (Number.isInteger(v) && v > 0), then sessionLogPath SHALL return null and no filesystem write SHALL occur.

Validates: Requirements 13.1, 13.2