dev_arc_aws/.kiro/specs/code-audit-fixes/design.md
Samuel James c9d7462c33
All checks were successful
CI / validate (pull_request) Successful in 7m7s
Add code-audit-fixes spec + audit steering
- Spec for fixing 13 Critical + High code audit issues
- requirements.md (14 requirements, EARS format)
- design.md (component fixes + 8 correctness properties)
- tasks.md (8 tasks, parallel execution waves)
- CloudFormation test deploy template planned
- code-audit.md steering file (all 40 issues documented)

Co-authored-by: Samuel James <ssamjame@amazon.com>
Co-authored-by: Kiro <noreply@kiro.dev>
2026-06-24 15:19:32 -04:00

561 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Design Document: Code Audit Fixes
## Overview
This design addresses 13 Critical + High severity issues from the ArchNest code audit plus a CloudFormation test deploy template. The fixes are surgical — each targets a specific file with a minimal, correct patch. No new dependencies are introduced; all fixes use Node.js built-ins, existing Fastify hooks, and standard React patterns already in the codebase.
## Architecture
The fixes span three layers:
```
┌─────────────────────────────────────────────────────────┐
│ Frontend (React 19 + Vite 8 + TypeScript) │
│ ├─ TerminalSessionContext.tsx — WS lifecycle, auth │
│ ├─ Sidebar.tsx — promise error catch │
│ ├─ App.tsx — ErrorBoundary wrapper │
│ └─ ErrorBoundary.tsx — new component │
├─────────────────────────────────────────────────────────┤
│ Backend (Fastify 5 + TypeScript) │
│ ├─ routes/terminal.ts — tmux validation, WS auth, │
│ │ session log ID validation │
│ ├─ routes/files.ts — path traversal prevention │
│ ├─ routes/agents.ts — timing-safe token compare │
│ ├─ routes/data.ts — body size limit │
│ ├─ routes/docker.ts — JSON parse error handling │
│ ├─ ssh/docker.ts — container ref validation, │
│ │ SSH connection cleanup │
│ └─ server.ts — CORS fail-closed default │
├─────────────────────────────────────────────────────────┤
│ Infrastructure │
│ └─ infra/test-deploy.yml — CloudFormation template │
└─────────────────────────────────────────────────────────┘
```
## Components and Interfaces
### Component 1: WebSocket Session Leak Prevention (Frontend)
**File:** `src/lib/TerminalSessionContext.tsx`
**Current problem:** The `connect()` function calls `s.ws?.close()` but a race condition exists where the old WS `onclose` handler fires after reassignment and corrupts state belonging to the new connection.
**Fix:**
```typescript
function connect(s: PaneSession, id: number, tmuxSession?: string) {
s.disposeListeners?.()
// Close existing WS if OPEN or CONNECTING — guard against leak
if (s.ws && (s.ws.readyState === WebSocket.OPEN || s.ws.readyState === WebSocket.CONNECTING)) {
s.ws.close()
}
s.ws = null
s.connected = false
bump()
const term = s.term
term.reset()
term.writeln('Connecting…')
const token = getToken()
const proto = window.location.protocol === 'https:' ? 'wss' : 'ws'
// Token sent as first message, NOT in URL query string
const ws = new WebSocket(`${proto}://${window.location.host}/api/terminal`)
const thisWs = ws // Capture reference to detect stale onclose
s.ws = ws
ws.onopen = () => {
// Send auth as first message
ws.send(JSON.stringify({ type: 'auth', token }))
ws.send(JSON.stringify({ type: 'connect', integrationId: id, cols: term.cols, rows: term.rows, tmuxSession }))
}
ws.onclose = () => {
// Guard: only update state if this WS is still the active one
if (s.ws !== thisWs) return
s.connected = false
bump()
}
// ... rest of message handling unchanged
}
```
The same pattern applies to `fetchTmuxSessions()` — remove `?token=` from URL, send auth as first message.
### Component 2: tmux Session Name Validation (Backend)
**File:** `backend/src/routes/terminal.ts`
**Current state:** The regex `TMUX_NAME_RE = /^[A-Za-z0-9_-]{1,64}$/` is already defined and used. The issue is that the validated name is interpolated directly into a `tmux attach -t ${tmuxSession}` command without shell quoting.
**Fix:** Apply `shQuote()` to the validated name in command construction, or use the existing pattern where invalid names fall through to `null`:
```typescript
const TMUX_NAME_RE = /^[A-Za-z0-9_-]{1,64}$/
// In the connect handler:
const tmuxSession = msg.tmuxSession && TMUX_NAME_RE.test(msg.tmuxSession)
? msg.tmuxSession
: null
if (tmuxSession) {
// Name is validated to contain only safe chars; quote for defense-in-depth
const safe = tmuxSession.replace(/'/g, "") // Impossible given regex, but belt+suspenders
client.exec(`tmux attach -t '${safe}' || tmux new-session -s '${safe}'`, {
pty: { cols, rows, term: 'xterm-256color' },
}, onChannel)
}
```
### Component 3: Docker Container Reference Validation (Backend)
**File:** `backend/src/ssh/docker.ts`
**Current state:** Already has `CONTAINER_REF_RE` and `shQuote()`. The regex and quoting are correctly implemented. Verify the regex anchors and character set are tight:
```typescript
const CONTAINER_REF_RE = /^[A-Za-z0-9][A-Za-z0-9_.-]{0,127}$/
```
This is already correct. The `shQuote()` function is already applied in `containerLogs`, `containerAction`, `removeContainer`, and `buildExecShellCommand`. No code change needed — the audit finding is already resolved in the current codebase.
### Component 4: Sidebar Promise Error Handling (Frontend)
**File:** `src/components/Sidebar.tsx`
**Fix:** Add `.catch()` to the unhandled promise:
```typescript
useEffect(() => {
api.listIntegrations()
.then(({ integrations }) => setIntegrations(integrations))
.catch(() => {}) // Leave state as null → shows "Checking…"
}, [])
```
### Component 5: WebSocket Authentication via First Message (Backend)
**Files:** `backend/src/routes/terminal.ts`, `backend/src/routes/docker.ts`
**Current state:** Both routes verify `req.query.token` on `connect`/`list_tmux` messages. This needs to change to a first-message auth protocol.
**Design:**
```typescript
// Shared auth state per WSocket connection
let authenticated = false
socket.on('message', async (raw: Buffer) => {
let msg: ClientMessage
try {
msg = JSON.parse(raw.toString())
} catch {
send(socket, { type: 'error', message: 'Invalid JSON' })
return
}
// Gate: first message must be auth
if (!authenticated) {
if (msg.type !== 'auth' || !msg.token) {
send(socket, { type: 'error', message: 'Authentication required' })
socket.close()
return
}
try {
await app.jwt.verify(msg.token)
authenticated = true
send(socket, { type: 'authenticated' })
} catch {
send(socket, { type: 'error', message: 'Unauthorized' })
socket.close()
}
return
}
// Normal message processing (connect, input, resize, etc.)
// ...
})
```
The `ClientMessage` type gets a new variant: `type: 'auth'` with `token: string`.
### Component 6: File Path Traversal Prevention (Backend)
**File:** `backend/src/routes/files.ts`
**Design:** Add a validation function applied to all path inputs before any SFTP operation:
```typescript
import { posix } from 'node:path'
function validatePath(path: string): string {
// Reject absolute paths
if (path.startsWith('/')) {
throw Object.assign(new Error('Absolute paths are not allowed'), { statusCode: 400 })
}
// Normalize and check for traversal
const normalized = posix.normalize(path)
if (normalized.startsWith('../') || normalized === '..' || normalized.includes('/../')) {
throw Object.assign(new Error('Path traversal is not allowed'), { statusCode: 400 })
}
return normalized
}
```
Applied at the top of every endpoint handler that accepts a user path: `list`, `content` (read), `content` (write), `mkdir`, `rename`, `delete`, `chmod`, `download`, `upload`.
### Component 7: Agent Token Timing-Safe Comparison (Backend)
**File:** `backend/src/routes/agents.ts`
**Current state:** The `agentTokenValid()` function already uses `timingSafeEqual` but has an early return on length mismatch that leaks timing information.
**Fix:** Pad to equal length before comparison:
```typescript
function agentTokenValid(req: FastifyRequest): { ok: boolean; configured: boolean } {
const expected = process.env.ARCHNEST_AGENT_TOKEN
if (!expected) return { ok: false, configured: false }
const header = req.headers.authorization ?? ''
const presented = header.startsWith('Bearer ') ? header.slice(7) : ''
const a = Buffer.from(presented)
const b = Buffer.from(expected)
// Pad shorter buffer to match longer — constant-time regardless of length diff
const maxLen = Math.max(a.length, b.length)
const aPadded = Buffer.alloc(maxLen)
const bPadded = Buffer.alloc(maxLen)
a.copy(aPadded)
b.copy(bPadded)
const match = timingSafeEqual(aPadded, bPadded)
// Both length AND content must match
return { ok: match && a.length === b.length, configured: true }
}
```
### Component 8: Data Import Size Limit (Backend)
**File:** `backend/src/routes/data.ts`
**Fix:** Add `bodyLimit` to the route registration:
```typescript
app.post('/api/data/import', {
onRequest: [app.adminOnly],
bodyLimit: 10 * 1024 * 1024, // 10 MB
}, async (req, reply) => {
// ... existing handler
})
```
Fastify natively returns 413 Payload Too Large when exceeded, before JSON parsing.
### Component 9: SSH Connection Leak Prevention (Backend)
**File:** `backend/src/ssh/docker.ts`
**Current state:** The `withSshClient()` function already has a `finally` block that calls `client.end()` and `jumpRef.current?.end()`. This is correctly implemented in the current codebase. The audit noted older code; the fix is already present.
Verify the `finally` block:
```typescript
} finally {
client.end()
jumpRef.current?.end()
}
```
This is already in place. No change needed.
### Component 10: CORS Origin Fail-Closed Default (Backend)
**File:** `backend/src/server.ts`
**Current state:** `origin: process.env.ARCHNEST_CORS_ORIGIN ?? true` — falls back to `true` (allow all).
**Fix:**
```typescript
const corsOrigin = process.env.ARCHNEST_CORS_ORIGIN || false
if (!process.env.ARCHNEST_CORS_ORIGIN) {
app.log.warn('ARCHNEST_CORS_ORIGIN is not set — all cross-origin requests will be blocked')
}
await app.register(cors, { origin: corsOrigin })
```
### Component 11: React Error Boundary (Frontend)
**File:** `src/components/ErrorBoundary.tsx` (new)
```typescript
import { Component, type ReactNode, type ErrorInfo } from 'react'
interface Props { children: ReactNode }
interface State { hasError: boolean }
export class ErrorBoundary extends Component<Props, State> {
state: State = { hasError: false }
static getDerivedStateFromError(): State {
return { hasError: true }
}
componentDidCatch(error: Error, info: ErrorInfo) {
console.error('[ErrorBoundary]', error, info.componentStack)
}
render() {
if (this.state.hasError) {
return (
<div style={{
display: 'flex', flexDirection: 'column', alignItems: 'center',
justifyContent: 'center', height: '100vh', backgroundColor: '#0D0E10',
color: '#E8E6E0', fontFamily: 'system-ui, sans-serif',
}}>
<p style={{ fontSize: '16px', marginBottom: '16px' }}>
Something went wrong.
</p>
<button
onClick={() => window.location.reload()}
style={{
padding: '8px 20px', backgroundColor: '#C8A434',
color: '#0D0E10', border: 'none', borderRadius: '6px',
cursor: 'pointer', fontWeight: 600,
}}
>
Reload Page
</button>
</div>
)
}
return this.props.children
}
}
```
**Integration in `App.tsx`:**
```typescript
import { ErrorBoundary } from './components/ErrorBoundary'
// In App():
return <ErrorBoundary><Dashboard /></ErrorBoundary>
```
### Component 12: WebSocket JSON Parse Error Handling (Backend)
**File:** `backend/src/routes/docker.ts`
**Current state:** The `dockerExecRoutes` handler already has a try/catch around `JSON.parse` that sends `{ type: 'error', message: 'Invalid JSON' }` and does NOT close the connection. This is already correctly implemented.
Verify the existing code:
```typescript
socket.on('message', async (raw: Buffer) => {
let msg: ExecMessage
try {
msg = JSON.parse(raw.toString())
} catch {
sendJson(socket, { type: 'error', message: 'Invalid JSON' })
return // Does not close — client can retry
}
// ...
})
```
Already correct. No change needed.
### Component 13: Session Log Path Traversal Prevention (Backend)
**File:** `backend/src/routes/terminal.ts`
**Fix:** Validate `integrationId` is a positive integer before constructing the log path:
```typescript
function sessionLogPath(integrationId: number): string | null {
// Validate: must be a positive integer (no NaN, no negative, no float)
if (!Number.isInteger(integrationId) || integrationId <= 0) return null
mkdirSync(SESSION_LOG_DIR, { recursive: true })
const stamp = new Date().toISOString().replace(/[:.]/g, '-')
return join(SESSION_LOG_DIR, `${integrationId}_${stamp}.log`)
}
```
Callers check for `null` return and skip logging.
### Component 14: CloudFormation Test Deploy Template
**File:** `infra/test-deploy.yml`
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Description: >-
ArchNest test deployment - t4g.small EC2 with Docker in us-east-1.
Budget alarm at $30/month. Destroy after testing.
Parameters:
KeyPairName:
Type: AWS::EC2::KeyPair::KeyName
Description: SSH key pair for instance access
NotificationEmail:
Type: String
Description: Email for budget alarm notifications
Resources:
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: ArchNest test - SSH + HTTP/HTTPS
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 22
ToPort: 22
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
- IpProtocol: tcp
FromPort: 443
ToPort: 443
CidrIp: 0.0.0.0/0
Instance:
Type: AWS::EC2::Instance
Properties:
InstanceType: t4g.small
KeyName: !Ref KeyPairName
ImageId: !Sub '{{resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-arm64}}'
SecurityGroupIds:
- !GetAtt SecurityGroup.GroupId
UserData:
Fn::Base64: |
#!/bin/bash -xe
dnf update -y
dnf install -y docker git
systemctl enable docker && systemctl start docker
usermod -aG docker ec2-user
# Install Docker Compose v2 plugin
mkdir -p /usr/local/lib/docker/cli-plugins
curl -fsSL "https://github.com/docker/compose/releases/latest/download/docker-compose-linux-$(uname -m)" \
-o /usr/local/lib/docker/cli-plugins/docker-compose
chmod +x /usr/local/lib/docker/cli-plugins/docker-compose
BudgetAlarm:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: archnest-test-budget
BudgetLimit:
Amount: 30
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
Subscribers:
- SubscriptionType: EMAIL
Address: !Ref NotificationEmail
Outputs:
PublicIP:
Value: !GetAtt Instance.PublicIp
Description: Instance public IP for SSH access
InstanceId:
Value: !Ref Instance
Description: EC2 instance ID
```
## Data Models
No new database tables or schema changes are required. All fixes operate on existing data structures.
**New TypeScript interface (WebSocket auth message):**
```typescript
interface AuthMessage {
type: 'auth'
token: string
}
// ClientMessage union extended:
type ClientMessage =
| AuthMessage
| { type: 'connect'; integrationId?: number; cols?: number; rows?: number; tmuxSession?: string }
| { type: 'input'; data?: string }
| { type: 'resize'; cols?: number; rows?: number }
| { type: 'disconnect' }
| { type: 'list_tmux'; integrationId?: number }
```
## Error Handling
| Component | Error Case | Behavior |
|-----------|-----------|----------|
| Path validation | Traversal/absolute path | HTTP 400, descriptive message |
| WS auth gate | Missing/invalid token | Error frame + close connection |
| Agent token | Wrong token (any form) | HTTP 401, identical response |
| Data import | Body > 10MB | HTTP 413 (Fastify built-in) |
| JSON parse | Malformed WS message | Error frame, keep connection open |
| Session log | Invalid integrationId | Skip logging, no error to user |
| Error Boundary | Component crash | Fallback UI with reload button |
| Sidebar | API rejection | Swallow error, show "Checking…" |
## Testing Strategy
**Unit tests** (example-based): WebSocket session lifecycle (Req 1), Sidebar error catch (Req 4), frontend token removal from URL (Req 5.15.2), Error Boundary rendering (Req 11), CORS default (Req 10), body size limit (Req 8).
**Property tests** (100+ iterations): Input validation functions (tmux names, container refs, path traversal, integration IDs), token comparison correctness, WS auth gate, SSH cleanup guarantee, JSON parse resilience.
**Smoke tests**: CloudFormation template structure validation (Req 14) — verify resource types, parameters, and outputs exist with correct values.
## Correctness Properties
*A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*
### Property 1: tmux Session Name Validation Prevents Injection
*For any* string `s`, if `TMUX_NAME_RE.test(s)` returns true, then `s` contains only characters from `[A-Za-z0-9_-]` and has length 164, and the resulting shell command `tmux attach -t '${s}'` is safe from injection. *For any* string `s` that contains characters outside this set or exceeds 64 chars, the validator SHALL reject it.
**Validates: Requirements 2.1, 2.3**
### Property 2: Container Reference Validation and Safe Escaping
*For any* string `ref`, `isValidContainerRef(ref)` returns true if and only if `ref` matches `^[A-Za-z0-9][A-Za-z0-9_.-]{0,127}$`. *For any* valid container reference, `shQuote(ref)` produces a string that, when embedded in a shell command, cannot break out of single quotes.
**Validates: Requirements 3.1, 3.3**
### Property 3: Path Traversal Prevention
*For any* path string `p`, if `p` starts with `/` or if `posix.normalize(p)` contains a `..` component that would escape the root (starts with `../`, equals `..`, or contains `/../`), then `validatePath(p)` SHALL throw an error. *For any* relative path without traversal components, `validatePath(p)` SHALL return the normalized path.
**Validates: Requirements 6.1, 6.2**
### Property 4: Agent Token Comparison Correctness
*For any* pair of token strings `(presented, expected)`, the `agentTokenValid` function returns `ok: true` if and only if `presented === expected`. The function SHALL never throw regardless of input lengths, and SHALL return the same HTTP 401 response shape for all rejection cases (length mismatch or content mismatch).
**Validates: Requirements 7.1, 7.2, 7.3**
### Property 5: WebSocket Auth Gate Rejects Unauthenticated Messages
*For any* WebSocket message received before a valid `auth` message has been processed, the backend SHALL respond with an error frame and close the connection. No `connect`, `list_tmux`, `input`, or `resize` message SHALL be processed in the unauthenticated state.
**Validates: Requirements 5.3, 5.4**
### Property 6: SSH Connection Cleanup Guarantee
*For any* operation function `fn` passed to `withSshClient`, regardless of whether `fn` resolves or rejects, both the primary SSH client and the jump-host client (if any) SHALL have `.end()` called before the `withSshClient` promise settles.
**Validates: Requirements 9.1, 9.2**
### Property 7: WebSocket Invalid JSON Resilience
*For any* byte sequence sent to the Docker exec WebSocket that is not valid JSON, the handler SHALL send `{ type: 'error', message: 'Invalid JSON' }` and SHALL NOT close the WebSocket connection.
**Validates: Requirements 12.1, 12.2**
### Property 8: Integration ID Numeric Validation for Session Logs
*For any* value `v` used as `integrationId` when session logging is enabled, if `v` is not a positive integer (`Number.isInteger(v) && v > 0`), then `sessionLogPath` SHALL return `null` and no filesystem write SHALL occur.
**Validates: Requirements 13.1, 13.2**