Real-Time Chat System Architecture — What PostIdea Chose and Why

Architecture decisions for a real-time chat application aren't just technical preferences. Each one has a concrete consequence for what fails verification if you skip it. This page walks through every decision the PostIdea architecture engine made for the chat spec — what was decided, why, and what breaks in your implementation if you ignore it.

The decisions

Database: PostgreSQL

The spec requires message history persistence (FR-04), searchable message content (FR-06), and room participant tracking (FR-02). These are relational queries with clear entity relationships: rooms have messages, messages belong to users, users belong to rooms. PostgreSQL is the correct default.

What fails verification if you skip this: Message search (FR-06) requires full-text search on message content. Without PostgreSQL's tsvector or equivalent, you'll implement search as a LIKE '%keyword%' query that becomes unusable above 10,000 messages. NFR-02 (10,000 concurrent users) implies message volume that makes naive search untenable.

Service Structure: Monolith

Expected user load is 10,000 concurrent connections. The microservices threshold is 50,000. At 10,000 users, microservices add 3–6 months of operational overhead — service mesh, distributed tracing, inter-service auth — with no measurable benefit. A monolith is the correct architecture at this scale.

What fails verification if you skip this: Nothing fails immediately. But if you build microservices at 10,000 users, your implementation will contain infrastructure code (API gateways, service registries, health check endpoints) that the spec never asked for. Layer 3 semantic audit will flag these as out-of-scope.

Realtime Layer: WebSocket

The spec explicitly requires real-time message delivery (FR-01) with p95 latency < 200ms (NFR-01). WebSocket is the only viable choice for bidirectional real-time communication at this latency target. Server-Sent Events (SSE) is unidirectional and adds 100–300ms overhead for client-to-server messages via HTTP POST.

What fails verification if you skip this: If you implement SSE instead of WebSocket, NFR-01 (p95 < 200ms) will fail under any real load. The round-trip time for a message send (HTTP POST) + receive (SSE event) exceeds 200ms on 4G networks. WebSocket keeps a persistent connection open, eliminating the HTTP handshake overhead on every message.

Cache Layer: Redis (Pub/Sub)

Redis is required for two reasons: (1) WebSocket connection state must be shared across multiple server instances for horizontal scaling (NFR-02), and (2) message broadcast to all connected clients in a room requires a pub/sub mechanism. Redis pub/sub is the standard solution.

What fails verification if you skip this: Without Redis pub/sub, you cannot horizontally scale WebSocket servers. A message sent to server A will not reach clients connected to server B. NFR-02 (10,000 concurrent connections) requires multiple server instances behind a load balancer. Without Redis, you're limited to a single server, which caps you at ~5,000 connections before CPU saturation.

Auth Strategy: JWT Stateless

No signals requiring server-side session management. JWT stateless auth is the correct default for a chat application at this scale. WebSocket connections authenticate once on handshake using the JWT token.

What fails verification if you skip this: JWT stateless auth has one known failure mode: revocation. If a user account is compromised, you cannot invalidate their token until it expires. The spec doesn't require revocation — but if your implementation adds it later without a Redis blacklist, you'll have a security gap. The architecture decision is correct for the spec as written.

Background Jobs: Message Persistence Queue

NFR-03 requires ≥99.9% of messages persisted to database within 1 second of send. Synchronous database writes in the WebSocket handler will breach NFR-01 (p95 < 200ms). Messages must be queued for async persistence.

What fails verification if you skip this: If you write messages to PostgreSQL synchronously in the WebSocket handler, NFR-01 (p95 < 200ms) will fail. Database writes add 20–50ms latency per message. At 10,000 concurrent users, database connection pool exhaustion will push latency above 500ms. A background job queue (Redis-backed or RabbitMQ) is required to meet both NFR-01 and NFR-03.

Presence Tracking: Redis TTL

FR-03 requires online user presence tracking. Redis TTL keys are the standard solution: set a key with a 30-second TTL on every WebSocket heartbeat. If the key expires, the user is offline.

What fails verification if you skip this: If you track presence in PostgreSQL, you'll breach NFR-01. Presence updates happen on every WebSocket heartbeat (every 10–30 seconds per user). At 10,000 concurrent users, that's 300–1,000 database writes per second just for presence. Redis handles this trivially. PostgreSQL does not.

Architecture diagram

graph TD
    User([User])
    FE[Frontend]
    WS[WebSocket Server]
    API[REST API]
    DB[(PostgreSQL)]
    Redis[(Redis)]
    Queue[Message Queue]
    User --> FE
    FE --> WS
    FE --> API
    WS --> Redis
    WS --> Queue
    API --> DB
    Queue --> DB
    Redis --> WS

The point of this page

Every architecture decision above was made deterministically from the spec constraints — no LLM guessing, no preference-based choices. The rules engine reads your constraints and outputs decisions with explicit tradeoffs and future consequences.

The "what fails verification" annotations are the part that matters most. Architecture decisions aren't abstract. They have direct consequences for whether your implementation passes or fails the spec it was built against.

See how this architecture performed in verification →
View the full spec →
Check your own architecture risk score →
Generate architecture decisions for your own spec →