Component Overview
Concordance runs as a single Bun process per node. All components are wired together inmain.ts via callback injection — no dependency injection framework, no service locator. Bun’s single-threaded event loop serializes all message handling naturally, eliminating the need for mutexes or locks.
Raft Consensus
Concordance implements the Raft consensus algorithm from Ongaro and Ousterhout (2014). Every write goes through Raft to guarantee linearizable consistency across all nodes.Roles
Each node is in one of three states at any time:| Role | Description |
|---|---|
| Leader | Accepts client writes, replicates log entries to followers, sends heartbeats |
| Follower | Accepts log entries from the leader, votes in elections, serves reads from local state |
| Candidate | Running for leader after election timeout expires |
Election
- A follower’s election timer expires (randomized 150—300ms)
- It increments its term, votes for itself, transitions to candidate
- It sends
RequestVoteRPCs to all peers - If it receives votes from a majority (quorum), it becomes leader
- The new leader sends an immediate heartbeat (empty
AppendEntries) to assert authority - A noop entry is committed to establish the leader’s term in the log
Log Replication
When the leader receives a write:- It appends the command to its local log (SQLite
raft.db) - It sends
AppendEntriesRPCs to all followers - Each follower checks log consistency at
prevLogIndex/prevLogTerm - On success, the follower appends the entries and acknowledges
- On failure (log gap or conflict), the leader decrements
nextIndexand retries - Once a majority of nodes have acknowledged, the entry is committed
- The leader advances
commitIndexand resolves the client’s pending promise - Followers learn of the new
commitIndexvia the nextAppendEntriesand apply locally
Timing Parameters
| Parameter | Default | Notes |
|---|---|---|
| Election timeout | 150—300ms | Randomized per node to prevent split votes |
| Heartbeat interval | 50ms | Must be well below election timeout |
| Max entries per AppendEntries | 100 | Caps batch size for replication RPCs |
| Snapshot threshold | 10,000 entries | Log compaction trigger |
| Proposal timeout | 5,000ms | Max wait for quorum commit |
Log Entry Types
| Type | Value | Description |
|---|---|---|
| Command | 0 | KV operation (set, delete, batch) |
| Config | 1 | Cluster membership change |
| Noop | 2 | Leader establishment (committed after election) |
SQLite Storage
Each node maintains two SQLite databases:state.db (KV Store)
Stores the actual key-value data and change history. Uses WAL mode withPRAGMA synchronous=NORMAL for performance.
kv(namespace), kv(expires_at) WHERE expires_at IS NOT NULL, changelog(tenant_id, seq).
raft.db (Raft Log)
Stores the Raft log and persistent metadata. UsesPRAGMA synchronous=FULL because Raft correctness requires that committed entries survive crashes.
Why Two Databases
Separating state from Raft log enables independent lifecycle management:- Snapshots serialize
state.dbas a singleUint8Arrayviabun:sqlite’sdb.serialize(). The Raft log can then be compacted (entries deleted) without affecting the KV state. - Durability tradeoffs differ: the Raft log needs
synchronous=FULL(every write is fsync’d), while the KV store usessynchronous=NORMAL(faster, safe because Raft guarantees replay on crash). - Recovery is straightforward: if
state.dbis lost, replay the Raft log. Ifraft.dbis lost, request a snapshot from the leader.
Namespace Design
All keys are organized under tenant-scoped namespaces. The namespace is the first component of the compound primary key instate.db.
Format
Scope Resolution
Clients specify a high-level scope (e.g., “user”, “tenant”, “session”). Diminuendo resolves this to a full namespace using the authenticated identity:| Client Scope | Resolved Namespace | Additional Param |
|---|---|---|
user | tenant:{tid}/user:{uid}/preferences | — |
tenant | tenant:{tid}/settings | — |
session | tenant:{tid}/sessions/{sid} | sessionId required |
project | tenant:{tid}/projects/{pid} | projectId required |
device | tenant:{tid}/user:{uid}/devices/{did} | deviceId optional |
Tenant Isolation
TheextractTenantId() function parses the tenant ID from any namespace string. This enables:
- Filtered change polling:
GET /api/v1/changes?tenant=acmereturns only changes for that tenant - Scoped pub/sub: changes are published to both
tenant:{id}(broad) andns:{namespace}(targeted) topics
Pub/Sub Flow
Real-time change notifications flow through Bun’s native pub/sub system:Subscription Lifecycle
- Client connects to
/streamWebSocket endpoint (with Bearer token auth) - Client sends
watch/subscribewith a namespace - Server calls
ws.subscribe("ns:{namespace}")(Bun’s native topic subscription) - When a change occurs in that namespace, the FSM publishes a
watch/changenotification - Bun routes the message to all subscribed sockets automatically
- On disconnect, all subscriptions are cleaned up via
ws.unsubscribe()
id, no response expected):
Write Path (Detailed)
A complete write operation flows through these stages:- Client request arrives via HTTP PUT or WebSocket
kv/set - Leader check: if this node is not the leader, return a 307 redirect (HTTP) or
-32001 Not Leadererror (WebSocket) with the leader’s address - Proposal: the leader serializes the command to JSON bytes and calls
raft.propose(data) - Log append: the entry is written to
raft.dbwithsynchronous=FULL - Replication:
AppendEntriesRPCs are sent to all followers - Quorum: once a majority acknowledges,
commitIndexadvances - Apply:
raft.onApplycallsfsm.apply(), which writes tostate.db - Pub/sub: the FSM publishes the change event to Bun’s topic system
- Response: the pending promise resolves, and the client receives the result
Snapshots
When the Raft log grows beyond 10,000 entries (configurable viasnapshotThreshold), the SnapshotManager creates a snapshot:
store.serialize()produces the entirestate.dbas aUint8Array- The Raft log is compacted: all entries up to
lastIncludedIndexare deleted fromraft.db - Snapshot metadata (
lastIncludedIndex,lastIncludedTerm) is stored in memory
InstallSnapshot notice followed by the binary snapshot data over WebSocket. The follower:
- Receives the JSON notice with metadata
- Receives the binary
state.dbdata - Writes the snapshot to disk and opens a new
KvStore - Compacts its Raft log up to
lastIncludedIndex
Single-Port Design
All traffic flows through oneBun.serve() listener (default :4100):
| Path | Protocol | Purpose |
|---|---|---|
/api/v1/* | HTTP | REST API for KV operations and cluster management |
/stream | WebSocket | Client JSON-RPC (reads, writes, subscriptions) |
/raft | WebSocket | Peer-to-peer Raft messages (JSON) and snapshots (binary) |
/raft endpoint uses a nodeId query parameter to identify the connecting peer.