Building hnswz: a vector database, in Zig

How it started

I ran into a problem where I needed to search through a pile of local files semantically — not by filename, not by grep, but by meaning. “Find me that thing I read about compaction last month” sort of search.

The path of least resistance is to pipe everything to a hosted embedding API and pay by the token. I didn’t want to do that. The files are on my laptop, the laptop can clearly do the work, and paying a cloud every time I poke around my own data is a weird reversal of who’s supposed to be serving whom.

The other thing I wanted was an excuse to write Zig again.

So: how hard could it be to write a vector database?

Turns out the ceiling on “hard” is pretty high. This post is a tour of what hnswz became — what’s in it, which pieces are actually interesting, and which parts I’m still thinking about.

Status: work in progress. Don’t point production at it.

What a vector DB is, in one paragraph

Text goes in, a model turns it into a fixed-dimension float vector (“embedding”), and similar-meaning texts end up close together in that space. Search = find the nearest vectors to a query vector. Doing this exactly is O(N·dim) per query, which at N = 1M and dim = 4096 is not great. So you use an approximate nearest-neighbor index. The one everyone uses is HNSW — Hierarchical Navigable Small World graphs.

The pieces

Fig. 01 — File text is embedded, stored, indexed, and ranked.

There are a few building blocks I needed before the interesting part:

a vector store — a big flat []f32 of dim × capacity, with tombstones and a free list so deletes don’t fragment
a distance function — cosine / dot, but fast
an embedder — something to turn text into those vectors; I use Ollama locally
a graph index — HNSW, where the actual work happens

I’ll skim the first three and spend real time on HNSW and the server.

Storage

Flat, contiguous, 64-byte aligned, allocated once. dim is a runtime field set from config, not a comptime parameter — I want one binary, many configurations. Deletes mark a tombstone bit and push the slot onto a free-list; the next add pops from that LIFO before bumping the high-water mark. That’s all it is.

Distance, SIMD

Zig’s @Vector makes the SIMD dot product basically free to write:

const VecLen = std.simd.suggestVectorLength(f32) orelse 4;
const Vec4 = @Vector(VecLen, f32);

pub fn dot(a: []const f32, b: []const f32) f32 {
    var acc: Vec4 = @splat(0);
    var i: usize = 0;
    while (i + VecLen <= a.len) : (i += VecLen) {
        const va: Vec4 = a[i..][0..VecLen].*;
        const vb: Vec4 = b[i..][0..VecLen].*;
        acc += va * vb;
    }
    var sum: f32 = @reduce(.Add, acc);
    while (i < a.len) : (i += 1) sum += a[i] * b[i];
    return sum;
}

std.simd.suggestVectorLength asks the target what lane-width it wants, so I don’t bake a number in. On Apple Silicon that’s 4 × f32; a modern x86 would be 8 or 16. One file, portable, no intrinsics.

Cosine distance is 1 - dot(a,b) / (||a|| · ||b||). If the caller normalizes vectors on the way in, we skip the norms at query time — the hot path becomes a single dot product.

Embedder

A tiny vtable interface (embed(text, out)) with an Ollama implementation behind it. All buffers preallocated at startup — no allocations on the request path. That’s a pattern you’ll see repeated everywhere in this project, because it’s what lets you reason about memory.

HNSW — the actual core

This is the part worth zooming in on.

The idea: build a layered graph where the bottom layer (layer 0) contains every vector and is densely connected, and each layer above is a progressively sparser sampling. To search, you enter at the top, greedily hop toward the query at each level, and when you hit the bottom you run a beam search to collect the final neighbors.

Fig. 02 — Entry at the top; greedy descent through layers; beam search at layer 0.

Picking a level

When a new vector comes in, you pick the highest layer it lives on by sampling from a geometric distribution. Most vectors get level 0 (bottom only); very few climb high. This is where the “hierarchical” part comes from:

pub fn randomLevel(self: *Self) u8 {
    const uni = 1.0 - random.float(f64);     // (0, 1]
    const l: f64 = @floor(-@log(uni) * ml);  // ml = 1 / ln(M)
    return @intFromFloat(@min(l, MAX_LEVEL));
}

M is the graph fan-out (neighbors per node per layer). With M=16 the expected layer-above share is 1/16, which is exactly what makes “enter at the top, descend” cheap: upper layers are small, so the greedy phase touches a handful of nodes.

Beam search on layer 0

Once greedy descent lands us on a decent entry node at layer 0, we switch to beam search with a fixed width ef:

a min-heap of candidates to expand next (closest-first)
a max-heap of results of size ef (farthest-of-the-best on top, so we can evict in O(log ef))
a visited set (bitset over the max-vector capacity)

Pop the closest candidate. If it’s already worse than the worst result we’ve got, stop. Otherwise expand its neighbors, push any unseen one onto candidates, and if it beats the current worst result, replace it.

Fig. 03 — The frontier tightens around the query; results (sage) stay closest, candidates (blue) wait to be expanded.

Every buffer the search touches — the two heaps, the visited bitset, a few scratch slices — lives inside a Workspace struct that’s sized once at startup and handed out from a per-worker pool. No alloc on the hot path. That’s how you get predictable p99 latencies.

The neighbor-selection heuristic

When a new node connects, you don’t just take the M closest candidates — that over-clusters. The paper’s heuristic is “prefer diverse directions”: accept a candidate only if it’s closer to the new node than to any already-accepted neighbor. This keeps the graph well-connected across the space instead of collapsing into a few dense regions.

Deletes without rebuilding

HNSW doesn’t want you to delete. The standard answer is “just tombstone it, keep it as a routing hop, skip it in results.” That’s what I do — with one wrinkle. Each node has some number of upper-layer slots (literally zero for level-0 nodes, more for high-level ones). When a node dies, its slot run goes into a free-list keyed by run length, and the next insert that sampled the same level pops from that bucket first before bumping the arena cursor. Exact-match only, no splitting. Works because insert and delete sample the same distribution, so the buckets stay balanced in expectation.

From a script to a server

Original flow was:

hnswz build ./docs     # embed every .txt, build graph, save to disk
hnswz query            # REPL: ask questions, get filenames back

Perfect for my folder-of-notes use case. Breaks the moment I want to insert or delete a single document without reindexing the whole corpus. So I added serve — a long-running process speaking a custom binary TCP protocol.

Why binary

A single vector at dim=4096 is 16 KiB of raw floats. Framing that as JSON would mean stringifying 4096 numbers per call, on both sides, for no reason. The wire format is dull on purpose:

request frame:
  u32   body_len
  u8    opcode
  u32   req_id
  []    opcode-specific payload

Nine bytes of header, little-endian throughout, matching the on-disk formats so serialization is the same code everywhere. req_id is echoed in the response, so pipelined/async clients work without a format break. Full spec lives in src/protocol.zig.

The concurrency story

This is the part where naive designs get embarrassing.

At dim=4096, a single HNSW insert takes ~19 ms and a search ~4 ms on my laptop. Run that synchronously on the event-loop thread and every request stalls every other client. So the architecture is:

Main thread: a kqueue reactor (src/io/darwin.zig) — accepts connections, runs the per-connection state machine (read_header → read_body → dispatch → write_response), and hands requests off.
Worker pool (src/dispatcher.zig): N threads, each with its own HNSW Workspace and scratch. A std.Thread.RwLock guards the {store, index, metadata} triple. Searches hold it shared. Inserts/deletes/replace/snapshot hold it exclusive.
Wakeups: workers post finished requests onto a done queue and write one byte to a pipe(2). The main loop has the pipe’s read end registered as a readReady completion, so it wakes on that and drains the queue. No polling, no timers.

Fig. 04 — The reactor dispatches into the pool; a pipe byte wakes it when results are ready.

The style is deliberately close to TigerBeetle’s — caller-owned Completion objects, a *anyopaque context, a small tagged Operation. The difference is that TB runs everything on the loop thread; we can’t, because our compute isn’t bounded enough.

Durability: write-ahead log

Every mutation is appended to a WAL and fsync‘d before the server acknowledges the client. On startup the WAL is replayed on top of the last snapshot. Each record carries a CRC32 covering its length prefix and body, so a torn write at any byte boundary produces a mismatch and terminates replay at the last fully-valid record.

Snapshots atomically truncate the WAL via rename-over-a-tempfile — so if the truncate itself crashes, either the old or the new WAL is intact on disk. Acknowledged writes survive a crash; unacknowledged ones don’t. That’s the whole contract.

One trap I stepped in: the snapshot serializes vector ids from the store’s high-water mark, but HNSW insert draws a level from an RNG that isn’t part of that state. Early versions recorded the id but re-sampled the level during WAL replay, which produced a different graph than the one that had been acknowledged. The fix was to draw the level before writing the WAL record, persist it in the record, and feed it into insertWithLevel on replay. Small, boring, but the kind of invariant that’ll silently corrupt your index if you miss it.

Numbers

A turnkey comparison harness against hnswlib lives in bench/ — downloads SIFT1M, runs both on the same bytes with matched parameters (M=16, ef_construction=200, ef_search=100, top-k=10), diffs the JSON reports.

SIFT1M, 1M × 128d, 10k queries, single-threaded Apple M-series:

metric	hnswz	hnswlib
build wall	358.6 s	473.1 s
build throughput	2.8k / s	2.1k / s
search QPS	4.6k / s	3.5k / s
search p50	218 µs	291 µs
search p99	335 µs	393 µs
recall@10	0.9804	0.9772

Faster than hnswlib on every axis, identical recall. One laptop, single-threaded, cosine distance over L2-normalized vectors. Don’t extrapolate to a cloud fleet; do run it on your own hardware if you care.

Fig. 05 — SIFT1M · 1M × 128d · 10k queries · single-threaded. Recall@10 within 0.003 of each other.

The honest caveat: hnswlib parallelizes add_items natively and hnswz’s writer path still serializes. That’s the obvious next thing.

What I learned

Zig’s @Vector gives you portable SIMD in a few lines, no intrinsics.
Being explicit about allocation turns out to be a superpower for a database. Every buffer has a known size, a known owner, a known lifetime.
The hard parts weren’t the HNSW algorithm. The HNSW paper is short and the pseudocode translates cleanly. The hard parts were the things around it: the WAL invariants, the kqueue state machine, making sure the id-space stays consistent across snapshot-then-replay.
Building your own thing is always more work than you estimate, and always worth it anyway.

What’s next

Multi-threaded ingest. Single-writer today. Not a fundamental limit — just the obvious next investment.
Richer metadata. Right now a vector is (id, filename). I want arbitrary payloads attached to each record without bloating the hot path.
Linux backend. I built this for my own use on a Mac, so the reactor is kqueue-only. Next up: spin up a Linux VM and port the I/O layer to io_uring.
A real REPL mode. Today’s REPL loads a pre-built index + vectors from disk and you just type text to get results back. I want to replace it with a REPL that connects to the server and speaks the wire protocol — so you can drive any opcode, not just search.

If you want to poke at it, source is on the repo — zig build gets you a binary, bench/run.sh siftsmall runs a 20-second smoke test against hnswlib. Feedback welcome; bug reports more welcome.