·6 min read

How we built hybrid BM25 + vector semantic search for 2,319 SaaS ideas

A technical tour of the search stack: why we picked Voyage AI over OpenAI, how the BM25/vector blend works inside one SQL round-trip, and the three gotchas nobody warns you about.


Until last week, searching Vibe Code Ideas for *"tools for dog owners"* returned nothing. We had pet-industry ideas in the catalog — just none of them contained the literal word "dog."

Classic full-text search bug. PostgreSQL's tsvector is excellent at matching tokens, and hopeless at recognizing that "dog" and "pet grooming platform" are talking about the same thing. Every indie-hacker directory hits this wall eventually. The fix — semantic search — is well-known. What's less well-discussed is the honest cost of implementing it, and the surprising amount of friction between "yes, let's add embeddings" and "actually serving them at query time."

Here's how we did it, including the parts we got wrong.

Why we didn't pick OpenAI

The default answer in 2026 for "which embedding provider" is still OpenAI's text-embedding-3-small: $0.02 per million tokens, ubiquitous integrations, well-documented. It would have taken half a day.

We picked Voyage AI instead. Three reasons:

1. No OpenAI account. The maker of Vibe Code Ideas (hi, Luca) doesn't use OpenAI for anything else. Adding a second vendor with a second billing relationship for one feature is real tax — Voyage was a clean single-vendor add alongside Anthropic. 2. Quality. Voyage's voyage-3 model consistently tops the MTEB retrieval leaderboard. For a directory where "did the right idea surface?" is the entire user experience, 2-3 points of retrieval precision is worth more than a convenient integration. 3. Lifetime free tier. 200M tokens free, forever. Our entire backfill of 2,006 ideas used ~300K tokens — 0.15% of the allocation. Anthropic recommends Voyage for customers who need embeddings precisely because Anthropic doesn't ship one.

The only surprise: the free tier is capped at 3 requests/minute and 10K tokens/minute until you add a payment method. We hit the ceiling after 3 batches (192/2,006 embedded) and had to pause. Adding a card lifts it to 300 RPM / 1M TPM while keeping the lifetime free credits. Real cost of the full backfill: $0.0065.

The architecture

Voyage handles embedding. For storage and ANN search we use Tiger Data Ghost, a Postgres flavor that bundles pgvectorscale with the StreamingDiskANN index. We already had Ghost provisioned for our full-text search work, so adding a vector(1024) column next to our existing BM25 tsvector was a one-migration change.

The pipeline:

1. On new-idea insert, call embedOne(title + "\n\n" + summary, "document") — returns a 1024-dim vector or null. 2. Store the vector in Ghost alongside the existing tsvector. 3. At query time, embed the user's search string with input_type: "query". That two-mode distinction (document vs query) is worth ~2 MTEB points at retrieval time — the model knows when it's embedding a long document versus a short search intent. 4. Run a single SQL query that blends BM25 full-text with vector cosine distance.

The input_type distinction is the detail every tutorial skips. If you're embedding both sides as document, you're leaving quality on the table.

The hybrid query

The interesting piece is the blend. You can't just add a BM25 score (unbounded, roughly 0-10) to a vector cosine similarity (0-1) — the BM25 term would dominate every result. Normalization is mandatory, and the details matter.

Our approach:

1. Run two parallel queries: top-50 by BM25 rank, top-50 by vector cosine. 2. For each result set, normalize scores to [0,1] by dividing by the per-source max of that set. This preserves relative ranking within each method without cross-contaminating. 3. Full-outer-join on idea ID. Missing from one side = 0 for that score. 4. Final score = 0.45 * bm25_norm + 0.55 * vector_norm.

Why 0.45/0.55? We tuned by hand on a 30-query test set pulled from our actual search logs. Pure vector was 55% right; pure BM25 was 68% right (keyword match is strong for exact-phrase searches like "subscription manager"); the blend landed at 82%. The slight vector bias handles the vocabulary drift cases that motivated the whole feature.

The whole thing is one SQL round-trip. No post-processing in Node, no re-ranking service, no ColBERT. Just two CTEs, a FULL OUTER JOIN, and an ORDER BY. Read the actual query if you want the unvarnished version.

Graceful degradation

Voyage is 99.9% available. That other 0.1% is when you're demoing to an investor.

Every query-time embedding call is wrapped in an AbortController with a 15-second timeout and a try/catch that returns null on any failure. If null comes back, the hybrid query falls through to BM25-only — not "please try again later," not a broken page, just slightly worse results that are still better than the no-semantic-search baseline. Users never see the Voyage dependency.

Same pattern for the backfill script: a null embedding for one idea just skips that idea, logs the failure, and keeps going. Resumable. Idempotent.

The gotchas

Three things bit us. All three would have been one-line mentions in someone else's blog post. Here they are spelled out:

1. voyage-3 only supports output_dimension=1024. Our Ghost column was initially vector(1536) — a holdover from the earlier OpenAI plan. First API call: 400 Bad Request: output_dimension not valid for voyage-3. The fix is a one-line migration script that drops the column, re-adds it as vector(1024), and rebuilds the diskann index. Check your target model's supported dimensions *before* you pick a column type. 2. Free-tier RPM caps are not documented at signup. Voyage's free tier is "200M lifetime tokens" on the pricing page — which is true, but incomplete. Without a payment method on file, you also get 3 RPM / 10K TPM, full stop. Our backfill hit the wall at 192 ideas and had to pause for card entry. Different from how OpenAI's free tier advertises. 3. CWD resets in long-running scripts. During backfill, npx tsx scripts/backfill-embeddings.ts failed mid-run with "Cannot find module" — because the current working directory had silently reset to the workspace root. Always use explicit absolute paths in backfill scripts you'll run over multiple hours. This isn't Voyage's fault; it's a general lesson that took us an hour to re-learn.

Cost, end to end

Embedding 2,006 active ideas cost $0.0065. The full backfill would have cost $0.0005 on OpenAI's text-embedding-3-small. Query-time embeddings at our traffic level are essentially free. The entire project — from "should we do this" to "prod verified" — was two sessions and under a penny.

The marginal cost per new idea is ~0.0003¢. We could embed every post on Hacker News in 2026 for under $10.

What's next

Two natural extensions we haven't built yet:

  • Query expansion — re-embedding the search string with synonyms Pulled from a term dictionary before the cosine match. Would help on very short queries ("CRM") where there's not much signal to embed.
  • Per-user semantic feeds — embedding a user's saves + click history and sorting the entire catalog against that vector. Turns a directory into a recommendation engine.

Both are Week 2 post-launch work. If you want the current version, search Vibe Code Ideas for anything you're curious about — the results now rank on meaning, not just keyword match.

Try the search → | Browse by category → | Fresh ideas this week →