all articles
Apr 04, 2026·5 min read·RAG

Choosing an Embedding Model: Benchmarks Over Brand

OpenAI is not always the best choice. How Sefaria's benchmark proved Gemini is 40% more accurate on Rabbinic texts, at 3x less cost.

When you build a RAG system, one of the first choices is: which embedding model do you use to convert your text into vectors?

Most tutorials default to OpenAI. It is popular, well-documented, and works well on English text. But "works well on English text" does not mean "works well on YOUR text."

Sefaria (the open-source Jewish texts library) tested 18 embedding models on their own Rabbinic texts. The results were surprising.


The benchmark

Sefaria published a Rabbinic Embedding Leaderboard on HuggingFace. They tested how well each model retrieves the correct English translation for a given Hebrew Rabbinic text. This is a real-world task: can the model understand that a Hebrew verse and its English translation are about the same thing?

  • Gemini Embedding 001 - 93.9% recall, $0.15/1M tokens (API)
  • Qwen3-Embedding-8B - 89.4% recall, $0 (open-source, runs locally)
  • OpenAI text-embedding-3-large - 69.9% recall, $0.13/1M tokens (API)
  • OpenAI text-embedding-3-small - ~65% recall, $0.02/1M tokens (API)
  • DictaBERT (Hebrew specialized) - 1.7% recall, $0 (local)

Two things stand out:

Gemini is 40% more accurate than OpenAI on Rabbinic texts. 93.9% recall vs 69.9%. For every 100 queries, Gemini finds the correct passage 94 times while OpenAI finds it 70 times.

But Gemini is expensive at scale. I initially estimated $2.50 to embed all 4.4 million Sefaria texts. The reality: $0.15 per 1M tokens (not $0.006 as I assumed from an outdated source). The actual cost for the full dataset would be ~$27. I hit my Google Cloud spending cap at $5 after indexing only 94K texts. A 25x estimation error.

The lesson: never trust pricing from blog posts or training data. Always check the official pricing page, and always test on a small batch (1000 documents) before committing to millions. I now plan to migrate to Qwen3-Embedding-0.6B, an open-source model that runs locally for free with 89.4% recall - only 5 points behind Gemini.


Why the default choice is often wrong

Most developers pick an embedding model based on:

  • "OpenAI is the most popular"
  • "I already have an OpenAI API key"
  • "The tutorial I followed uses OpenAI"

None of these are good reasons. The right way to choose an embedding model:

1. Find a benchmark for YOUR domain

General English benchmarks (MTEB) test on news articles, Wikipedia, and Stack Overflow. Your data might be completely different. Sefaria's data is Rabbinic Hebrew from the Talmud, medieval commentaries, and liturgical texts. A model that scores well on English Wikipedia might completely fail on this content.

Look for domain-specific benchmarks. If none exist, create a small evaluation set (50-100 query-document pairs) and test the models yourself.

2. Test on your actual data

Embed 1,000 documents from your dataset with each candidate model. Run 20-30 real queries. Check if the top results are actually relevant.

This takes a few hours and a few dollars. It saves you from re-indexing millions of documents later when you realize the model is not good enough.

3. Consider cost at scale

Embedding cost matters when you have millions of documents. At $0.15/1M tokens, Gemini costs ~$27 for the full Sefaria dataset (4.4M texts). OpenAI would cost ~$55. But an open-source model like Qwen3 costs $0 and can re-index unlimited times. When you are iterating on chunking strategies, that flexibility matters more than the best benchmark score.


The DictaBERT surprise

You might think: "Hebrew text? Use a Hebrew-specialized model!" DictaBERT is trained specifically on Hebrew text. It scores... 1.7% recall. Almost zero.

Why? Because DictaBERT was trained on modern Hebrew, not Rabbinic Hebrew. The Talmud and medieval commentaries use a completely different vocabulary, grammar, and style. A model trained on Israeli news articles cannot understand Rashi.

Gemini, trained on massive multilingual data, somehow learned to understand Rabbinic Hebrew better than a model specifically built for Hebrew. Scale beats specialization when the specialization is on the wrong sub-domain.


What this means for your project

If you are building a RAG system:

  1. Do not default to OpenAI. It might be the best, or it might not. Test it.
  2. Look for domain benchmarks. Someone might have already tested models on data similar to yours.
  3. Test on your actual data. 1,000 documents, 20 queries, a few hours of work.
  4. Consider the full cost. Embedding cost + re-indexing flexibility + accuracy. The cheapest accurate model wins.
  5. Specialized models are not always better. A model trained on "your language" might fail on "your sub-domain."

For our project, we started with Gemini Embedding 001 (93.9% recall) but discovered the cost was 25x higher than estimated. We are now migrating to Qwen3-Embedding-0.6B (89.4% recall, $0 cost, runs locally). The 5-point recall difference is worth the infinite re-indexing flexibility for a POC. The lesson: benchmark accuracy AND cost on your actual data before committing.

RAGEmbeddingsAI EngineeringBenchmarks
next →
Next up · Mar 21 2026

RAG vs Long Context: do you still need a vector database?