Skillia
← Back to articles

Choosing an embedding model: benchmarks matter more than brand

OpenAI is not always the best choice. How Sefaria's benchmark proved Gemini is 40% more accurate on Rabbinic texts, at 3x less cost.

When you build a RAG system, one of the first choices is: which embedding model do you use to convert your text into vectors?

Most tutorials default to OpenAI. It is popular, well-documented, and works well on English text. But "works well on English text" does not mean "works well on YOUR text."

Sefaria (the open-source Jewish texts library) tested 18 embedding models on their own Rabbinic texts. The results were surprising.


The benchmark

Sefaria published a Rabbinic Embedding Leaderboard on HuggingFace. They tested how well each model retrieves the correct English translation for a given Hebrew Rabbinic text. This is a real-world task: can the model understand that a Hebrew verse and its English translation are about the same thing?

ModelRecall@1Cost to embed all SefariaType
Gemini Embedding 00193.9%~$2.50API
OpenAI text-embedding-3-large69.9%~$55API
OpenAI text-embedding-3-small~65%~$8API
Cohere embed-english-v3~72%~$12API
DictaBERT (Hebrew specialized)1.7%$0Local

Two things stand out:

Gemini is 40% more accurate than OpenAI on Rabbinic texts. 93.9% recall vs 69.9%. That is not a small difference. It means that for every 100 queries, Gemini finds the correct passage 94 times while OpenAI finds it 70 times.

Gemini is 22x cheaper. $2.50 to embed all 4.4 million Sefaria texts, vs $55 for OpenAI's best model.


Why the default choice is often wrong

Most developers pick an embedding model based on:

  • "OpenAI is the most popular"
  • "I already have an OpenAI API key"
  • "The tutorial I followed uses OpenAI"

None of these are good reasons. The right way to choose an embedding model:

1. Find a benchmark for YOUR domain

General English benchmarks (MTEB) test on news articles, Wikipedia, and Stack Overflow. Your data might be completely different. Sefaria's data is Rabbinic Hebrew from the Talmud, medieval commentaries, and liturgical texts. A model that scores well on English Wikipedia might completely fail on this content.

Look for domain-specific benchmarks. If none exist, create a small evaluation set (50-100 query-document pairs) and test the models yourself.

2. Test on your actual data

Embed 1,000 documents from your dataset with each candidate model. Run 20-30 real queries. Check if the top results are actually relevant.

This takes a few hours and a few dollars. It saves you from re-indexing millions of documents later when you realize the model is not good enough.

3. Consider cost at scale

Embedding cost matters when you have millions of documents. The difference between $2.50 and $55 is small in absolute terms, but it means you can re-index 22 times with Gemini for the price of one OpenAI run. That flexibility matters when you are iterating on chunking strategies or metadata.


The DictaBERT surprise

You might think: "Hebrew text? Use a Hebrew-specialized model!" DictaBERT is trained specifically on Hebrew text. It scores... 1.7% recall. Almost zero.

Why? Because DictaBERT was trained on modern Hebrew, not Rabbinic Hebrew. The Talmud and medieval commentaries use a completely different vocabulary, grammar, and style. A model trained on Israeli news articles cannot understand Rashi.

Gemini, trained on massive multilingual data, somehow learned to understand Rabbinic Hebrew better than a model specifically built for Hebrew. Scale beats specialization when the specialization is on the wrong sub-domain.


What this means for your project

If you are building a RAG system:

  1. Do not default to OpenAI. It might be the best, or it might not. Test it.
  2. Look for domain benchmarks. Someone might have already tested models on data similar to yours.
  3. Test on your actual data. 1,000 documents, 20 queries, a few hours of work.
  4. Consider the full cost. Embedding cost + re-indexing flexibility + accuracy. The cheapest accurate model wins.
  5. Specialized models are not always better. A model trained on "your language" might fail on "your sub-domain."

For us, the choice was clear: Gemini Embedding 001 at 93.9% recall and $2.50 total cost. Not because it is from Google, but because the benchmark proved it on our actual data.