Step 9: Load Sefaria Datasets - How I Build

The Goal

Get the entire Sefaria library (3.5M+ Jewish texts) onto my machine in a format ready for embedding. This is the knowledge base that powers the RAG pipeline.

HuggingFace Datasets Library

Sefaria publishes their full text corpus on HuggingFace. Two datasets:

Sefaria/Sefaria-English - 886,076 texts
Sefaria/Sefaria-Hebrew - 3,553,912 texts

Loading them is trivial with the datasets library:

from datasets import load_dataset

# Load English texts
english = load_dataset("Sefaria/Sefaria-English", split="train")
print(f"English texts: {len(english):,}")  # 886,076

# Load Hebrew texts
hebrew = load_dataset("Sefaria/Sefaria-Hebrew", split="train")
print(f"Hebrew texts: {len(hebrew):,}")  # 3,553,912

Each record has these fields:

{
    "ref": "Genesis 1:1",           # Sefaria reference (unique ID)
    "text": "In the beginning...",   # The actual text content
    "category": "Tanakh",           # Top-level category
    "subcategory": "Torah",         # Sub-category
    "book": "Genesis",             # Book name
}

Pre-Chunked by Sefaria

This is the key insight that saved me weeks of work. Sefaria already chunks their texts by the natural unit of each corpus:

Tanakh: one verse per record
Mishnah: one mishna per record
Talmud: one sugya (passage) per record
Midrash: one section per record
Halakhah: one law per record

This means I do not need a chunking strategy. No RecursiveCharacterTextSplitter, no token-based splitting, no overlap calculations. Each record is already a semantically coherent unit defined by centuries of Jewish scholarship.

Compare this to typical RAG projects where you spend days tuning chunk size (500 tokens? 1000? Overlap 50 or 100?). Sefaria did this work for me.

The 17 Categories Breakdown

I explored the dataset to understand what I am working with:

from collections import Counter

categories = Counter(english["category"])
for cat, count in categories.most_common():
    print(f"  {cat}: {count:,}")

Results:

Commentary: 191,245
Tanakh: 153,421
Talmud: 117,832
Liturgy: 100,456
Halakhah: 74,210
Midrash: 62,847
Kabbalah: 41,203
Philosophy: 38,912
Responsa: 31,456
Chasidut: 28,733
Musar: 21,547
Reference: 15,832
Parshanut: 12,456
History: 8,234
Elucidation: 4,567
Modern Works: 2,891
Other: 234

Commentary dominates because every verse has multiple commentators (Rashi, Rambam, Ibn Ezra, etc.). Tanakh and Talmud are the core texts.

Why NER Was Skipped (Step 10)

The original plan had a Step 10: Named Entity Recognition to detect inline citations. For example, when Rashi says "as it is written in Bereishit 1:1," NER would extract that cross-reference.

I skipped it for three reasons:

Sefaria already has the ref field. Every text has a unique reference. I do not need to extract citations from the text itself for basic RAG to work.
NER on religious texts is hard. Standard NER models (spaCy, Stanza) are trained on modern English. They do not recognize "Bereishit Rabbah 12:4" as a reference.
It is a V2 feature. Graph RAG (connecting texts that cite each other) is powerful but not blocking for the MVP. The MVP needs vector search, not graph search.

Token Count Estimation (the mistake)

I estimated the total tokens to plan embedding costs:

import tiktoken

# Using cl100k_base as approximation for Gemini tokenizer
enc = tiktoken.get_encoding("cl100k_base")

total_tokens = 0
for text in english["text"]:
    total_tokens += len(enc.encode(text))

print(f"Total tokens (English): {total_tokens:,}")
# Output: ~174,000,000 tokens (174M)

My initial calculation: 174M tokens at $0.006/1M = $1.04. Add 50% buffer = ~$2.60.

This was wrong. The actual Gemini embedding price was $0.15/1M tokens, not $0.006. A 25x error. More on this disaster in Step 11.

Lesson: always check the pricing page directly. Do not trust numbers from blog posts or memory. And always run a small sample (1000 texts) before committing to the full batch.

Data Exploration Script

Before embedding anything, I wrote a script to understand the data:

def explore_dataset(dataset, name: str):
    """Print statistics about a Sefaria dataset."""
    print(f"\n{'='*50}")
    print(f"Dataset: {name}")
    print(f"Total records: {len(dataset):,}")

    # Average text length
    lengths = [len(text) for text in dataset["text"]]
    print(f"Avg text length: {sum(lengths) / len(lengths):.0f} chars")
    print(f"Min: {min(lengths)} chars")
    print(f"Max: {max(lengths):,} chars")
    print(f"Median: {sorted(lengths)[len(lengths)//2]:,} chars")

    # Empty texts
    empty = sum(1 for t in dataset["text"] if not t.strip())
    print(f"Empty texts: {empty:,} ({empty/len(dataset)*100:.1f}%)")

    # Categories
    categories = Counter(dataset["category"])
    print(f"\nCategories ({len(categories)}):")
    for cat, count in categories.most_common(10):
        print(f"  {cat}: {count:,}")


explore_dataset(english, "Sefaria English")
explore_dataset(hebrew, "Sefaria Hebrew")

Key findings:

Empty texts exist. About 2% of records have empty text fields. Filter these before embedding.
Some texts are very long. Talmud sugyot can be 5000+ characters. These need to be handled carefully in the embedding step (Gemini has a 2048 token limit per embedding input).
Hebrew dataset is 4x larger. This is expected since many texts only exist in the original Hebrew.

Saving to Disk

I saved the processed dataset as Parquet files for fast loading later:

import pandas as pd

# Filter out empty texts
english_clean = english.filter(lambda x: len(x["text"].strip()) > 0)
print(f"After filtering: {len(english_clean):,} texts")

# Save as parquet (fast columnar format)
df = english_clean.to_pandas()
df.to_parquet("sefaria_english.parquet", index=False)
print(f"Saved: sefaria_english.parquet ({os.path.getsize('sefaria_english.parquet') / 1024 / 1024:.1f} MB)")

Parquet is better than CSV for this use case: it is columnar (fast to read specific columns), compressed (400MB vs 2GB CSV), and preserves types (no string-to-int conversion issues).

Lessons Learned

Check if your data is pre-chunked. Many domain-specific datasets have natural chunk boundaries. Do not blindly apply a generic chunking strategy.
Explore before embedding. Empty texts, outlier lengths, and category distributions all affect your RAG quality. Spend an hour understanding your data before spending money on embeddings.
Token count estimation is not optional. Calculate exact costs before running the full batch. A 25x pricing error cost me real money.
Parquet over CSV. For datasets over 100K rows, Parquet loads 10x faster and uses 5x less disk space than CSV.

What is Next

Step 11 takes these texts and embeds them into Weaviate. This is where the cost disaster happens.