Skillia
Back to Projects

Torah Study AI

in progress

Production RAG pipeline on 3.5M sacred texts. Hybrid search, Cohere reranking, strict anti-hallucination guardrails. Built with FastAPI, Weaviate, and Gemini.

PythonFastAPIWeaviateGemini 2.5 FlashCohere RerankNext.jsshadcn/uiDocker

System Architecture

Stack overview

ComponentChoiceCost
FrontendNext.js + shadcn/ui$0
State managementuseState$0
BackendFastAPI (Python)$0
DatabaseSQLite$0
AuthJWT + MFA (email code)$0
RAG frameworkLangChain$0
EmbeddingsGemini Embedding 001~$27 for all Sefaria ($0.15/1M tokens)
Vector DBWeaviate (Elestio)~$6/month
LLMGemini 2.5 Flash~$0 (free tier)
RerankingCohere Rerank$0 (free tier)
ObservabilityLangFuse (Elestio)$0
Parashah dataSefaria API /calendars$0
HostingElestio~$10/month
Total POC~$20-30/month

Embeddings

Sefaria tested 18 models on their own texts and published the results:

ModelRecall@1Cost for all SefariaType
Gemini Embedding 00193.9%~$27 ($0.15/1M tokens)API
OpenAI text-embedding-3-large69.9%~$55API
OpenAI text-embedding-3-small~65%~$8API
DictaBERT (Hebrew specialized)1.7%$0Local

Our choice: Gemini Embedding 001. 3x cheaper AND 40% more accurate than OpenAI on Rabbinic texts.

Database and Auth

Database: SQLite. A single file on disk. No server to manage. Good enough for a POC. We move to PostgreSQL when we need multiple concurrent users.

Auth: JWT + MFA with email code.

  • User registers with email
  • Receives a verification code by email (MFA)
  • Gets a JWT token after verification
  • Token sent with every API request
  • Stateless auth, no session storage needed

RAG Pipeline

Parashah of the week

Uses Sefaria's /calendars API endpoint which returns the current week's Torah portion. We display it as a suggested study topic when the user opens the app.

What Sefaria gives us (HuggingFace)

ResourceWhat it isImpact
hebrew_library3.55M text segments, already chunkedSkip manual chunking
english_library886K English segmentsReady to embed directly
links3.74M cross-referencesKnowledge graph for V2
NER modelsDetect citations in textAuto-extract Sefaria refs
Embedding Leaderboard18 models benchmarkedProved Gemini is best

Acknowledgment

This project would not exist without Sefaria and their incredible work making Jewish texts accessible to everyone. Their open-source datasets on HuggingFace, their embedding benchmark, their NER models, and their API are what makes projects like this possible. As Sara Wolkenfeld (Chief Learning Officer at Sefaria) said: "Making our data available for others to use creatively has always been a core part of Sefaria's mission."