RAG Chatbot

active

Enterprise-grade RAG pipeline built from scratch. PDF ingestion, hybrid search, Cohere reranking, and LLM generation.

RAGPythonWeaviateCohereLangChain

A complete RAG (Retrieval-Augmented Generation) pipeline built as a learning project for AI Engineering.

Stack

Ingestion: Docling (PDF to Markdown), LangChain (chunking), Cohere (embeddings), Weaviate (vector DB)
Search: Hybrid search (BM25 + vector), Cohere Rerank, LLM generation via LiteLLM
Interface: Streamlit web app, deployed on HuggingFace Spaces

Architecture

Two Python files handle the entire pipeline:

ingest.py: PDF conversion, chunking, embedding, indexation
chat.py: query embedding, hybrid search, reranking, prompt construction, LLM generation

The system uses incremental ingestion with a SHA-256 manifest to avoid re-processing unchanged documents.

benchmarks

Benchmark 000: Baseline Pipeline (Post-Migration)

4 min read

Date: 2026-03-25 What: Baseline metrics after migrating from LiteLLM/Cohere embeddings to Google Gemini for both embeddings and LLM generation.

Setup

Component	Value
Embedding model	Google gemini-embedding-001 (3072 dimensions)
Vector DB	Weaviate on Elestio (hybrid search)
Hybrid alpha	0.5 (50% BM25 + 50% vector)
Search K	20 candidates
Reranker	Cohere rerank-english-v3.0
Rerank Top N	5
LLM	Google Gemini 2.5 Flash-Lite
Max tokens	2048
Chunk size	500 tokens, 50 overlap
Documents	2 PDFs (AI Engineering Guidebook, AI Engineering by Chip Huyen)
Total chunks	804

Ingestion Results

Metric	Value
Total chunks	804
Batch size	50 chunks
Pause between batches	60 seconds
Total batches	17 (1 for collection creation + 16 for remaining)
Total ingestion time	~20 minutes (including Docling conversion from cache)
Docling conversion	Loaded from cache (previously converted)

Rate limit findings

Batch size	Pause	Result
500	5s	429 RESOURCE_EXHAUSTED
100	10s	429 after batch 1 (too many sub-requests)
50	60s	OK — all 17 batches completed

Lesson: langchain-google-genai splits each batch into sub-batches of ~20 texts internally. Even 100 chunks triggers 5-6 rapid API calls. 50 chunks + 60s pause is safe.

Query Test: "What is RAG?"

Retrieval

Metric	Value
Hybrid search candidates	20
Rerank scores (top 5)	0.997, 0.994, 0.984, 0.978, 0.963
Sources returned	AI Engineering Guidebook.pdf, AI Engineering (Chip Huyen)

Latency

Step	Time
Weaviate connection	~200ms
Google embedding (query)	~600ms
Hybrid search	~1.2s
Cohere rerank	~800ms
Gemini generation	~1.3s
Total end-to-end	~4s

Answer quality (manual assessment)

Correct: Yes — accurately explains RAG
Grounded: Yes — uses information from the indexed documents
Hallucination: None detected
Sources cited: Correctly identifies both source PDFs
Tone: Clear, well-structured with numbered steps

Cost Estimate

API	Usage	Cost
Google embeddings (ingestion)	~400k tokens (804 chunks)	~$0.06
Google embeddings (per query)	~50 tokens	almost nothing
Google Gemini Flash-Lite (per query)	~2500 input + ~500 output	~$0.0005
Cohere rerank (per query)	1 call, 20 docs	Free (trial)

Previous Stack Comparison

Aspect	Before (LiteLLM)	After (Google)
Embedding model	Cohere embed-english-v3.0 (1024d)	Google gemini-embedding-001 (3072d)
LLM	Via LiteLLM proxy on Elestio	Google Gemini 2.5 Flash-Lite (direct)
API keys needed	4 (Cohere, LiteLLM URL, LiteLLM key, Weaviate)	3 (Google, Cohere, Weaviate)
Dependencies	langchain-openai, langchain-cohere	langchain-google-genai, langchain-cohere
LLM proxy	Self-hosted LiteLLM on Elestio	None (direct API)
Embedding dimensions	1024	3072
Batching required	Yes (Cohere 100k tokens/min limit)	No (Google paid tier is generous)

What's Next

Phase 0 of the RAG Mastery Roadmap: build evaluate.py with automated metrics (Recall@K = how many correct results in top K, MRR = how high the first correct result ranks, Faithfulness = does the answer match the sources) to replace manual assessment
Create data/eval/questions.json test set for repeatable evaluation

decisions

Decision: Chunk size of 500 tokens with 50 token overlap

2 min read

Date: March 2026 Status: Active

Context

After converting documents to Markdown with Docling, we need to split them into chunks for embedding and retrieval. The chunk size directly impacts retrieval quality.

Options considered

Chunk size	Overlap	Pros	Cons
200 tokens	20	Very precise retrieval	Too small — loses context, more chunks to embed
500 tokens	50	Good balance of precision and context	May split some ideas across chunks
1000 tokens	100	Rich context per chunk	Less precise — chunk may contain irrelevant parts
Full document	0	Complete context	Too large for embedding, single vector per doc

Decision: 500 tokens / 50 overlap

Why 500 tokens

~375 words — roughly one topic or paragraph
Large enough to contain a complete idea
Small enough that the embedding captures a specific concept (not a blurry mix of many topics)
Fits well within embedding model input limits
Standard recommendation in most RAG guides and frameworks

Why 50 token overlap

10% of chunk size — prevents cutting a sentence in half at chunk boundaries
If a key sentence falls right at the edge of a chunk, it appears in both the current and next chunk
Small enough to avoid excessive duplication (only ~37 extra words per chunk)

Splitter choice: RecursiveCharacterTextSplitter

Uses tiktoken (cl100k_base tokenizer - the tool that splits text into tokens) to measure in tokens, not characters. Tries break points in order:

\n\n (paragraph break — best split point)
\n (line break)
(space — last resort)

This keeps paragraphs intact when possible.

Impact

With our 2 PDF books:

Total text: ~550k characters
Chunks created: 804
Average chunk: ~375 words

Future experiment

Try different chunk sizes (200, 500, 1000) and compare retrieval quality with the same set of test questions. Save results in docs/benchmarks/.

Decision: Why we originally chose Cohere for embeddings (then switched to Google)

2 min read

Date: March 2026 Status: Superseded — migrated to Google gemini-embedding-001

Context

We needed an embedding model to convert document chunks into vectors for the RAG pipeline.

Options considered

Provider	Model	Dimensions	Free tier	Quality
Cohere	embed-english-v3.0	1024	Trial (100k tokens/min)	Very good
Google	gemini-embedding-001	3072	Free (1500 req/min)	Very good
OpenAI	text-embedding-3-small	1536	Paid only	Very good
Local (HuggingFace)	all-MiniLM-L6-v2	384	Free (local)	Good

Original decision: Cohere

Free trial tier was generous enough for development
Good LangChain integration (langchain-cohere)
Supports search_document / search_query input types (made to work better for search)
Also provides reranking (one provider for both)

Why we switched to Google

Supply chain attack on LiteLLM (March 2026) prompted a review of all providers
We were already using a GOOGLE_API_KEY for the LLM (Gemini)
Using Google for both embeddings and LLM = one API key, simpler stack
Google's free tier is generous enough
3072 dimensions vs 1024 (small quality improvement)

Trade-off

Switching embedding models required a full re-index of all documents because the vector dimensions changed (1024 → 3072). Old vectors don't work with the new model.

Current stack

Embeddings: Google gemini-embedding-001 (3072d)
Reranking: Cohere rerank-english-v3.0 (kept — no Google equivalent)
LLM: Google Gemini 2.5 Flash-Lite

Decision: Why Weaviate as vector database

2 min read

Date: March 2026 Status: Active

Context

We needed a vector database to store document chunks and their embeddings, with support for hybrid search (BM25 + vector).

Options considered

Database	Hybrid search	Hosting	Free tier	LangChain support
Weaviate	Yes (native)	Self-hosted / Cloud	Elestio	Yes
Pinecone	No (vector only)	Cloud only	Free tier	Yes
Qdrant	No (vector only)	Self-hosted / Cloud	Free tier	Yes
ChromaDB	No (vector only)	In-memory / local	Free (local)	Yes
pgvector	Partial (with extra setup)	Self-hosted	Free (local)	Yes

Decision: Weaviate

Reasons

Native hybrid search - Weaviate runs BM25 and vector search in parallel and merges results with an alpha you can adjust. Other databases need extra setup for keyword search.
Self-hostable on Elestio - Full control over data, not stuck with one provider, predictable cost. Elestio provides managed Docker deployment with backups.
Good LangChain integration — langchain-weaviate provides WeaviateVectorStore with from_documents(), similarity_search(), and hybrid search alpha parameter.
Scales well - Handles millions of vectors. Our ~800 chunks are tiny, but it will still work when we have more data.
gRPC support - gRPC is a fast way to send data, used during ingestion for speed.

Trade-offs

More complex setup than ChromaDB (needs a running server)
Elestio hosting has a monthly cost (vs free local ChromaDB)
Weaviate Python client v4 API needs more code compared to simpler alternatives

Configuration

HTTP port 443 (HTTPS via Elestio reverse proxy)
gRPC port 50052
Basic auth (username/password)
Collection: RagChunks with text + vector fields