Skillia
Back to Projects

RAG Chatbot

active

Enterprise-grade RAG pipeline built from scratch. PDF ingestion, hybrid search, Cohere reranking, and LLM generation.

RAGPythonWeaviateCohereLangChain

A complete RAG (Retrieval-Augmented Generation) pipeline built as a learning project for AI Engineering.

Stack

  • Ingestion: Docling (PDF to Markdown), LangChain (chunking), Cohere (embeddings), Weaviate (vector DB)
  • Search: Hybrid search (BM25 + vector), Cohere Rerank, LLM generation via LiteLLM
  • Interface: Streamlit web app, deployed on HuggingFace Spaces

Architecture

Two Python files handle the entire pipeline:

  • ingest.py: PDF conversion, chunking, embedding, indexation
  • chat.py: query embedding, hybrid search, reranking, prompt construction, LLM generation

The system uses incremental ingestion with a SHA-256 manifest to avoid re-processing unchanged documents.

benchmarks

Benchmark 000: Baseline Pipeline (Post-Migration)

4 min read

Date: 2026-03-25 What: Baseline metrics after migrating from LiteLLM/Cohere embeddings to Google Gemini for both embeddings and LLM generation.

Setup

ComponentValue
Embedding modelGoogle gemini-embedding-001 (3072 dimensions)
Vector DBWeaviate on Elestio (hybrid search)
Hybrid alpha0.5 (50% BM25 + 50% vector)
Search K20 candidates
RerankerCohere rerank-english-v3.0
Rerank Top N5
LLMGoogle Gemini 2.5 Flash-Lite
Max tokens2048
Chunk size500 tokens, 50 overlap
Documents2 PDFs (AI Engineering Guidebook, AI Engineering by Chip Huyen)
Total chunks804

Ingestion Results

MetricValue
Total chunks804
Batch size50 chunks
Pause between batches60 seconds
Total batches17 (1 for collection creation + 16 for remaining)
Total ingestion time~20 minutes (including Docling conversion from cache)
Docling conversionLoaded from cache (previously converted)

Rate limit findings

Batch sizePauseResult
5005s429 RESOURCE_EXHAUSTED
10010s429 after batch 1 (too many sub-requests)
5060sOK — all 17 batches completed

Lesson: langchain-google-genai splits each batch into sub-batches of ~20 texts internally. Even 100 chunks triggers 5-6 rapid API calls. 50 chunks + 60s pause is safe.

Query Test: "What is RAG?"

Retrieval

MetricValue
Hybrid search candidates20
Rerank scores (top 5)0.997, 0.994, 0.984, 0.978, 0.963
Sources returnedAI Engineering Guidebook.pdf, AI Engineering (Chip Huyen)

Latency

StepTime
Weaviate connection~200ms
Google embedding (query)~600ms
Hybrid search~1.2s
Cohere rerank~800ms
Gemini generation~1.3s
Total end-to-end~4s

Answer quality (manual assessment)

  • Correct: Yes — accurately explains RAG
  • Grounded: Yes — uses information from the indexed documents
  • Hallucination: None detected
  • Sources cited: Correctly identifies both source PDFs
  • Tone: Clear, well-structured with numbered steps

Cost Estimate

APIUsageCost
Google embeddings (ingestion)~400k tokens (804 chunks)~$0.06
Google embeddings (per query)~50 tokensalmost nothing
Google Gemini Flash-Lite (per query)~2500 input + ~500 output~$0.0005
Cohere rerank (per query)1 call, 20 docsFree (trial)

Previous Stack Comparison

AspectBefore (LiteLLM)After (Google)
Embedding modelCohere embed-english-v3.0 (1024d)Google gemini-embedding-001 (3072d)
LLMVia LiteLLM proxy on ElestioGoogle Gemini 2.5 Flash-Lite (direct)
API keys needed4 (Cohere, LiteLLM URL, LiteLLM key, Weaviate)3 (Google, Cohere, Weaviate)
Dependencieslangchain-openai, langchain-coherelangchain-google-genai, langchain-cohere
LLM proxySelf-hosted LiteLLM on ElestioNone (direct API)
Embedding dimensions10243072
Batching requiredYes (Cohere 100k tokens/min limit)No (Google paid tier is generous)

What's Next

  • Phase 0 of the RAG Mastery Roadmap: build evaluate.py with automated metrics (Recall@K = how many correct results in top K, MRR = how high the first correct result ranks, Faithfulness = does the answer match the sources) to replace manual assessment
  • Create data/eval/questions.json test set for repeatable evaluation

decisions

Decision: Chunk size of 500 tokens with 50 token overlap

2 min read

Date: March 2026 Status: Active

Context

After converting documents to Markdown with Docling, we need to split them into chunks for embedding and retrieval. The chunk size directly impacts retrieval quality.

Options considered

Chunk sizeOverlapProsCons
200 tokens20Very precise retrievalToo small — loses context, more chunks to embed
500 tokens50Good balance of precision and contextMay split some ideas across chunks
1000 tokens100Rich context per chunkLess precise — chunk may contain irrelevant parts
Full document0Complete contextToo large for embedding, single vector per doc

Decision: 500 tokens / 50 overlap

Why 500 tokens

  • ~375 words — roughly one topic or paragraph
  • Large enough to contain a complete idea
  • Small enough that the embedding captures a specific concept (not a blurry mix of many topics)
  • Fits well within embedding model input limits
  • Standard recommendation in most RAG guides and frameworks

Why 50 token overlap

  • 10% of chunk size — prevents cutting a sentence in half at chunk boundaries
  • If a key sentence falls right at the edge of a chunk, it appears in both the current and next chunk
  • Small enough to avoid excessive duplication (only ~37 extra words per chunk)

Splitter choice: RecursiveCharacterTextSplitter

Uses tiktoken (cl100k_base tokenizer - the tool that splits text into tokens) to measure in tokens, not characters. Tries break points in order:

  1. \n\n (paragraph break — best split point)
  2. \n (line break)
  3. (space — last resort)

This keeps paragraphs intact when possible.

Impact

With our 2 PDF books:

  • Total text: ~550k characters
  • Chunks created: 804
  • Average chunk: ~375 words

Future experiment

Try different chunk sizes (200, 500, 1000) and compare retrieval quality with the same set of test questions. Save results in docs/benchmarks/.

Decision: Why we originally chose Cohere for embeddings (then switched to Google)

2 min read

Date: March 2026 Status: Superseded — migrated to Google gemini-embedding-001

Context

We needed an embedding model to convert document chunks into vectors for the RAG pipeline.

Options considered

ProviderModelDimensionsFree tierQuality
Cohereembed-english-v3.01024Trial (100k tokens/min)Very good
Googlegemini-embedding-0013072Free (1500 req/min)Very good
OpenAItext-embedding-3-small1536Paid onlyVery good
Local (HuggingFace)all-MiniLM-L6-v2384Free (local)Good

Original decision: Cohere

  • Free trial tier was generous enough for development
  • Good LangChain integration (langchain-cohere)
  • Supports search_document / search_query input types (made to work better for search)
  • Also provides reranking (one provider for both)

Why we switched to Google

  • Supply chain attack on LiteLLM (March 2026) prompted a review of all providers
  • We were already using a GOOGLE_API_KEY for the LLM (Gemini)
  • Using Google for both embeddings and LLM = one API key, simpler stack
  • Google's free tier is generous enough
  • 3072 dimensions vs 1024 (small quality improvement)

Trade-off

Switching embedding models required a full re-index of all documents because the vector dimensions changed (1024 → 3072). Old vectors don't work with the new model.

Current stack

  • Embeddings: Google gemini-embedding-001 (3072d)
  • Reranking: Cohere rerank-english-v3.0 (kept — no Google equivalent)
  • LLM: Google Gemini 2.5 Flash-Lite

Decision: Why Weaviate as vector database

2 min read

Date: March 2026 Status: Active

Context

We needed a vector database to store document chunks and their embeddings, with support for hybrid search (BM25 + vector).

Options considered

DatabaseHybrid searchHostingFree tierLangChain support
WeaviateYes (native)Self-hosted / CloudElestioYes
PineconeNo (vector only)Cloud onlyFree tierYes
QdrantNo (vector only)Self-hosted / CloudFree tierYes
ChromaDBNo (vector only)In-memory / localFree (local)Yes
pgvectorPartial (with extra setup)Self-hostedFree (local)Yes

Decision: Weaviate

Reasons

  1. Native hybrid search - Weaviate runs BM25 and vector search in parallel and merges results with an alpha you can adjust. Other databases need extra setup for keyword search.

  2. Self-hostable on Elestio - Full control over data, not stuck with one provider, predictable cost. Elestio provides managed Docker deployment with backups.

  3. Good LangChain integrationlangchain-weaviate provides WeaviateVectorStore with from_documents(), similarity_search(), and hybrid search alpha parameter.

  4. Scales well - Handles millions of vectors. Our ~800 chunks are tiny, but it will still work when we have more data.

  5. gRPC support - gRPC is a fast way to send data, used during ingestion for speed.

Trade-offs

  • More complex setup than ChromaDB (needs a running server)
  • Elestio hosting has a monthly cost (vs free local ChromaDB)
  • Weaviate Python client v4 API needs more code compared to simpler alternatives

Configuration

  • HTTP port 443 (HTTPS via Elestio reverse proxy)
  • gRPC port 50052
  • Basic auth (username/password)
  • Collection: RagChunks with text + vector fields