Writing about what I learn.
AI engineering in production - RAG, agents, evaluations. One post when I ship something or break it. No newsletter speak.
5 techniques to make your RAG system actually work
A vanilla RAG retrieves documents and hopes for the best. Here are 5 techniques that production RAG systems use to go from 'it works sometimes' to 'it works reliably'.
How a RAG Server Works, Step by Step
A RAG server has two phases: prepare the knowledge base once, then answer questions forever. Here is what happens at each step.
Why strict RAG matters on sensitive data
When your LLM can fall back to general knowledge, it will. On religious texts, legal docs, or medical data, that is not acceptable. Here is why.
How to handle bad RAG results gracefully
Your RAG system found nothing relevant. Now what? The industry patterns for fallback strategies, relevance thresholds, and honest abstention.
Cohere Rerank: When to Use It (and When Not)
Reranking improved our search from 'sort of related' to 'exactly what you asked.' Here is how the scores work and when to add it to your pipeline.
Choosing an Embedding Model: Benchmarks Over Brand
OpenAI is not always the best choice. How Sefaria's benchmark proved Gemini is 40% more accurate on Rabbinic texts, at 3x less cost.
RAG vs Long Context: do you still need a vector database?
Context windows now hold millions of tokens. So why not just dump everything in? Here's when RAG still wins, when long context is better, and how to choose.
SHA-256: How It Works
What SHA-256 is, its key properties, and why we use it to track file changes
Why Reranking Matters
What reranking does, how cross-encoders work, and why it dramatically improves RAG quality
Hybrid Search Explained
What hybrid search is, how the alpha parameter works, and when to adjust it
What Are Embedding Dimensions?
What dimensions mean in embedding vectors, whether more is better, and when it matters
BM25 vs Vector Search
The difference between keyword search (BM25) and semantic search (vectors), and why you need both
What are tokens (and why they cost you money)
Tokens are the currency of LLMs. Understand how they work and you'll understand your bill, the limits, and the quirks of AI.
What is an LLM?
Large Language Models explained simply. What they are, how they work, and what they can do.
CLAUDE.md and AGENTS.md: giving your AI agent a memory
Your agent forgets everything between conversations. CLAUDE.md and AGENTS.md fix that. Here's what works everywhere, and what Claude Code adds on top.