RAG
03·RAG·updated 2026-04-19

Vector Databases

A vector database stores unstructured data (text, image, audio) as high-dimensional vectors called embeddings. It supports fast approximate nearest neighbor (ANN) search over millions to billions of vectors. In RAG, the vector DB is the LLM's external memory: the place where the LLM can "look up" information it was not trained on.

Vector Databases

Watch or read first

TL;DR

A vector database stores unstructured data (text, image, audio) as high-dimensional vectors called embeddings. It supports fast approximate nearest neighbor (ANN) search over millions to billions of vectors. In RAG, the vector DB is the LLM's external memory: the place where the LLM can "look up" information it was not trained on.

The historical problem

Relational databases excel at structured data and exact matches. They are bad at "find me documents similar in meaning to this one". SQL has no native notion of semantic similarity.

Early attempts (keyword search, TF-IDF, BM25) work on lexical overlap. They miss paraphrases and cross-language matches. "car" and "automobile" are unrelated to a BM25 index.

With the rise of deep-learning embeddings (Word2Vec 2013, BERT 2018, sentence transformers 2019), each text can be compressed into a vector where semantic similarity maps to geometric proximity. Fruits cluster together, cities cluster together, the vector of "king - man + woman" lands near "queen".

Then you need a database specialized in nearest-neighbor queries on millions of these vectors. That is the vector database.

How it works

1. Embeddings are vectors

Each item (chunk, image, row) gets transformed into a fixed-size vector. Typical dimensions:

  • OpenAI text-embedding-3-small: 1536
  • OpenAI text-embedding-3-large: 3072
  • Cohere embed-multilingual-v3: 1024
  • BGE-M3: 1024
  • CLIP (images): 512 or 768

2. Similarity metrics

Given two vectors a and b:

  • Cosine similarity: a . b / (|a| |b|), 1 = identical direction, 0 = orthogonal, -1 = opposite.
  • Dot product: a . b. Same as cosine when vectors are normalized.
  • Euclidean (L2): |a - b|. Distance, lower is closer.

Most vector DBs default to cosine for text embeddings.

3. Exact vs approximate nearest neighbor

  • Exact NN: compute similarity with every vector in the DB. O(N) per query. Fine for 100k vectors, slow at 10M, impossible at 1B.
  • Approximate NN (ANN): use indexes like HNSW, IVF, or ScaNN to find "probably the closest k" in sublinear time. Trade-off: 95-99% recall vs exact, 100-1000x faster.

Production systems use ANN.

4. Payload and metadata

A vector database stores three things per item:

  • The vector itself (for similarity search)
  • The original content (payload: the chunk text, the image, etc.)
  • Metadata (source URL, author, date, tags, language, user_id)

Metadata enables filtering ("retrieve top-k from docs published after 2024 and in English"). Essential in production.

5. ANN index structures

  • HNSW (Hierarchical Navigable Small World): multi-layer graph. Most common. Fast, high recall, more RAM.
  • IVF (Inverted File): partition the vector space into Voronoi cells, search within relevant cells. Lower RAM, slower.
  • ScaNN: Google's algorithm, anisotropic quantization. Used inside BigQuery, Vertex AI.
  • DiskANN: disk-based for billion-scale.

The role of vector databases in RAG

The core loop

Ingestion (offline):
  Text -> Embedding model -> Vector -> Vector DB (+ payload + metadata)

Query (online):
  User query -> Embedding model (same!) -> Query vector -> ANN search
                                                            |
                                                            v
                                                  Top-k similar vectors
                                                            |
                                                            v
                                                  Retrieve payloads
                                                            |
                                                            v
                                                  Stuff into LLM prompt

What the vector DB replaces

Before vector DBs, you would either:

  • Retrain the LLM on new data (slow, expensive)
  • Stuff everything into the prompt (limited by context window)

A vector DB lets you:

  • Scale the knowledge base to billions of items
  • Refresh data without retraining
  • Support multi-tenant isolation (per-user vectors)
  • Keep data private (it never enters the LLM's training)

See rag workflow for the full pipeline.

Relevance today (2026)

The market consolidated

Landscape in 2026:

  • Managed: Pinecone (leader), Weaviate Cloud, Qdrant Cloud, Cohere embeddings + their own store
  • Open source self-hosted: Qdrant, Weaviate, Milvus, Chroma (devex leader)
  • Bolt-on to existing DBs: pgvector (Postgres), Elasticsearch dense vector, Redis vector, MongoDB Atlas Vector Search, SQLite sqlite-vec
  • Cloud-native: Vertex AI Vector Search, AWS OpenSearch, Azure AI Search

The big shift: bolt-on to existing DBs overtook specialized DBs for many teams. If you already use Postgres, pgvector is often enough up to 50M vectors.

Hybrid search is standard

Dense vectors (cosine similarity) miss exact matches: product codes, SKUs, rare names. BM25 sparse retrieval misses paraphrases. Hybrid: run both, fuse rankings (reciprocal rank fusion, or weighted sum). In 2026, every serious vector DB supports this natively.

Embedding dimensions shrank (Matryoshka)

Matryoshka Representation Learning produces embeddings you can truncate (1536 -> 512 -> 128) with minimal loss. Cuts storage and index cost drastically. OpenAI text-embedding-3-* models support this.

Quantization and compression

Product quantization, scalar quantization (int8), binary quantization (1 bit) reduce vector size 4x-32x with small recall cost. Qdrant, Pinecone, Milvus all support it. Billion-scale on commodity hardware became practical.

Multimodal vectors are routine

CLIP-style embeddings let you search images by text and vice versa. By 2026, many DBs index multiple vector columns per record (text embedding + image embedding + metadata) and let queries filter on all three.

Critical question: do you need one?

If your corpus has <10k chunks, a simple Python list + numpy is fine. Do not adopt Pinecone for a demo. As you scale, decision tree:

  • <100k vectors: pgvector, SQLite, Chroma
  • 100k-10M: Qdrant, Weaviate, pgvector with proper tuning
  • 10M-1B: Pinecone, Milvus, Qdrant cluster
  • 1B: Milvus, Vertex Vector Search, custom

Critical questions

  • Why store the original text as payload if you only query by vector? (You need it for the LLM prompt after retrieval.)
  • Why cosine and not Euclidean for text? (Cosine is invariant to magnitude. Embedding norms reflect token count, which is not semantic.)
  • HNSW is memory-heavy. When do you pick IVF or disk-based indexes instead? (Billion-scale, cost-sensitive, or read-heavy-but-rare queries.)
  • What happens when you upgrade your embedding model? (Re-embed everything, or keep two indexes during transition. Mixed-space comparisons are meaningless.)
  • Why filter by metadata before or after ANN? (Pre-filter: cheaper, might miss items if filter is very selective. Post-filter: correct but wastes ANN work. Most DBs offer both; Qdrant and Weaviate do well here.)
  • Should you use a dedicated vector DB or pgvector? (Start with pgvector if you already use Postgres. Migrate when you hit performance or scale limits.)

Production pitfalls

  • Over-indexing on "the best vector DB". The differentiator is almost never the store. It is chunking, embedding quality, reranking. Pick a reasonable DB and move on.
  • Mixing embedding models. You re-ran with a new model and forgot to wipe the old index. Results degrade silently.
  • Ignoring payload size. Storing full HTML pages per vector inflates storage 50x. Store just the chunk.
  • No pre-aggregation for filters. Filters like "posts by user X" on 100M rows with ANN can be brutally slow. Partition your index by tenant or heavy-use filter key.
  • Naive cold start. Embedding 10M chunks with OpenAI at $0.02/1M tokens is manageable, but pay attention to rate limits. Batch smartly.
  • Missing re-ranking. Raw ANN recall is imperfect. A reranker on top is near-free in latency but big in accuracy.
  • Security. Multi-tenant vector stores without row-level security leak data between users. Audit.

Alternatives / Comparisons

OptionStrengthsWeaknessesGood for
PineconeManaged, mature, fastClosed, costs at scaleProd teams, enterprise
QdrantOpen source, fast, great filteringMore opsSelf-hosted prod
WeaviateBuilt-in hybrid, multi-modalComplex configHybrid-first teams
ChromaDev ergonomics, localScale limitsPrototypes, small apps
pgvectorReuse existing PostgresSlower at huge scaleTeams with Postgres
MilvusBillion-scale, mature ANNHeavyweightVery large corpora
Elastic + dense_vectorExisting stackMediocre ANNLog-heavy teams
Redis vectorReal-time, in-memoryPricey at scaleLow-latency, small corpus
SQLite-vecLocal, zero-opsSingle-nodeEdge, mobile, dev

Mental parallels (non-AI)

  • GPS coordinates for meaning: embeddings place each concept on a map. "Close on the map = similar in meaning". The vector DB is the map provider.
  • Spotify recommendation: songs are embedded based on audio and user behavior. "Find similar songs" is ANN on a vector DB.
  • Face recognition: each face is a vector. Matching a new photo is ANN on the database of known faces.
  • Card catalog with fuzzy search: traditional libraries use exact catalogs. Vector DBs are like a librarian who understands what you MEAN, not just what you typed.

Mini-lab

See rag workflow lab. Specifically for vector DBs:

  1. Embed 10k chunks with text-embedding-3-small.
  2. Load into three stores: Qdrant, pgvector, Chroma.
  3. Measure:
    • Indexing time
    • Query latency at top-10
    • Recall vs exact NN
    • Disk and RAM usage
  4. Add metadata filters ("only chunks from source X") and measure the impact.
  5. Try quantization (int8 in Qdrant) and compare.

Goal: feel which DB actually fits your shape. Decisions flow from measurements, not marketing.

Further reading

Canonical

Related in this KB

Tools

vector-dbembeddingsannindexingsimilarity-search