Vector Databases

Watch or read first

Daily Dose DS, "What are vector databases?" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/
Pinecone Learning Center - Vector DB 101: https://www.pinecone.io/learn/vector-database/
James Briggs YouTube tutorials on Pinecone, Qdrant, Weaviate: https://www.youtube.com/@jamesbriggs

TL;DR

A vector database stores unstructured data (text, image, audio) as high-dimensional vectors called embeddings. It supports fast approximate nearest neighbor (ANN) search over millions to billions of vectors. In RAG, the vector DB is the LLM's external memory: the place where the LLM can "look up" information it was not trained on.

The historical problem

Relational databases excel at structured data and exact matches. They are bad at "find me documents similar in meaning to this one". SQL has no native notion of semantic similarity.

Early attempts (keyword search, TF-IDF, BM25) work on lexical overlap. They miss paraphrases and cross-language matches. "car" and "automobile" are unrelated to a BM25 index.

With the rise of deep-learning embeddings (Word2Vec 2013, BERT 2018, sentence transformers 2019), each text can be compressed into a vector where semantic similarity maps to geometric proximity. Fruits cluster together, cities cluster together, the vector of "king - man + woman" lands near "queen".

Then you need a database specialized in nearest-neighbor queries on millions of these vectors. That is the vector database.

How it works

1. Embeddings are vectors

Each item (chunk, image, row) gets transformed into a fixed-size vector. Typical dimensions:

OpenAI text-embedding-3-small: 1536
OpenAI text-embedding-3-large: 3072
Cohere embed-multilingual-v3: 1024
BGE-M3: 1024
CLIP (images): 512 or 768

2. Similarity metrics

Given two vectors a and b:

Cosine similarity: a . b / (|a| |b|), 1 = identical direction, 0 = orthogonal, -1 = opposite.
Dot product: a . b. Same as cosine when vectors are normalized.
Euclidean (L2): |a - b|. Distance, lower is closer.

Most vector DBs default to cosine for text embeddings.

3. Exact vs approximate nearest neighbor

Exact NN: compute similarity with every vector in the DB. O(N) per query. Fine for 100k vectors, slow at 10M, impossible at 1B.
Approximate NN (ANN): use indexes like HNSW, IVF, or ScaNN to find "probably the closest k" in sublinear time. Trade-off: 95-99% recall vs exact, 100-1000x faster.

Production systems use ANN.

4. Payload and metadata

A vector database stores three things per item:

The vector itself (for similarity search)
The original content (payload: the chunk text, the image, etc.)
Metadata (source URL, author, date, tags, language, user_id)

Metadata enables filtering ("retrieve top-k from docs published after 2024 and in English"). Essential in production.

5. ANN index structures

HNSW (Hierarchical Navigable Small World): multi-layer graph. Most common. Fast, high recall, more RAM.
IVF (Inverted File): partition the vector space into Voronoi cells, search within relevant cells. Lower RAM, slower.
ScaNN: Google's algorithm, anisotropic quantization. Used inside BigQuery, Vertex AI.
DiskANN: disk-based for billion-scale.

The role of vector databases in RAG

The core loop

Ingestion (offline):
  Text -> Embedding model -> Vector -> Vector DB (+ payload + metadata)

Query (online):
  User query -> Embedding model (same!) -> Query vector -> ANN search
                                                            |
                                                            v
                                                  Top-k similar vectors
                                                            |
                                                            v
                                                  Retrieve payloads
                                                            |
                                                            v
                                                  Stuff into LLM prompt

What the vector DB replaces

Before vector DBs, you would either:

Retrain the LLM on new data (slow, expensive)
Stuff everything into the prompt (limited by context window)

A vector DB lets you:

Scale the knowledge base to billions of items
Refresh data without retraining
Support multi-tenant isolation (per-user vectors)
Keep data private (it never enters the LLM's training)

See rag workflow for the full pipeline.

Relevance today (2026)

The market consolidated

Landscape in 2026:

Managed: Pinecone (leader), Weaviate Cloud, Qdrant Cloud, Cohere embeddings + their own store
Open source self-hosted: Qdrant, Weaviate, Milvus, Chroma (devex leader)
Bolt-on to existing DBs: pgvector (Postgres), Elasticsearch dense vector, Redis vector, MongoDB Atlas Vector Search, SQLite sqlite-vec
Cloud-native: Vertex AI Vector Search, AWS OpenSearch, Azure AI Search

The big shift: bolt-on to existing DBs overtook specialized DBs for many teams. If you already use Postgres, pgvector is often enough up to 50M vectors.

Hybrid search is standard

Dense vectors (cosine similarity) miss exact matches: product codes, SKUs, rare names. BM25 sparse retrieval misses paraphrases. Hybrid: run both, fuse rankings (reciprocal rank fusion, or weighted sum). In 2026, every serious vector DB supports this natively.

Embedding dimensions shrank (Matryoshka)

Matryoshka Representation Learning produces embeddings you can truncate (1536 -> 512 -> 128) with minimal loss. Cuts storage and index cost drastically. OpenAI text-embedding-3-* models support this.

Quantization and compression

Product quantization, scalar quantization (int8), binary quantization (1 bit) reduce vector size 4x-32x with small recall cost. Qdrant, Pinecone, Milvus all support it. Billion-scale on commodity hardware became practical.

Multimodal vectors are routine

CLIP-style embeddings let you search images by text and vice versa. By 2026, many DBs index multiple vector columns per record (text embedding + image embedding + metadata) and let queries filter on all three.

Critical question: do you need one?

If your corpus has <10k chunks, a simple Python list + numpy is fine. Do not adopt Pinecone for a demo. As you scale, decision tree:

<100k vectors: pgvector, SQLite, Chroma
100k-10M: Qdrant, Weaviate, pgvector with proper tuning
10M-1B: Pinecone, Milvus, Qdrant cluster
1B: Milvus, Vertex Vector Search, custom

Critical questions

Why store the original text as payload if you only query by vector? (You need it for the LLM prompt after retrieval.)
Why cosine and not Euclidean for text? (Cosine is invariant to magnitude. Embedding norms reflect token count, which is not semantic.)
HNSW is memory-heavy. When do you pick IVF or disk-based indexes instead? (Billion-scale, cost-sensitive, or read-heavy-but-rare queries.)
What happens when you upgrade your embedding model? (Re-embed everything, or keep two indexes during transition. Mixed-space comparisons are meaningless.)
Why filter by metadata before or after ANN? (Pre-filter: cheaper, might miss items if filter is very selective. Post-filter: correct but wastes ANN work. Most DBs offer both; Qdrant and Weaviate do well here.)
Should you use a dedicated vector DB or pgvector? (Start with pgvector if you already use Postgres. Migrate when you hit performance or scale limits.)

Production pitfalls

Over-indexing on "the best vector DB". The differentiator is almost never the store. It is chunking, embedding quality, reranking. Pick a reasonable DB and move on.
Mixing embedding models. You re-ran with a new model and forgot to wipe the old index. Results degrade silently.
Ignoring payload size. Storing full HTML pages per vector inflates storage 50x. Store just the chunk.
No pre-aggregation for filters. Filters like "posts by user X" on 100M rows with ANN can be brutally slow. Partition your index by tenant or heavy-use filter key.
Naive cold start. Embedding 10M chunks with OpenAI at $0.02/1M tokens is manageable, but pay attention to rate limits. Batch smartly.
Missing re-ranking. Raw ANN recall is imperfect. A reranker on top is near-free in latency but big in accuracy.
Security. Multi-tenant vector stores without row-level security leak data between users. Audit.

Alternatives / Comparisons

Option	Strengths	Weaknesses	Good for
Pinecone	Managed, mature, fast	Closed, costs at scale	Prod teams, enterprise
Qdrant	Open source, fast, great filtering	More ops	Self-hosted prod
Weaviate	Built-in hybrid, multi-modal	Complex config	Hybrid-first teams
Chroma	Dev ergonomics, local	Scale limits	Prototypes, small apps
pgvector	Reuse existing Postgres	Slower at huge scale	Teams with Postgres
Milvus	Billion-scale, mature ANN	Heavyweight	Very large corpora
Elastic + dense_vector	Existing stack	Mediocre ANN	Log-heavy teams
Redis vector	Real-time, in-memory	Pricey at scale	Low-latency, small corpus
SQLite-vec	Local, zero-ops	Single-node	Edge, mobile, dev

Mental parallels (non-AI)

GPS coordinates for meaning: embeddings place each concept on a map. "Close on the map = similar in meaning". The vector DB is the map provider.
Spotify recommendation: songs are embedded based on audio and user behavior. "Find similar songs" is ANN on a vector DB.
Face recognition: each face is a vector. Matching a new photo is ANN on the database of known faces.
Card catalog with fuzzy search: traditional libraries use exact catalogs. Vector DBs are like a librarian who understands what you MEAN, not just what you typed.

Mini-lab

See rag workflow lab. Specifically for vector DBs:

Embed 10k chunks with text-embedding-3-small.
Load into three stores: Qdrant, pgvector, Chroma.
Measure:
- Indexing time
- Query latency at top-10
- Recall vs exact NN
- Disk and RAM usage
Add metadata filters ("only chunks from source X") and measure the impact.
Try quantization (int8 in Qdrant) and compare.

Goal: feel which DB actually fits your shape. Decisions flow from measurements, not marketing.

Vector Databases

Vector Databases

Watch or read first

TL;DR

The historical problem

How it works

1. Embeddings are vectors

2. Similarity metrics

3. Exact vs approximate nearest neighbor

4. Payload and metadata

5. ANN index structures

The role of vector databases in RAG

The core loop

What the vector DB replaces

Relevance today (2026)

The market consolidated

Hybrid search is standard

Embedding dimensions shrank (Matryoshka)

Quantization and compression

Multimodal vectors are routine

Critical question: do you need one?

Critical questions

Production pitfalls

Alternatives / Comparisons

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical

Related in this KB

Tools