What Are Embedding Dimensions?
Question
When an embedding model outputs a vector with 1024 or 3072 dimensions, what does that mean? Does more = better?
Explanation
A dimension is one number in the vector. Each number captures one aspect of the text's meaning.
Think of describing a person:
- 2 dimensions (height, weight) - you can tell people apart, but not very well
- 5 dimensions (+ age, hair color, eye color) - much better
- 3072 dimensions - captures small details you couldn't even name, learned by the model during training
More dimensions = better?
Yes, but each step gives you less and less improvement:
- 384d (MiniLM, local, free) - correct
- 768d (older Google models) - good
- 1024d (Cohere embed-v3) - very good
- 3072d (Google gemini-embedding-001) - excellent
Going from 384 to 1024 is a big jump. Going from 1024 to 3072 is a small improvement but 3x the storage cost.
Is it configurable?
Usually no - the model decides. Exception: Matryoshka embeddings (like Russian dolls). The model is trained so you can cut the vector down to fewer dimensions and it still works. The first dimensions capture the most important info, the last ones capture fine details.
What matters more than dimensions
- Model quality - a good 384d model beats a bad 1024d model
- Training domain - a model trained on English tech docs is better for your PDFs than a generic one
- Chunking - bad chunks = bad embeddings, no matter the dimension
- Consistency - you MUST use the same model for indexing and searching
Example
We switched from Cohere (1024d) to Google gemini-embedding-001 (3072d). This required a full re-index because the old vectors were on a completely different "map" - you can't compare 1024-number coordinates with 3072-number coordinates.