RAG
03·RAG·updated 2026-04-19

REFRAG and CAG (Cache-Augmented Generation)

REFRAG and CAG both attack a core RAG inefficiency: the LLM wastes compute on irrelevant or repeatedly-seen context. REFRAG (Meta, 2025) compresses each chunk into a single vector and only expands the relevant ones, achieving 30x faster time-to-first-token. CAG moves stable knowledge into the model's KV cache so it is never re-fetched or re-processed. In practice, combine: use CAG for stable data, RAG (optionally REFRAG) for volatile data.

REFRAG and CAG (Cache-Augmented Generation)

Watch or read first

TL;DR

REFRAG and CAG both attack a core RAG inefficiency: the LLM wastes compute on irrelevant or repeatedly-seen context. REFRAG (Meta, 2025) compresses each chunk into a single vector and only expands the relevant ones, achieving 30x faster time-to-first-token. CAG moves stable knowledge into the model's KV cache so it is never re-fetched or re-processed. In practice, combine: use CAG for stable data, RAG (optionally REFRAG) for volatile data.

The historical problem

Problem 1: wasted tokens in RAG

Classic RAG retrieves top-k chunks and stuffs them all into the LLM context. Many of these chunks are near-duplicates or tangentially relevant. The LLM pays compute for every token, whether useful or not.

Metrics for a typical RAG:

  • Top-10 chunks = 5,000-10,000 tokens in the prompt
  • Only 1-2 chunks are truly relevant
  • 70-80% of the retrieved tokens add nothing

Problem 2: re-processing stable context

Many apps have context that barely changes:

  • Company policies (rarely updated)
  • Product catalog metadata
  • Legal boilerplate
  • Agent system prompt

RAG fetches and passes them on every query. The LLM re-processes them every time. Wasted prefill compute, wasted cache turnover.

REFRAG tackles Problem 1. CAG tackles Problem 2. They are complementary.

How REFRAG works

Paper: "REFRAG: Rethinking RAG based Decoding" (Meta AI, 2025).

Key idea: compress, filter, selectively expand

1. Encode each chunk as a single compressed embedding (not hundreds of token embeddings).
2. A lightweight RL-trained policy evaluates the compressed embeddings vs the query.
3. The policy keeps only the most relevant chunks.
4. Only chosen chunks are expanded back into full token-level embeddings for the LLM.
5. Rejected chunks remain as their single compressed vector, which the LLM still sees but cannot attend to in detail.

Step-by-step (from Daily Dose DS)

Step 1-2) Encode docs and store in a vector DB.
Step 3-5) Encode the full user query and find relevant chunks.
          Also compute token-level embeddings for the query (step 7)
          and the matching chunks.
Step 6)   Use the relevance policy (trained via RL) to select chunks to keep.
Step 8)   Concatenate:
            - token-level representation of the query
            - token-level embeddings of SELECTED chunks
            - compressed single-vector representation of REJECTED chunks
Step 9-10) Send all that to the LLM.

Published results

  • 30.85x faster time-to-first-token (3.75x better than previous SOTA)
  • 16x larger effective context window
  • Outperforms LLaMA on 16 RAG benchmarks using 2-4x fewer decoder tokens
  • No accuracy loss across RAG, summarization, multi-turn conversation tasks

Meaning: you can process 16x more context at 30x the speed with the same accuracy.

How CAG works

CAG stands for Cache-Augmented Generation.

Key idea: cache stable knowledge in the KV cache, not in the retrieval loop

Recall from kv cache: the KV cache holds past tokens' Key and Value vectors. Prompt caching APIs (Anthropic, OpenAI, Google) let you mark a prompt prefix as cacheable so the server keeps those KV vectors across requests.

CAG uses this for RAG-like data:

Static knowledge (company policies, reference guides, product specs)
  -> packed once into a cacheable prompt prefix
  -> sent once, cached by the LLM server
  -> reused on every subsequent query at ~90% cost reduction

Dynamic knowledge (today's prices, last user message, new tickets)
  -> continues to be fetched via RAG
  -> appended after the cached prefix

Hybrid RAG + CAG workflow

Incoming query
  |
  +--> Classifier: is this about stable knowledge, dynamic, or both?
          |
          +-- stable only:     cached prefix + query            (CAG path)
          +-- dynamic only:    standard RAG retrieval           (RAG path)
          +-- mixed:           cached prefix + RAG chunks + query  (Hybrid)
  |
  +--> LLM
  |
  +--> Answer

What to cache, what to retrieve (rule of thumb)

  • Cache (cold): refresh < 1x/week, read > 100x/week, stable semantics. E.g., product docs, policies, FAQ.
  • Retrieve (hot): refresh > 1x/day, small items, per-user. E.g., customer tickets, recent orders, live prices.

If you cache everything, you blow the context limit. If you retrieve everything, you pay the prefill tax every query. Split.

Relevance today (2026)

Prompt caching APIs make CAG trivial

By 2026, Anthropic and OpenAI have mature prompt caching (5 min TTL default, can extend). Gemini launched its own in 2024. Implementing CAG is now:

# Anthropic
messages = [
    {"role": "user", "content": [
        {"type": "text", "text": large_policy_doc, "cache_control": {"type": "ephemeral"}},
        {"type": "text", "text": user_query}
    ]}
]

One annotation. Cost drops 90% on cached tokens after the first hit.

REFRAG is newer, less mainstream

REFRAG is a 2025 Meta paper. Not yet a standard library in 2026. Implementations are emerging, mostly research-grade. Expect production-ready libraries by late 2026 or 2027.

When each shines

ScenarioUse
High QPS, stable corpus (< 200K tokens)CAG
Large corpus, volatile, latency-sensitiveREFRAG when it matures
Moderate corpus, moderate QPSClassic RAG + Contextual + Rerank
Extreme corpus (>10M chunks), billion-scaleRAG with sharding, not CAG

Why this matters strategically

As LLM context windows grow and prompt caching gets cheap, the line between "retrieval" and "context loading" blurs. REFRAG and CAG both say: "smart compression + smart caching will beat brute-force retrieval for a growing share of use cases".

The question an engineer should ask in 2026: "Can I cache this context instead of retrieving it?" Often the answer changes the architecture entirely.

Critical questions

  • Why does REFRAG not lose accuracy despite compressing? (The RL policy learns which compressed signals are safe to skip expanding. The LLM still sees the compressed vectors of rejected chunks, just not their full token detail.)
  • What if my "stable" knowledge updates once a day? (Still cacheable. Bust the cache on update. Anthropic supports explicit cache breakpoints.)
  • Can CAG replace RAG entirely? (Only for small static corpora that fit in the context. Past that, you need RAG for scale.)
  • How do REFRAG and prompt caching differ? (Prompt caching stores the full token-level KV of an unchanging prefix. REFRAG compresses chunks at the embedding level and never expands them if irrelevant. Different layers of the stack.)
  • Does REFRAG help with streaming outputs? (Yes, faster TTFT is the headline. Users see the first token sooner.)
  • Do cache hits help on Anthropic if every user has a different preamble? (No. Cache is per-prefix. Share prefixes across users. Per-user personalization goes AFTER the cache.)

Production pitfalls

CAG pitfalls

  • Caching dynamic data. You cache today's prices. Tomorrow they are wrong. Cache only stable content.
  • Cache bust on every small change. Keep your stable content strictly stable. If you edit a comma, you pay prefill again.
  • Mixing per-user content into the cached prefix. Breaks sharing. Put per-user content AFTER the cache breakpoint.
  • TTL misunderstanding. Anthropic's default cache is 5 min. For high QPS it is great. For low QPS you pay full cost every call. Use 1h cache extensions if supported.
  • Silent cache miss. Monitor cache hit rate. If it drops, investigate prefix drift.

REFRAG pitfalls

  • Immature tooling. In 2026, implementations are mostly research code. Expect rough edges.
  • Retraining the RL policy. On new domains, the shipped policy may underfit. Budget for domain adaptation.
  • Opacity. You cannot easily tell why the policy rejected a chunk. Add logging.

Alternatives / Comparisons

ApproachWhat it acceleratesInfra cost2026 maturity
Naive RAG-LowMature
RAG + rerankRelevanceLowMature
REFRAGLatency, token countResearch-gradeEmerging
CAG (prompt caching)Cost on stable prefixTrivial to addMature
CAG + RAG hybridBothTrivial to addMature, underused
Long context stuffing (1M)Eliminates retrievalHigh per-query costWorks, often wasteful
Fine-tuning on corpusIn-weights knowledgeTraining costMature, not always right

Mental parallels (non-AI)

  • Browser cache + CDN: CAG is like a CDN for your prompt. Static assets cached at the edge. Dynamic pages rendered per request.
  • Mise-en-place in a kitchen: CAG is the pre-chopped vegetables. Done once, reused across services. REFRAG is the sous-chef who only pulls the ingredients needed for each dish instead of bringing the whole fridge.
  • Physical libraries: CAG is the reference section (stable, always available). REFRAG is a smart clerk who fetches only the pages you need, with a summary of the rest in case.
  • Git diff: REFRAG sends compressed summaries unless a chunk is flagged as relevant, at which point it sends the full content. Similar to shallow clone vs full clone.

Mini-lab

labs/refrag-cag/ (to create):

  1. Take a corpus of 100 product policy docs (stable) + a live feed of customer tickets (dynamic).
  2. Implement three pipelines:
    • Naive RAG: retrieve from everything every time
    • CAG: cache the policies, RAG only the tickets
    • CAG + REFRAG (if available): same + REFRAG on tickets
  3. Measure:
    • Cost per query
    • Latency
    • Answer accuracy on a 50-query golden set
  4. Plot cost vs QPS for CAG vs RAG. Identify the break-even point.

Stack: uv, anthropic (for prompt caching), langchain, qdrant.

Further reading

Canonical

Related in this KB

Related concepts

ragrefragcagkv-cachecompressionmetacaching