Reasoning Prompting Techniques (CoT, Self-Consistency, ToT, ARQ)

Watch or read first

Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (2022): the founding paper. Short, very readable: https://arxiv.org/abs/2201.11903
Yao et al., "Tree of Thoughts" (2023): https://arxiv.org/abs/2305.10601.
Daily Dose DS - section "3 prompting techniques for reasoning in LLMs" in the AI Engineering Guidebook (2025, paid): https://www.dailydoseofds.com/ai-engineering-guidebook/

TL;DR

Four prompt-level techniques push an LLM to reason harder instead of guessing. Chain of Thought (CoT) asks for steps. Self-Consistency runs CoT many times and votes on the majority answer. Tree of Thoughts (ToT) explores a tree of partial reasonings and picks the best branch. Attentive Reasoning Queries (ARQs) replace free-form chains with a JSON schema of explicit sub-questions.

The historical problem

Early GPT-3 outputs in 2020-2022 were impressive but shallow. On math, logic, or multi-step questions, the model jumped straight to an answer and got it wrong in a plausible-looking way.

Two insights shifted the field:

Reasoning is latent in the model, but you must elicit it. A model that cannot solve a problem in one shot often can in five steps.
The quality of the output depends on what happens between the prompt and the final tokens. The LLM needs "thinking space".

CoT (Wei et al., 2022) was the first prompt-level trick. Self-Consistency and Tree of Thoughts came next. ARQ (2024-2025, used in Parlant) is the most recent evolution, replacing free-form chains with structured reasoning queries.

How it works

1. Chain of Thought (CoT)

Ask the model to show its steps.

WITHOUT CoT:
  User: If a train travels 60 km in 45 minutes, what is its speed in km/h?
  Model: 45 km/h   <- wrong

WITH CoT:
  User: ... Let's think step by step.
  Model:
    Step 1: 45 minutes = 0.75 hour
    Step 2: speed = distance / time = 60 / 0.75
    Step 3: 60 / 0.75 = 80
    Answer: 80 km/h   <- correct

Zero-shot CoT is often "Let's think step by step". Few-shot CoT provides 1-3 worked examples in the prompt.

2. Self-Consistency (majority voting over CoT)

CoT with temperature > 0 produces different reasoning paths on each run. Not all reach the same answer. Self-Consistency:

1. Run CoT N times (e.g., N = 10)
2. Extract the final answer from each run
3. Take the majority answer

This smooths noise. Works best when the answer is a discrete value (number, category, A/B/C).

3. Tree of Thoughts (ToT)

ToT treats reasoning as search. At each step, the model generates M candidate next steps. A separate scorer (same LLM with a judge prompt, or a small classifier) ranks them. You keep the best, expand, repeat. Breadth-first or depth-first search.

Question
  |
  +-- Thought A1 (score 0.3)
  |     +-- Thought A1.a (score 0.6)
  |     +-- Thought A1.b (score 0.4)
  +-- Thought A2 (score 0.8)  <- keep
        +-- Thought A2.a (score 0.9) <- keep
        +-- Thought A2.b (score 0.2)

More compute, more cost, better results on games, planning, puzzles.

4. ARQ (Attentive Reasoning Queries)

Instead of free-form "thinking aloud", ARQ encodes each reasoning step as a targeted key inside a JSON schema. The LLM fills the JSON, and the domain-specific questions keep the critical policies in attention.

{
  "did_user_request_refund": true,
  "refund_policy_applies": false,
  "escalation_needed": true,
  "recommended_action": "escalate_to_human_agent",
  "final_response": "I will loop in a specialist ..."
}

Daily Dose DS numbers across 87 test scenarios:

ARQ: 90.2%
CoT: 86.1%
Direct: 81.5%

Used inside Parlant, an open-source framework for instruction-following agents.

Relevance today (2026)

The landscape shifted a lot since CoT was born in 2022.

Reasoning models made CoT implicit

Claude Opus 4.5 (thinking mode), GPT-o1, GPT-o3, DeepSeek-R1, Gemini 2.5 Thinking - all produce chain-of-thought internally. You do not need to prompt "let's think step by step" anymore, and adding it can actually hurt on these models because they already think.

In 2026 you split your stack:

Non-reasoning models (GPT-4o, Claude Haiku, Sonnet default): CoT still helps a lot, especially for math and logic.
Reasoning models (o1, o3, Opus 4.5 thinking, R1): let them think. Do NOT ask for step-by-step.

Self-Consistency is more expensive than reasoning models

10x CoT samples used to be the cheap way to boost accuracy. By 2026, one reasoning model call is often cheaper and better. Self-Consistency survives in eval pipelines and in research, less so in prod.

Tree of Thoughts moved inside the models

o1 / Opus 4.5 thinking / R1 all do an internal form of tree search during their "thinking" budget. Explicit ToT at the prompt level is rare in prod. It lives in agent frameworks like LangGraph for complex planning tasks.

ARQ is underrated

ARQ is the most practical of the four for building agents in 2026. It combines CoT with structured output, which means:

Cheaper (fewer tokens)
Auditable (each reasoning step is a JSON key)
Compatible with function calling and guardrails
Less drift on long prompts (the schema keeps attention anchored)

Daily Dose DS is right to highlight it. Most teams still ignore it.

Key question: when to use which?

Technique	Use case	Cost	2026 status
CoT	Non-reasoning model, short task	1 call	Foundational, always valid
Self-Consistency	High-stakes single answer, discrete output	N calls	Mostly replaced by reasoning models
ToT	Planning, games, puzzles	Expensive	Migrating into agent frameworks
ARQ	Production agents with strict domain rules	1 call + schema	Rising
Just use a reasoning model	Complex problem, cost is OK	1 call (expensive internally)	Default for hard problems

Critical questions

Why does CoT improve accuracy if the model has the same weights? (It uses more tokens at inference to "compute" an answer via text. Reasoning is unrolled autoregressively.)
Why does Self-Consistency only help on problems with a discrete answer? (Majority voting needs you to bucket outputs.)
ToT on a 2026 reasoning model is often worse than just letting the model think. Why? (The reasoning model has been trained to explore internally. An external tree adds latency without new capability.)
ARQ forces a JSON schema. Does that limit exploration? (Yes, intentionally. Exploration is the enemy in a production agent that must follow policies.)
For a chat app with a 2000-word system prompt, which technique keeps the agent on track? (ARQ. CoT drifts after a few turns. This is the Parlant insight.)

Production pitfalls

Asking a reasoning model to "think step by step". Double-thinking. Wastes tokens, sometimes hurts accuracy. Check the model card before prompting.
Self-Consistency without answer extraction. You get 10 verbose responses and no tooling to tally. Always define a grammar or a final answer tag.
ToT in production. Expensive, slow, often brittle. Unless you have a clear search space and a scorer, skip it.
ARQ schema too loose. If the keys are vague, the LLM fills them with generic text. Each key must force a specific, checkable answer.
CoT leaks into the final answer. Without structure, the model returns the entire chain instead of the final answer. Use a tag like <answer> or a JSON field.
Mixing CoT and tool calls naively. The chain can contain fake tool results. Use ARQ or function calling when tools are involved.

Alternatives / Comparisons

Beyond these four, consider:

Plan-then-execute: an explicit planning pass followed by step execution. Used in LangGraph, LlamaIndex.
Program of Thoughts (PoT): the model writes code, the code runs, the result is the answer. Great for math.
Reflection: the model critiques its own output and revises. See agentic design patterns.
Best-of-N + reward model: generate N answers, rank with a learned reward. Research heavy.

Mental parallels (non-AI)

CoT = showing your work on a math test. The teacher gives partial credit, and you catch your own mistakes along the way.
Self-Consistency = second opinions. Ask 10 doctors with the same info, take the majority diagnosis.
Tree of Thoughts = chess engines. Stockfish expands a tree of positions and keeps the best branches.
ARQ = a checklist in surgery. The Checklist Manifesto by Atul Gawande. You do not freestyle in an OR, you tick boxes. Fewer errors.

Mini-lab

labs/reasoning-prompting/ (to create):

Take 20 GSM8K math questions.
Run 4 configurations on a mid-tier model (Claude Haiku 4.5 or Gemini 2.0 Flash):
- Direct answer
- Zero-shot CoT ("Let's think step by step")
- Self-Consistency with N=10
- ARQ schema (question_understood, variables_identified, formula, computation, answer)
Measure accuracy and token cost for each.
Compare to one call of a reasoning model (o1-mini or DeepSeek-R1).

Goal: feel the accuracy-cost trade-off in your hands. Decide which technique would survive in your next project.

Reasoning Prompting Techniques (CoT, Self-Consistency, ToT, ARQ)

Reasoning Prompting Techniques (CoT, Self-Consistency, ToT, ARQ)

Watch or read first

TL;DR

The historical problem

How it works

1. Chain of Thought (CoT)

2. Self-Consistency (majority voting over CoT)

3. Tree of Thoughts (ToT)

4. ARQ (Attentive Reasoning Queries)

Relevance today (2026)

Reasoning models made CoT implicit

Self-Consistency is more expensive than reasoning models

Tree of Thoughts moved inside the models

ARQ is underrated

Key question: when to use which?

Critical questions

Production pitfalls

Alternatives / Comparisons

Mental parallels (non-AI)

Mini-lab

Further reading

Canonical papers

Related in this KB

Tools