Glossary

Multi-Agent Debate & LLM Reasoning Glossary

Plain-English definitions for the terms used across MAD Studio research — from M-MAD and CoT-SC to Degeneration of Thought and the Bullshit Index.

Multi-Agent Debate (MAD)

Also known as: MAD, LLM debate, multi-agent AI debate

Multi-Agent Debate is an LLM reasoning technique in which two or more language model instances argue across structured rounds, critiquing each other's claims before a verdict. Compared to single-prompt baselines, multi-agent debate consistently improves factuality, calibration, and adversarial robustness across peer-reviewed benchmarks.

Related: self consistency, m mad, degeneration of thought

Self-Consistency (CoT-SC)

Also known as: CoT-SC, chain-of-thought self-consistency, majority voting LLM

Self-Consistency is a reasoning strategy that samples multiple chain-of-thought paths from one model at non-zero temperature, then majority-votes the final answer. Introduced by Wang et al. (ICLR 2023), it is fast, embarrassingly parallel, and the strong default for math and multiple-choice tasks with a single correct label.

Related: multi agent debate, chain of thought

M-MAD

M-MAD (Multidimensional Multi-Agent Debate) is a verdict-scoring protocol from Feng et al. (ACL 2025) that runs independent arbiter passes on separate dimensions — correctness, evidence use, responsiveness, calibration, and citation quality — instead of one holistic 'who won' judgment. MAD Studio uses M-MAD as the backbone of its Truth-Seeking Debate verdict.

Related: multi agent debate, arbiter, dimension sweep

Degeneration of Thought (DoT)

Also known as: DoT

Degeneration of Thought is a failure mode formalized by Liang et al. (EMNLP 2024) where an LLM, once committed to an answer, fails to produce genuinely novel reasoning during self-reflection — even when the initial answer is wrong. Multi-agent debate fixes this by separating critic and advocate into agents with distinct context.

Related: multi agent debate, self refine

Bullshit Index

The Bullshit Index is MAD Studio's real-time fact-checking layer. Every claim made by an agent is extracted and cross-referenced against the evidence pack, the public web, and the agent's earlier turns. Hallucinated citations, drifted positions, false precision, and contradicted statements all push the meter higher.

Related: hallucination, fact checking, calibration

Saga

Saga is MAD Studio's recursive optimization engine. It spawns hidden child sessions from a source conversation, scores each transcript against a rubric, applies the best optimizer suggestion, and re-runs — generation after generation — until the score curve flattens or a stop condition fires. Direct precedent: Reflexion and Self-Refine.

Related: reflexion, self refine, experiments

Truth-Seeking Debate

Truth-Seeking Debate is MAD Studio's verdict-grade protocol — a 10-phase pipeline that combines structured rebuttals, claim extraction, and an independent M-MAD arbiter sweep. Designed for cases where the deliverable must be an auditable per-dimension scorecard rather than a single confidence number.

Related: m mad, arbiter, multi agent debate

Open Discussion

Open Discussion is MAD Studio's brainstorming protocol — agents take turns surfacing objections, evidence, and angles without a fixed verdict pipeline. Best for exploratory pre-mortems, early-stage idea expansion, and surfacing the long tail of objections before pressure-testing the strongest ones with Truth-Seeking Debate.

Related: truth seeking debate, team discussion

Team Discussion

Team Discussion is MAD Studio's two-team protocol. Battle mode pits an advocate team against an adversary team across bounded rounds; collaboration mode lets the teams synthesize. Useful for political messaging, product positioning, and any setting where the strongest opposing case must be staged before a decision is locked in.

Related: open discussion, truth seeking debate

Dimension-Sweep Arbiter

A Dimension-Sweep Arbiter scores a debate across independent passes — typically correctness, evidence use, responsiveness, calibration, and citation quality — rather than producing a single holistic judgment. Introduced by Feng et al. (ACL 2025) as M-MAD, it prevents an arbiter from retroactively rewriting its own verdict.

Related: m mad, arbiter

Mixture-of-Agents (MoA)

Also known as: MoA

Mixture-of-Agents is a layered multi-LLM architecture from Wang et al. (ICLR 2025) where each layer's agents refine the previous layer's outputs. Open-source MoA reached 65.1% on AlpacaEval 2.0, beating GPT-4 Omni — strong evidence that orchestrated ensembles can outperform any single frontier model.

Related: multi agent debate

Chain-of-Thought (CoT)

Also known as: CoT

Chain-of-Thought prompting (Wei et al., 2022) is the technique of asking an LLM to 'think step by step' before giving an answer. It unlocks reasoning that greedy single-shot decoding misses and is the foundation that Self-Consistency and Multi-Agent Debate both build on.

Related: self consistency

Reflexion

Reflexion (Shinn et al., NeurIPS 2023) is a verbal self-critique loop where an agent reflects on prior attempts, generates a critique, and retries. It iteratively raises performance without weight updates — the direct precedent for MAD Studio's Saga recursive optimization passes.

Related: saga, self refine

Self-Refine

Self-Refine (Madaan et al., NeurIPS 2023) is single-model iterative refinement using self-generated feedback. It is the minimal version of what multi-agent debate scales up across roles — the same critique-and-revise loop, but performed by separate agents with distinct context to avoid Degeneration of Thought.

Related: reflexion, degeneration of thought

ChatEval

ChatEval (Chan et al., ICLR 2024) demonstrated that multi-agent debate panels evaluate generated text more reliably than single-judge baselines. It is one of the foundational results behind using LLM-as-judge debates for tasks like response quality scoring and open-ended evaluation.

Related: llm as judge, arbiter

AutoGen

AutoGen (Microsoft, COLM 2024) is a multi-agent conversation framework that orchestrates role-specialized agents through structured dialogues. It is one of the most-cited alternatives to MAD Studio for general multi-agent tasks; MAD Studio differentiates with peer-reviewed debate protocols and the M-MAD arbiter pipeline.

Related: multi agent debate

CrewAI

CrewAI is an open-source framework for role-based multi-agent orchestration with sequential or hierarchical task execution. Compared to MAD Studio, it is general-purpose agent orchestration; MAD Studio is specifically built for debate, arbitration, and recursive optimization with academic-grade protocols.

Related: multi agent debate, autogen

LangGraph

LangGraph is LangChain's graph-based agent orchestration library, modeling multi-agent flows as stateful directed graphs. MAD Studio uses purpose-built debate protocols rather than user-authored graphs, with built-in M-MAD scoring and Bullshit Index hallucination detection.

Related: multi agent debate, autogen

Model Context Protocol (MCP)

Also known as: MCP

The Model Context Protocol is an open standard for connecting LLM clients to external tools and data sources. MAD Studio ships a native MCP server so debates can be triggered as a callable tool from Claude Desktop, Cursor, or any MCP-compatible client — your agent of agents.

Sycophancy

Sycophancy is the tendency of LLMs to agree with users or peers even when the correct answer disagrees. In multi-agent debate, Wynn et al. (2025) showed agents can shift from correct to incorrect answers to match peers — making adversary calibration and adaptive stopping essential.

Related: agreement bias, calibration

Adaptive Stability Detection

Adaptive Stability Detection (Hu et al., NeurIPS 2025) is a stopping rule for multi-agent debates: stop the moment the arbiter verdict stabilizes across consecutive rounds, rather than running fixed-length debates. Improves accuracy over majority vote while cutting token cost by 30–60% in reported benchmarks.

Related: m mad, truth seeking debate

GroupDebate

GroupDebate (Liu et al., AAMAS 2026) introduces group-based discussion to scale multi-agent debate efficiency. By partitioning agents into discussion groups before plenary synthesis, it cuts token cost relative to all-pairs debate while preserving accuracy on hard reasoning tasks.

Related: multi agent debate, sparse topology

Sparse Communication Topology

Sparse Communication Topology (Li et al., EMNLP 2024 Findings) shows that fully-connected agent communication is wasteful — selecting a sparse subset of inter-agent edges preserves multi-agent debate accuracy while dramatically reducing context length and cost.

Related: groupdebate, multi agent debate

Calibration

In LLM evaluation, calibration measures whether a model's stated confidence matches its actual accuracy. Well-calibrated agents say 'I'm uncertain' on questions they get wrong. Multi-agent debate generally improves calibration; the M-MAD arbiter scores it as a separate dimension.

Related: m mad, bullshit index

Steelman

Steelmanning is the rhetorical practice of constructing the strongest possible version of an opposing argument before rebutting it. In multi-agent debate, steelman quality is a scored dimension — adversaries that produce only strawmen flag low and reduce trust in the verdict.

Related: red teaming

Position Drift

Position Drift is the failure mode where an agent silently abandons or contradicts an earlier claim without acknowledgment. The Bullshit Index tracks drift across turns: if an agent endorses A in turn 3 and contradicts it in turn 9 without explicit reversal, the meter rises.

Related: bullshit index, sycophancy

Hallucination

An LLM hallucination is a confidently-stated claim that is fabricated, contradicted by sources, or unsupported by the evidence pack. Multi-agent debate reduces hallucination rates by forcing claims through adversarial cross-examination; the Bullshit Index quantifies it in real time.

Related: bullshit index, calibration

LLM-as-Judge

LLM-as-Judge is the practice of using a language model to score, rank, or arbitrate the outputs of other models. ChatEval and M-MAD both work in this paradigm. Risks include positional bias, length bias, and self-preference — mitigated by debate among judges and dimension-sweep scoring.

Related: chateval, m mad

Red-Teaming

In AI strategy, red-teaming means stress-testing an idea, claim, or decision against hostile scrutiny before commitment. Multi-agent debate operationalizes this by staging structured adversary teams with specific personas, then capturing the surviving claims and unresolved objections as deliverables.

Related: steelman, team discussion

Best-of-N

Best-of-N sampling runs an LLM N times and selects the highest-scoring output by a reward model or rubric. It is conceptually close to Self-Consistency but selects rather than votes. MAD Studio's Lab Experiments use Best-of-N over parameter sweeps with a validation prompt as the selector.

Related: self consistency, experiments

Worker

In MAD Studio, a Worker is a reusable agent definition — model, system prompt, persona, and provider configuration. Workers are snapshotted into a conversation when used, so editing a Worker definition never rewrites historical transcripts. The primitive that makes 2–100 agent compositions practical.

Related: persona, playbook