What is multi-agent debate?

Multi-agent debate is an AI reasoning technique where multiple language models (or multiple instances of the same model with different roles) argue about a question across structured rounds. Peer-reviewed research consistently shows debate produces more factual, better-calibrated answers than single-prompt baselines, especially on hard reasoning and evaluation tasks.

How is MAD Studio different from running prompts in ChatGPT or Claude?

A single prompt gives you one model's first-pass answer. MAD Studio runs 2–100 reasoning agents through five built-in protocol engines — Truth-Seeking Debate (10-phase M-MAD), Open Discussion, Team Discussion (battle/collaboration), Blind Ping Pong, Scored Debate (FREE-MAD) — plus a custom Protocol Library you can fork and save. Claims get rebutted, evidence gets weighed, and verdicts come with auditable per-dimension scorecards.

Which AI models can I use with MAD Studio?

MAD Studio supports any model on OpenRouter (including GPT 5.5, Claude Opus 4.7, Gemini, Llama, Mixtral, and dozens more), local models served by LM Studio, and deterministic dummy providers for testing. You can mix providers per agent and configure automatic fallbacks.

Is multi-agent debate scientifically validated?

Yes. MAD Studio is built on peer-reviewed research from MIT, Google Brain, Anthropic, Tencent AI Lab, and others. Every protocol traces back to published methodology. Key papers:

Where can I read more about multi-agent debate?

We publish free, in-depth guides on multi-agent debate methodology — no signup required. Start here:

What can I use multi-agent debate for?

Political campaigns stress-test messaging against simulated opposition. Researchers run hypotheses through skeptical peer-review panels. Marketers debate competing campaign angles. Lawyers map adversarial arguments. Product teams institutionalize the devil's advocate. Educators make critical thinking visible. Anyone can run debates for fun — pick a topic, pick six agents, hit start.

What is the Bullshit Index?

The Bullshit Index is MAD Studio's real-time fact-checking layer. Every claim made by an agent is extracted, cross-referenced against your evidence pack, the public web, and the agent's earlier turns. Hallucinated citations, drifted positions, false precision, and contradicted statements all push the meter up. It's hallucination detection built directly into the debate loop.

Can I integrate MAD Studio into my own product?

Yes. MAD Studio offers a full REST API and a native Model Context Protocol (MCP) server. Spin up sessions, inject human turns, stream transcripts, and run experiments programmatically. The MCP server drops directly into Claude Desktop, Cursor, and any MCP-compatible client.

Saga is MAD Studio's recursive optimization engine. It spawns hidden child sessions from a source conversation, scores each transcript against your rubric, applies the best optimizer suggestion, and re-runs — generation after generation — until the score curve flattens or a stop condition fires. It's how you find answers that no single prompt would have produced.

Is MAD Studio an alternative to AutoGen, CrewAI, or LangGraph?

MAD Studio is purpose-built for multi-agent debate specifically — verdict-grade protocols, the M-MAD arbiter pipeline, and the Bullshit Index — rather than general-purpose agent orchestration. If you need a graph of role-specialized agents executing arbitrary tasks, AutoGen, CrewAI, and LangGraph are excellent. If you need auditable structured disagreement with per-dimension scoring, MAD Studio is the right tool.

How much does multi-agent debate cost in tokens?

Token cost scales with agents × rounds, so a 6-agent, 5-round Truth-Seeking Debate is roughly 30× a single-prompt baseline before arbiter passes. MAD Studio mitigates this with rolling summaries, sparse communication topology (Li et al., EMNLP 2024), adaptive stopping (Hu et al., NeurIPS 2025), and hard cost caps that self-terminate sessions before they burn budget.

Does multi-agent debate actually beat majority voting?

It depends on the task. For discrete math and multiple-choice with one correct answer, Self-Consistency (CoT-SC) is usually the better default. For factuality, open-ended strategy, and adversarial red-teaming, multi-agent debate consistently wins in peer-reviewed benchmarks (Du et al. ICML 2024, Khan et al. ICML 2024). MAD Studio supports both paradigms and a hybrid mode that combines them.

Can I use MAD Studio without writing any code?

Yes. Every protocol — Open Discussion, Truth-Seeking Debate, Team Discussion, Saga, Lab Experiments — is configurable from the web UI. The REST API and MCP server are there if you want to drive debates programmatically from your own stack, but they are optional. The platform ships with reusable Personas, Playbooks, and Teams so a typical first session takes under two minutes to configure.

Is my data used to train AI models?

No. MAD Studio sends prompts to whichever provider you configure (OpenRouter, LM Studio, or your own endpoint) — we do not retrain models on your transcripts and do not share session data with third parties. Local-only deployments via LM Studio are fully private. Transcripts are stored in your Supabase workspace and you can purge them at any time.

What is the Degeneration of Thought problem?

Degeneration of Thought, formalized by Liang et al. (EMNLP 2024), is the failure mode where an LLM commits to an answer and then cannot produce genuinely novel reasoning during self-reflection — even when wrong. The critic and advocate inside one model share the same latent commitment. Multi-agent debate fixes this by separating roles into agents with distinct context.

How to Red-Team Ideas with Multi-Agent Debate

TL;DR

Red-teaming with LLMs means simulating your harshest critics before real critics get the chance — not checking for toxic outputs after the fact.
Single-prompt “play devil's advocate” fails because one model playing both sides collapses into agreement (Degeneration-of-Thought).
Multi-agent debate works when agents have different roles, evidence, and incentives — then an independent arbiter scores the result.
Run bounded rounds with hard cost caps; the transcript is the deliverable, not just the final verdict.

Every organization has a version of this story: a confident strategy doc, a campaign message, a research claim, or a product bet that looked bulletproof in internal review — until someone outside the room asked the one question nobody thought to ask. Red-teaming exists to manufacture that question on demand.

With LLMs, the temptation is to paste your draft into ChatGPT and ask “what's wrong with this?” That helps, but it is not adversarial deliberation. It is one model negotiating with itself. This guide covers how to red-team ideas properly using multi-agent debate — what to configure, what the research says, and where it fails.

Two kinds of “red teaming”

The term gets used loosely. In AI safety, red-teaming usually means probing a model for harmful outputs, jailbreaks, or policy violations — work that Khan et al. and others frame as debate between persuasive agents. In strategy, journalism, law, and politics, red-teaming means something different: stress-testing an idea against hostile scrutiny before you commit to it.

This guide focuses on the second kind — adversarial idea testing — though the same architecture applies to both. The goal is not a thumbs-up/thumbs-down. The goal is a structured record of objections, weak evidence, missing citations, and failure modes you can act on.

Why single-prompt devil's advocate fails

Ask one LLM to critique its own proposal and you hit three walls:

Degeneration-of-Thought. Liang et al. (EMNLP 2024) showed that once a model commits to an answer, self-reflection rarely produces genuinely novel reasoning — even when the initial answer is wrong. The critic and the advocate share the same latent commitment.
Agreement bias. Wynn et al. found agents shift from correct to incorrect answers to match peers, favoring social agreement over challenging flawed reasoning. A single model playing both roles drifts toward the path of least resistance.
No auditable adversary. You get a bullet list of concerns with no record of which objections survived scrutiny, which were rebutted, and which the arbiter weighted most heavily. That is brainstorming, not deliberation.

Multi-agent debate fixes this by separating agents into distinct roles with separate context — each agent defends or attacks from a defined perspective, across multiple rounds, before a verdict.

The red-team debate architecture

A production-grade red-team run has five components. Skip any one and you are back to expensive theater.

Component	Role	Common mistake
Subject matter	The claim, draft, or decision under test — shared evidence pack	Vague prompts with no shared document for agents to cite
Advocate team	Agents defending the proposal with assigned personas	All agents given identical system prompts
Adversary team	Agents assigned to attack — skeptic, competitor, regulator, journalist	Adversaries that are too polite or too generic
Debate protocol	Bounded rounds with structured turn order and rebuttal rules	Open-ended chat with no turn limits or cost caps
Independent arbiter	Separate scoring pass on fixed dimensions — not a debater	Letting the strongest debater declare victory

Feng et al. (M-MAD, ACL 2025) formalized the arbiter side: instead of one holistic “who won?” judgment, run independent passes on correctness, evidence use, responsiveness to counterarguments, calibration, and citation quality. A claim can lose on evidence while winning on rhetoric — and you want to know that before publication, not after.

Five red-team scenarios that work

1. Political and campaign messaging

Configure opposition agents seeded with personas across the political spectrum — not caricatures, but plausible voters and commentators with specific priors. Run Team Discussion in battle mode: one team defends the message, one team attacks it with the strongest available rebuttals. The deliverable is a list of claims that survived cross-examination and a list that did not.

What you are looking for: lines that sound persuasive in isolation but collapse when an agent with different values engages. That is the question a hostile interviewer or opponent will ask on camera.

2. Product and strategy decisions

Assign agents to competitor personas, customer archetypes, and internal risk lenses (legal, security, finance). Open Discussion for exploratory objection surfacing, then Truth-Seeking Debate to pressure-test the top three risks. Du et al. found debate particularly strong on factuality — agents drop uncertain claims when peers challenge them, which is exactly what you want in a roadmap review.

3. Research claims and pre-print review

Simulate peer review before submission. Configure agents with field-specific skepticism: methodology critic, replication skeptic, statistics reviewer, domain expert with opposing priors. The arbiter scores each major claim on evidence and citation quality. Gaps become explicit: “Claim 2 rated low on citation quality — no primary source cited for the effect size.”

4. Legal and due diligence

Map prosecutor/defense argument trees over a shared evidence pack.SWE-Debate showed competitive multi-agent debate helps when the problem spans multiple parts of a codebase — the same logic applies to multi-document legal arguments where fault propagation matters. Each agent traces a different line of reasoning; debate consolidates the fix plan (or in legal terms, the claims ledger).

5. Investigative journalism pre-flight

Before publication, simulate the most aggressive defense your subject could mount. Configure a subject advocate with the best available counter-narrative, a legal counsel agent focused on defamation risk, and a skeptical editor agent. If the story survives that panel, you have done more diligence than most newsrooms can afford on deadline.

How to configure agents that actually disagree

ChatEval (ICLR 2024) found that diverse role prompts are essential — identical personas degrade multi-agent performance. For red-teaming specifically:

Give adversaries real incentives in the prompt.“Find the weakest claim” beats “offer constructive feedback.”
Mix models if you can. Heterogeneous agents surface blind spots homogeneous pools miss — though Liang et al. note LLMs may not be fair judges when debaters use different models. Keep the arbiter separate and consistent.
Share the evidence, not the conclusion. All agents read the same source document; none start with a pre-written verdict.
Cap rounds and cost. GroupDebate shows token cost scales brutally with agents × rounds. Red-teaming should have hard ceilings — stop when the arbiter stabilizes or you hit your budget.

Warning: debate can make things worse

If adversaries are too weak, too polite, or outnumbered by advocates, debate becomes rubber-stamping. If agents are homogeneous, Wynn et al. show accuracy can decrease as agents conform. Calibrate adversary strength and use adaptive stopping (Hu et al., NeurIPS 2025) so you do not keep debating after consensus is stable.

What a good red-team output looks like

Do not optimize for a binary pass/fail. A useful red-team run produces:

A claims ledger — every material assertion extracted and tracked across rounds
Objection log — which counterarguments were raised, which were rebutted, which stand
Dimension scores — M-MAD-style breakdown, not a single gut-check score
Citation gaps — claims marked unsupported or weakly sourced
Full transcript — auditable record for compliance, editorial, or legal review

The transcript is often more valuable than the verdict. It shows why a claim failed — the reasoning chain a human reviewer can follow and challenge.

Red-team debate vs other LLM checks

Method	Best for	Weakness
Single-prompt critique	Quick sanity check, typos, obvious gaps	DoT, agreement bias, no audit trail
Self-consistency voting	Discrete answers (math, classification)	No adversarial scrutiny; shared blind spots
RAG fact-check	Verifying claims against a corpus	Corpus-limited; no argument structure
Multi-agent red-team debate	Strategy, messaging, research, legal arguments	Higher cost; needs protocol tuning

For a deeper comparison of debate vs self-consistency on reasoning tasks, see our guide on when to use multi-agent debate vs self-consistency.

Running red-team debate in MAD Studio

MAD Studio is built for exactly this workflow. Configure adversary and advocate Workers with distinct personas and playbooks, snapshotted into a Team Discussion (battle mode) or Truth-Seeking Debate run. Set cost and turn caps before you start. Inject human guidance mid-run if an agent misses an obvious line of attack. When the run finishes, review the M-MAD dimension scores and export the full transcript.

Saga and Lab Experiments help tune adversary strength — sweep temperature and persona variants in hidden child runs until the red team reliably surfaces the objections your human reviewers would raise.

Stress-test your next idea before it ships

MAD Studio runs structured adversarial debate with M-MAD arbiter scoring, 2–100 agents, and full transcript persistence. Join the beta waitlist to red-team without building the orchestration yourself.

Join the beta waitlist →