Closed beta · 139 teams already in

The world's most advanced multi-agent debate platform.

MAD Studio is an operational thinking console for structured AI deliberation. Configure 2–100 reasoning agents, run them through peer-reviewed debate protocols, and watch them pressure-test ideas one turn at a time.

Request free beta access Read the research guidesJoin 139 teams · Free during beta · no credit card

Agents per session: 2 – 100
Debate phases: 1 – 10
Concurrent runs: 1 – 100

01 / Scientific foundation

Built on peer-reviewed research, not vibes.

MAD Studio operationalizes the leading academic frameworks in multi-agent debate, turning them from research notebooks into a production-grade workbench. Every protocol links to its published source — on arXiv, ACL Anthology, or NeurIPS.

New to the field? Read our guide on when to use multi-agent debate vs self-consistency.

Paper 01MIT / Google Brain · ICML 2024
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Li, Torralba, Tenenbaum, Mordatch
Foundational result: agents critiquing each other across rounds converge on more factual, better-reasoned answers.
Read on arXiv PDF
Paper 02Tencent AI Lab · EMNLP 2024
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Liang, He, Jiao, Wang, Wang, Wang, Yang, Shi, Tu
Introduces the Degeneration-of-Thought problem and shows multi-agent debate unlocks divergent thinking where self-reflection stalls.
Read on ACL Anthology PDF arXiv
Paper 03ACL 2025
M-MAD: Multidimensional Multi-Agent Debate for Translation Evaluation
Feng, Zhao, Lyu, Li, Tu, Wang
Introduces the per-dimension arbiter sweep that powers MAD Studio's truth-seeking verdict scoring.
Read on arXiv PDF
Paper 04ICLR 2024
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Chan, Chen, Yu, Lu, Sun, Liu
Demonstrates that multi-agent debate panels evaluate generated text more reliably than single-judge baselines.
Read on arXiv PDF
Paper 05Anthropic · ICML 2024 (Best Paper)
Debating with More Persuasive LLMs Leads to More Truthful Answers
Khan, Hughes, Valentine, Ruis, Sachan, Radhakrishnan, Bowman, Perez
Strong empirical evidence that debate makes weaker judges reliably select truthful answers from stronger debaters.
Read on arXiv PDF
Paper 06NeurIPS 2025
Multi-Agent Debate for LLM Judges with Adaptive Stability Detection
Hu, Tan, Wang, Qu, Chen
Formalizes debate among LLM judges and adds adaptive stability detection so debates stop when consensus stabilizes — improving accuracy over majority vote.
Read on NeurIPS
Paper 07Microsoft · COLM 2024
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
Wu, Bansal, Zhang, Wu, Li, Zhu, Wang, Saied, Awadallah, Awadalla, Wang
Shows that role-specialized agent groups orchestrated through structured conversation consistently outperform monolithic prompts on complex tasks.
Read on arXiv PDF
Paper 08Northeastern · NeurIPS 2023
Reflexion: Language Agents with Verbal Reinforcement Learning
Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao
Verbal self-critique loops iteratively raise agent performance — the direct precedent for Saga's recursive optimization passes.
Read on arXiv PDF
Paper 09CMU · NeurIPS 2023
Self-Refine: Iterative Refinement with Self-Feedback
Madaan, Tandon, Gupta, Hallinan, Gao, Wiegreffe, Alon, et al.
Single-model iterative refinement via self-generated feedback. The minimal version of what multi-agent debate scales up across roles.
Read on arXiv PDF
Paper 10Together AI · ICLR 2025
Mixture-of-Agents Enhances Large Language Model Capabilities
Wang, Bai, Liu, Chen, Cardie, Zhang, et al.
Layered multi-LLM collaboration where each layer's agents refine the previous layer's outputs. Open-source MoA reaches 65.1% on AlpacaEval 2.0, beating GPT-4 Omni.
Read on arXiv PDF
Paper 112026
Demystifying Multi-Agent Debate: The Role of Confidence and Diversity
Choi, Zhu, Li, et al.
Pinpoints when multi-agent debate actually beats majority vote: diversity-aware initialization plus calibrated confidence communication.
Read on arXiv PDF

Full reference list

Research guides

Practical guides on multi-agent debate

Peer-reviewed methodology, explained for builders — free to read, no signup required.

View all guides

02 / Capabilities

Five engines. One protocol library. Zero duct tape.

Five built-in debate engines and a custom protocol library you can fork and save. Each mode is peer-reviewed or production-hardened — from 10-phase M-MAD truth-seeking to blind pairwise ping-pong.

Truth-Seeking Debate
10-phase · M-MAD
Independent openings and rebuttals, then a five-dimension arbiter sweep — correctness, evidence, responsiveness, calibration, citations — before a binding verdict.
Open Discussion
Multi-agent
Rolling brainstorm where 2–100 agents reason in public. Models nominate the next speaker; human interventions steer mid-run.
Team Discussion
Battle · Collaborate
Two saved Teams of 2–6 Workers each. Private huddles, public spokesperson turns — battle mode for adversarial critique, collaboration mode for joint synthesis.
Blind Ping Pong
Masked 1:1
Two participants alternate in a blind, human-style chat. Identities masked until external stop — ideal for unbiased pairwise reasoning.
Scored Debate
FREE-MAD
Round-robin debate on a single structured question. Score-based decisions across rounds with anti-conformity prompts built in.
Custom Protocol Library
Your rules
Fork any built-in engine, edit prompt sections, and save named variants. Your protocol — without forking the codebase.

01
Evidence packs & claims ledger
Attach plain-text evidence to truth-seeking runs. Claims are tracked across phases so the arbiter scores against what was actually cited — not what agents wish they had said.
02
2 to 100 reasoning agents
Compose participants from a reusable Worker library. Snapshotted into the conversation so editing a Worker never rewrites historical transcripts.
03
Dimension-sweep arbiter
Five independent arbiter passes feed a structured verdict the model cannot retroactively rewrite — each dimension scored before the final call.
04
Rolling summary + recent window
Bounded prompts that scale to long-running runs. Full transcripts stay persisted, but each turn is fed a compact, faithful context.
05
Saga recursive optimization
Hidden child runs iteratively refine the parent prompt. Stops on convergence, score threshold, cost cap, or generation limit.
06
Lab Experiments
Sweep temperature, repetition, frequency, and presence penalties across hidden child copies. Score transcripts against a validation prompt and expected outcome.
07
Live human intervention
Inject guidance mid-run, optionally targeting the next responding agent. Interventions are first-class transcript inputs.
08
Multi-provider routing
OpenRouter, LM Studio, and deterministic dummy providers. Configure per-agent fallbacks with server-side prompt caching.
09
Cost, runtime, and turn caps
Hard ceilings on every dimension that matters. Sessions self-terminate before they burn budget or time.
10
Durable orchestration
Supabase-backed job claims, runtime locks, and transactional mutations. One failed conversation never stalls the others.
11
Personas, Playbooks, Teams
Reusable prompt guidance, hidden discussion rules, and snapshotted 2–6 Worker teams. Configure once, run forever.
12
In-app + email notifications
Get pinged when a run finishes or hits its cost cap. Resend integration when configured, in-app fallback when not.

03 / Programmable

Drive every debate from your own stack.

Full REST API and a first-class Model Context Protocol server. Spin up sessions, inject human turns, stream transcripts, and run entire experiments — programmatically, from anywhere.

REST APIv1

Full REST API

Every UI action is mirrored as a documented endpoint with stable contracts, idempotency keys, and webhook callbacks for long-running runs.

POST/v1/sessions
POST/v1/sessions/{id}/start
POST/v1/sessions/{id}/intervene
GET/v1/sessions/{id}/turns
GET/v1/sessions/{id}/transcript
POST/v1/teams/{a}/vs/{b}

MCP serverstdio · http

Model Context Protocol native

Drop MAD Studio into Claude Desktop, Cursor, or any MCP-compatible client. Orchestrate full debates as a callable tool — your agent of agents.

mad.session.create
mad.session.start
mad.session.intervene
mad.session.transcript
mad.team.battle
mad.experiment.run

{
  "mcpServers": {
    "mad-studio": {
      "command": "npx",
      "args": ["-y", "@mad-studio/mcp"]
    }
  }
}

04 / Recursive

Find what a single prompt can't find.

Saga and Experiments run hidden recursive debates until the agents stop surprising each other. Surface insights no model — not GPT 5.5, not Opus 4.7 — would give you from a single shot.

Sagarecursive · convergence-stopped

Let them argue until something stabilizes.

Saga spawns zero-turn child sessions, scores each transcript against your rubric, applies the best optimizer suggestion, and re-runs. It stops only when the score curve flattens — or when the answer changes everything you thought you knew.

Hidden child runs — never pollute the source conversation
Per-generation scorecard, applied patch, optimizer suggestion
Stops on convergence, score threshold, cost cap, or generation limit

Experimentsparameter sweep · best-of-N

Run it a hundred times. Keep what wins.

Experiments fan out hidden child runs across temperature, repetition, frequency, and presence sweeps. Score every transcript against a validation prompt. Promote the configuration that beats the field. The answer you would have hand-written, rewritten by the search.

Run 1 mirrors the source; later runs randomize sampling per agent
Validation prompt + expected outcome power transcript scoring
Stops on iteration limit, score threshold, or total cost cap

The promise

Beyond what one prompt can reach. Both Saga and Experiments surface answers no human or single model would have arrived at unaided.

Join the beta

05 / Evaluation

Twelve dimensions. One verdict you can defend.

The MAD Studio Evaluation Matrix scores every agent, turn, and run across twelve rigorous dimensions — from logical consistency to steelman quality to calibration. The single number you ship is the single number you can prove.

live matrixsession #84a3 · turn 14 / 28

≥ 9080–8970–7960–69< 60

Agent	Logical	Evidence	Rebuttal	Steelman	Calibration	Novelty	Signal-to-noise	Citation	Position	Concession	Falsifiability	Decision-readiness
Debater A	92	78	88	71	84	66	81	90	58	73	87	79
Debater B	85	91	76	82	70	88	74	83	65	80	72	86
Arbiter	94	89	95	90	93	71	92	96	84	91	88	95

aggregate 87.3winner Debater B (Δ +2.1)confidence 0.9112 dims · 36 cells · 1.4s

the bullshit indexbeta

We built a bullshit meter. It works.

Every factual claim is extracted, cross-checked against your evidence pack, the public web, and the agent's own earlier turns. Hallucinated citations, drifted positions, false precision, and unsupported assertions all push the needle. Bullshit gets flagged before it becomes a quote.

verified73 claims
flagged9 hedged
contradicted3 cases
false positives0.04%
avg latency1.2s / claim
sourcesevidence + web

Session BS-indexlow

6/ 100▼ 4 vs prev run

cleanhedgedcontradictedfabricated

Latest flag

Debater A · turn 11 cited a 2027 study that does not exist.

source: web cross-check · score impact: −2.3 calibration · −4 evidence

Custom rubrics
Define your own dimensions, weights, and scoring guidance. The arbiter pipeline picks them up without re-architecting the protocol.
ELO across topics
Pairwise debate outcomes update per-agent, per-topic ELO ratings. Find which Worker is actually best at adversarial law versus product strategy.
Heatmaps + diffs
Compare two runs side by side. Spot the dimension that shifted when you tweaked the prompt. Export to CSV, JSON, or paste straight into a report.
Auditable verdicts
Every score is traceable to the turn that earned it. Click any cell to jump to the supporting transcript span and the arbiter rationale.

06 / Use cases

One platform. Every domain that needs structured argument.

MAD Studio is the only platform on the web that delivers peer-reviewed multi-agent debate as a daily-driver tool. Where AutoGen, CrewAI, and LangGraph give you a graph of agents, MAD Studio gives you a verdict — built on M-MAD scoring and the Bullshit Index. From war rooms to lab notebooks to weekend curiosity — the same engine, the same rigor.

Political Campaigns

Stress-test every message before it ships.

Stage opposition agents that hammer your platform with the strongest rebuttals from across the spectrum. Find weak claims before journalists do.

Academic Research

Simulate the toughest peer review you'll ever get.

Run hypotheses through a panel of skeptical agents with persona-specific priors. Capture structured rebuttals, citations needed, and open questions.

Marketing & Brand

Debate positioning until only the strongest survives.

Run two competing campaign angles as Teams. Battle mode surfaces critique; collaboration mode synthesizes the best of both into a single brief.

Legal & Due Diligence

Map adversarial arguments end-to-end.

Configure prosecutor and defense agents over a shared evidence pack. Get a structured claims ledger and a five-dimension verdict scorecard.

Product Strategy

Institutionalize the devil's advocate.

Pressure-test roadmap decisions, pricing models, and launch plans against agents seeded with competitor personas, customer archetypes, and risk lenses.

Education & Training

Watch reasoning happen, step by step.

Make critical thinking visible. Students follow turn-by-turn argument structure, evidence handling, and the arbiter's dimension-level rationale.

Investigative Journalism

Pre-flight your strongest counterstory.

Before publication, simulate the most aggressive defense your subject could mount. Identify the holes that will get raised — and patch them.

Just for Fun

Send GPT 5.5 and Opus 4.7 into a five-round debate.

Pick a spicy topic. Pick six agents. Hit start. Watch them go. Export the transcript. No wrong answers — that's what the arbiter is for.

Frequently asked

Multi-agent debate, demystified.

Quick answers to the questions teams ask before they run their first MAD Studio session.

01What is multi-agent debate?

Multi-agent debate is an AI reasoning technique where multiple language models (or multiple instances of the same model with different roles) argue about a question across structured rounds. Peer-reviewed research consistently shows debate produces more factual, better-calibrated answers than single-prompt baselines, especially on hard reasoning and evaluation tasks.

02How is MAD Studio different from running prompts in ChatGPT or Claude?

A single prompt gives you one model's first-pass answer. MAD Studio runs 2–100 reasoning agents through five built-in protocol engines — Truth-Seeking Debate (10-phase M-MAD), Open Discussion, Team Discussion (battle/collaboration), Blind Ping Pong, Scored Debate (FREE-MAD) — plus a custom Protocol Library you can fork and save. Claims get rebutted, evidence gets weighed, and verdicts come with auditable per-dimension scorecards.

03Which AI models can I use with MAD Studio?

MAD Studio supports any model on OpenRouter (including GPT 5.5, Claude Opus 4.7, Gemini, Llama, Mixtral, and dozens more), local models served by LM Studio, and deterministic dummy providers for testing. You can mix providers per agent and configure automatic fallbacks.

04Is multi-agent debate scientifically validated?

Yes. MAD Studio is built on peer-reviewed research from MIT, Google Brain, Anthropic, Tencent AI Lab, and others. Every protocol traces back to published methodology. Key papers:

See all 11 foundational papers ↓

05Where can I read more about multi-agent debate?

We publish free, in-depth guides on multi-agent debate methodology — no signup required. Start here:

See all 11 foundational papers ↓

06What can I use multi-agent debate for?

Political campaigns stress-test messaging against simulated opposition. Researchers run hypotheses through skeptical peer-review panels. Marketers debate competing campaign angles. Lawyers map adversarial arguments. Product teams institutionalize the devil's advocate. Educators make critical thinking visible. Anyone can run debates for fun — pick a topic, pick six agents, hit start.

07What is the Bullshit Index?

The Bullshit Index is MAD Studio's real-time fact-checking layer. Every claim made by an agent is extracted, cross-referenced against your evidence pack, the public web, and the agent's earlier turns. Hallucinated citations, drifted positions, false precision, and contradicted statements all push the meter up. It's hallucination detection built directly into the debate loop.

08Can I integrate MAD Studio into my own product?

Yes. MAD Studio offers a full REST API and a native Model Context Protocol (MCP) server. Spin up sessions, inject human turns, stream transcripts, and run experiments programmatically. The MCP server drops directly into Claude Desktop, Cursor, and any MCP-compatible client.

09What does Saga do?

Saga is MAD Studio's recursive optimization engine. It spawns hidden child sessions from a source conversation, scores each transcript against your rubric, applies the best optimizer suggestion, and re-runs — generation after generation — until the score curve flattens or a stop condition fires. It's how you find answers that no single prompt would have produced.

10Is MAD Studio an alternative to AutoGen, CrewAI, or LangGraph?

MAD Studio is purpose-built for multi-agent debate specifically — verdict-grade protocols, the M-MAD arbiter pipeline, and the Bullshit Index — rather than general-purpose agent orchestration. If you need a graph of role-specialized agents executing arbitrary tasks, AutoGen, CrewAI, and LangGraph are excellent. If you need auditable structured disagreement with per-dimension scoring, MAD Studio is the right tool.

11How much does multi-agent debate cost in tokens?

Token cost scales with agents × rounds, so a 6-agent, 5-round Truth-Seeking Debate is roughly 30× a single-prompt baseline before arbiter passes. MAD Studio mitigates this with rolling summaries, sparse communication topology (Li et al., EMNLP 2024), adaptive stopping (Hu et al., NeurIPS 2025), and hard cost caps that self-terminate sessions before they burn budget.

12Does multi-agent debate actually beat majority voting?

It depends on the task. For discrete math and multiple-choice with one correct answer, Self-Consistency (CoT-SC) is usually the better default. For factuality, open-ended strategy, and adversarial red-teaming, multi-agent debate consistently wins in peer-reviewed benchmarks (Du et al. ICML 2024, Khan et al. ICML 2024). MAD Studio supports both paradigms and a hybrid mode that combines them.

13Can I use MAD Studio without writing any code?

Yes. Every protocol — Open Discussion, Truth-Seeking Debate, Team Discussion, Saga, Lab Experiments — is configurable from the web UI. The REST API and MCP server are there if you want to drive debates programmatically from your own stack, but they are optional. The platform ships with reusable Personas, Playbooks, and Teams so a typical first session takes under two minutes to configure.

14Is my data used to train AI models?

No. MAD Studio sends prompts to whichever provider you configure (OpenRouter, LM Studio, or your own endpoint) — we do not retrain models on your transcripts and do not share session data with third parties. Local-only deployments via LM Studio are fully private. Transcripts are stored in your Supabase workspace and you can purge them at any time.

15What is the Degeneration of Thought problem?

Degeneration of Thought, formalized by Liang et al. (EMNLP 2024), is the failure mode where an LLM commits to an answer and then cannot produce genuinely novel reasoning during self-reflection — even when wrong. The critic and advocate inside one model share the same latent commitment. Multi-agent debate fixes this by separating roles into agents with distinct context.

07 / Beta access

Be among the first to deploy a real debate engine.

Closed beta is rolling out now. Join 139 teams already on the waitlist — free early access, no card, no commitment, no spam.

By joining, you agree to receive infrequent product updates. Unsubscribe anytime. Questions? mad@multiagentdebates.com

The world's most advanced multi-agent debate platform.

Built on peer-reviewed research, not vibes.

Further reading

Full reference list

Practical guides on multi-agent debate

When to Use Multi-Agent Debate vs Self-Consistency

How to Red-Team Ideas with Multi-Agent Debate

Multi-Agent Debate Glossary

The Bullshit Index

Five engines. One protocol library. Zero duct tape.

Truth-Seeking Debate

Open Discussion

Team Discussion

Blind Ping Pong

Scored Debate

Custom Protocol Library

Evidence packs & claims ledger

2 to 100 reasoning agents

Dimension-sweep arbiter

Rolling summary + recent window

Saga recursive optimization

Lab Experiments

Live human intervention

Multi-provider routing

Cost, runtime, and turn caps

Durable orchestration

Personas, Playbooks, Teams

In-app + email notifications