Closed beta · 127 teams already in

The world's most advanced multi-agent debate platform.

MAD Studio is an operational thinking console for structured AI deliberation. Configure 2–100 reasoning agents, run them through peer-reviewed debate protocols, and watch them pressure-test ideas one turn at a time.

Request free beta accessJoin 127 teams · Free during beta · no credit card
Agents per session
2 – 100
Debate phases
1 – 10
Concurrent runs
1 – 100
01 / Scientific foundation

Built on peer-reviewed research, not vibes.

MAD Studio operationalizes the leading academic frameworks in multi-agent debate, turning them from research notebooks into a production-grade workbench. Every protocol is traceable to a published methodology.

  1. Paper 01MIT / Google Brain · ICML 2024
    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Du, Li, Torralba, Tenenbaum, Mordatch

    Foundational result: agents critiquing each other across rounds converge on more factual, better-reasoned answers.

  2. Paper 02Tencent AI Lab · EMNLP 2024
    Encouraging Divergent Thinking in LLMs through Multi-Agent Debate

    Liang, He, Ma, Zhang, Wang, Hu, Zhang, Lin

    Establishes that adversarial multi-agent debate counteracts degeneration of thought and unlocks deeper reasoning.

  3. Paper 03ACL 2025
    M-MAD: Multidimensional Multi-Agent Debate for Translation Evaluation

    Feng, Zhao, Lyu, Li, Tu, Wang

    Introduces the per-dimension arbiter sweep that powers MAD Studio's truth-seeking verdict scoring.

  4. Paper 04ICLR 2024
    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chan, Chen, Yu, Lu, Sun, Liu

    Demonstrates that multi-agent debate panels evaluate generated text more reliably than single-judge baselines.

  5. Paper 05Anthropic · ICML 2024 (Best Paper)
    Debating with More Persuasive LLMs Leads to More Truthful Answers

    Khan, Hughes, Valentine, Ruis, Sachan, Radhakrishnan, Bowman, Perez

    Strong empirical evidence that debate makes weaker judges reliably select truthful answers from stronger debaters — the modern, capable-model successor to the original debate-as-alignment thesis.

  6. Paper 06Microsoft · COLM 2024
    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Bansal, Zhang, Wu, Li, Zhu, Wang, Saied, Awadallah, Awadalla, Wang

    Shows that role-specialized agent groups orchestrated through structured conversation consistently outperform monolithic prompts on complex tasks.

  7. Paper 07Northeastern · NeurIPS 2023
    Reflexion: Language Agents with Verbal Reinforcement Learning

    Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao

    Verbal self-critique loops iteratively raise agent performance — the direct precedent for Saga's recursive optimization passes.

  8. Paper 08CMU · NeurIPS 2023
    Self-Refine: Iterative Refinement with Self-Feedback

    Madaan, Tandon, Gupta, Hallinan, Gao, Wiegreffe, Alon, et al.

    Single-model iterative refinement via self-generated feedback. The minimal version of what multi-agent debate scales up across roles.

  9. Paper 09Together AI · ICLR 2025
    Mixture-of-Agents Enhances Large Language Model Capabilities

    Wang, Bai, Liu, Chen, Cardie, Zhang, et al.

    Layered multi-LLM collaboration where each layer's agents refine the previous layer's outputs. Open-source MoA reaches 65.1% on AlpacaEval 2.0, beating GPT-4 Omni.

  10. Paper 102026
    Demystifying Multi-Agent Debate: The Role of Confidence and Diversity

    Choi, Zhu, Li, et al.

    Pinpoints when multi-agent debate actually beats majority vote: diversity-aware initialization plus calibrated confidence communication. Directly informs MAD Studio's persona and confidence design.

02 / Capabilities

Engineered for serious deliberation.

Every primitive a researcher, strategist, or analyst needs to run a structured AI debate at production quality — without rebuilding the scaffolding from scratch.

  • 01

    Three rigorous protocols

    Open Discussion for brainstorming, Truth-Seeking Debate with a 10-phase M-MAD verdict, and Team Discussion for two-team battle or collaboration modes.

  • 02

    2 to 100 reasoning agents

    Compose participants from a reusable Worker library. Snapshotted into the conversation so editing a Worker never rewrites historical transcripts.

  • 03

    Dimension-sweep arbiter

    Five independent arbiter passes — correctness, evidence use, responsiveness, calibration, citation quality — feed a structured verdict the model cannot retroactively rewrite.

  • 04

    Rolling summary + recent window

    Bounded prompts that scale to long-running runs. Full transcripts stay persisted, but each turn is fed a compact, faithful context.

  • 05

    Saga recursive optimization

    Hidden child runs iteratively refine the parent prompt. Stops on convergence, score threshold, cost cap, or generation limit.

  • 06

    Lab Experiments

    Sweep temperature, repetition, frequency, and presence penalties across hidden child copies. Score transcripts against a validation prompt and expected outcome.

  • 07

    Live human intervention

    Inject guidance mid-run, optionally targeting the next responding agent. Interventions are first-class transcript inputs.

  • 08

    Multi-provider routing

    OpenRouter, LM Studio, and deterministic dummy providers. Configure per-agent fallbacks with server-side prompt caching.

  • 09

    Cost, runtime, and turn caps

    Hard ceilings on every dimension that matters. Sessions self-terminate before they burn budget or time.

  • 10

    Durable orchestration

    Supabase-backed job claims, runtime locks, and transactional mutations. One failed conversation never stalls the others.

  • 11

    Personas, Playbooks, Teams

    Reusable prompt guidance, hidden discussion rules, and snapshotted 2–6 Worker teams. Configure once, run forever.

  • 12

    In-app + email notifications

    Get pinged when a run finishes or hits its cost cap. Resend integration when configured, in-app fallback when not.

03 / Programmable

Drive every debate from your own stack.

Full REST API and a first-class Model Context Protocol server. Spin up sessions, inject human turns, stream transcripts, and run entire experiments — programmatically, from anywhere.

REST APIv1

Full REST API

Every UI action is mirrored as a documented endpoint with stable contracts, idempotency keys, and webhook callbacks for long-running runs.

  • POST/v1/sessions
  • POST/v1/sessions/{id}/start
  • POST/v1/sessions/{id}/intervene
  • GET/v1/sessions/{id}/turns
  • GET/v1/sessions/{id}/transcript
  • POST/v1/teams/{a}/vs/{b}
MCP serverstdio · http

Model Context Protocol native

Drop MAD Studio into Claude Desktop, Cursor, or any MCP-compatible client. Orchestrate full debates as a callable tool — your agent of agents.

  • mad.session.create
  • mad.session.start
  • mad.session.intervene
  • mad.session.transcript
  • mad.team.battle
  • mad.experiment.run
{
  "mcpServers": {
    "mad-studio": {
      "command": "npx",
      "args": ["-y", "@mad-studio/mcp"]
    }
  }
}
04 / Recursive

Find what a single prompt can't find.

Saga and Experiments run hidden recursive debates until the agents stop surprising each other. Surface insights no model — not GPT 5.5, not Opus 4.7 — would give you from a single shot.

Sagarecursive · convergence-stopped

Let them argue until something stabilizes.

Saga spawns zero-turn child sessions, scores each transcript against your rubric, applies the best optimizer suggestion, and re-runs. It stops only when the score curve flattens — or when the answer changes everything you thought you knew.

  • Hidden child runs — never pollute the source conversation
  • Per-generation scorecard, applied patch, optimizer suggestion
  • Stops on convergence, score threshold, cost cap, or generation limit
Experimentsparameter sweep · best-of-N

Run it a hundred times. Keep what wins.

Experiments fan out hidden child runs across temperature, repetition, frequency, and presence sweeps. Score every transcript against a validation prompt. Promote the configuration that beats the field. The answer you would have hand-written, rewritten by the search.

  • Run 1 mirrors the source; later runs randomize sampling per agent
  • Validation prompt + expected outcome power transcript scoring
  • Stops on iteration limit, score threshold, or total cost cap
The promise

Beyond what one prompt can reach. Both Saga and Experiments surface answers no human or single model would have arrived at unaided.

Join the beta
05 / Evaluation

Twelve dimensions. One verdict you can defend.

The MAD Studio Evaluation Matrix scores every agent, turn, and run across twelve rigorous dimensions — from logical consistency to steelman quality to calibration. The single number you ship is the single number you can prove.

live matrixsession #84a3 · turn 14 / 28
≥ 9080–8970–7960–69< 60
AgentLogicalEvidenceRebuttalSteelmanCalibrationNoveltySignal-to-noiseCitationPositionConcessionFalsifiabilityDecision-readiness
Debater A
92
78
88
71
84
66
81
90
58
73
87
79
Debater B
85
91
76
82
70
88
74
83
65
80
72
86
Arbiter
94
89
95
90
93
71
92
96
84
91
88
95
aggregate 87.3winner Debater B (Δ +2.1)confidence 0.9112 dims · 36 cells · 1.4s
the bullshit indexbeta

We built a bullshit meter. It works.

Every factual claim is extracted, cross-checked against your evidence pack, the public web, and the agent's own earlier turns. Hallucinated citations, drifted positions, false precision, and unsupported assertions all push the needle. Bullshit gets flagged before it becomes a quote.

  • verified73 claims
  • flagged9 hedged
  • contradicted3 cases
  • false positives0.04%
  • avg latency1.2s / claim
  • sourcesevidence + web
Session BS-indexlow
6/ 100▼ 4 vs prev run
cleanhedgedcontradictedfabricated
Latest flag
Debater A · turn 11 cited a 2027 study that does not exist.
source: web cross-check · score impact: −2.3 calibration · −4 evidence
  • Custom rubrics

    Define your own dimensions, weights, and scoring guidance. The arbiter pipeline picks them up without re-architecting the protocol.

  • ELO across topics

    Pairwise debate outcomes update per-agent, per-topic ELO ratings. Find which Worker is actually best at adversarial law versus product strategy.

  • Heatmaps + diffs

    Compare two runs side by side. Spot the dimension that shifted when you tweaked the prompt. Export to CSV, JSON, or paste straight into a report.

  • Auditable verdicts

    Every score is traceable to the turn that earned it. Click any cell to jump to the supporting transcript span and the arbiter rationale.

06 / Use cases

One platform. Every domain that needs structured argument.

MAD Studio is the only platform on the web that delivers peer-reviewed multi-agent debate as a daily-driver tool. From war rooms to lab notebooks to weekend curiosity — the same engine, the same rigor.

Political Campaigns

Stress-test every message before it ships.

Stage opposition agents that hammer your platform with the strongest rebuttals from across the spectrum. Find weak claims before journalists do.

Academic Research

Simulate the toughest peer review you'll ever get.

Run hypotheses through a panel of skeptical agents with persona-specific priors. Capture structured rebuttals, citations needed, and open questions.

Marketing & Brand

Debate positioning until only the strongest survives.

Run two competing campaign angles as Teams. Battle mode surfaces critique; collaboration mode synthesizes the best of both into a single brief.

Legal & Due Diligence

Map adversarial arguments end-to-end.

Configure prosecutor and defense agents over a shared evidence pack. Get a structured claims ledger and a five-dimension verdict scorecard.

Product Strategy

Institutionalize the devil's advocate.

Pressure-test roadmap decisions, pricing models, and launch plans against agents seeded with competitor personas, customer archetypes, and risk lenses.

Education & Training

Watch reasoning happen, step by step.

Make critical thinking visible. Students follow turn-by-turn argument structure, evidence handling, and the arbiter's dimension-level rationale.

Investigative Journalism

Pre-flight your strongest counterstory.

Before publication, simulate the most aggressive defense your subject could mount. Identify the holes that will get raised — and patch them.

Just for Fun

Send GPT 5.5 and Opus 4.7 into a five-round debate.

Pick a spicy topic. Pick six agents. Hit start. Watch them go. Export the transcript. No wrong answers — that's what the arbiter is for.

Frequently asked

Multi-agent debate, demystified.

Quick answers to the questions teams ask before they run their first MAD Studio session.

01What is multi-agent debate?
Multi-agent debate is an AI reasoning technique where multiple language models (or multiple instances of the same model with different roles) argue about a question across structured rounds. Peer-reviewed research consistently shows debate produces more factual, better-calibrated answers than single-prompt baselines, especially on hard reasoning and evaluation tasks.
02How is MAD Studio different from running prompts in ChatGPT or Claude?
A single prompt gives you one model's first-pass answer. MAD Studio runs 2–100 reasoning agents through formal protocols — open discussion, truth-seeking debate with a 10-phase M-MAD verdict, or two-team battle and collaboration modes — so claims get rebutted, evidence gets weighed, and verdicts come with auditable per-dimension scorecards instead of a single confidence number.
03Which AI models can I use with MAD Studio?
MAD Studio supports any model on OpenRouter (including GPT 5.5, Claude Opus 4.7, Gemini, Llama, Mixtral, and dozens more), local models served by LM Studio, and deterministic dummy providers for testing. You can mix providers per agent and configure automatic fallbacks.
04Is multi-agent debate scientifically validated?
Yes. MAD Studio is built on peer-reviewed research from MIT, Google Brain, Anthropic, Tencent AI Lab, and others — including the ICML 2024 Best Paper on debate-based truthfulness, the M-MAD multi-dimensional arbiter framework, and the Mixture-of-Agents architecture that beats GPT-4 Omni on AlpacaEval. Every protocol traces back to published methodology.
05What can I use multi-agent debate for?
Political campaigns stress-test messaging against simulated opposition. Researchers run hypotheses through skeptical peer-review panels. Marketers debate competing campaign angles. Lawyers map adversarial arguments. Product teams institutionalize the devil's advocate. Educators make critical thinking visible. Anyone can run debates for fun — pick a topic, pick six agents, hit start.
06What is the Bullshit Index?
The Bullshit Index is MAD Studio's real-time fact-checking layer. Every claim made by an agent is extracted, cross-referenced against your evidence pack, the public web, and the agent's earlier turns. Hallucinated citations, drifted positions, false precision, and contradicted statements all push the meter up. It's hallucination detection built directly into the debate loop.
07Can I integrate MAD Studio into my own product?
Yes. MAD Studio offers a full REST API and a native Model Context Protocol (MCP) server. Spin up sessions, inject human turns, stream transcripts, and run experiments programmatically. The MCP server drops directly into Claude Desktop, Cursor, and any MCP-compatible client.
08What does Saga do?
Saga is MAD Studio's recursive optimization engine. It spawns hidden child sessions from a source conversation, scores each transcript against your rubric, applies the best optimizer suggestion, and re-runs — generation after generation — until the score curve flattens or a stop condition fires. It's how you find answers that no single prompt would have produced.
07 / Beta access

Be among the first to deploy a real debate engine.

Closed beta is rolling out now. Join 127 teams already on the waitlist — free early access, no card, no commitment, no spam.

By joining, you agree to receive infrequent product updates. Unsubscribe anytime.