10 min read

How to Red-Team Ideas with Multi-Agent Debate

A practical playbook for adversarial AI deliberation — stress-test political messaging, product decisions, research claims, and legal arguments before they ship.

TL;DR

  • Red-teaming with LLMs means simulating your harshest critics before real critics get the chance — not checking for toxic outputs after the fact.
  • Single-prompt “play devil's advocate” fails because one model playing both sides collapses into agreement (Degeneration-of-Thought).
  • Multi-agent debate works when agents have different roles, evidence, and incentives — then an independent arbiter scores the result.
  • Run bounded rounds with hard cost caps; the transcript is the deliverable, not just the final verdict.

Every organization has a version of this story: a confident strategy doc, a campaign message, a research claim, or a product bet that looked bulletproof in internal review — until someone outside the room asked the one question nobody thought to ask. Red-teaming exists to manufacture that question on demand.

With LLMs, the temptation is to paste your draft into ChatGPT and ask “what's wrong with this?” That helps, but it is not adversarial deliberation. It is one model negotiating with itself. This guide covers how to red-team ideas properly using multi-agent debate — what to configure, what the research says, and where it fails.

Two kinds of “red teaming”

The term gets used loosely. In AI safety, red-teaming usually means probing a model for harmful outputs, jailbreaks, or policy violations — work that Khan et al. and others frame as debate between persuasive agents. In strategy, journalism, law, and politics, red-teaming means something different: stress-testing an idea against hostile scrutiny before you commit to it.

This guide focuses on the second kind — adversarial idea testing — though the same architecture applies to both. The goal is not a thumbs-up/thumbs-down. The goal is a structured record of objections, weak evidence, missing citations, and failure modes you can act on.

Why single-prompt devil's advocate fails

Ask one LLM to critique its own proposal and you hit three walls:

  1. Degeneration-of-Thought. Liang et al. (EMNLP 2024) showed that once a model commits to an answer, self-reflection rarely produces genuinely novel reasoning — even when the initial answer is wrong. The critic and the advocate share the same latent commitment.
  2. Agreement bias. Wynn et al. found agents shift from correct to incorrect answers to match peers, favoring social agreement over challenging flawed reasoning. A single model playing both roles drifts toward the path of least resistance.
  3. No auditable adversary. You get a bullet list of concerns with no record of which objections survived scrutiny, which were rebutted, and which the arbiter weighted most heavily. That is brainstorming, not deliberation.

Multi-agent debate fixes this by separating agents into distinct roles with separate context — each agent defends or attacks from a defined perspective, across multiple rounds, before a verdict.

The red-team debate architecture

A production-grade red-team run has five components. Skip any one and you are back to expensive theater.

ComponentRoleCommon mistake
Subject matterThe claim, draft, or decision under test — shared evidence packVague prompts with no shared document for agents to cite
Advocate teamAgents defending the proposal with assigned personasAll agents given identical system prompts
Adversary teamAgents assigned to attack — skeptic, competitor, regulator, journalistAdversaries that are too polite or too generic
Debate protocolBounded rounds with structured turn order and rebuttal rulesOpen-ended chat with no turn limits or cost caps
Independent arbiterSeparate scoring pass on fixed dimensions — not a debaterLetting the strongest debater declare victory

Feng et al. (M-MAD, ACL 2025) formalized the arbiter side: instead of one holistic “who won?” judgment, run independent passes on correctness, evidence use, responsiveness to counterarguments, calibration, and citation quality. A claim can lose on evidence while winning on rhetoric — and you want to know that before publication, not after.

Five red-team scenarios that work

1. Political and campaign messaging

Configure opposition agents seeded with personas across the political spectrum — not caricatures, but plausible voters and commentators with specific priors. Run Team Discussion in battle mode: one team defends the message, one team attacks it with the strongest available rebuttals. The deliverable is a list of claims that survived cross-examination and a list that did not.

What you are looking for: lines that sound persuasive in isolation but collapse when an agent with different values engages. That is the question a hostile interviewer or opponent will ask on camera.

2. Product and strategy decisions

Assign agents to competitor personas, customer archetypes, and internal risk lenses (legal, security, finance). Open Discussion for exploratory objection surfacing, then Truth-Seeking Debate to pressure-test the top three risks. Du et al. found debate particularly strong on factuality — agents drop uncertain claims when peers challenge them, which is exactly what you want in a roadmap review.

3. Research claims and pre-print review

Simulate peer review before submission. Configure agents with field-specific skepticism: methodology critic, replication skeptic, statistics reviewer, domain expert with opposing priors. The arbiter scores each major claim on evidence and citation quality. Gaps become explicit: “Claim 2 rated low on citation quality — no primary source cited for the effect size.”

4. Legal and due diligence

Map prosecutor/defense argument trees over a shared evidence pack.SWE-Debate showed competitive multi-agent debate helps when the problem spans multiple parts of a codebase — the same logic applies to multi-document legal arguments where fault propagation matters. Each agent traces a different line of reasoning; debate consolidates the fix plan (or in legal terms, the claims ledger).

5. Investigative journalism pre-flight

Before publication, simulate the most aggressive defense your subject could mount. Configure a subject advocate with the best available counter-narrative, a legal counsel agent focused on defamation risk, and a skeptical editor agent. If the story survives that panel, you have done more diligence than most newsrooms can afford on deadline.

How to configure agents that actually disagree

ChatEval (ICLR 2024) found that diverse role prompts are essential — identical personas degrade multi-agent performance. For red-teaming specifically:

  • Give adversaries real incentives in the prompt.“Find the weakest claim” beats “offer constructive feedback.”
  • Mix models if you can. Heterogeneous agents surface blind spots homogeneous pools miss — though Liang et al. note LLMs may not be fair judges when debaters use different models. Keep the arbiter separate and consistent.
  • Share the evidence, not the conclusion. All agents read the same source document; none start with a pre-written verdict.
  • Cap rounds and cost. GroupDebate shows token cost scales brutally with agents × rounds. Red-teaming should have hard ceilings — stop when the arbiter stabilizes or you hit your budget.

Warning: debate can make things worse

If adversaries are too weak, too polite, or outnumbered by advocates, debate becomes rubber-stamping. If agents are homogeneous, Wynn et al. show accuracy can decrease as agents conform. Calibrate adversary strength and use adaptive stopping (Hu et al., NeurIPS 2025) so you do not keep debating after consensus is stable.

What a good red-team output looks like

Do not optimize for a binary pass/fail. A useful red-team run produces:

  • A claims ledger — every material assertion extracted and tracked across rounds
  • Objection log — which counterarguments were raised, which were rebutted, which stand
  • Dimension scores — M-MAD-style breakdown, not a single gut-check score
  • Citation gaps — claims marked unsupported or weakly sourced
  • Full transcript — auditable record for compliance, editorial, or legal review

The transcript is often more valuable than the verdict. It shows why a claim failed — the reasoning chain a human reviewer can follow and challenge.

Red-team debate vs other LLM checks

MethodBest forWeakness
Single-prompt critiqueQuick sanity check, typos, obvious gapsDoT, agreement bias, no audit trail
Self-consistency votingDiscrete answers (math, classification)No adversarial scrutiny; shared blind spots
RAG fact-checkVerifying claims against a corpusCorpus-limited; no argument structure
Multi-agent red-team debateStrategy, messaging, research, legal argumentsHigher cost; needs protocol tuning

For a deeper comparison of debate vs self-consistency on reasoning tasks, see our guide on when to use multi-agent debate vs self-consistency.

Running red-team debate in MAD Studio

MAD Studio is built for exactly this workflow. Configure adversary and advocate Workers with distinct personas and playbooks, snapshotted into a Team Discussion (battle mode) or Truth-Seeking Debate run. Set cost and turn caps before you start. Inject human guidance mid-run if an agent misses an obvious line of attack. When the run finishes, review the M-MAD dimension scores and export the full transcript.

Saga and Lab Experiments help tune adversary strength — sweep temperature and persona variants in hidden child runs until the red team reliably surfaces the objections your human reviewers would raise.

Stress-test your next idea before it ships

MAD Studio runs structured adversarial debate with M-MAD arbiter scoring, 2–100 agents, and full transcript persistence. Join the beta waitlist to red-team without building the orchestration yourself.

Join the beta waitlist →