TL;DR
- Red-teaming with LLMs means simulating your harshest critics before real critics get the chance — not checking for toxic outputs after the fact.
- Single-prompt “play devil's advocate” fails because one model playing both sides collapses into agreement (Degeneration-of-Thought).
- Multi-agent debate works when agents have different roles, evidence, and incentives — then an independent arbiter scores the result.
- Run bounded rounds with hard cost caps; the transcript is the deliverable, not just the final verdict.
Every organization has a version of this story: a confident strategy doc, a campaign message, a research claim, or a product bet that looked bulletproof in internal review — until someone outside the room asked the one question nobody thought to ask. Red-teaming exists to manufacture that question on demand.
With LLMs, the temptation is to paste your draft into ChatGPT and ask “what's wrong with this?” That helps, but it is not adversarial deliberation. It is one model negotiating with itself. This guide covers how to red-team ideas properly using multi-agent debate — what to configure, what the research says, and where it fails.
Two kinds of “red teaming”
The term gets used loosely. In AI safety, red-teaming usually means probing a model for harmful outputs, jailbreaks, or policy violations — work that Khan et al. and others frame as debate between persuasive agents. In strategy, journalism, law, and politics, red-teaming means something different: stress-testing an idea against hostile scrutiny before you commit to it.
This guide focuses on the second kind — adversarial idea testing — though the same architecture applies to both. The goal is not a thumbs-up/thumbs-down. The goal is a structured record of objections, weak evidence, missing citations, and failure modes you can act on.
Why single-prompt devil's advocate fails
Ask one LLM to critique its own proposal and you hit three walls:
- Degeneration-of-Thought. Liang et al. (EMNLP 2024) showed that once a model commits to an answer, self-reflection rarely produces genuinely novel reasoning — even when the initial answer is wrong. The critic and the advocate share the same latent commitment.
- Agreement bias. Wynn et al. found agents shift from correct to incorrect answers to match peers, favoring social agreement over challenging flawed reasoning. A single model playing both roles drifts toward the path of least resistance.
- No auditable adversary. You get a bullet list of concerns with no record of which objections survived scrutiny, which were rebutted, and which the arbiter weighted most heavily. That is brainstorming, not deliberation.
Multi-agent debate fixes this by separating agents into distinct roles with separate context — each agent defends or attacks from a defined perspective, across multiple rounds, before a verdict.
The red-team debate architecture
A production-grade red-team run has five components. Skip any one and you are back to expensive theater.
| Component | Role | Common mistake |
|---|---|---|
| Subject matter | The claim, draft, or decision under test — shared evidence pack | Vague prompts with no shared document for agents to cite |
| Advocate team | Agents defending the proposal with assigned personas | All agents given identical system prompts |
| Adversary team | Agents assigned to attack — skeptic, competitor, regulator, journalist | Adversaries that are too polite or too generic |
| Debate protocol | Bounded rounds with structured turn order and rebuttal rules | Open-ended chat with no turn limits or cost caps |
| Independent arbiter | Separate scoring pass on fixed dimensions — not a debater | Letting the strongest debater declare victory |
Feng et al. (M-MAD, ACL 2025) formalized the arbiter side: instead of one holistic “who won?” judgment, run independent passes on correctness, evidence use, responsiveness to counterarguments, calibration, and citation quality. A claim can lose on evidence while winning on rhetoric — and you want to know that before publication, not after.
Five red-team scenarios that work
1. Political and campaign messaging
Configure opposition agents seeded with personas across the political spectrum — not caricatures, but plausible voters and commentators with specific priors. Run Team Discussion in battle mode: one team defends the message, one team attacks it with the strongest available rebuttals. The deliverable is a list of claims that survived cross-examination and a list that did not.
What you are looking for: lines that sound persuasive in isolation but collapse when an agent with different values engages. That is the question a hostile interviewer or opponent will ask on camera.
2. Product and strategy decisions
Assign agents to competitor personas, customer archetypes, and internal risk lenses (legal, security, finance). Open Discussion for exploratory objection surfacing, then Truth-Seeking Debate to pressure-test the top three risks. Du et al. found debate particularly strong on factuality — agents drop uncertain claims when peers challenge them, which is exactly what you want in a roadmap review.
3. Research claims and pre-print review
Simulate peer review before submission. Configure agents with field-specific skepticism: methodology critic, replication skeptic, statistics reviewer, domain expert with opposing priors. The arbiter scores each major claim on evidence and citation quality. Gaps become explicit: “Claim 2 rated low on citation quality — no primary source cited for the effect size.”
4. Legal and due diligence
Map prosecutor/defense argument trees over a shared evidence pack.SWE-Debate showed competitive multi-agent debate helps when the problem spans multiple parts of a codebase — the same logic applies to multi-document legal arguments where fault propagation matters. Each agent traces a different line of reasoning; debate consolidates the fix plan (or in legal terms, the claims ledger).
5. Investigative journalism pre-flight
Before publication, simulate the most aggressive defense your subject could mount. Configure a subject advocate with the best available counter-narrative, a legal counsel agent focused on defamation risk, and a skeptical editor agent. If the story survives that panel, you have done more diligence than most newsrooms can afford on deadline.
How to configure agents that actually disagree
ChatEval (ICLR 2024) found that diverse role prompts are essential — identical personas degrade multi-agent performance. For red-teaming specifically:
- Give adversaries real incentives in the prompt.“Find the weakest claim” beats “offer constructive feedback.”
- Mix models if you can. Heterogeneous agents surface blind spots homogeneous pools miss — though Liang et al. note LLMs may not be fair judges when debaters use different models. Keep the arbiter separate and consistent.
- Share the evidence, not the conclusion. All agents read the same source document; none start with a pre-written verdict.
- Cap rounds and cost. GroupDebate shows token cost scales brutally with agents × rounds. Red-teaming should have hard ceilings — stop when the arbiter stabilizes or you hit your budget.
Warning: debate can make things worse
If adversaries are too weak, too polite, or outnumbered by advocates, debate becomes rubber-stamping. If agents are homogeneous, Wynn et al. show accuracy can decrease as agents conform. Calibrate adversary strength and use adaptive stopping (Hu et al., NeurIPS 2025) so you do not keep debating after consensus is stable.
What a good red-team output looks like
Do not optimize for a binary pass/fail. A useful red-team run produces:
- A claims ledger — every material assertion extracted and tracked across rounds
- Objection log — which counterarguments were raised, which were rebutted, which stand
- Dimension scores — M-MAD-style breakdown, not a single gut-check score
- Citation gaps — claims marked unsupported or weakly sourced
- Full transcript — auditable record for compliance, editorial, or legal review
The transcript is often more valuable than the verdict. It shows why a claim failed — the reasoning chain a human reviewer can follow and challenge.
Red-team debate vs other LLM checks
| Method | Best for | Weakness |
|---|---|---|
| Single-prompt critique | Quick sanity check, typos, obvious gaps | DoT, agreement bias, no audit trail |
| Self-consistency voting | Discrete answers (math, classification) | No adversarial scrutiny; shared blind spots |
| RAG fact-check | Verifying claims against a corpus | Corpus-limited; no argument structure |
| Multi-agent red-team debate | Strategy, messaging, research, legal arguments | Higher cost; needs protocol tuning |
For a deeper comparison of debate vs self-consistency on reasoning tasks, see our guide on when to use multi-agent debate vs self-consistency.
Running red-team debate in MAD Studio
MAD Studio is built for exactly this workflow. Configure adversary and advocate Workers with distinct personas and playbooks, snapshotted into a Team Discussion (battle mode) or Truth-Seeking Debate run. Set cost and turn caps before you start. Inject human guidance mid-run if an agent misses an obvious line of attack. When the run finishes, review the M-MAD dimension scores and export the full transcript.
Saga and Lab Experiments help tune adversary strength — sweep temperature and persona variants in hidden child runs until the red team reliably surfaces the objections your human reviewers would raise.
Stress-test your next idea before it ships
MAD Studio runs structured adversarial debate with M-MAD arbiter scoring, 2–100 agents, and full transcript persistence. Join the beta waitlist to red-team without building the orchestration yourself.
Join the beta waitlist →