Internalizing Self-Consistency in Language Models through Multi-Agent Debate
π Paper: arXiv:2509.15172
MACA trains language models to be more consistent reasoners through multi-agent debate and consensus-based reinforcement learning.
Key Features:
- π€ Multi-Agent Debate: Orchestrate debates between agents for improved reasoning
- π― Consensus Training: Post-train on debate outputs using agreement patterns as rewards
- β‘ Distributed Processing: Multi-GPU parallel training with QLoRA adapters
- π Analysis Tools: Built-in performance tracking and visualization
conda env create -f env.yml
pip install -e .python main.py --model qwen2b --dataset gsm8k --gpus_per_model 1 --max_concurrent_tasks 4 --train_size 1500 --test_size 500 --lora_r 128 --lora_alpha 128 --dpo --epoch_dpo 3 --batch_dpo 6 --lr_dpo 1e-5 --beta_dpo 0.1 --gradient_accumulation_steps_dpo 4 --seed 1 --wandbpython maca_single_agent.py --output_dir q2b_sa_runs --model qwen2b --phase kto --kto --train_datasets math gsm8k mathqa --test_datasets math gsm8k mathqa svamp gpqa csqa --use_full_test --lora_r_range 64 --lora_alpha_range 64 --lr_kto 1e-5 --evaluation_batch_size 24 --wandb--model: Quantized base model to use (llama1b/3b/8b, phi4b, qwen2b/7b, gemma4b, mistral7b)--dataset: Dataset (gsm8k, math, mathqa, gpqa, svamp, csqa)--agents: Number of agents in the debate (default: 3)--finetune: Enable Majority Vote Supervised Fine-Tuning (SFT)--post_train: Enable Majority Vote Group-Relative Policy Optimization training (GRPO)--dpo: Enable Majority Vote Direct Preference Optimization training--kto: Enable Majority Vote Kahneman-Tversky Optimization training--use_consensus_reward: Enable consensus-based rewards
--wandb: Enable Weights & Biases logging--project_name: W&B project name (default: llm-marl, requires setting --wandb)--entity_name: W&B entity/team name (default: llm-marl, requires setting --wandb)
maca/
βββ main.py # Main training entry point
βββ maca_single_agent.py # Single agent hyperparameter tuning and testing
βββ model.py # Agent implementation and reward functions
βββ debate.py # Multi-agent debate orchestration
βββ orchestrator.py # Training coordination and management
βββ data.py # Dataset loading and preprocessing
βββ parser.py # Answer parsing and grading utilities
βββ args.py # Command-line argument definitions
βββ utils.py # Utility functions and helpers
βββ scheduler.py # Dynamic job scheduling for adapters
βββ train_agent_subprocess.py # Subprocess training management
βββ analyze_experiment_performance.py # Debate results analysis
βββ read_debate_performance.py # Read debate utils
βββ data/ # Dataset storage and splits
βββ experiments/ # Experiment outputs and results
βββ checkpoints/ # Model checkpoints and adapters
Built on Hugging Face TRL, supports multiple paradigms with majority vote variants:
- MV-SFT: Supervised fine-tuning on consensus examples
- MV-GRPO: Reinforcement learning with consensus rewards
- MV-KTO/DPO: Preference optimization methods
See args.py for complete argument documentation.
This work was developed at Meta AI in collaboration with Meta Superintelligence Labs and the LIINC Lab at Columbia University.
If you use this framework in your research, please cite:
@misc{samanta2024maca,
title={Internalizing Self-Consistency in Language Models: Multi-Agent Consensus Alignment},
author={Ankur Samanta and Akshayaa Magesh and Youliang Yu and Runzhe Wu and Ayush Jain and Daniel Jiang and Boris Vidolov and Paul Sajda and Yonathan Efroni and Kaveh Hassani},
year={2024},
eprint={2509.15172},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://doi.org/10.48550/arXiv.2509.15172}
}MACA is MIT licensed, as found in the LICENSE file.
