AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Authors: Florian Grötschla, Luis Müller, Mikhail Galkin, Jan Tönshoff, Bryan Perozzi
Paper: https://arxiv.org/abs/2507.08616
Code: https://github.com/floriangroetschla/AgentsNet
Dataset: https://huggingface.co/datasets/disco-eth/AgentsNet
TL;DR
The authors introduce AgentsNet, a novel benchmark designed to evaluate the coordination and collaborative reasoning of multi-agent LLM systems. It grounds evaluation in five fundamental, theoretically-backed problems from distributed computing: graph coloring, minimal vertex cover, maximal matching, leader election, and consensus. Agents are situated within diverse network topologies (e.g., small-world, scale-free) and interact via a structured, multi-round message-passing protocol. A key innovation is its scalability, testing systems with up to 100 agents, a significant leap from the 2-5 agents in typical benchmarks.
This work is significant because it moves beyond measuring simple task performance to rigorously assess the core competencies of decentralized coordination, self-organization, and communication. The findings reveal a critical bottleneck: while today's top LLMs perform well in small groups, their collaborative abilities collapse as the network size increases, with performance dropping to near-zero in 100-agent scenarios. AgentsNet provides a much-needed, scalable tool to diagnose these failures and guide the development of more robust, truly collaborative AI systems.
Details
A New Frontier: From Individual AI to Collective Intelligence
The rapid progress in Large Language Models (LLMs) has sparked immense interest in multi-agent systems, where multiple AI agents collaborate to solve problems beyond the scope of any single model. This paradigm mirrors human teamwork and the principles of distributed computing. While prior frameworks like GPTSwarm (https://arxiv.org/abs/2402.16823) have demonstrated the potential of structured multi-agent systems, a critical question has remained largely unanswered: how do we meaningfully evaluate whether these systems are truly collaborating, or just a collection of powerful individuals? Existing benchmarks often focus on end-task performance in small groups, failing to assess the underlying mechanisms of coordination and communication, especially as complexity scales.
The paper "AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs" directly confronts this challenge by introducing a new benchmark that serves as a litmus test for collective intelligence.
A Principled Approach to Evaluating Coordination
The core of AgentsNet is its methodology, which is thoughtfully grounded in decades of research from distributed systems and graph theory. Instead of ad-hoc tasks, the benchmark is built upon five fundamental distributed computing problems, each requiring a different form of collaboration:
(Δ+1)-Coloring: Assigning roles or groups while avoiding conflicts with neighbors.
Minimal Vertex Cover: Identifying a minimal set of "monitor" agents to observe all interactions.
Maximal Matching: Forming agent pairs for resource allocation or task assignment.
Leader Election: Establishing hierarchy by converging on a single decision-maker.
Consensus: Achieving global agreement on a decision through local communication.
These tasks are not arbitrary; they are classics from distributed computing theory, carefully chosen to represent a spectrum of coordination challenges. They range from local problems like Coloring (solvable in O(log* n) rounds) to global problems like Leader Election and Consensus (requiring O(D) rounds, where D is the network diameter) that necessitate network-wide information propagation.
This theoretical grounding separates AgentsNet from other benchmarks. This approach also mirrors classic experiments in sociology that study how humans coordinate on similar graph-based tasks (https://www.science.org/doi/10.1126/science.1127207), bridging the gap between AI evaluation and human collective intelligence research.
The Sobering Reality of Scaling Collaboration
The experimental results provide a clear-eyed view of the current state of multi-agent LLMs. While top-tier models like Gemini 2.5 Pro (achieving an overall score of 0.80) and Claude 3.7 Sonnet (0.70) demonstrate strong capabilities on small networks of 4 to 16 agents, their performance tells a different story as the system scales (Table 2).
The most striking finding is the sharp decline in performance as the number of agents increases.
In a scalability test using Gemini 2.0 Flash on networks of up to 100 agents, the success rate smoothly drops to near-zero (Figure 5). This suggests that current LLMs are not yet capable of maintaining coherent global strategies under the strain of increasing communication and memory demands.
Qualitative analysis reveals the underlying reasons for these failures. In one revealing example (Appendix E), agents on a complete graph correctly agree on a strategy. However, a single agent then confidently—and incorrectly—claims the network is a "star graph." This one message causes its neighbors to abandon the correct, agreed-upon strategy, leading to a cascade of errors. This highlights a critical vulnerability: agents are often too trusting of their peers, even when presented with contradictory information.
Impact and Future Directions
AgentsNet represents a significant step forward, moving the evaluation of multi-agent systems toward more solid theoretical ground. It provides a principled, scalable, and open-source framework. The stark results challenge the optimistic assumption that scaling individual model intelligence will automatically translate into effective collective intelligence. Instead, they highlight that decentralized coordination is a fundamentally hard problem that requires dedicated solutions.
The authors acknowledge the limitations of their current approach, such as the use of a synchronous communication model and the assumption of purely cooperative agents (Section 6). These limitations naturally point toward future research directions, including exploring asynchronous protocols, studying heterogeneous agent teams, and assessing system robustness in the presence of faulty or adversarial agents.
Conclusion
This paper delivers a crucial reality check for the field of multi-agent AI. By creating a rigorous and scalable benchmark, the authors demonstrate that while the building blocks for collaborative AI are in place, the ability to orchestrate them into effective, large-scale collectives remains a major hurdle. AgentsNet is not just an evaluation tool; it is a roadmap that illuminates the key challenges—such as strategy coordination and robust decentralized reasoning—that the research community must now address to unlock the true potential of multi-agent systems. It offers a valuable contribution and a clear path forward for building more capable and intelligent collaborative AI.