Debugging AutoGen Group Chat: Why Your Agents Loop, Over-Spend, and Ignore Each Other

Group chat is AutoGen's most powerful and most opaque feature. Here is a toolkit for when it goes wrong.

The Group Chat Debugging Problem

AutoGen's GroupChat is powerful: multiple agents collaborate, debate, and delegate in a shared conversation. It is also opaque. When something goes wrong -- agents looping, the wrong agent picking up a task, costs exploding, the group getting stuck -- it is hard to know why without the right tools.

This article covers the five most common group chat failure modes and exactly how to debug and fix each one.

Failure 1: Agents Loop Forever

The group chat runs, agents keep responding to each other, and the conversation never terminates. This is almost always a missing or broken termination condition.

The fix: always set max_round and a termination string

import ag2
 
groupchat = ag2.GroupChat(
    agents=[assistant, critic, user_proxy],
    messages=[],
    max_round=12,        # hard cap on total turns
    speaker_selection_method="auto",
)
 
manager = ag2.GroupChatManager(
    groupchat=groupchat,
    llm_config={"model": "gpt-4o", "api_key": "..."},
)
 
user_proxy = ag2.UserProxyAgent(
    name="user",
    human_input_mode="NEVER",
    # Terminate when any agent says TERMINATE
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
    max_consecutive_auto_reply=3,  # prevent any single agent from dominating
)

Add explicit instructions in each agent's system prompt: 'When the task is complete, end your message with TERMINATE.' Without this instruction, agents will keep elaborating indefinitely.

Set max_round conservatively -- lower than you think you need. If the task genuinely requires more turns, increase it incrementally. A runaway group chat with GPT-4o can accumulate significant cost in minutes.

Failure 2: The Wrong Agent Gets Selected

In auto speaker selection mode, the GroupChatManager LLM decides which agent should speak next based on the conversation context. If the LLM makes a poor selection (e.g. always picking the same agent, or picking an agent that is clearly wrong for the task), the group chat produces bad outputs.

Diagnosis

# Enable verbose logging to see speaker selection decisions
manager = ag2.GroupChatManager(
    groupchat=groupchat,
    llm_config={"model": "gpt-4o", "api_key": "..."},
    is_termination_msg=lambda msg: "TERMINATE" in msg.get("content", ""),
)
 
# Print full conversation to trace which agent was selected each round
for message in groupchat.messages:
    print(f"[{message['role']} / {message.get('name', 'unknown')}]: "
          f"{message['content'][:200]}")

Fix Option 1: Use round_robin for predictable turn order

groupchat = ag2.GroupChat(
    agents=[researcher, writer, reviewer],
    messages=[],
    max_round=9,
    speaker_selection_method="round_robin",  # guaranteed order, no LLM selection
)

Fix Option 2: Use a custom speaker selection function

def custom_speaker_selector(last_speaker, groupchat):
    messages = groupchat.messages
    if not messages:
        return researcher   # always start with researcher
    last_content = messages[-1].get("content", "")
    if "RESEARCH COMPLETE" in last_content:
        return writer
    if "DRAFT COMPLETE" in last_content:
        return reviewer
    return last_speaker   # default: same agent continues
 
groupchat = ag2.GroupChat(
    agents=[researcher, writer, reviewer],
    messages=[],
    max_round=12,
    speaker_selection_method=custom_speaker_selector,
)

Failure 3: Cost Explosion

Group chat passes the full conversation history to every agent on every turn. In a 10-round group chat with 3 agents, each using GPT-4o, you are paying for the growing context window 30 times. Costs scale non-linearly with round count and message length.

Fixes

Set max_round aggressively -- start low (6-8 rounds) and increase only if needed
Use a cheaper model for the GroupChatManager (it only selects speakers, not the task work)
Use a cheaper model for lower-stakes agents (e.g. critic, reviewer) -- reserve GPT-4o for the primary worker
Summarise long tool outputs before they enter the conversation (a 5000-token search result becomes 200 tokens)

# Cost-optimised group chat: cheap manager, mixed models
manager = ag2.GroupChatManager(
    groupchat=groupchat,
    llm_config={"model": "gpt-4o-mini", "api_key": "..."},  # cheap selector
)
 
researcher = ag2.AssistantAgent(
    name="researcher",
    llm_config={"model": "gpt-4o", "api_key": "..."},       # best model for research
)
 
reviewer = ag2.AssistantAgent(
    name="reviewer",
    llm_config={"model": "gpt-4o-mini", "api_key": "..."},  # cheaper for review
)

Failure 4: Agents Produce Outputs the Next Agent Cannot Use

Agent A produces a JSON blob. Agent B expects plain text. Agent C produces a numbered list when the downstream code expects a dict. The group chat 'works' but produces garbage that fails silently downstream.

The fix: enforce output contracts in system prompts

researcher = ag2.AssistantAgent(
    name="researcher",
    system_message=(
        "You are a research analyst. When your research is complete, output "
        "EXACTLY this structure and nothing else:\n"
        "FINDINGS:\n"
        "- [finding 1]\n"
        "- [finding 2]\n"
        "SOURCES:\n"
        "- [url 1]\n"
        "Then write RESEARCH COMPLETE on a new line."
    ),
    llm_config={"model": "gpt-4o", "api_key": "..."},
)

Alternatively, use structured outputs (if your model supports them) to enforce a schema at the API level rather than relying on prompt instructions alone.

Failure 5: Human Input Mode Blocks Automation

In development, human_input_mode='ALWAYS' is useful -- it lets you step through the conversation. In production automation, it blocks execution waiting for terminal input that never comes. This is a common reason why group chats that work locally hang silently in production.

# Development:
user_proxy = ag2.UserProxyAgent(
    human_input_mode="ALWAYS",   # prompts for input each turn
    ...
)
 
# Production:
user_proxy = ag2.UserProxyAgent(
    human_input_mode="NEVER",    # fully automated
    max_consecutive_auto_reply=5,
    is_termination_msg=lambda m: "TERMINATE" in m.get("content", ""),
    ...
)

Quick Reference

Always set max_round -- start at 6-8, increase only if needed
Always set is_termination_msg with a TERMINATE string convention
Use verbose=True or print groupchat.messages to trace speaker selection
Use round_robin or a custom function for deterministic turn order
Use cheaper models for GroupChatManager and lower-stakes agents
Summarise long tool outputs before injecting them into the conversation
Switch human_input_mode to NEVER in all production deployments