Evaluating conversational ai systems powered by large language models (LLMs) presents a critical challenge in artificial intelligence. These systems must handle multi-turn dialogues, integrate domain-specific tools, and adhere to complex policy constraints—capabilities that traditional evaluation methods struggle to assess. Existing benchmarks are based on small-scale data sets, hand-curated with crude metrics, and fail to capture the dynamic interplay of policies, user interactions, and real-world variability. This gap limits the ability to diagnose weaknesses or optimize agents for deployment in high-risk environments such as healthcare or finance, where reliability is non-negotiable.
Current evaluation frameworks, such as bank τ either ALMITAfocus on narrow domains like customer support and use static, limited data sets. For example, τ-bench evaluates airline and retail chatbots, but includes only 50 to 115 manually crafted samples per domain. These benchmarks prioritize end-to-end success rates, overlooking granular details like policy violations or dialogue consistency. Other tools, such as those evaluating recovery augmented generation (RAG) systems, lack support for multi-turn interactions. Reliance on human curation restricts scalability and diversity, leaving conversational ai evaluations incomplete and impractical for real-world demands. To address these limitations, Plurai researchers have introduced IntellAgent, an open source multi-agent framework designed to automate the creation of various policy-driven scenarios. Unlike previous methods, IntellAgent combines graph-based policy modeling, synthetic event generation, and interactive simulations to comprehensively evaluate agents.
In essence, IntellAgent uses a policy chart model the relationships and complexities of domain-specific rules. The nodes in this graph represent individual policies (for example, “refunds must be processed within 5 to 7 days”), and each is assigned a complexity score. The edges between nodes denote the probability that policies coexist in a conversation. For example, a policy on changing flight reservations could be linked to another on refund terms. The graph is constructed using an LLM, which extracts policies from system cues, ranks their difficulty, and estimates the probabilities of coexistence. This structure allows IntellAgent to generate synthetic events as shown in Figure 4 (user requests combined with valid database states) through a weighted random walk. Starting with an initial uniformly sampled policy, the system traverses the graph, accumulating policies until the total complexity reaches a predefined threshold. This approach ensures that events span an even distribution of complexities while maintaining realistic policy combinations.
Once the events are generated, IntellAgent simulates dialogues between a user agent and the chatbot under test as shown in Figure 5. The user agent initiates requests based on the event details and monitors the chatbot's compliance with policies. If the chatbot violates a rule or completes the task, the interaction ends. TO critical component then analyzes the dialogue, identifying which policies were tested and violated. For example, in the case of an airline, criticism could point to failures to verify the user's identity before modifying a reservation. This step produces detailed diagnostics that reveal not only overall performance but also specific weaknesses, such as difficulties with user consent policies, a category that τ-bench overlooks.
To validate IntellAgent, the researchers compared their synthetic benchmarks with τ-bench using state-of-the-art LLMs such as GPT-4o, Claude-3.5, and Gemini-1.5. Despite relying entirely on automated data generation, IntellAgent achieved Pearson correlations of 0.98 (airline) and 0.92 (retail) with results handpicked by τ-bench. More importantly, it uncovered nuanced insights: All models failed user consent policies, and performance declined predictably as complexity increased, although degradation patterns varied across models. For example, Gemini-1.5-pro outperformed GPT-4o-mini at lower complexity levels, but converged with it at higher levels. These findings highlight IntellAgent's ability to guide model selection based on specific operational needs. The framework's modular design enables seamless integration of new domains, policies, and tools, supported by an open source implementation based on the LangGraph library.
In conclusion, IntellAgent addresses a critical bottleneck in conversational ai development by replacing static, limited assessments with dynamic, scalable diagnostics. Its policy graph and automated event generation enable comprehensive testing across multiple scenarios, while detailed critiques point out actionable improvements. By closely correlating with existing benchmarks and exposing previously undetected weaknesses, the framework bridges the gap between research and real-world implementation. Future enhancements, such as incorporating real user interactions to refine policy graphs, could further elevate its usefulness, solidifying IntellAgent as a critical tool for developing reliable, policy-aware conversational agents.
Verify he Paper and <a target="_blank" href="https://github.com/plurai-ai/intellagent” target=”_blank” rel=”noreferrer noopener”>GitHub page. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on <a target="_blank" href="https://x.com/intent/follow?screen_name=marktechpost” target=”_blank” rel=”noreferrer noopener”>twitter and join our Telegram channel and LinkedIn Grabove. Don't forget to join our SubReddit over 70,000 ml.
<a target="_blank" href="https://nebius.com/blog/posts/studio-embeddings-vision-and-language-models?utm_medium=newsletter&utm_source=marktechpost&utm_campaign=embedding-post-ai-studio” target=”_blank” rel=”noreferrer noopener”> (Recommended Reading) Nebius ai Studio Expands with Vision Models, New Language Models, Embeddings, and LoRA (Promoted)
Vineet Kumar is a Consulting Intern at MarktechPost. He is currently pursuing his bachelor's degree from the Indian Institute of technology (IIT), Kanpur. He is a machine learning enthusiast. He is passionate about research and the latest advances in Deep Learning, Computer Vision and related fields.