Recent advances in large language models (LLMs) sparked a growing research interest in tool-assisted LLMs that solve real-world challenges, requiring a comprehensive evaluation of tool-usability capabilities. While previous work focused on evaluation over stateless web services (RESTful APIs), based on a single-turn user message, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation, and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open-source and proprietary models have a significant performance gap, and complex tasks such as state dependency, canonicalization, and insufficient information defined in ToolSandbox are challenging even for the most capable SOTA LLMs, providing new insights into the tool-usability capabilities of LLMs.