正文

来源:arXiv 2604.11557标题:Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents链接:arXiv 摘要页 · PDF · 代码仓库

我先提炼的关键信息

Abstract 精译

Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls.

LLM Agent 的工具使用能力是其核心组成部分,它让模型可以通过结构化函数调用与外部系统交互。

However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks.

但现有研究存在三个明显问题:交互表示不一致对工具使用轨迹的结构分布考虑不足,以及 评测基准彼此不兼容

We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation.

作者提出了 UniToolCall,这是一个面向工具学习的统一框架,试图把 工具集构建、数据集生成到评测 的整条流程标准化。

The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories.

该框架整理了一个 2.2 万+ 工具 的大规模工具池,并通过整合 10 个标准化公开数据集结构可控的合成轨迹,构建出 39 万+ 样本 的混合训练语料。

It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures.

它显式建模了多种交互模式,包括 单跳/多跳、单轮/多轮,同时覆盖 串行与并行执行结构

To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies.

为了支持连贯的多轮推理,作者进一步引入 Anchor Linkage 机制,用来强制建模跨轮依赖关系。

Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels.

此外,作者把 7 个公开 benchmark 统一转换为 QAOA(Query--Action--Observation--Answer) 表示,并在 函数调用级、轮次级、对话级 做细粒度评测。

Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.

实验表明,在该数据集上微调 Qwen3-8B 能显著提升工具使用能力;在干扰项很多的 Hybrid-20 场景下,其单轮 Strict Precision 达到 93.0%,并超过 GPT、Gemini、Claude 等商业模型。

这篇论文值不值得看

值得看

先别期待太高

我建议你重点看这几块

  1. QAOA 表示到底怎么定义,和现有 function calling trace 有什么本质差别
  1. 结构控制的 synthetic trajectories 是怎么造出来的,是否会过于模板化
  1. Anchor Linkage 的形式化定义与训练注入方式
  1. Hybrid-20 评测设置怎么构建 distractor tools
  1. 和现有 ToolBench / BFCL / APIBench 一类工作相比,统一方式到底解决了多少真实问题

备注