LINUX DO Channel

chengyongru (@yongru_cheng) 在 arXiv:2605.27922： Agent能力取决于模型还是harness？Harness-Bench 中发帖

论文：[2605.27922] Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows 
代码：GitHub - Qihoo360/harness-bench · GitHub 
 [image] 


harness bench 简单来说就是固定任务和模型，只换harness，看agent表现差多少。 
方法 
106个沙箱化离线任务，8个类别（SWE、数据分析、DevOps、长程状态维护等），每个任务有独立的oracle grader。 
评估维度有completion score 、LLM judge score 和security score。 
测了6个现在比较火的agent（OpenClaw、nanobot、Hermes、ZeroClaw、NullClaw、Molti...