
{"id":9740,"date":"2026-06-28T21:39:51","date_gmt":"2026-06-28T13:39:51","guid":{"rendered":"https:\/\/infernews.com\/blog\/evaluation-harness\/"},"modified":"2026-06-28T21:39:51","modified_gmt":"2026-06-28T13:39:51","slug":"evaluation-harness","status":"publish","type":"post","link":"https:\/\/infernews.com\/blog\/evaluation-harness\/","title":{"rendered":"GauntletBench \u8a55\u6e2c\u6846\u67b6\u9ede\u51fa Agent \u76f2\u9ede"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/infernews.com\/blog\/wp-content\/uploads\/2026\/06\/pasted-83b3723fe2d4.jpg\" alt=\"GauntletBench logo\"><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">GauntletBench \u662f\u4e00\u500b\u6975\u5177\u6311\u6230\u6027\u7684\u57fa\u65bc Web \u7684\u57fa\u6e96\u6e2c\u8a66\uff0c\u7528\u65bc\u8861\u91cf\u667a\u80fd\u9ad4\u7cfb\u7d71\u5728\u8907\u96dc\u3001\u57fa\u65bc\u8996\u89ba\u7684\u5c08\u696d\u4efb\u52d9\u4e2d\u7684\u6cdb\u5316\u80fd\u529b\u3002 <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">GauntletBench \u570d\u7e5e\u8457\u4e94\u500b\u9bae\u70ba\u4eba\u77e5\u7684\u61c9\u7528\u5834\u666f\u69cb\u5efa\u2014\u2014\u8996\u983b\u7de8\u8f2f\u5668\u3001\u5de5\u4f5c\u6d41\u7a0b\u69cb\u5efa\u5668\u30013D \u5efa\u6a21\u5668\u3001\u98db\u884c\u5206\u6790\u5668\u548c\u96fb\u8def\u8a2d\u8a08\u5668\u2014\u2014\u8a55\u4f30\u4e86\u4e09\u500b\u5c1a\u672a\u5145\u5206\u63a2\u7d22\u7684\u80fd\u529b\uff1a\u6642\u9593\u611f\u77e5\u3001\u5716\u5f62\u7406\u89e3\u548c3D \u63a8\u7406\u3002\u8a72\u57fa\u6e96\u6e2c\u8a66\u6db5\u84cb100 \u9805\u4eba\u985e\u53ef\u5b8c\u6210\u7684\u4efb\u52d9\u3001\u6a21\u7d44\u5316\u7684\u8a55\u4f30\u6d41\u7a0b\u4ee5\u53ca\u81ea\u52d5\u5316\u7684\u9818\u57df\u7279\u5b9a\u8a55\u5206\uff0c\u63ed\u793a\u4e86\u524d\u6cbf\u667a\u80fd\u9ad4\u8207\u4eba\u985e\u8868\u73fe\u4e4b\u9593\u5b58\u5728\u986f\u8457\u5dee\u8ddd\uff1a\u88ab\u8a55\u4f30\u7684\u6700\u5f37\u667a\u80fd\u9ad4\u7684\u6210\u529f\u7387\u50c5\u70ba19.1%\uff0c\u800c\u975e\u5c08\u5bb6\u4eba\u985e\u6a19\u8a3b\u8005\u7684\u6210\u529f\u7387\u5247\u8d85\u904e80%\uff0c\u9019\u8868\u660e\u7576\u524d\u7684\u667a\u80fd\u9ad4\u5728\u5fa9\u96dc\u7684\u771f\u5be6\u4e16\u754c\u4e2d\u4ecd\u53ef\u9054\u5230\u53ef\u9760\u7684\u771f\u5be6\u4e16\u754c\u7684\u6027\u80fd\u6c34\u5e73\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u73fe\u6709 benchmark \u591a\u6578\u653e\u5728\u71b1\u9580\u61c9\u7528\u548c\u8f03\u76f4\u63a5\u7684\u4efb\u52d9\uff0c\u5bb9\u6613\u4ee4\u65b0\u4e00\u4ee3 agents \u51fa\u73fe\u5206\u6578\u98fd\u548c\uff0c\u672a\u5fc5\u771f\u80fd\u53cd\u6620\u5b83\u5011\u96e2\u771f\u5be6\u5de5\u4f5c\u6709\u5e7e\u9060\u3002GauntletBench \u7684\u53d6\u5411\u525b\u597d\u76f8\u53cd\uff1a\u523b\u610f\u907f\u958b\u5e38\u898b app\uff0c\u6539\u7528 Circuit Designer\u3001Flight Analyser\u3001Video Editor\u30013D Modeller\u3001Workflow Builder \u4e94\u985e\u8f03\u5c11\u88ab\u8986\u84cb\u7684\u74b0\u5883\uff0c\u91cd\u65b0\u628a\u554f\u984c\u5b9a\u7fa9\u6210\u300c\u80fd\u5426\u5728\u4e0d\u719f\u6089\u4ecb\u9762\u5b8c\u6210\u8996\u89ba\u5bc6\u96c6\u5de5\u4f5c\u300d\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u9019\u500b GitHub \u9805\u76ee\u672c\u8eab\u4e0d\u662f\u6a21\u578b\uff0c\u800c\u662f\u8dd1\u8a55\u6e2c\u7684\u6846\u67b6\uff1bREADME \u5df2\u4ea4\u4ee3\u53ef\u6309\u55ae\u4e00 task\u3001\u6574\u500b application\uff0c\u751a\u81f3\u7528 JSON \u6279\u6b21\u57f7\u884c\u5be6\u9a57\uff0c\u4e5f\u652f\u63f4\u4e26\u884c\u57f7\u884c\u8207 YAML task file\u3002\u5e95\u5c64 agent run mechanics \u76f4\u63a5\u6cbf\u7528 REAL \u7684 browser harness \u8207 task loop\uff0c\u9019\u500b\u9805\u76ee\u65b0\u589e\u7684\u91cd\u9ede\u5247\u662f evaluation framework\u3001batch runner\u3001objective and LLM-as-a-judge evaluators\uff0c\u4ee5\u53ca\u65b0\u7684 task suites\u3002<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>100 \u500b\u4efb\u52d9<\/strong>\uff0c\u6bcf\u500b\u61c9\u7528 20 \u500b\uff0c\u5168\u90e8\u5c6c vision-intensive tasks<\/li>\n<li><strong>\u9810\u8a2d\u6a21\u578b\u53c3\u6578<\/strong> \u53ef\u6307\u5b9a <code>--model<\/code>\uff0c\u9810\u8a2d\u70ba <code>o3<\/code><\/li>\n<li><strong>\u53ef\u64f4\u5145\u6e2c\u8a66\u65b9\u5f0f<\/strong>\uff0c\u652f\u63f4 YAML \u4efb\u52d9\u6a94\u8207 JSON \u6279\u91cf\u8a2d\u5b9a<\/li>\n<li><strong>\u7d50\u679c\u8a0a\u865f\u6e05\u695a<\/strong>\uff1a\u6700\u4f73 agent \u7d04 19.1% \u81f3 20.9% success\uff0c\u975e\u5c08\u696d\u4eba\u985e\u6a19\u8a3b\u8005\u8d85\u904e 80% \u81f3 90%<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">\u6700\u503c\u5f97\u7559\u610f\u7684\u662f\u5b83\u53cd\u6620\u51fa\u4e00\u500b\u5f88\u5be6\u969b\u7684\u843d\u5dee\uff1aagent framework \u666e\u904d\u6bd4\u55ae\u7d14 raw models \u597d\uff0c\u4f46\u6574\u9ad4\u8ddd\u96e2\u4eba\u985e\u4ecd\u7136\u5f88\u9060\uff1bopen-source models \u751a\u81f3\u666e\u904d\u4f4e\u65bc 1%\u3002Video Editor \u5c6c\u8f03\u53ef\u8655\u7406\u7684\u7bc4\u570d\uff0cCircuit Designer \u5247\u63a5\u8fd1\u300c\u5e7e\u4e4e\u505a\u4e0d\u5230\u300d\uff0c\u6240\u4ee5\u9019\u5957\u5de5\u5177\u7279\u5225\u9069\u5408\u7814\u7a76 Agentic\u3001Computer-use agents\u3001\u7db2\u9801\u81ea\u52d5\u5316\u8207\u591a\u6a21\u614b\u80fd\u529b\u7684\u5718\u968a\uff0c\u7528\u4f86\u627e\u51fa\u6a21\u578b\u4e0d\u662f\u300c\u7b54\u932f\u300d\uff0c\u800c\u662f\u6839\u672c\u770b\u4e0d\u61c2\u6642\u9593\u3001\u5716\u5f62\u8207\u7a7a\u9593\u7d50\u69cb\u7684\u4f4d\u7f6e\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/gauntlet-landing-page.vercel.app\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>\u9805\u76ee\u4e3b\u9801<\/strong><\/a> \u00b7 <a href=\"https:\/\/github.com\/gauntlet-benchmark\/evaluation-harness\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>GitHub<\/strong><\/a> \u00b7 <a href=\"https:\/\/arxiv.org\/pdf\/2606.14397\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>Paper<\/strong><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u9019\u662f\u7528\u4f86\u6e2c\u8a66 agent \u6cdb\u5316\u80fd\u529b\u7684\u8a55\u6e2c\u5de5\u5177\u3002\u5b83\u907f\u958b\u5e38\u898b\u61c9\u7528\uff0c\u5c08\u653b\u8996\u89ba\u5bc6\u96c6\u53c8\u504f\u5c08\u696d\u7684\u7db2\u9801\u4efb\u52d9\u3002<\/p>\n","protected":false},"author":8,"featured_media":9739,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ai_generated_summary":"","footnotes":""},"categories":[133,185,163,164,140,116,31,38,132,119,76,191,197],"tags":[],"class_list":["post-9740","post","type-post","status-publish","format-standard","hentry","category-133","category-qwen","category-163","category-164","category-gemini","category-agentic","category-video","category-38","category-3d","category-119","category-76","category-anthropic","category-framework"],"_links":{"self":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/9740","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/comments?post=9740"}],"version-history":[{"count":0,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/9740\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media\/9739"}],"wp:attachment":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media?parent=9740"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/categories?post=9740"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/tags?post=9740"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}