
{"id":8851,"date":"2026-06-06T01:10:59","date_gmt":"2026-06-05T17:10:59","guid":{"rendered":"https:\/\/infernews.com\/blog\/official-repository-for-our-paper-adaplanbench-evaluating-adaptive-planning-in-l\/"},"modified":"2026-06-06T01:10:59","modified_gmt":"2026-06-05T17:10:59","slug":"official-repository-for-our-paper-adaplanbench-evaluating-adaptive-planning-in-l","status":"publish","type":"post","link":"https:\/\/infernews.com\/blog\/official-repository-for-our-paper-adaplanbench-evaluating-adaptive-planning-in-l\/","title":{"rendered":"AdaPlanBench\uff1aLLM \u667a\u80fd\u9ad4\u9069\u61c9\u6027\u898f\u5283\u65b0\u6a19\u5c3a"},"content":{"rendered":"<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/infernews.com\/blog\/wp-content\/uploads\/2026\/06\/pasted-28279377ea4f.jpg\" alt=\"Pipeline Overview\"><\/figure>\n<p>\u73fe\u5be6\u751f\u6d3b\u4e2d\uff0cAI\u667a\u80fd\u9ad4\u5e6b\u6211\u5011\u5b89\u6392\u884c\u7a0b\u3001\u64cd\u4f5c\u5de5\u5177\u6642\uff0c\u5f80\u5f80\u4e0d\u6703\u4e00\u958b\u59cb\u5c31\u638c\u63e1\u6240\u6709\u9650\u5236\u689d\u4ef6\uff0c\u800c\u662f\u908a\u505a\u908a\u767c\u73fe\u65b0\u7d04\u675f\u3002<strong>AdaPlanBench<\/strong>\uff08Adaptive Planning Benchmark\uff09\u6b63\u662f\u91dd\u5c0d\u9019\u7a2e\u300c\u908a\u505a\u908a\u8abf\u6574\u300d\u7684\u80fd\u529b\u800c\u8a2d\u8a08\u7684\u8a55\u6e2c\u57fa\u6e96\u3002\u5b83\u628a 307 \u500b\u5bb6\u5c45\u4efb\u52d9\u7576\u4f5c\u8d77\u9ede\uff0c\u518d\u7528\u4e00\u5957\u53ef\u64f4\u5c55\u7684\u7d04\u675f\u69cb\u5efa\u6d41\u7a0b\uff0c\u70ba\u6bcf\u500b\u4efb\u52d9\u52a0\u4e0a\u5169\u985e\u9650\u5236\uff0c\u9010\u6b65\u63ed\u793a\u7d66\u667a\u80fd\u9ad4\u3002<\/p>\n<p>\u9019\u5957\u57fa\u6e96\u7684\u7368\u7279\u4e4b\u8655\u5728\u65bc\u300c<strong>\u96d9\u91cd\u7d04\u675f<\/strong>\u300d\uff08dual constraints\uff09\u8207\u300c<strong>\u9010\u6b65\u62ab\u9732<\/strong>\u300d\uff08progressive disclosure\uff09\u3002\u4e00\u985e\u662f\u4e16\u754c\u7d04\u675f\uff08World Constraints\uff09\uff0c\u5373\u74b0\u5883\u4e2d\u4e0d\u53ef\u7528\u6216\u5931\u6548\u7684\u5de5\u5177\u8207\u7269\u4ef6\uff1b\u53e6\u4e00\u985e\u662f\u7528\u6236\u7d04\u675f\uff08User Constraints\uff09\uff0c\u5373\u7528\u6236\u5c0d\u5de5\u5177\u5c6c\u6027\u3001\u4f7f\u7528\u65b9\u5f0f\u6216\u884c\u70ba\u7684\u504f\u597d\u7981\u6b62\u3002\u667a\u80fd\u9ad4\u6bcf\u56de\u5408\u63d0\u4ea4\u8a08\u5283\uff0c\u8a55\u5224\u6a5f\u5236\u6bd4\u5c0d\u76ee\u524d\u5df2\u63ed\u793a\u7684\u7d04\u675f\u4e26\u6253\u5206\uff0c\u9055\u898f\u6642\u56de\u994b\u932f\u8aa4\uff0c\u667a\u80fd\u9ad4\u9700\u5728\u591a\u8f2a\u4e92\u52d5\u4e2d\u6301\u7e8c\u4fee\u8a02\u7b56\u7565\u3002<\/p>\n<p>\u6e2c\u8a66\u7d50\u679c\u986f\u793a\u9019\u4ef6\u4e8b\u5c0d\u73fe\u6709\u6a21\u578b\u800c\u8a00\u4e26\u4e0d\u5bb9\u6613\u3002\u5728\u4e2d\u7b49\u7d04\u675f\u91cf\u4e0b\uff0c\u8868\u73fe\u6700\u5f37\u7684 GPT-5 \u50c5\u9054\u5230 67.75% \u6e96\u78ba\u7387\uff0c\u591a\u6578\u6a21\u578b\u4f4e\u65bc 45%\uff0c\u958b\u6e90\u6b0a\u91cd\u6a21\u578b\u666e\u904d\u5728 30% \u4e0a\u4e0b\u3002\u7814\u7a76\u4ea6\u767c\u73fe\uff0c\u6709\u6548\u8a08\u5283\u7387\uff08VPR\uff09\u9ad8\u4e26\u4e0d\u7b49\u65bc\u4efb\u52d9\u6210\u529f\uff0c\u7d04\u675f\u589e\u52a0\u6642\u8868\u73fe\u660e\u986f\u4e0b\u6ed1\uff0c\u800c\u7528\u6236\u7d04\u675f\u5e36\u4f86\u7684\u6311\u6230\u5c24\u5176\u7a81\u51fa\u3002<\/p>\n<p><strong>\u9019\u500b\u9805\u76ee\u9069\u5408\u8ab0\uff1f<\/strong> \u5982\u679c\u4f60\u7814\u7a76 LLM \u667a\u80fd\u9ad4\u7684\u898f\u5283\u80fd\u529b\u3001\u4e92\u52d5\u6c7a\u7b56\u6216\u591a\u8f2a\u63a8\u7406\uff0c\u53c8\u6216\u8005\u4f60\u5728\u505a Computer-use agents\uff08CUAs\uff09\u3001OSWorld \u7b49\u74b0\u5883\u7684\u61c9\u7528\u958b\u767c\uff0cAdaPlanBench \u63d0\u4f9b\u4e86\u4e00\u500b\u8cbc\u8fd1\u771f\u5be6\u3001\u96e3\u5ea6\u53ef\u63a7\u7684\u6e2c\u8a66\u5834\u666f\u3002\u7d04\u675f\u91cf\u8a2d\u6709\u4f4e\u3001\u4e2d\u3001\u9ad8\u4e09\u6a94\uff08\u53e6\u6709 4\u20136 \u6a94\u4f5c\u58d3\u529b\u6e2c\u8a66\uff09\uff0c\u65b9\u4fbf\u6309\u9700\u8981\u8abf\u6574\u96e3\u5ea6\u3002<\/p>\n<p>\u4ee5\u4e0b\u662f\u9019\u500b\u57fa\u6e96\u503c\u5f97\u7559\u610f\u7684\u91cd\u9ede\uff1a<\/p>\n<ul>\n<li><strong>\u96d9\u91cd\u7d04\u675f\u806f\u5408\u6e2c\u8a66<\/strong>\uff1a\u5728\u540c\u4e00\u898f\u5283\u56de\u5408\u4e2d\u540c\u6642\u8003\u9a57\u4e16\u754c\u8207\u7528\u6236\u5169\u985e\u9650\u5236\uff0c\u6bd4\u55ae\u4e00\u7d04\u675f\u8a2d\u5b9a\u66f4\u8cbc\u8fd1\u73fe\u5be6\u3002<\/li>\n<li><strong>\u589e\u91cf\u62ab\u9732\u8a2d\u8a08<\/strong>\uff1a\u7d04\u675f\u96a8\u5c0d\u8a71\u9010\u6b65\u63ed\u793a\uff0c\u903c\u667a\u80fd\u9ad4\u5f9e\u56de\u994b\u4e2d\u63a8\u5c0e\u4e26\u8ffd\u8e64\u9650\u5236\uff0c\u800c\u975e\u4f9d\u8cf4\u4e00\u6b21\u6027\u5b8c\u6574\u898f\u683c\u3002<\/li>\n<li><strong>\u53ef\u8abf\u7bc0\u96e3\u5ea6<\/strong>\uff1a\u6bcf\u689d\u67e5\u8a62\u914d\u5099\u516d\u7a2e\u74b0\u5883\u8a2d\u5b9a\uff0c\u5df2\u516c\u958b Low\u3001Medium\u3001High \u4e09\u6a94\uff0c\u652f\u63f4\u4e0d\u540c\u7a0b\u5ea6\u7684\u58d3\u529b\u6e2c\u8a66\u3002<\/li>\n<li><strong>\u591a\u8f2a\u56de\u994b\u5faa\u74b0<\/strong>\uff1a\u667a\u80fd\u9ad4\u5728\u9054\u6a19\u3001\u63d0\u65e9\u505c\u6b62\u6216\u56de\u5408\u8017\u76e1\u524d\u6301\u7e8c\u8fed\u4ee3\uff0c\u63d0\u4f9b\u66f4\u8c50\u5bcc\u7684\u884c\u70ba\u6578\u64da\u3002<\/li>\n<li><strong>\u591a\u7dad\u5ea6\u8a55\u4f30\u6307\u6a19<\/strong>\uff1a\u9664\u6e96\u78ba\u7387\u5916\uff0c\u4ea6\u8a18\u9304\u6709\u6548\u8a08\u5283\u7387\u3001\u5e73\u5747\u56de\u5408\u6578\u8207\u91cd\u8907\u9055\u898f\u7387\uff0c\u5354\u52a9\u8a3a\u65b7\u5931\u6557\u6a21\u5f0f\u3002<\/li>\n<\/ul>\n<p>\u6db5\u84cb\u7684\u6a21\u578b\u5305\u62ec GPT-5\u3001Claude \u7cfb\u5217\uff0c\u4ee5\u53ca\u591a\u6b3e\u4e3b\u6d41\u958b\u6e90\u6b0a\u91cd LLM\uff0c\u6574\u9ad4\u7d50\u679c\u4e00\u81f4\u6307\u5411\u540c\u4e00\u7d50\u8ad6\uff1a\u5728\u7d04\u675f\u6301\u7e8c\u7d2f\u7a4d\u7684\u60c5\u5883\u4e0b\uff0c\u7576\u524d LLM \u667a\u80fd\u9ad4\u4ecd\u96e3\u4ee5\u505a\u5230\u7a69\u5065\u7684\u9069\u61c9\u6027\u898f\u5283\u3002<\/p>\n<p><strong>GitHub\uff1a<\/strong> <a href=\"https:\/\/github.com\/JiayuJeff\/AdaPlanBench\" rel=\"noopener noreferrer\">https:\/\/github.com\/JiayuJeff\/AdaPlanBench<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AdaPlanBench\u6e2c\u8a66\u5927\u578b\u8a9e\u8a00\u6a21\u578b\u667a\u80fd\u9ad4\u5728\u96d9\u91cd\u7d04\u675f\u4e0b\u908a\u505a\u908a\u6539\u7684\u898f\u5283\u80fd\u529b\uff0c\u63ed\u793a\u7576\u524d\u6a21\u578b\u4ecd\u96e3\u4ee5\u53ef\u9760\u9069\u61c9\u52d5\u614b\u8b8a\u5316\u7684\u74b0\u5883\u8207\u7528\u6236\u504f\u597d\u3002<\/p>\n","protected":false},"author":8,"featured_media":8850,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ai_generated_summary":"","wpai_meta_description":"","footnotes":""},"categories":[133,197],"tags":[],"class_list":["post-8851","post","type-post","status-publish","format-standard","hentry","category-133","category-framework"],"_links":{"self":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/8851","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/comments?post=8851"}],"version-history":[{"count":0,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/8851\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media\/8850"}],"wp:attachment":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media?parent=8851"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/categories?post=8851"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/tags?post=8851"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}