
{"id":9851,"date":"2026-07-03T23:02:53","date_gmt":"2026-07-03T15:02:53","guid":{"rendered":"https:\/\/infernews.com\/blog\/mrpo\/"},"modified":"2026-07-03T23:41:10","modified_gmt":"2026-07-03T15:41:10","slug":"mrpo","status":"publish","type":"post","link":"https:\/\/infernews.com\/blog\/mrpo\/","title":{"rendered":"MRPO\uff1a\u91ab\u7642\u591a\u6a21\u614b\u63a8\u7406\u8a13\u7df4\u65b0\u8def\u7dda"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img data-dominant-color=\"dcdad9\" data-has-transparency=\"false\" style=\"--dominant-color: #dcdad9;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"582\" src=\"https:\/\/infernews.com\/blog\/wp-content\/uploads\/2026\/07\/pasted-250de31c3fce.jpg\" alt=\"alt text\" class=\"wp-image-9850 not-transparent\" srcset=\"https:\/\/infernews.com\/blog\/wp-content\/uploads\/2026\/07\/pasted-250de31c3fce.jpg 1024w, https:\/\/infernews.com\/blog\/wp-content\/uploads\/2026\/07\/pasted-250de31c3fce-300x171.jpg 300w, https:\/\/infernews.com\/blog\/wp-content\/uploads\/2026\/07\/pasted-250de31c3fce-768x437.jpg 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">MRPO \u662f\u4e00\u500b\u7528\u65bc\u91ab\u7642\u591a\u6a21\u614b\u63a8\u7406\u7684\u5f37\u5316\u5b78\u7fd2\u6846\u67b6\uff08reinforcement learning framework\uff09\u3002\u5b83\u8981\u89e3\u6c7a\u7684\u554f\u984c\u4e0d\u662f\u55ae\u7d14\u7b54\u5c0d\u8207\u5426\uff0c\u800c\u662f\u91ab\u7642 VQA \u904e\u7a0b\u4e2d\u63a8\u7406\u93c8\u4e00\u65e9\u51fa\u932f\uff0c\u4e4b\u5f8c\u4e00\u8def\u9023\u9396\u5931\u8aa4\uff0c\u4ee4\u6700\u5f8c\u7b54\u6848\u504f\u96e2\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u73fe\u6709 post-training \u505a\u6cd5\u591a\u6578\u504f\u5411 outcome-centric\uff0c\u4e3b\u8981\u770b final answer correctness \u6216 sequence-level preferences\u3002\u4f5c\u8005\u8a8d\u70ba\u9019\u7a2e\u7bc4\u5f0f\u7684\u554f\u984c\u662f sparse credit assignment\uff0c\u6a21\u578b\u77e5\u9053\u7b54\u932f\uff0c\u537b\u672a\u5fc5\u77e5\u9053\u7a76\u7adf\u7531\u54ea\u4e00\u6b65\u958b\u59cb\u5931\u6e96\uff1bMRPO \u56e0\u800c\u6539\u5beb GRPO-style advantages\uff0c\u7d50\u5408 answer-level reward \u8207 step-wise process rewards\uff0c\u4e26\u5728\u6700\u7d42\u7b54\u6848\u932f\u8aa4\u6642\uff0c\u5c0d\u8f03\u65e9\u51fa\u73fe\u7684 invalid steps \u7d66\u4e88\u66f4\u5927\u61f2\u7f70\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u9019\u500b\u8a2d\u8a08\u7684\u53d6\u5411\u5f88\u660e\u78ba\uff1a\u5b83\u4e0d\u662f\u53ea\u7f70\u932f\u7b54\u6848\uff0c\u800c\u662f\u91cd\u65b0\u5206\u914d\u5b78\u7fd2\u8a0a\u865f\uff0c\u512a\u5148\u4fee\u6b63\u6700\u65e9\u767c\u751f\u7684\u63a8\u7406\u932f\u8aa4\uff0c\u907f\u514d failure cascades \u64f4\u5927\u3002README \u63d0\u5230\uff0cMRPO \u5728\u4e09\u500b multimodal LLM backbones \u4e0a\u90fd\u512a\u65bc standard GRPO \u8207\u53e6\u4e00\u500b\u8fd1\u671f RL baseline\uff1b\u5728 Qwen3-VL-8B-Instruct \u4e0a\uff0c\u66f4\u4ee5\u53ea\u7528 13K training samples \u8d85\u904e\u8f03\u5927\u7684\u91ab\u7642 MLLMs\uff0c\u4f8b\u5982 HuatuoGPT-Vision-34B\uff0c\u5206\u6578\u9ad8\u51fa 2.79\u3002<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u6838\u5fc3\u65b9\u6cd5\uff1a\u4ee5 answer-level reward \u52a0 step-wise process rewards \u91cd\u6574 GRPO-style advantages<\/li>\n\n\n\n<li>\u4e3b\u8981\u5dee\u7570\uff1a\u91cd\u9ede\u653e\u5728 first failure\uff0c\u800c\u4e0d\u662f\u53ea\u770b\u6700\u5f8c\u6709\u5187\u7b54\u4e2d<\/li>\n\n\n\n<li>\u5df2\u516c\u5e03\u5167\u5bb9\uff1a\u5b8c\u6574 reinforcement learning recipe\u3001code\u3001datasets \u540c infrastructure<\/li>\n\n\n\n<li>\u53ef\u91cd\u73fe\u65b9\u5f0f\uff1a\u9805\u76ee\u63d0\u4f9b\u74b0\u5883\u8173\u672c\u3001\u8cc7\u6599\u4e0b\u8f09\u8207\u524d\u8655\u7406\u6d41\u7a0b\uff0c\u8a13\u7df4\u8cc7\u6599\u5305\u542b image\u3001problem\u3001solution \u6b04\u4f4d<\/li>\n\n\n\n<li>\u76f8\u95dc\u6a21\u578b\uff1aQwen3-VL-8B-Instruct\u3001HuatuoGPT-Vision-34B\uff0c\u4ee5\u53ca README \u63d0\u53ca\u7684\u53e6\u5916\u5169\u500b multimodal LLM backbones<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">\u91cf\u5316\u7d50\u679c\u6700\u503c\u5f97\u7559\u610f\u7684\u662f\u63a8\u7406\u8cea\u7d20\u5206\u6790\u3002MRPO \u5c07 early-stage reasoning failures \u7531 64.0% \u964d\u5230 13.0%\uff0c\u53cd\u6620\u5b83\u4e0d\u53ea\u662f\u628a\u7b54\u6848\u5206\u6578\u63a8\u9ad8\uff0c\u800c\u662f\u4ee4\u4e2d\u9014\u63a8\u7406\u8f03\u5c11\u4e00\u958b\u59cb\u5c31\u504f\u96e2\uff1b\u9019\u5c0d\u91ab\u7642\u5f71\u50cf\u554f\u7b54\u5c24\u5176\u91cd\u8981\uff0c\u56e0\u70ba\u932f\u8aa4\u5f80\u5f80\u4e0d\u662f\u51fa\u5728\u6700\u5f8c\u4e00\u53e5\uff0c\u800c\u662f\u524d\u9762\u89c0\u5bdf\u8207\u5224\u65b7\u5df2\u7d93\u5931\u7126\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u9019\u500b\u9805\u76ee\u8f03\u9069\u5408\u7814\u7a76\u91ab\u7642 AI\u3001\u91ab\u7642\u5f71\u50cf\u554f\u7b54\u3001multimodal reasoning post-training \u7684\u5718\u968a\u53c3\u8003\uff0c\u4e5f\u9069\u5408\u60f3\u6bd4\u8f03 RL \u8a13\u7df4\u914d\u65b9\u5dee\u7570\u7684\u4eba\u95b1\u8b80\u8207\u91cd\u73fe\u3002\u5b83\u73fe\u968e\u6bb5\u66f4\u63a5\u8fd1\u7814\u7a76\u539f\u578b\u8207\u8a13\u7df4\u65b9\u6cd5\u5c55\u793a\uff0c\u4e0d\u662f\u5373\u88dd\u5373\u7528\u7684\u81e8\u5e8a\u7522\u54c1\uff1b\u91cd\u9ede\u50f9\u503c\u5728\u65bc\uff0c\u5b83\u628a\u300c\u6a21\u578b\u54ea\u4e00\u6b65\u958b\u59cb\u8ad7\u932f\u300d\u6b63\u5f0f\u7d0d\u5165\u8a13\u7df4\u8a0a\u865f\uff0c\u70ba\u91ab\u7642 MLLMs \u63d0\u4f9b\u4e00\u689d\u6bd4\u53ea\u770b\u6700\u7d42\u7b54\u6848\u66f4\u7d30\u7dfb\u7684\u512a\u5316\u65b9\u5411\u3002<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/jungjunha.github.io\/MRPO-page\/\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>\u9805\u76ee\u4e3b\u9801<\/strong><\/a> \u00b7 <a href=\"https:\/\/github.com\/dmis-lab\/MRPO\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>GitHub<\/strong><\/a> \u00b7 <a href=\"https:\/\/huggingface.co\/collections\/dmis-lab\/mrpo\" rel=\"noopener noreferrer\" target=\"_blank\"><strong>\u6a21\u578b<\/strong><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>MRPO \u4e0d\u662f\u518d\u8ffd\u6700\u7d42\u7b54\u6848\u5206\u6578\uff0c\u800c\u662f\u76f4\u63a5\u4fee\u88dc\u63a8\u7406\u4e2d\u9014\u51fa\u932f\u7684\u4f4d\u7f6e\u3002\u91ab\u7642 VQA \u8868\u73fe\u56e0\u6b64\u66f4\u7a69\uff0c\u4e5f\u66f4\u6709\u53ef\u9077\u79fb\u6027\u3002<\/p>\n","protected":false},"author":8,"featured_media":9850,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"ai_generated_summary":"","footnotes":""},"categories":[133,147,30,144,153,185,119,197,76,127],"tags":[],"class_list":["post-9851","post","type-post","status-publish","format-standard","hentry","category-133","category-deepseek","category-image","category-medical","category-openai","category-qwen","category-119","category-framework","category-76","category-127"],"_links":{"self":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/9851","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/comments?post=9851"}],"version-history":[{"count":1,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/9851\/revisions"}],"predecessor-version":[{"id":9854,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/9851\/revisions\/9854"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media\/9850"}],"wp:attachment":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media?parent=9851"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/categories?post=9851"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/tags?post=9851"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}