
{"id":3230,"date":"2024-07-10T05:25:03","date_gmt":"2024-07-09T21:25:03","guid":{"rendered":"https:\/\/infernews.com\/?p=3230"},"modified":"2024-07-10T05:43:46","modified_gmt":"2024-07-09T21:43:46","slug":"tokenization-%e6%a8%99%e8%a8%98%e5%8c%96%e5%9f%ba%e7%a4%8e%e8%aa%8d%e8%ad%98","status":"publish","type":"post","link":"https:\/\/infernews.com\/blog\/tokenization-%e6%a8%99%e8%a8%98%e5%8c%96%e5%9f%ba%e7%a4%8e%e8%aa%8d%e8%ad%98\/","title":{"rendered":"Tokenization (\u6a19\u8a18\u5316)\u57fa\u790e\u8a8d\u8b58"},"content":{"rendered":"\n<p>LLM \u5be6\u969b\u4e0a\u4e0d\u6703\u76f4\u63a5\u8655\u7406\u6587\u5b57\u4e2d\u7684\u55ae\u5b57\uff0c\u800c\u662f\u4f7f\u7528\u53ef\u4ee5\u8868\u793a\u55ae\u5b57\u3001\u55ae\u5b57\u7684\u4e00\u90e8\u5206\u3001\u55ae\u5b57\u751a\u81f3\u55ae\u5b57\u7d44\u7684\u6a19\u8a18\u3002\u5728\u8a13\u7df4\u548c\u63a8\u7406\uff08\u900f\u904e\u63d0\u793a\u8207\u6a21\u578b\u4e92\u52d5\uff09\u671f\u9593\uff0c\u7a31\u70ba\u6a19\u8a18\u751f\u6210\u5668\u7684\u9810\u8655\u7406\u5143\u4ef6\u4f7f\u7528\u7279\u5b9a\u7684\u6a19\u8a18\u5316\u6f14\u7b97\u6cd5\u5c07\u6587\u5b57\u8f49\u63db\u70ba\u6a19\u8a18\u3002\u7576Token \u5b78\u7fd2\u8a13\u7df4\u8cc7\u6599\u4e4b\u9593\u7684\u7d71\u8a08\u95dc\u4fc2\u4ee5\u53ca\u96a8\u5f8c\u57fa\u65bc\u9019\u4e9b\u5b78\u7fd2\u5230\u7684\u95dc\u4fc2\u4f7f\u7528\u6a5f\u7387\u5275\u5efa\u65b0\u6587\u672c\u6642\uff0cToken \u4f7f\u7528\u7684\u662f\u6a19\u8a18\u800c\u4e0d\u662f\u55ae\u5b57\u3002<\/p>\n\n\n\n<p>\u8b93\u6211\u5011\u4f86\u770b\u4e00\u500b\u975e\u5e38\u7c21\u5316\u7684\u3001\u5047\u8a2d\u7684\u3001\u975e\u6a19\u6e96\u7684\u7bc4\u4f8b\uff0c\u5176\u4e2d\u6211\u5011\u6709\u4e00\u500b\u5c07\u55ae\u5b57\u5206\u89e3\u70ba\u97f3\u7bc0\u7684\u6a19\u8a18\u5316\u65b9\u6848\uff1a<\/p>\n\n\n\n<p>\u4eba\u985e\u53ef\u80fd\u6703\u5c07\u9019\u53e5\u8a71\u8996\u70ba 6 \u500b\u5b57\uff1aA cat is a furry animal.<\/p>\n\n\n\n<p>\u7136\u800c\uff0c\u4f7f\u7528\u6211\u5011\u5047\u8a2d\u7684\u6a19\u8a18\u5668\u7684Token \u5c07\u770b\u5230 10 \u500b\u6a19\u8a18\uff1a A cat is a Fur ry an i mal \u3002<\/p>\n\n\n\n<p>\u5be6\u969b\u4e0a\uff0c\u4f4d\u5143\u7d44\u5c0d\u7de8\u78bc (BPE &#8211; Byte Pair Encoding&nbsp;) \u6216 WordPiece \u7b49\u6a19\u8a18\u5316\u6f14\u7b97\u6cd5\u8981\u8907\u96dc\u4e00\u4e9b\uff0c\u6b64\u7bc4\u4f8b\u50c5\u7528\u65bc\u8aaa\u660e\u76ee\u7684\u3002<\/p>\n\n\n\n<p>\u57fa\u65bc\u4ee5\u4e0b\u5e7e\u500b\u539f\u56e0\uff0c\u5340\u5206\u55ae\u5b57\u548c\u6a19\u8a18\u975e\u5e38\u91cd\u8981\uff1a<\/p>\n\n\n\n<p>1\uff09\u7576\u6211\u5011\u8ac7\u8ad6\u4e0a\u4e0b\u6587\u8996\u7a97\u548c\u6700\u5927\u4e0a\u4e0b\u6587\uff08\u6a21\u578b\u5728\u55ae\u4e00\u63d0\u793a\u671f\u9593\u53ef\u4ee5\u8655\u7406\u7684\u8cc7\u6599\u91cf\uff0c\u5373 8K\u4e0a\u4e0b\u6587\uff09\u6642\uff0c\u6211\u5011\u8a0e\u8ad6\u7684\u662f Token(\u6a19\u8a18)\u7684\u6578\u91cf\uff0c\u800c\u4e0d\u662f\u5b57\u5143\u6216\u55ae\u5b57\u7684\u6578\u91cf\u3002<\/p>\n\n\n\n<p>2) \u4e86\u89e3Token \u4f7f\u7528\u7684\u662f\u6a19\u8a18\u800c\u4e0d\u662f\u55ae\u8a5e\uff0c\u6709\u52a9\u65bc\u978f\u56faToken \u57fa\u65bc\u7d71\u8a08\u6a21\u5f0f\u800c\u4e0d\u662f\u7406\u89e3\u4f86\u7522\u751f\u6587\u672c\u7684\u6982\u5ff5\u3002<\/p>\n\n\n\n<p>3\uff09\u9019\u7a2e\u65b9\u6cd5\u4e26\u975e\u6c92\u6709\u8907\u96dc\u6027\u3002\u6a19\u8a18\u5316\u96e3\u4ee5\u8a08\u7b97\u55ae\u5b57\u6578\u3001\u8a08\u7b97\u55ae\u5b57\u4e2d\u5b57\u6bcd\u51fa\u73fe\u6b21\u6578\u7b49\u7684\u5e7e\u500b\u539f\u56e0\u4e4b\u4e00\u3002<\/p>\n\n\n\n<p>\u90a3\u70ba\u4ec0\u9ebc\u8981\u4f7f\u7528\u6a19\u8a18\u5316\u800c\u4e0d\u662f\u55ae\u5b57\u5462\uff1fTokenization \u5be6\u969b\u4e0a\u5df2\u7d93\u4f7f\u7528\u4e86\u5e7e\u5341\u5e74\uff08\u5728 LLM \u51fa\u73fe \u4e4b\u524d\uff09\uff0c\u4e26\u4e14\u57fa\u65bc\u4e00\u4e9b\u76f8\u540c\u7684\u539f\u56e0\u3002<\/p>\n\n\n\n<p>1\uff09\u66f4\u597d\u7684 &#8220;\u610f\u601d&#8221;\uff0c\u80fd\u5920\u5728\u8a13\u7df4\u671f\u9593\u548c\u7a0d\u5f8c\u7684\u63a8\u7406\u904e\u7a0b\u4e2d\u5728\u8a13\u7df4\u8cc7\u6599\u7684\u5404\u500b\u90e8\u5206\u4e4b\u9593\u5f62\u6210\u66f4\u7cbe\u78ba\u7684\u95dc\u4fc2\u3002\u4f8b\u5982\uff0c\u5982\u679c\u60a8\u554f \u201cWhat is a cat?\u201d\u3002LLM \u53ef\u4ee5\u958b\u59cb\u9810\u6e2c\u201c\u8c93\u662f\u6bdb\u8338\u8338\u7684\u201d\uff0c\u7136\u5f8c\u53ef\u4ee5\u9078\u64c7\u201c-ry\u201d\u3001\u201c-red\u201d\u3001\u201c-covered\u201d\u7b49\uff0c\u9019\u6bd4\u5fc5\u9808\u5728\u6240\u6709\u5c0d\u8c61\u4e4b\u9593\u5275\u5efa\u95dc\u4fc2\u66f4\u6709\u6548\u500b\u5225\u7684\u8a71\u3002<\/p>\n\n\n\n<p>2\uff09\u900f\u904e\u5c0d\u76f8\u4f3c\u7684\u55ae\u5b57\u90e8\u5206\u9032\u884c\u5206\u7d44\u4f86\u63d0\u9ad8\u6548\u7387\uff0c\u6e1b\u5c11\u6a21\u578b\u5927\u5c0f\u548c\u8a08\u7b97\u8ca0\u8f09\u3002<\/p>\n\n\n\n<p>3\uff09\u8de8\u8a9e\u8a00\u548c\u7ffb\u8b6f\u4efb\u52d9\u6216\u6982\u5ff5\u5316\u672a\u77e5\u55ae\u5b57\u6642\u7684\u597d\u8655\u3002<\/p>\n\n\n\n<p>\u4e00\u4e9b\u8ca0\u9762\u56e0\u7d20\u662f\u6a19\u8a18\u5316\u78ba\u5be6\u589e\u52a0\u4e86\u8907\u96dc\u6027\uff0c\u4e26\u4e14\u53ef\u80fd\u6703\u5c0e\u81f4 Token \u7684\u67d0\u4e9b\u554f\u984c\u9818\u57df\u51fa\u73fe\u554f\u984c\uff0c\u4f8b\u5982\u8a08\u7b97\u55ae\u5b57\u6578\u3001\u8a08\u7b97\u55ae\u5b57\u4e2d\u7684\u5b57\u6bcd\u7b49\u3002<\/p>\n\n\n\n<p>\u7c21\u800c\u8a00\u4e4b\uff0cToken \u4e0d\u6703\u50cf\u4eba\u985e\u90a3\u6a23\u4f7f\u7528\u6216\u7406\u89e3\u8a9e\u8a00\uff0c\u800c\u662f\u5c07\u8a9e\u8a00\u5206\u89e3\u6210\u7247\u6bb5\uff0c\u5728\u8a13\u7df4\u671f\u9593\u5b78\u7fd2\u9019\u4e9b\u7247\u6bb5\u4e4b\u9593\u7684\u95dc\u4fc2\uff0c\u7136\u5f8c\u4ee5\u76f8\u540c\u7684\u65b9\u5f0f\u5206\u89e3\u7528\u6236\u8f38\u5165\uff0c\u4ee5\u6839\u64da\u6a5f\u7387\u751f\u6210\u65b0\u6587\u5b57.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>LLM \u5be6\u969b\u4e0a\u4e0d\u6703\u76f4\u63a5\u8655\u7406\u6587\u5b57\u4e2d\u7684\u55ae\u5b57\uff0c\u800c\u662f\u4f7f\u7528\u53ef\u4ee5\u8868\u793a\u55ae\u5b57\u3001\u55ae\u5b57\u7684\u4e00\u90e8\u5206\u3001\u55ae\u5b57\u751a\u81f3\u55ae\u5b57\u7d44\u7684\u6a19\u8a18\u3002\u5728\u8a13\u7df4\u548c\u63a8\u7406\uff08\u900f\u904e\u63d0\u793a\u8207\u6a21\u578b\u4e92\u52d5\uff09\u671f\u9593\uff0c\u7a31\u70ba\u6a19\u8a18\u751f\u6210\u5668\u7684\u9810\u8655\u7406\u5143\u4ef6\u4f7f\u7528\u7279\u5b9a\u7684\u6a19\u8a18\u5316\u6f14\u7b97\u6cd5\u5c07\u6587\u5b57\u8f49\u63db\u70ba\u6a19\u8a18\u3002\u7576Token \u5b78\u7fd2\u8a13\u7df4\u8cc7\u6599\u4e4b\u9593\u7684\u7d71\u8a08\u95dc\u4fc2\u4ee5\u53ca\u96a8\u5f8c\u57fa\u65bc\u9019\u4e9b\u5b78\u7fd2\u5230\u7684\u95dc\u4fc2\u4f7f\u7528\u6a5f\u7387\u5275\u5efa\u65b0\u6587\u672c\u6642\uff0cToken \u4f7f\u7528\u7684\u662f\u6a19\u8a18\u800c\u4e0d\u662f\u55ae\u5b57\u3002 \u8b93\u6211\u5011\u4f86\u770b\u4e00\u500b\u975e\u5e38\u7c21\u5316\u7684\u3001\u5047\u8a2d\u7684\u3001\u975e\u6a19\u6e96\u7684\u7bc4\u4f8b\uff0c\u5176\u4e2d\u6211\u5011\u6709\u4e00\u500b\u5c07\u55ae\u5b57\u5206\u89e3\u70ba\u97f3\u7bc0\u7684\u6a19\u8a18\u5316\u65b9\u6848\uff1a \u4eba\u985e\u53ef\u80fd\u6703\u5c07\u9019\u53e5\u8a71\u8996\u70ba 6 \u500b\u5b57\uff1aA cat is a furry animal. \u7136\u800c\uff0c\u4f7f\u7528\u6211\u5011\u5047\u8a2d\u7684\u6a19\u8a18\u5668\u7684Token \u5c07\u770b\u5230 10 \u500b\u6a19\u8a18\uff1a A cat is a Fur ry an i mal \u3002 \u5be6\u969b\u4e0a\uff0c\u4f4d\u5143\u7d44\u5c0d\u7de8\u78bc (BPE &#8211; Byte Pair Encoding&nbsp;) \u6216 WordPiece \u7b49\u6a19\u8a18\u5316\u6f14\u7b97\u6cd5\u8981\u8907\u96dc\u4e00\u4e9b\uff0c\u6b64\u7bc4\u4f8b\u50c5\u7528\u65bc\u8aaa\u660e\u76ee\u7684\u3002 \u57fa\u65bc\u4ee5\u4e0b\u5e7e\u500b\u539f\u56e0\uff0c\u5340\u5206\u55ae\u5b57\u548c\u6a19\u8a18\u975e\u5e38\u91cd\u8981\uff1a 1\uff09\u7576\u6211\u5011\u8ac7\u8ad6\u4e0a\u4e0b\u6587\u8996\u7a97\u548c\u6700\u5927\u4e0a\u4e0b\u6587\uff08\u6a21\u578b\u5728\u55ae\u4e00\u63d0\u793a\u671f\u9593\u53ef\u4ee5\u8655\u7406\u7684\u8cc7\u6599\u91cf\uff0c\u5373 8K\u4e0a\u4e0b\u6587\uff09\u6642\uff0c\u6211\u5011\u8a0e\u8ad6\u7684\u662f Token(\u6a19\u8a18)\u7684\u6578\u91cf\uff0c\u800c\u4e0d\u662f\u5b57\u5143\u6216\u55ae\u5b57\u7684\u6578\u91cf\u3002 2) \u4e86\u89e3Token \u4f7f\u7528\u7684\u662f\u6a19\u8a18\u800c\u4e0d\u662f\u55ae\u8a5e\uff0c\u6709\u52a9\u65bc\u978f\u56faToken \u57fa\u65bc\u7d71\u8a08\u6a21\u5f0f\u800c\u4e0d\u662f\u7406\u89e3\u4f86\u7522\u751f\u6587\u672c\u7684\u6982\u5ff5\u3002 3\uff09\u9019\u7a2e\u65b9\u6cd5\u4e26\u975e\u6c92\u6709\u8907\u96dc\u6027\u3002\u6a19\u8a18\u5316\u96e3\u4ee5\u8a08\u7b97\u55ae\u5b57\u6578\u3001\u8a08\u7b97\u55ae\u5b57\u4e2d\u5b57\u6bcd\u51fa\u73fe\u6b21\u6578\u7b49\u7684\u5e7e\u500b\u539f\u56e0\u4e4b\u4e00\u3002 \u90a3\u70ba\u4ec0\u9ebc\u8981\u4f7f\u7528\u6a19\u8a18\u5316\u800c\u4e0d\u662f\u55ae\u5b57\u5462\uff1fTokenization \u5be6\u969b\u4e0a\u5df2\u7d93\u4f7f\u7528\u4e86\u5e7e\u5341\u5e74\uff08\u5728 LLM \u51fa\u73fe \u4e4b\u524d\uff09\uff0c\u4e26\u4e14\u57fa\u65bc\u4e00\u4e9b\u76f8\u540c\u7684\u539f\u56e0\u3002 1\uff09\u66f4\u597d\u7684 &#8220;\u610f\u601d&#8221;\uff0c\u80fd\u5920\u5728\u8a13\u7df4\u671f\u9593\u548c\u7a0d\u5f8c\u7684\u63a8\u7406\u904e\u7a0b\u4e2d\u5728\u8a13\u7df4\u8cc7\u6599\u7684\u5404\u500b\u90e8\u5206\u4e4b\u9593\u5f62\u6210\u66f4\u7cbe\u78ba\u7684\u95dc\u4fc2\u3002\u4f8b\u5982\uff0c\u5982\u679c\u60a8\u554f \u201cWhat is a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"googlesitekit_rrm_CAowvqSiDA:productID":"","footnotes":""},"categories":[27],"tags":[],"class_list":["post-3230","post","type-post","status-publish","format-standard","hentry","category-paper"],"_links":{"self":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/3230","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/comments?post=3230"}],"version-history":[{"count":0,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/posts\/3230\/revisions"}],"wp:attachment":[{"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/media?parent=3230"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/categories?post=3230"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/infernews.com\/blog\/wp-json\/wp\/v2\/tags?post=3230"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}