CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 637
Releases: PaddlePaddle/FastDeploy
v2.2.1
e42dc8c
Compare
ๆฐๅขๅ่ฝ
- ๆฐๅขๅจ็บฟๆ้ๆดๆฐๆฏๆๅผๅฏPrefix Caching
- ๆฐๅขGLM 4.5 Airๆจกๅ้จ็ฝฒๆฏๆ
What's Changed
- [docs] update best practice docs for release/2.2 by @zoooo0820 in #3970
- [Docs] release 2.2.0 by @ming1753 in #3991
- [docs] update readme by @yangjianfengo1 in #3996
- [Optimize]Error messages about Model api. by @AuferGachet in #3972
- [Cherry-Pick] get org_vocab_size from args by @zeroRains in #3984
- ใFIXใChange the name of sparse attn from moba to plas by @yangjianfengo1 in #4006
- Fix down projection weight shape in fused MOE layer by @yuanlehome in #4041
- [Fix] fix multi api server log dir by @ltd0924 in #3966
- Fixed the issue of metrics file conflicts between multiple instances โฆ by @zhuangzhuang12 in #4010
- [Feature] Support mixed deployment with yiyan adapter in release22 by @rainyfly in #3974
- [CI] update paddlepaddle==3.2.0 in release/2.2 by @EmmonsCurse in #3997
- [setup optimize]Support git submodule (#4033) by @YuanRisheng in #4080
- [CP]Glm45 air 2.2 by @ckl117 in #4073
- [feat] support prefix cache clearing when
/clear_load_weight
is called by @liyonghua0910 in #4091 - [BugFix]fix tp/ep group gid by @gzy19990617 in #4038
- Support limit thinking lengths. by @K11OntheBoat in #4070
- Add assertion for ENABLE_V1_KVCACHE_SCHEDULER by @Jiang-Jia-Jun in #4146
- [fix] fix ep group all-reduce by @liyonghua0910 in #4140
- [Cherry-pick] fix MTP load with v1 loader by @zoooo0820 in #4153
- [CP2.2] Machete support group scale & wint8 & v1 loader by @Sunny-bot1 in #4166
- [Feature] support rdma IB transfer by @ltd0924 in #4123
- [BugFix]2.2 glm all reduce tp group by @ckl117 in #4188
- [Executor] Adjust signal sending order in RL training (#3773) (#4066) by @gongshaotian in #4178
- [fix] initialize available_gpu_block_num with max_gpu_block_num by @liyonghua0910 in #4193
- [fix]Modify follow-up push parameters and Modify the verification method for thinking length by @luukunn in #4177
- Fix noaux_tc cuda Error 700 in CUDAGraph and Add wfp8apf8 moe quant method by @ckl117 in #4115
- [Feature]CP support data clear by @ltd0924 in #4214
- [fix] fix clearing caches synchronization and add more logs by @liyonghua0910 in #4212
- fix ernie vl distributed attr. by @ZHUI in #4217
- [2.2]include_stop_str_in_output=False not return eos text by @ckl117 in #4231
- [fix]update apply_chat_template by @luukunn in #4249
- [fix]remove reasoning_max_tokens=max_toksns*0.8 in sampling_params by @luukunn in #4294
- ใfixใRemove the logic that assigns the default value of 80% to reasoning_max_tokens in the offline component of FastDeploy by @kxz2002 in #4304
- [feature]2.2 custom_allreduce support cudagraph recapture by @ckl117 in #4307
- [BUGFIX] clear request by @ltd0924 in #4320
Full Changelog: v2.2.0...v2.2.1
Assets 2
v2.2.0
d40a104
Compare
ๆฐๅขๅ่ฝ
- ้ๆ ท็ญ็ฅไธญ็bad_wordsๆฏๆไผ ๅ ฅtoken ids
- ๆฐๅขQwen2.5-VL็ณปๅๆจกๅๆฏๆ(่ง้ข่ฏทๆฑไธๆฏๆenable-chunked-prefill)
- API-Server completionsๆฅๅฃprompt ๅญๆฎตๆฏๆไผ ๅ ฅtoken idๅ่กจ๏ผๅๆถๆฏๆๆน้ๆจ็
- ๆฐๅขfunction call่งฃๆๅ่ฝ๏ผๆฏๆ้่ฟ
tool-call-parse
่งฃๆfunction call็ปๆ - ๆฏๆๆๅกๅฏๅจๆ่ฏทๆฑไธญ่ชๅฎไนchat_template
- ๆฏๆๆจกๅchat_template.jinjaๆไปถ็ๅ ่ฝฝ
- ่ฏทๆฑๆฅ้็ปๆๅขๅ ๅผๅธธๅ ๆ ไฟกๆฏ๏ผๅฎๅๅผๅธธlog่ฎฐๅฝ
- ๆฐๅขๆททๅMTPใNgram็ๆๆบ่งฃ็ ๆนๆณ
- ๆฏๆ็จไบๆๆบ่งฃ็ ็Tree Attentionๅ่ฝ
- ๆจกๅๅ ่ฝฝๅ่ฝๅขๅผบ๏ผๅฎ็ฐไบไฝฟ็จ่ฟญไปฃๅจๅ ่ฝฝๆจกๅๆ้๏ผๅ ่ฝฝ้ๅบฆๅๅ ๅญๅ ็จ่ฟไธๆญฅไผๅ
- API-Serverๅฎๅๆฅๅฟๆ ผๅผ๏ผๅขๅ ๆถ้ดไฟกๆฏ
- ๆฐๅขๆไปถๆบๅถ๏ผๅ ่ฎธ็จๆทๅจไธไฟฎๆนFastDeployๆ ธๅฟไปฃ็ ็ๅๆไธๆฉๅฑ่ชๅฎไนๅ่ฝ
- ๆฏๆMarlin kernelๆไปถๅจ็ผ่ฏ้ถๆฎตๆ็ งๆจก็้ ็ฝฎ่ชๅจ็ๆ
- ๆฏๆๅ ่ฝฝ HuggingFaceๅ็Safetensorsๆ ผๅผ็ๆๅฟใQwen็ณปๅๆจกๅ
- ๅฎๅDP+TP+EPๆททๅๅนถ่กๆจ็
ๆง่ฝไผๅ
- ๆฐๅขW4Afp8 MoE Group GEMM็ฎๅญ
- CUDA Graphๅขๅ ๅฏน่ถ 32K้ฟๆ็ๆฏๆ
- ไผๅmoe_topk_select็ฎๅญๆง่ฝ๏ผๆๅMoEๆจกๅๆง่ฝ
- ๆฐๅขMachete WINT4 GEMM็ฎๅญ๏ผไผๅWINT4 GEMMๆง่ฝ๏ผ้่ฟFD_USE_MACHETE=1ๅผๅฏ
- Chunked prefill ้ป่ฎคๅผๅฏ
- V1 KVCache่ฐๅบฆ็ญ็ฅไธไธไธๆ็ผๅญ้ป่ฎคๅผๅฏ
- MTPๆฏๆๆดๅค่็จฟtokenๆจ็๏ผๆๅๅคๆญฅๆฅๅ็
- ๆฐๅขๅฏๆๆ่ฝป้ๅ็จ็ๆณจๆๅๅ ้้ฟๆๆจ็
- ้ๅฏนDecodeๆฏๆ่ช้ๅบๅ้ถๆฎต็All-to-All้ไฟก๏ผๆๅ้ไฟก้ๅบฆ
- ๆฏๆDeepSeek็ณปๅๆจกๅMLA Bankend encoder้ถๆฎตๅฏ็จFlash-Attrntion-V3
- ๆฏๆDeepSeek็ณปๅๆจกๅq_a_proj & kv_a_proj_with_mqa linearๆจชๅ่ๅ
- API-Serverๆฐๅขzmq dealer ๆจกๅผ้ไฟก็ฎก็ๆจกๅ๏ผๆฏๆ่ฟๆฅๅค็จ่ฟไธๆญฅๆฉๅฑๆๅกๅฏๆฏๆ็ๆๅคงๅนถๅๆฐ
Bugไฟฎๅค
- completionๆฅๅฃechoๅๆพๆฏๆ
- ไฟฎๅค V1่ฐๅบฆไธไธไธๆ็ผๅญ็็ฎก็ bug
- ไฟฎๅค Qwen ๆจกๅๅบๅฎ top_p=0 ไธคๆฌก่พๅบไธไธ่ด็้ฎ้ข
- ไฟฎๅค uvicorn ๅคworkerๅฏๅจใ่ฟ่กไธญ้ๆบๆๆ้ฎ้ข
- ไฟฎๅค API-Server completionsๆฅๅฃไธญๅคไธช prompt ็ logprobs ่ๅๆนๅผ
- ไฟฎๅค MTP ็้ๆ ท้ฎ้ข
- ไฟฎๅคPD ๅ็ฆปcache ไผ ่พไฟกๅท้่ฏฏ
- ไฟฎๅคๅผๅธธๆๅบๆต้ๆงๅถไฟกๅท้ๆพ้ฎ้ข
- ไฟฎๅค
max_tokens
ไธบ0 ๅผๅธธๆๅบๅคฑ่ดฅ้ฎ้ข - ไฟฎๅคEP + DP ๆททๅๆจกๅผไธ็ฆป็บฟๆจ็้ๅบhang้ฎ้ข
ๆๆกฃ
- ๆดๆฐไบๆไฝณๅฎ่ทตๆๆกฃไธญไธไบๆๆฏ็็จๆณๅๅฒ็ชๅ ณ็ณป
- ๆฐๅขๅคๆบๅผ ้ๅนถ่ก้จ็ฝฒๆๆกฃ
- ๆฐๅขๆฐๆฎๅนถ่ก้จ็ฝฒๆๆกฃ
ๅ ถๅฎ
- CIๆฐๅขๅฏน่ชๅฎไน็ฎๅญ็Approveๆฆๆช
- Configๆด็ๅ่ง่ๅ
What's Changed
- Describe PR diff coverage using JSON file by @XieYunshen in #3114
- [CI] add xpu ci case by @plusNew001 in #3111
- disable test_cuda_graph.py by @XieYunshen in #3124
- [CE] Add base test class for web server testing by @DDDivano in #3120
- [OPs] MoE Preprocess OPs Support 160 Experts by @ckl117 in #3121
- [Docs] Optimal Deployment by @ming1753 in #2768
- fix stop seq unittest by @zoooo0820 in #3126
- [XPU]Fix out-of-memory issue during single-XPU deployment by @iosmers in #3133
- [Code Simplification] Refactor Post-processing in VL Model Forward Method by @DrRyanHuang in #2937
- add case by @DDDivano in #3150
- fix ci by @XieYunshen in #3141
- Fa3 ๆฏๆ้ไธญๅผ by @yangjianfengo1 in #3112
- Add CI cases by @ZhangYulongg in #3155
- [XPU]Updata XPU dockerfiles by @plusNew001 in #3144
- [Feature] remove dependency on enable_mm and refine multimodal's code by @ApplEOFDiscord in #3014
- ใInference OptimizeใSupport automatic generation of marlin kernel by @chang-wenbin in #3149
- Update init.py by @DDDivano in #3163
- fix load_pre_sharded_checkpoint by @bukejiyu in #3152
- ใFeatureใadd fd plugins && rm model_classes by @gzy19990617 in #3123
- [Bug Fix] fix pd disaggregated kv cache signal by @ltd0924 in #3172
- Update test_base_chat.py by @DDDivano in #3183
- Fix approve shell scripts by @YuanRisheng in #3108
- [Bug Fix] fix the bug in test_sampler by @zeroRains in #3157
- ใFeatureใsupport qwen3 name_mapping by @gzy19990617 in #3179
- remove useless code by @zhoutianzi666 in #3166
- [Bug fix] Fix cudagraph when use ep. by @Wanglongzhi2001 in #3130
- [Bugfix] Fix uninitialized decoded_token and add corresponding unit tโฆ by @sunlei1024 in #3195
- [CI] add test_compare_top_logprobs by @EmmonsCurse in #3191
- fix expertwise_scale by @rsmallblue in #3181
- [FIX]fix bad_words when sending requests consecutively by @Sunny-bot1 in #3197
- [plugin] Custom model_runner/model support by @lizhenyun01 in #3186
- Add more base chat cases by @DDDivano in #3203
- Add switch to apply fine-grained per token quant fp8 by @RichardWooSJTU in #3192
- [Bug Fix]Fix bug of append attention test case by @gongshaotian in #3202
- add more cases by @DDDivano in #3207
- fix coverage report by @XieYunshen in #3198
- [New Feature] fa3 ๆฏๆflash mask by @yangjianfengo1 in #3184
- [Test] scaled_gemm_f8_i4_f16 skip test while sm != 89 by @ming1753 in #3210
- [EP] Refactor DeepEP Engine Organization for Mixed Mode & Buffer Management Optimization by @RichardWooSJTU in #3182
- [Bug fix] Fix lm head bias by @RichardWooSJTU in #3185
- Ce add repitation early stop cases by @DDDivano in #3213
- [BugFix]fix test_air_top_p_sampling name by @ckl117 in #3211
- [BugFix] support real batch_size by @lizexu123 in #3109
- Ce add bad cases by @DDDivano in #3215
- revise noaux_tc by @rsmallblue in #3164
- [Bug Fix] Fix bug of MLA Attention Backend by @gongshaotian in #3176
- support qk norm for append attn by @rsmallblue in #3145
- Fix approve ci by @XieYunshen in #3212
- [Trace]add trace when fd start by @sg263 in #3174
- [New Feature] Support W4Afp8 MoE GroupGemm by @yangjianfengo1 in #3171
- Perfect approve error message by @YuanRisheng in #3224
- Fix the confused enable_early_stop when only set early_stop_config by @zeroRains in #3214
- [CI] Add ci case for min token and max token by @xjkmfa in #3229
- add some evil cases by @DDDivano in #3240
- support qwen3moe by @bukejiyu in #3084
- [Feature] support seed parameter by @lizexu123 in #3161
- ใFix Bugใ ไฟฎๅค fa3 ๆฏๆ้ไธญๅผbug by @yangjianfengo1 in #3235
- [bugfix]fix blockwisefp8 and all_reduce by @bukejiyu in #3243
- [Feature] multi source download by @Yzc216 in #3125
- [fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3247
- [Doc][XPU] Update deps and fix dead links by @hong19860320 in #3252
- Fix approve ci bug by @YuanRisheng in #3239
- [Executor]Update graph test case and delete test_attention by @gongshaotian in #3257
- [CI] remove useless case by @EmmonsCurse in #3261
- Ce add benchmark test by @DDDivano in #3262
- [stop_seq] fix out-bound value for stop sequence by @zoooo0820 in #3216
- [fix] multi source download by @Yzc216 in #3259
- [Bug fix] support logprob in scheduler v1 by @rainyfly in #3249
- [feat]add fast_weights_iterator by @bukejiyu in #3258
- [Iluvatar GPU] Optimze attention and moe performance by @wuyujiji in #3234
- delete parallel_state.py by @yuanlehome in #3250
- [bugfix]qwen3_fix and qwq fix by @bukejiyu in #3255
- ใFixใใMTPใFix MTP sample bug by @freeliuzc in #3139
- [CI] add CI logprobs case by @plusNew001 in #3189
- Move create_parameters to init in FuseMOE for CultassBackend and TritonBackend by @zeroRains in #3148
- [Bugfix] Fix model accuracy in some ops by @gzy19990617 in #3231
- add base test ci by @XieYunshen in #3225
- [BugFix] fix too many ...
Assets 2
v2.1.1
c49c43d
Compare
ๆๆกฃ
- ๆฐๅขๅคๆบๅผ ้ๅนถ่ก้จ็ฝฒๆๆกฃ
- ๆๅฟ็ณปๅๆจกๅๆไฝณๅฎ่ทตๆๆกฃๆดๆฐๅฐๆๆฐ็จๆณ
- ๆดๆฐCUDA Graphไฝฟ็จ่ฏดๆ
ๆฐๅขๅ่ฝ
- ่ฟๅ็ปๆๆฐๅข
completion_tokens
ไธprompt_tokens
๏ผๆฏๆ่ฟๅๅๅง่พๅ ฅไธๆจกๅๅๅง่พๅบๆๆฌ - completionๆฅๅฃๆฏๆ
echo
ๅๆฐ
Bugไฟฎๅค
- ไฟฎๅคV1 KVCache่ฐๅบฆไธLogProbๆ ๆณ่ฟๅ้ฎ้ข
- ไฟฎๅค
chat_template_kwargs
ๅๆฐๆ ๆณ็ๆ้ฎ้ข - ไฟฎๅคๆททๅๆถๆ้จ็ฝฒไธ็EPๅนถ่ก้ฎ้ข
- ไฟฎๅคcompletionๆฅๅฃ่ฟๅ็ปๆไธญ่พๅบToken่ฎกๆฐ้่ฏฏ้ฎ้ข
- ไฟฎๅคlogprobs่ฟๅ็ปๆ่ๅ้ฎ้ข
What's Changed
- [Docs] Add Multinode deployment document by @ltd0924 in #3416
- [docs] cherry-pick update docs by @zoooo0820 in #3422
- [Docs]update installation readme by @yongqiangma in #3435
- [Docs] release 2.1 by @ming1753 in #3441
- [Docs]Updata docs of graph opt backend by @gongshaotian in #3443
- [Feature] Support logprob in scheduler v1 for release/2.1 by @rainyfly in #3446
- [Bugfix]fix config bug in dynamic_weight_manager by @gzy19990617 in #3432
- [Feature] Pass through the chat_template_kwargs to the data processing module by @luukunn in #3469
- [CI] fix run_ci error in release/2.1 by @EmmonsCurse in #3499
- [BugFix] fix ep real_bsz by @lizexu123 in #3396
- [Feature] add prompt_tokens and completion_tokens by @memoryCoderC in #3505
- [fix] setting disable_chat_template while passing prompt_token_ids led to response error by @liyonghua0910 in #3511
- [Excutor] Fixed the issue of CUDA graph execution failure caused by dโฆ by @gongshaotian in #3512
- [Feature] add tool parser by @luukunn in #3518
- [BUGFIX] fix ep mixed bug by @ltd0924 in #3513
- [BugFix] Api server bugs by @ltd0924 in #3530
- [Feature] Support limit thinking len for text models by @K11OntheBoat in #3527
- [Bug Fix] Close get think_end_id for XPU for now. by @K11OntheBoat in #3563
- [Feature] Support mixed deployment with yiyan adapter by @rainyfly in #3533
- [Cherry-Pick] Launch expert_service before kv_cache initialization in worker_process by @zeroRains in #3558
- ใBugFixใcompletionๆฅๅฃechoๅๆพๆฏๆ by @AuferGachet in #3477
- [fix] fix completion stream api output_tokens not in usage by @liyonghua0910 in #3588
- [fix] fix ZmqIpcClient.close() error by @liyonghua0910 in #3600
- [Bugfix] Correct logprobs aggregation for multiple prompts in /completions endpoint by @sunlei1024 in #3620
- [BugFix] ep mixed mode offline exit failed by @ltd0924 in #3623
- ใBugfixใไฟฎๅค2.1ๅๆฏไธ0.3Bๆจกๅๆง่ฝๅคงๅน ไธ้ by @AuferGachet in #3624
- [CI] add cleanup logic in release/2.1 workflows by @EmmonsCurse in #3655
- [BugFix] fix parameter is 0 by @ltd0924 in #3663
- [fix] qwen output inconsistency when top_p=0 (#3634) by @liyonghua0910 in #3662
- Revert "[BugFix] fix parameter is 0" by @Jiang-Jia-Jun in #3681
- [feat] add metrics for yiyan adapter by @liyonghua0910 in #3615
- [bugfix]PR3663 parameter is 0 by @ltd0924 in #3679
- [BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER. by @lizexu123 in #3670
- Revert "[BugFix] Modify the bug in Qwen2 when enabling ENABLE_V1_KVCACHE_SCHEDULER." by @Jiang-Jia-Jun in #3719
- [Cherry-Pick] fix the bug when num_key_value_heads < tensor_parallel_size by @zeroRains in #3722
- [Optimize] Increase zmq buffer size to prevent apiserver too slowly tโฆ by @gongshaotian in #3728
- [Fix] Do not drop result when request result slowly by @rainyfly in #3704
- [Bug fix] Fix prefix cache in v1 by @rainyfly in #3710
- [Bug fix] Fix mix deployment perf with yiyan adapter in release21 by @rainyfly in #3703
Full Changelog: v2.1.0...v2.1.1
Assets 2
v2.1.0
d998efb
Compare
FastDeploy v2.1.0้่ฟๅ็บงKVCache่ฐๅบฆๆบๅถใๅขๅผบ้ซๅนถๅๅบๆฏ่ฝๅไปฅๅไธฐๅฏ้ๆ ท็ญ็ฅ๏ผ่ฟไธๆญฅๆๅ็จๆทไฝ้ชๅๆๅก็จณๅฎๆง๏ผ้่ฟCUDA GraphไปฅๅMTP็ญๅค้กนไผๅๆๅๆจ็ๆง่ฝ๏ผๆญคๅค๏ผ่ฟๆฐๅขๆฏๆๅคๆฌพๅฝไบง็กฌไปถไธๆๅฟๅผๆบๆจกๅ็ๆจ็่ฝๅใ
ไฝฟ็จไฝ้ชไผๅ
- KVCache่ฐๅบฆๆบๅถๅ็บง๏ผ้็จ่พๅ
ฅไธ่พๅบ็KVCache็ปไธ็ฎก็ๆนๅผ๏ผ่งฃๅณๆญคๅ็ฑไบ
kv_cache_ratio
ๅๆฐ้ ็ฝฎไธๅฝๅฏผ่ด็OOM้ฎ้ข๏ผ่งฃๅณๅคๆจกๆๆจกๅ็ฑไบ่พๅบKVCacheไธ่ถณ๏ผ็ๆๆๅ็ปๆ็้ฎ้ขใ้จ็ฝฒๆถ้่ฟ้ ็ฝฎ็ฏๅขๅ้export ENABLE_V1_KVCACHE_SCHEDULER=1
ๅฏ็จ๏ผไธไธช็ๆฌไผ้ป่ฎคๅผๅฏ๏ผ๏ผๅณๅฏไธๅไพ่ตkv_cache_ratio
็่ฎพ็ฝฎ๏ผๆจ่ไฝฟ็จใ - ้ซๅนถๅๅบๆฏๅ่ฝๅขๅผบ๏ผๅขๅ
max_concurrency
/max_waiting_time
ๆงๅถๅนถๅ๏ผๅฏนไบ่ถ ๆถ่ฏทๆฑ่ฟ่กๆ็ปไผๅ็จๆทไฝ้ช๏ผไฟ้ๆๅก็จณๅฎๆงใ - ๅคๆ ท็้ๆ ทๆนๅผๆฏๆ๏ผๆฐๅข
min_p
ใtop_k_top_p
้ๆ ทๆนๅผๆฏๆ๏ผไฝฟ็จๆนๅผๅ่ ้ๆ ท่ฏดๆ๏ผๅๆถๅขๅ ๅบไบRepetition็ญ็ฅๅๅบไบstop่ฏๅ่กจๆฉๅ่ฝๅ๏ผ่ฏฆ่ง ๆฉๅ่ฏดๆใ - ๆๅกๅ้จ็ฝฒ่ฝๅๆๅ๏ผๅขๅ
return_token_ids
/include_stop_str_in_output
/logprobs
็ญๅๆฐๆฏๆ่ฟๅๆดๅฎๆด็ๆจ็ไฟกๆฏใ - ้ป่ฎคๅๆฐไธๆง่ฝๆๅ๏ผๅขๅผบๅ max_num_seqs้ป่ฎคๅผไธๅฎ้ ๅนถๅไธไธ่ดๆถๆง่ฝไธ้้ฎ้ข๏ผ้ฟๅ ๆๅจไฟฎๆนmax_num_seqsใ
ๆจ็ๆง่ฝไผๅ
- CUDA Graph่ฆ็ๆดๅคๅบๆฏ๏ผ่ฆ็ๅคๅกๆจ็๏ผๆฏๆไธไธไธๆ็ผๅญใChunked Prefillๅๆถไฝฟ็จ๏ผๅจERNIE 4.5็ณปๅใQwen3็ณปๅๆจกๅไธๆง่ฝๆๅ17%~91%๏ผ่ฏฆ็ปไฝฟ็จๅฏไปฅๅ่ๆไฝณๅฎ่ทตๆๆกฃใ
- MTPๆๆบ่งฃ็ ๆง่ฝๆๅ ๏ผไผๅ็ฎๅญๆง่ฝ๏ผๅๅฐCPU่ฐๅบฆๅผ้๏ผๆๅๆดไฝๆง่ฝ๏ผๅๆถ๏ผ็ธๆฏv2.0.0็ๆฌๆฐๅขERNIE-4.5-21B-A3BๆจกๅๆฏๆMTPๆๆบ่งฃ็ ใ
- ็ฎๅญๆง่ฝไผๅ๏ผไผๅW4A8ใ KVCache INT4ใWINT2 Group GEMM็ญ่ฎก็ฎKernel๏ผๆๅๆง่ฝ๏ผๅฆERNIE-4.5-300B-A47B WINT2ๆจกๅๆง่ฝๆๅ25.5%ใ
- PDๅ็ฆปๅฎๆๆดๅคๆจกๅ้ช่ฏ๏ผP่็นๅฎๅFlashAttentionๅ็ซฏ๏ผๆๅ้ฟๆๆจ็ๆง่ฝ๏ผๅนถๅบไบERNIE-4.5-21B-A3B็ญ่ฝป้ๆจกๅๅฎๆ้ช่ฏใ
ๅฝไบง็กฌไปถ้จ็ฝฒ่ฝๅๅ็บง
- ๆฐๅขๆฏๆๆไป่ฏP800ไธERNIE-4.5-21B-A3Bๆจกๅ้จ็ฝฒ๏ผๆดๅค่ฏดๆๅ่ ๆไป่ฏP800้จ็ฝฒๆๆกฃใ
- ๆฐๅขๆฏๆๆตทๅ K100-AIไธERNIE4.5ๆๆฌ็ณปๅๆจกๅ้จ็ฝฒ๏ผๆดๅค่ฏดๆๅ่ ๆตทๅ K100-AI้จ็ฝฒๆๆกฃใ
- ๆฐๅขๆฏๆ็งๅS60ไธERNIE4.5ๆๆฌ็ณปๅๆจกๅ็้จ็ฝฒ๏ผๆดๅค่ฏดๆๅ่ ็งๅS60้จ็ฝฒๆๆกฃใ
- ๆฐๅขๆฏๆๅคฉๆฐๅคฉๅ150ไธERNIE-4.5-300B-A47BๅERNIE-4.5-21B-A3Bๆจกๅ้จ็ฝฒ๏ผๅนถไผๅๆจ็ๆง่ฝ๏ผๆดๅค่ฏดๆๅ่ ๅคฉๆฐ้จ็ฝฒๆๆกฃใ
ERNIE4.5 ๆจกๅๅฝไบง็กฌไปถๆจ็้้ ๆ ๅต๏ผโ ๅทฒๆฏๆ ๐ง ้้ ไธญ โๆๆ ่ฎกๅ๏ผ | ||||||
---|---|---|---|---|---|---|
ๆจกๅ | ๆไป่ฏP800 | ๆ่ พ910B | ๆตทๅ K100-AI | ๅคฉๆฐๅคฉๅ150 | ๆฒๆฆๆฆไบC550 | ็งๅS60/L600 |
ERNIE4.5-VL-424B-A47B | ๐ง | ๐ง | โ | โ | โ | โ |
ERNIE4.5-300B-A47B | โ | ๐ง | โ | โ | ๐ง | โ |
ERNIE4.5-VL-28B-A3B | ๐ง | ๐ง | โ | ๐ง | โ | โ |
ERNIE4.5-21B-A3B | โ | ๐ง | โ | โ | โ | โ |
ERNIE4.5-0.3B | โ | ๐ง | โ | โ | โ | โ |
็ธๅ ณๆๆกฃๅ่ฏดๆ
- ๅ็บงๅฏน้ฃๆกจๆกๆถ็ไพ่ต**๏ผFastDeploy v2.1.0็ๆฌไพ่ตPaddlePaddle v3.1.1็ๆฌ**๏ผPaddlePaddleๅฎ่ฃ ๆนๅผ่ฏทๅ่้ฃๆกจๅฎ็ฝๅฎ่ฃ ่ฏดๆ
- FastDeploy v2.1.0็ๆๅก้จ็ฝฒ่ฏทๆฑไธๅๆจ่ไฝฟ็จmetadataๅญๆฎต๏ผDeprecated๏ผv2.1.0็ๆฌๅฏไฝฟ็จ๏ผๆชๆฅไผ็งป้ค๏ผ๏ผๆดๆฐไธบไฝฟ็จextra_body๏ผ่ฏฆ่งๅๆฐๆฏๆ่ฏดๆ
- FastDeployๅค็กฌไปถๅฎ่ฃ ๅ็ผ่ฏ่ฏดๆ
- FastDeploy้จ็ฝฒๅๆฐ
- ๆๅกๅ้จ็ฝฒไฝฟ็จ่ฏดๆ
- GPU้จ็ฝฒๆไฝณๅฎ่ทต
ๆด่ฏฆ็ป็่ฏดๆๅไธพๅฆไธ๏ผ
-
ๆฐๅขๅ่ฝ
- PDๅ็ฆปDๆๅกๆฏๆW4A8ๅจ็บฟ/็ฆป็บฟ้ๅ
- PDๅ็ฆปๅผๅฏChunked Prefillไธๆฏๆ้Chunk็KVCacheไผ ่พ
- ๆฏๆlogprobs่ฟๅ
- ๆฏๆOpenTelemetry้้่ฏทๆฑๅค็็ถๆ
- ๆฐๅขreturn_token_idsๅๆฐ๏ผๆฏๆ่ฟๅ่ฏทๆฑ็่พๅ ฅๅ่พๅบToken IDๅ่กจ
- ๆฐๅขinclude_stop_str_in_outputๅๆฐ๏ผๆฏๆ็ปๆ็ฌฆ็่ฟๅ
- ๆฐๅขQwQๆจกๅ enable_thinkingๅๆฐๆงๅถๆ่ๆจกๅผๅผๅ ณ
- ๆฐๅขrepetitionๆฉๅๅ่ฝๆฏๆ
- ๆฐๅขstopๅๆฐๆฏๆ
- ๆฐๅขๅคๆบๅผ ้ๅนถ่ก้จ็ฝฒๆฏๆ
- ๆฐๅขๆๅก่ฏทๆฑๅนถๅไธ่ถ ๆถๆงๅถ
- ๆฏๆmin_p/top_k_top_p้ๆ ท
- ๆฏๆbad_words
- ไผๅOpenAI API-Serverๆฅๅฃ๏ผๆฏๆextra_bodyๆฉๅ ้ขๅคๅๆฐๆฏๆ๏ผๅบๅผmetadata็ไฝฟ็จ
-
ๆง่ฝไผๅ
- PDๅ็ฆปEPๅนถ่กไธDecode็W4A8่ฎก็ฎๆง่ฝไผๅ
- ๅบไบๆ้้ๆไผๅWINT2 Group-GEMM็ฎๅญKernelๆง่ฝ
- MTPไผๅไธๆฏๆๅผๅฏChunked Prefill
- ไผๅMTP & ๆๆบ่งฃ็ ๆจ็ๆง่ฝ
- ๅบไบTritonไผๅBlockwise FP8้ๅๆง่ฝ
- CUDA Graph ๆฏๆ Padding Batch๏ผๆพๅญๅ ็จๅคงๅน ๅๅฐ
- ๆฐๅขCustom All Reduce็ฎๅญ๏ผCUDA GraphๆฏๆTPๅนถ่ก
- ๆฏๆChunked PrefillไธๅผๅฏCUDA Graph
- GetBolockShapeAndSplitKVBlock็ฎๅญๆง่ฝไผๅ
- AttentionๆฏๆC4้ๅฏน็งฐ้ๅๆจ็
- FlashAttnๅ็ซฏ้้ TPๅนถ่กๅๆฏๆFlashAttention V2
- KVCache็ฎก็ๆบๅถๅ็บง๏ผๅฝๅไป ๆฏๆGPU๏ผ้่ฟexport ENABLE_V1_KVCACHE_SCHEDULER=1ๅฏ็จ
- FlashAttention V3ไธๆฏๆๅผๅฏC16/C8/C4็Chunked Prefillไผๅ
- ๆๅก้จ็ฝฒๆฏๆEngine่ชๅจ่ๅ็ๆ็ปๆๆๅๆๅกไธๅฎขๆท็ซฏ้ไฟกๆ็
-
ๅค็กฌไปถๆฏๆ
- ๆไป่ฏ P800ๆฏๆERNIE-21B-A3B Wint4/Wint8ๆจกๅ
- ๆตทๅ K100-AIๆฏๆERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3Bๆจกๅ
- ็งๅS60ๆฏๆERNIE4.5็ณปๅๆจกๅ
- ๅคฉๆฐๆฏๆERNIE-4.5-300B-A47B & ERNIE-4.5-21B-A3Bๆจกๅ๏ผๅนถ่ฟ่กๆง่ฝไผๅ
-
Bugไฟฎๅค
- ไฟฎๅคPDๅ็ฆป้จ็ฝฒๆถๆๅผๅฏMTPๆถDๆๅก้ฆToken้่ฏฏ้ฎ้ข
- ไฟฎๅคSFTๅๆๅฟ็บฏๆๆจกๅToken้ๆ ท่ถ็้ฎ้ข
- ไฟฎๅคXPU ้0ๅกๅฏๅจๆพๅญOOM็้ฎ้ข
- ไฟฎๅคXPUไฝฟ็จENABLE_V1_KVCACHE_SCHEDULER=1ๆง่ฝไธ้้ฎ้ข
- ไฟฎๅคChunked Prefillไธๅคๆจกๆๆจกๅๅนถๅๆจ็ๆจกๅๅดฉๆบ้ฎ้ข
- ไฟฎๅคQwen3-8Bๆจกๅ็ๆ็ปๆไนฑ็ ็้ฎ้ข
- ไฟฎๅคRMSNorm็กฌ็ผ็ ็้ฎ้ข
- ไฟฎๅคlinear.pyไธญqkv_biasๆฒกๅฎไน็้ฎ้ข
- ไฟฎๅคmax_tokens=1ๆถๆฅ้็้ฎ้ข
- ไฟฎๅคtoken_processor่พๅ ฅๆฅๅฟๆ ผๅผ้ฎ้ข
- ไฟฎๅคchunked_prefillไธ๏ผchunk size ๅฐไบblock sizeๆๅกhang้ฎ้ข
- ไฟฎๅคvl ๅบๆฏไธๆฐๆฎไฟๅญ้ฎ้ข
-
ๆๆกฃ
- ๅขๅ ไธญๆReadMEๅMKDocsๆฏๆ
- ๆฐๅขๅๆจกๅ้จ็ฝฒๆไฝณๅฎ่ทตๆๆกฃ
- ๅขๅ SamplingๅEarly Stoppingไฝฟ็จๆๆกฃ่ฏดๆ
- ๆดๆฐCUDA Graphไธๅจ่ฝฌ้ไฝฟ็จๆฅๅฃๅๆๆกฃ
- ๆดๆฐๆจกๅๆฏๆๆๆกฃ
-
ๅ ถๅฎ
- ๆฐๅขไธญ่ฑๆๆๆกฃๆฏๆ
- ไผๅๆจกๅๅ ่ฝฝ้ๅๆจกๅๅๆฐๆฅ้ไฟกๆฏ
- ็ปไธๅคๆจกๆๆจกๅๅ็บฏๆๆจกๅ็ModelRunner
- ๅบไบtriton_utilsๆดๆฐWINT2 Triton็ฎๅญ
- ไผๅไปฃ็ ไธญๅคไธชConfigๅฎ็ฐ็นๆ้ฎ้ข
What's Changed
- add wint2 performance by @ZhangHandi in #2673
- Update gh-pages.yml by @DDDivano in #2680
- add --force-reinstall --no-cache-dir when pip install fastdeploy*.whl by @yuanlehome in #2682
- [Sync] Update to latest code by @Jiang-Jia-Jun in #2679
- [doc] update docs by @kevincheng2 in #2690
- [Bug] fix logger format by @ltd0924 in #2689
- [feat] support fa3 backend for pd disaggregated by @yuanlehome in #2695
- add quick benchmark script by @DDDivano in #2703
- [Doc] modify reasoning_output docs by @LiqinruiG in #2696
- [MTP] Support chunked_prefill in speculative decoding(MTP) by @freeliuzc in #2705
- [RL] update reschedule finish reason by @ltd0924 in #2709
- [feature]add fd whl version info by @gzy19990617 in #2698
- Extract eh_proj Layer from ParallelLMHead for MTP to Avoid Weight Transposition Issue by @Deleter-D in #2707
- ๆทปๅ XPU CI, test=model by @quanxiang-liu in #2701
- [CI] Add validation for MTP and CUDAGraph by @EmmonsCurse in #2710
- add support QWQ enable_thinking by @lizexu123 in #2706
- [BugFix] fix paddle_git_commit_id error by @EmmonsCurse in #2714
- spec token map lazy. by @wtmlon in #2715
- fix bug. by @wtmlon in #2718
- ไฟฎๆนXPU CI, test=model by @quanxiang-liu in #2721
- [LLM] support multi node deploy by @ltd0924 in #2708
- [Doc]Update eb45-0.3B minimum memory requirement by @ckl117 in #2686
- [RL] Check if the controller port is available by @lddfym in #2724
- remove redundant install whl of fastdeploy by @yuanlehome in #2726
- support FastDeploy version setting by @XieYunshen in #2725
- [iluvatar_gpu] Adapt for iluvatar gpu by @liddk in #2684
- [Optimize] Optimize tensorwise fp8 performance by @ming1753 in #2729
- [Bug fix] fix complie bug when sm < 89 by @ming1753 in #2738
- [SOT] Remove BreakGraph with
paddle.maximum
by @DrRyanHuang in #2731 - ใFeartureใsupport qwen2 some func by @gzy19990617 in #2740
- [GCU] Support gcu platform by @EnflameGCU in #2702
- [Bug fix] Fixed the garbled text issues in Qwen3-8B by @lizexu123 in #2737
- [Bug fix] Add the missing
pod_ip
param to the launch_cache_manager function. by @Wanglongzhi2001 in #2742 - [Bug fix] fix attention rank init by @RichardWooSJTU in #2743
- add precision check for ci by @xiegetest in #2732
- [SOT] Make custom_op dy&st unified by @DrRyanHuang in #2733
- Revert "[Bug fix] fix attention rank init" by @RichardWooSJ...
Assets 2
v2.0.0
a1fa84e
Compare
FastDeploy 2.0: Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
News
๐ฅ Released FastDeploy v2.0: Supports inference and deployment for ERNIE 4.5. Furthermore, we open-source an industrial-grade PD disaggregation with context caching, dynamic role switching for effective resource utilization to further enhance inference performance for MoE models.
About
FastDeploy is an inference and deployment toolkit for large language models and visual language models based on PaddlePaddle. It delivers production-ready, out-of-the-box deployment solutions with core acceleration technologies:
- ๐ Load-Balanced PD Disaggregation: Industrial-grade solution featuring context caching and dynamic instance role switching. Optimizes resource utilization while balancing SLO compliance and throughput.
- ๐ Unified KV Cache Transmission: Lightweight high-performance transport library with intelligent NVLink/RDMA selection.
- ๐ค OpenAI API Server and vLLM Compatible: One-command deployment with vLLM interface compatibility.
- ๐งฎ Comprehensive Quantization Format Support: W8A16, W8A8, W4A16, W4A8, W2A16, FP8, and more.
- โฉ Advanced Acceleration Techniques: Speculative decoding, Multi-Token Prediction (MTP) and Chunked Prefill.
- ๐ฅ๏ธ Multi-Hardware Support: NVIDIA GPU, Kunlunxin XPU, Hygon DCU, Ascend NPU, Iluvatar GPU, Enflame GCU, MetaX GPU etc.
Supported Models
Model | Data Type | PD Disaggregation | Chunked Prefill | Prefix Caching | MTP | CUDA Graph | Maximum Context Length |
---|---|---|---|---|---|---|---|
ERNIE-4.5-300B-A47B | BF16/WINT4/WINT8/W4A8C8/WINT2/FP8 | โ | โ | โ | โ (WINT4) | WIP | 128K |
ERNIE-4.5-300B-A47B-Base | BF16/WINT4/WINT8 | โ | โ | โ | โ (WINT4) | WIP | 128K |
ERNIE-4.5-VL-424B-A47B | BF16/WINT4/WINT8 | WIP | โ | WIP | โ | WIP | 128K |
ERNIE-4.5-VL-28B-A3B | BF16/WINT4/WINT8 | โ | โ | WIP | โ | WIP | 128K |
ERNIE-4.5-21B-A3B | BF16/WINT4/WINT8/FP8 | โ | โ | โ | WIP | โ | 128K |
ERNIE-4.5-21B-A3B-Base | BF16/WINT4/WINT8/FP8 | โ | โ | โ | WIP | โ | 128K |
ERNIE-4.5-0.3B | BF16/WINT8/FP8 | โ | โ | โ | โ | โ | 128K |