v3.0.0-beta4

@DrownFish19

本次版本中，我们全面集成了 DeepSeek R1类的思考模型。推理团队深度优化了模型推理，速度业界领先。此外，我们还发布了自研PP-UIE信息抽取模型。本次重点更新如下。

重点更新：

模型新增
- DeepSeek V3/R1, R1-distill, QwQ-32B 热门思考模型，全面支持。用户可以点击官方模型文档列表查看、下载所有模型。
- 飞桨自研发布下一代通用信息抽取工具 PP-UIE 全新发布。支持8K长度信息抽取。使用文档。
推理部署
- 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理，MTP投机解码。
  - FP8推理，单机输出超1000 tokens/s；4比特单机部署，输出超2100 tokens/s！
- 首次协同推理团队，发布统一推理部署镜像，热门模型一键部署。推理部署使用文档全面更新，体验全面提升！见文档。
模型训练：
- 新增大模型 Embedding 训练，支持INF-CL超大batch size训练。
- 新增MergeKit模型融合工具，缓解对齐代价。见文档。
- 低资源训练全面优化。16G小显存可以流畅训练。
其他重点特性：
- 文档页面，新增模型列表展示。用户可查看、下载对应模型文件。见文档。
- 训练新增 adam-mini 优化器。AdamW优化器支持 BF16 动量。

下面是一些对应的更新细节：

1. 模型、框架组件更新

模型新增
- 模型新增列表：
  - paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B
  - deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base，deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero,
  - deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
  - Qwen/Qwen2.5-7B-Instruct-1M，Qwen/Qwen2.5-14B-Instruct-1M, Qwen/QwQ-32B, Qwen/QwQ-32B-Preview
- PR #9738: Deepseek V3 模型新增。PR #9876: 增加 MTP 支持。PR #9797:修复 TP问题。 PR #9643: Deepseek llama3.3 新增模型说明（@DrownFish19）
- PR #9906: Deepseek V3 支持动态图直接加载 Float8 参数并进行推理 (@ZHUI)
- PR #9845: 新增PP-UIE系列模型 @Fantasy-02 i PR #9911 & PR #9913: PP-UIE 相关文档更新（@DrownFish19）
Tokenizer 改进
- PR #9548、PR #9577、PR #9594: “Hackathon No.43” 系列，完善 TokenizerFast 功能支持（@yinfan98）
- PR #9745: 修复 AutoTokenizer 问题（@DrownFish19）PR #9837: 保存额外的 special tokens（@DesmonDay）
Unified Checkpoint 相关:
- PR #9540: 修复加载master weight PR #9523: 修复缺失key问题。
- PR #9669: 统一检查点的 Bug 修复 PR #9935: 针对忽略 merge optimizer 时直接加载参数的问题进行修复
- PR #9741 & PR #9821: 修复专家并行支持问题
MergeKit 功能增强与优化
- 新增功能与优化
  - PR #9561: 新增 mergekit_with_sparsify 功能，支持稀疏化合并（@Mangodadada）。
  - PR #9702: 优化 MergeKit 的 GPU 支持，提升处理效率（@Mangodadada）。
  - PR #9811: 添加 LoRA（低秩适配器）合并功能，扩展模型融合能力（@lugimzzz）。
- 工具更新与维护
  - PR #9885: 对 MergeKit 工具进行代码更新与维护，优化整体逻辑。
- 日志与调试支持
  - PR #9948: 添加日志记录功能，增强调试与过程追踪能力（@lugimzzz）。
低资源特性优化
- PR #9804: 添加 use_fused_linear_cross_entropy 支持，减小显存。加入 pre_divided_factor 避免FP16溢出。
文档更新、其他：
- PR #9634: unified_checkpoint 文档更新
- PR #9734: 自定义设备代码重构（@ZHUI）
- PR #9715: 增加 offload_recompute_inputs（@will-jl944）
- PR #9800: 增加训练 token 计数功能（@lugimzzz）

2. LLM 训练更新

通用训练
- PR #9204: 更新 chatglmv2 的 tensor/pipeline 并行（@DrownFish19）
- PR #9827: 为 Qwen2Moe 和 Deepseek 增加 pipeline 与 flashmask 支持（@DrownFish19）
Embedding 训练
- PR #9508: Embedding trainer 新增（@DesmonDay）PR #9673: 增加 INF-CL 超大batch训练支持（@jie-z-0607）
- PR #9656: Trainer 中修复加载 rng 状态问题（@DesmonDay）
- PR #9721: 修复 embedding 随机性问题（@DesmonDay）
DPO训练
- PR #9543: LLM 模块中 dpo 对 qwen2 的 flashmask 支持（@wtmlon）
- PR #9620: 更新 dpo criterion（@lugimzzz）
- PR #9695: 支持 qwen 与 llama 的 dpo pp（@lugimzzz）
新功能和特性
- PR #9542: 增加 adam-mini 优化器支持（@lugimzzz）
- PR #9732: 支持BF16动量adamw 训练 (@lugimzzz)
- PR #9830: 修复非 flash 模式下 checkpoint 保存的问题（@SylarTiaNII）
- PR #9705: Cherry-Pick：在 optimizer step 前校验 loss（@SylarTiaNII）
- PR #9704: Cherry-Pick：为 LLM 训练增加异步 metrics dumper（@SylarTiaNII）
训练文档及问题修复
- PR #9689: 增加 KTO 功能（@lugimzzz）
- PR #9655: 更新 peft 文档（@lugimzzz）
- PR #9659: 修复 lora 相关问题（@lugimzzz）

3. Inference 更新

Predictor & Flask 更新
- PR #9831: 修复 multibatch 推理问题（@DrownFish19）
- PR #9841: 修复 position_ids 相关问题（@DrownFish19）
- PR #9864: 更新 Deepseek 推理（@DrownFish19）
- PR #9828: Flask 服务使 Inference 兼容 OpenAI API（@ZHUI）
MTP功能优化
- PR #9856: Inference 中支持 mtp 与 Deepseek-v3（@freeliuzc）
- PR #9894: 修复 Deepseek_v3 在多 GPU 模式下的 mtp 问题（@freeliuzc）
- PR #9936: 增加 mtp serving 支持（@freeliuzc）
部署优化
- PR #9872: 支持多机部署 LLM（@ltd0924）
- PR #9791: 合并 fastdeploy 部分代码（@kevincheng2）
Kernel优化
- PR #9707: 优化 gemm_dequant OP，利用 CUDA 核进行 int8_sq 运算（@zhink）
文档更新、测试
- PR #9613: Inference 模块支持 llama3.2 及文档更新（@yuanlehome）
- PR #9921: 修复 llama 的 block_size 设置（@zhaohaixu）
- PR #9711: 为 LLM predictor 增加 common models 和参数单元测试（@aooxin）

4. AutoParallel / 分布式训练更新

自动并行
- PR #9578: 增加 llama2-7b-cinn 的测试（@zhangbo9674）
基础配置与 CI 集成
- PR #9538: 增加 qwen model_auto 与 CI（@blacksheep-Aristotle）
- PR #9541: 增加 llama3.1 自动并行配置（@zhiqiu）
- PR #9551: 为 gpt 和 baichuan 自动 CI 加入支持（@blacksheep-Aristotle）
- PR #9591: 增加 gpt、baichuan 及 qwen 的 ce 支持（@blacksheep-Aristotle）
- PR #9412: 增加 single_model 网络和使用 intermediate API（@blacksheep-Aristotle）
- PR #9943: 通过 training_args 控制 split input（@blacksheep-Aristotle）
测试、验证与功能开关
- PR #9621: 增加 PIR recompute 测试（@waliwali777）
- PR #9647: 修改 loss_base 以支持 dropout 后 SPMD（@deepllz）
- PR #9714: 增加阶段 1 tensor fusion 相关开关（@AndSonder）
- PR #9672: 修复 recompute 测试在 to_static=1 下运行问题（@waliwali777）
- PR #9688: 自动并行下合并 ckpt 供推理使用（@xuxinyi389）
- PR #9750 & PR #9753: 修复 ernine auto trainer 相关 CI 错误（@blacksheep-Aristotle）
- PR #9749: 为 benchmark 开启 tensor fusion（@AndSonder）
- PR #9810: 增加 sharding tensor fusion save/load 开关（@AndSonder）
- PR #9862: 支持 deepseekv2 下的 DP/MP（@xuxinyi389）
- PR #9823: 增加 support ppo ckpt 功能（@xuxinyi389）

5. CI、文档、Benchmark 及测试脚本更新

CI 脚本及警告过滤
- PR #9547: 更新 CI 脚本（@Liujie0926）
- PR #9612: CI 中过滤 paddle.to_tensor 警告（@DrownFish19）
- PR #9626: 更新 a100 loss_base 配置（@Liujie0926）
- PR #9889: CI 脚本更新（@Liujie0926）
- PR #9524: LLM benchmark 中新增 qwen2.5-7b（@Liujie0926）
- PR #9662 & PR #9722: 更新 LLM_benchmark 脚本（@Liujie0926）
文档与说明改进
- PR #9585: 修复文档中失效链接（@DrownFish19）
- PR #9668: 更新 README.md（@ZHUI）
- PR #9785: 更新面向文档的 README（@ZHUI）
- PR #9746: 文档修复（@DrownFish19）
- PR #9725: 调整 benchmark 环境变量和模型配置（@XieYunshen）
- PR #9877: 修正 inference 和 servering 的文档（@ZHUI）
- PR #9834: 发布 DeepSeek 新闻及说明（@DrownFish19）
- PR #9922: 更正精调文档错误（@sijunhe）
Benchmark 配置与测试
- PR #9651: 修复 benchmark 多机任务异常退出的问题（@XieYunshen）
- PR #9891: 更新 gpt-13b 在 dygraph 模式下的最佳配置（@liym27）

6. NPU/XPU 及硬件相关更新

NPU 适配与修复
- PR #9499: 适配 NPU 用于 FusedHeadAndCrossEntropy（@tianhaodongbd）
- PR #9573: 修复 NPU 下的 where 问题（@tianhaodongbd）
- PR #9762: 适配新版 flash_attention_npu API（@will-jl944）
XPU 功能与优化
- PR #9549: qwen2 支持 flash_attn on XPU（@will-jl944）
- PR #9660: qwen2 支持 fused_rope（@will-jl944）
- PR #9789: 支持 XPU 下的 empty_cache（@will-jl944）
- PR #9796: 支持 XPU 用于自动并行 LLaMa（@From00）
- PR #9854: 为 deepseek 增加 XPU 下 fused op（@QingshuChen）

7. Bug 修复、性能优化及其他改进

状态加载与多线程问题
- PR #9464: 修复多线程下 load_state_dict 的问题（@DesmonDay）
各类模型与算子问题修复
- PR #9603: 修复 qwen2 modeling 中 d2s bug（@wawltor）
- PR #9569: 修复 dynamic 与 static 模式下的 norm outputs 问题（@Wangzheee）
- PR #9652: 修复 paddle.where 问题（@will-jl944）
- PR #9638: 增加 config replace_with_c_embedding（@Xing-lil）
- PR #9699: 修复 loraga amp 问题（@greycooker）
- PR #9752: 修复 get_block_shape_and_split_kv_block 的 bug（@lizhenyun01）
- PR #9759: 修复 speculate_verify_and_update op（@Wanglongzhi2001）
- PR #9674: 将 speculate_step 合并到 step op 中（@Wanglongzhi2001）
- PR #9757: Trainer 模块中更新 sequence parallel（@DesmonDay）
- PR #9765: 修复 loraga merge 问题（@greycooker）
- PR #9777: 分布式训练下 Cherry-Pick 支持 fuse optimizer（@SylarTiaNII）
- PR #9783: 修复 ce 错误（@blacksheep-Aristotle）
- PR #9779: 修复 pickle unsafe-load 问题（@DrownFish19）
- PR #9760: MoE 模块修复 expert parallel（@DesmonDay）
- PR #9790: 为 server infer 添加 pir_model 路径（@aooxin）
- PR #9706: Cherry-Pick 集成 PDC SDK 用于 LLM 训练容错（@SylarTiaNII）
- PR #9624: 添加 FLAGS 用于替换四个参数以便更好地加速（@zhink）
- PR #9806: 修复 LLAMA 参数解析 bug（@will-jl944）
- PR #9829: 更新 mixtral.md 文件（@yuanlehome）
- PR #9859: 修复 dsk rope 差异问题（@yuanlehome）

8. 环境/依赖及版本兼容更新

requirements 及安装更新
- PR #9514: 更新 py38 下的 requirements.txt （@ZHUI）
- PR #9118: 更新安装依赖（@DrownFish19）
- PR #9953: 针对 py38 增加 tokenizers 依赖（@DrownFish19）
Python 版本兼容性
- PR #9853: 解决类型注解在不同 Python 版本下的兼容性问题（@zty-king）

What's Changed

Update requirements.txt for py38 by @ZHUI in #9514
[Unified Checkpoint] fix single card loading without master weights by @DesmonDay in #9540
Fix multi-threading load_state_dict by @DesmonDay in #9464
delete generate_rank_mapping when export multi cards model by @yuanlehome in #9552
[LLM] dpo support qwen2 with flashmask by @wtmlon in #9543
[XPU] qwen2 supports flash_attn on XPU by @will-jl944 in #9549
[AutoParallel]: add qwen model_auto and ci by @blacksheep-Aristotle in #9538
add llama3.1 config for auto_parallel by @zhiqiu in #9541
Add more model support for speculate_decoding and refactor speculate_decoding by @Wanglongzhi2001 in #9504
[Intel_HPU]FSDPA custom kernel API update by @yanfeich in #9556
[Unified Checkpoint] fix load missing keys by @DesmonDay in #9523
【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 3 by @yinfan98 in #9548
adapt code to amsgrad supported adamw by @HydrogenSulfate in #9568
[CI]update scripts by @Liujie0926 in #9547
Adapting npu for FusedHeadAndCrossEntropy by @tianhaodongbd in #9499
【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 4 by @yinfan98 in #9577
fix(export_model): fix export_model.py python path by @thinking-computer in #9571
Fix_ckpt_oom_paddlenlp by @Xing-lil in #9507
Add GPUEventTimer by @sneaxiy in #9582
[npu] fix where bug by @tianhaodongbd in #9573
[doc] Fix dead links by @DrownFish19 in #9585
[AutoParallel]:add gpt & baichuan auto ci by @blacksheep-Aristotle in #9551
Add llama2-7b-cinn test by @zhangbo9674 in #9578
[AutoParallel]:add gpt&baichuan&qwen ce by @blacksheep-Aristotle in #9591
fix dpo pp eval by @lugimzzz in #9607
[LLM] update tensor and pipeline parallel for chatglmv2 by @DrownFish19 in #9204
[Install] Update requirment.txt by @DrownFish19 in #9118
[Trainer]Fix _get_eval_sampler by @greycooker in #9374
fix benchmark scripts by @XieYunshen in #9597
[Trainer] Add embedding trainer by @DesmonDay in #9608
[CI] filter paddle.to_tensor warnings when set_state_dict by @DrownFish19 in #9612
fix ckpt quant log by @wtmlon in #9606
fix the d2s bug in qwen2 modeling by @wawltor in #9603
【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 5 by @yinfan98 in #9594
fix pp_config bug by @tianhaodongbd in #9605
Speedup FusedHeadAndCrossEntropy by @will-jl944 in #9601
fix get_save_output op and refactor specu_decoding by @Wanglongzhi2001 in #9576
[Inference] Fix docs and support llama3.2 by @yuanlehome in #9613
fix by @DrownFish19 in #9628
fix norm outputs in dynamic and static mode by @Wangzheee in #9569
[CI]update a100 loss_base for gpt by @Liujie0926 in #9626
[LLM benchmark]add qwen2.5-7b by @Liujie0926 in #9524
Checkpoint Compression Doc by @wtmlon in #9614
Update unified_checkpoint.md by @DesmonDay in #9634
add llama and nv-embed training by @Li-Z-Q in #9323
[News] Unified Checkpoint by @DrownFish19 in #9632
feat(sdaa): support sdaa backend infer by @thinking-computer in #9570
[llm]update dpo criterion by @lugimzzz in #9620
[llm]add adam-mini by @lugimzzz in #9542
Update version for beta3 by @ZHUI in #9553
[LLM DOCs] Add deepseek llama3.3 new models by @DrownFish19 in #9643
[Tokenizer] Fix tokenizer of llama3.3 by @DrownFish19 in #9641
[AutoParallel] Add test for PIR recompute by @waliwali777 in #9621
Update README.md for 3.0 beta3 by @ZHUI in #9644
Add replace_with_parallel_cross_entropy flag by @waliwali777 in #9579
[AutoParallel] change loss_base after dropout support spmd by @deepllz in #9647
[Embedding] Add embedding training by @DesmonDay in #9508
[PEFT]Add LoRA-GA by @greycooker in #9592
mergekit_with_sparsify by @Mangodadada in #9561
Fix paddle.where by @will-jl944 in #9652
Add config replace_with_c_embedding by @Xing-lil in #9638
Update embedding trainer state by @DesmonDay in #9629
MoRA Implementation by @lcykww in #9562
[llm]update peft docs by @lugimzzz in #9655
[Trainer] Fix loading rng state by @DesmonDay in #9656
fix qwen&baichaun&gpt ci error by @blacksheep-Aristotle in #9650
[llm] fix lora by @lugimzzz in #9659
[XPU] qwen2 supports fused_rope by @will-jl944 in #9660
update hygon dcu docs by @TimeYWL in #9298
Make the timer compatible with devices other than GPU by @deepllz in #9665
[Trainer] update remove_master_weight by @DesmonDay in #9640
[DOC] Update README.md by @ZHUI in #9668
[Mthreads] support llama 13B train by @shang-mt in #9666
Structured Index of Documents by @dfmz759837901 in #9411
【Qwen2-VL Inference】add qwen2-vl high performance inference by @chang-wenbin in #9575
merge docs by @Mangodadada in #9657
[CI]update blacklist for gpt3 by @Liujie0926 in #9555
[体验优化] 整合训练的CUDA和Triton算子为 paddlenlp_kernel by @JunnYu in #9471
[Unified Checkpoint] bug fix by @DesmonDay in #9669
Add tied_weight_keys for pipeline model by @DesmonDay in #9663
Optimize performance for Qwen2 model by @sneaxiy in #9616
[MLU] add mlu llama readme by @PeiyuLau in #9671
Set tensor parallel name mapping when fusion is used by @sneaxiy in #9685
[LLM] add deploy server by @kevincheng2 in #9581
[Embedding] Add inf-cl in embedding trainer by @jie-z-0607 in #9673
[Fix]fix loraga amp by @greycooker in #9699
[LLM INFER] cutlass 3.x gemm on sm90 by @ckl117 in #9398
[Iluvatar] Add readme for llama-13b by @tianyuzhou668 in #9670
[AutoParallel] merge ckpt for inference by @xuxinyi389 in #9688
update gpt&baichuan&qwen ce name by @blacksheep-Aristotle in #9697
fix docs by @xuxinyi389 in #9703
[Inference] Use cuda core(int8_sq) for m <=4 in gemm_dequant OP by @zhink in #9707
[LLM] [Cherry-Pick] valid loss before optimizer step (#9255) by @SylarTiaNII in #9705
[llm]support dpo pp for qwen & llama by @lugimzzz in #9695
support qwen dpo fused kernel by @wtmlon in #9686
[AutoParallel] Fix recompute test running under to_static=1 by @waliwali777 in #9672
[LLM_benchmark]update LLM_benchmark scripts by @Liujie0926 in #9662
[LLM] [Cherry-Pick] add asynchronous metrics dumper for llm training by @SylarTiaNII in #9704
[llm] Add KTO by @lugimzzz in #9689
[Embedding] Fix embedding random by @DesmonDay in #9721
remove refined recompute deep copy by @JunnYu in #9617
add single_model network and use intermediate api by @blacksheep-Aristotle in #9412
Refactor custom devices. by @ZHUI in #9734
Add offload_recompute_inputs by @will-jl944 in #9715
[LLM] [Cherry-Pick] Integrate PDC SDK for LLM training fault tolerance platform by @SylarTiaNII in #9706
add common models and common params unit test for llm predictor. by @aooxin in #9711
Added FLAGS to replace four params and the value can be adjusted for better speedup by @zhink in #9624
[AutoParallel] add parameter enable_stage1_tensor_fusion_blanced_save_load and enable_stage1_tensor_fusion by @AndSonder in #9714
Adapt to new npu flash_attention api by @will-jl944 in #9735
[AutoParallel] Add test for PIR refined recompute by @waliwali777 in #9679
[Docs] Fix by @DrownFish19 in #9746
Bugfix update predictor.py by @ZHUI in #9742
Modify the environment variables and model configuration of the bench… by @XieYunshen in #9725
[Unified Checkpoint] Fix expert parallel by @DesmonDay in #9741
[AutoParallel]:ufix ernie ci error by @blacksheep-Aristotle in #9750
fix import bugs. by @aooxin in #9751
[AutoParallel]ckpt support local views keys to global views keys by @xuxinyi389 in #9604
Add XLMRoBERTaModel in paddlenlp by @jie-z-0607 in #9720
[AutoParallel]:fix ernine auto_trainer error by @blacksheep-Aristotle in #9753
fix get_block_shape_and_split_kv_block by @lizhenyun01 in #9752
fix speculate_verify_and_update op by @Wanglongzhi2001 in #9759
[Inference]merge speculate_step into step op by @Wanglongzhi2001 in #9674
[NPU] Adapt to new flash_attention_npu api by @will-jl944 in #9762
[Trainer] update sequence parallel by @DesmonDay in #9757
[tokenizer] Fix AutoTokenizer by @DrownFish19 in #9745
[LLM] Add DeepseekV3 by @DrownFish19 in #9738
[AutoParallel] open tensor_fusion for benchmark by @AndSonder in #9749
fix loraga merge by @greycooker in #9765
Fix ernie ci auto trainer error by @blacksheep-Aristotle in #9758
Update README.md by @ZHUI in #9766
Fix matryoshka norm loss by @DesmonDay in #9774
[Distributed] [Cherry-Pick] support fuse optimizer (#9519) by @SylarTiaNII in #9777
Update register_sequence_parallel_allreduce_hooks by @DesmonDay in #9782
Fix ce error by @blacksheep-Aristotle in #9783
fix pickle unsafe-load by @DrownFish19 in #9779
[MoE] fix expert parallel by @DesmonDay in #9760
fix dpo pp criterion by @wtmlon in #9786
add pir_model path for server infer. by @aooxin in #9790
[LLM] [Cherry-Pick] support flash device on static model (#9619) by @SylarTiaNII in #9787
[LLM Benchmark]update scripts by @Liujie0926 in #9722
mergekit gpu 1226 by @Mangodadada in #9702
[LLM] merge code from fastdeploy by @kevincheng2 in #9791
support eagle for llama by @freeliuzc in #9812
[CI] Fix by @ZHUI in #9633
wrap model when lora is ON and only do evaluation. by @wtmlon in #9803
Update README.md for documention by @ZHUI in #9785
[Checkpoint compression] Support sharding stage1 v2 by @DesmonDay in #9817
[LLM] Update model convert and fix TP for deepseekv3 by @DrownFish19 in #9797
[AutoParallel] add sharding tensor_fusion save load switch by @AndSonder in #9810
修复benchmark多机任务异常退出的处理 by @XieYunshen in #9651
Fix LLAMA arg parsing bug in pp by @will-jl944 in #9806
Update mixtral.md by @yuanlehome in #9829
[XPU] Support empty_cache on XPUs by @will-jl944 in #9789
[Inference] Fix multibatch inference by @DrownFish19 in #9831
Fix position_ids for infra by @DrownFish19 in #9841
[LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek by @DrownFish19 in #9827
[Mergekit]update & add LoRA merge by @lugimzzz in #9811
[Unified Checkpoint] Fix expert parallel by @DesmonDay in #9821
[Inference] Flask server compatible with OpenAI api. by @ZHUI in #9828
[LLM] fix checkpoint save for non flash mode by @SylarTiaNII in #9830
[DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) by @yuanlehome in #9769
解决类型注解Python版本兼容性问题 by @zty-king in #9853
[Tokenizer] save extra special tokens by @DesmonDay in #9837
[Bugfix] Fix dsk rope diff by @yuanlehome in #9859
Support lower memory cards. by @ZHUI in #9804
Support XPU for auto-paralllel LLaMa by @From00 in #9796
[XPU] Add fused op for deepseek by @QingshuChen in #9854
[Inference] Update deepseek by @DrownFish19 in #9864
[PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model by @ZHUI in #9855
[Inference]Support mtp with deepseek-v3 by @freeliuzc in #9856
[AutoParallel] Support deepseekv2 with DP/MP by @xuxinyi389 in #9862
[LLM] move modeling.py and modeling_nv.py to transformers by @Li-Z-Q in #9676
[Docs] fix docs for inference and servering by @ZHUI in #9877
[Docs] news of DeepSeek by @DrownFish19 in #9834
[AutoParallel]support_ppo_ckpt by @xuxinyi389 in #9823
suppport intermediate_api llama test by @liym27 in #9850
Update MergeKit by @lugimzzz in #9885
[LLM] Support multi machine deployment by @ltd0924 in #9872
【SpecInfer】修复 InferenceWithReference 接收率不高的 bug by @Wanglongzhi2001 in #9880
update the best conf for gpt-13b in dygraph mode by @liym27 in #9891
[Inference]fix deepseek_v3 with mtp in multi-gpu mode by @freeliuzc in #9894
[TaskFlow] Fix pir for taskflow by @DrownFish19 in #9822
[LLM-IE] Add pp-uie to Taskflow by @Fantasy-02 in #9845
[DOC] Update README for PP-UIE by @DrownFish19 in #9911
【benchmark】align benchmark conf for static baichuan2 gpt3 by @liym27 in #9901
[DOC] PP-UIE by @DrownFish19 in #9913
add gpu whl by @bukejiyu in #9890
add count trained tokens by @lugimzzz in #9800
更正精调文档错误 by @sijunhe in #9922
[CI]update ci scripts by @Liujie0926 in #9889
[LLM]: fix block_size setting for llama. by @zhaohaixu in #9921
support qwen2_5_vl by @chang-wenbin in #9924
[DSK] Fix some bugs for dsk-v3 by @yuanlehome in #9874
support intermediate_api gpt-3 test by @Function-Samuel in #9912
support intermediate_api qwen test by @Function-Samuel in #9910
[LLM] Add MTP for Deepseekv3 by @DrownFish19 in #9876
[taskflow] Fix taskflow bug by @Fantasy-02 in #9930
【Inference】Support mtp serving by @freeliuzc in #9936
[Autoparallel] Mtp for DeepSeekV3 by @xuxinyi389 in #9884
[Unified Checkpoint] Fix split param loading directly when using ignore_merge_optimizer by @DesmonDay in #9935
[DSK] Implement mla use matrix-absorption by @yuanlehome in #9875
use training_args to contorl split input by @blacksheep-Aristotle in #9943
[requirements] tokenizers for py38 by @DrownFish19 in #9953
[LLM] update llm server dockerfiles by @kevincheng2 in #9940
【Inference】fix dynamic_forward of mtp by @freeliuzc in #9947
[RL] Fix PPO and add GRPO by @DrownFish19 in #9925
[doc] update config and add docs for grpo by @DrownFish19 in #9962
Add Process Reward Model. by @XuLingnan in #9598
[Feature] Support float8 dtype storage and deepseek v3 with fp8 inference. by @ZHUI in #9906
[AutoParallel] Add auto parallel moe layer by @pkuzyc in #9886
[llm]add bf16 moment adamw by @lugimzzz in #9732
[MergeKit]add log by @lugimzzz in #9948
Longlora by @micelvrice in #9970
Fix update paddle_patch.py by @ZHUI in #9968
support MMLU eval by @vivienfanghuagood in #9967
Update paddle_patch.py by @ZHUI in #9978
[XPU] change llama loss func on xpu by @AndSonder in #9973
[Inference] refine csrc/tools/build_wheel.sh by @bukejiyu in #9971
[DSK] mla use tensor core by @yuanlehome in #9952
Update paddle_patch.py by @ZHUI in #9984
[LLM]fix ci by @lugimzzz in #9986
Fix mtp speed by @freeliuzc in #9987
[Trainer]fix wandb proxy by @greycooker in #9960
[LLM] add moe parallel groups by @SylarTiaNII in #9982
[AutoParallel] Fix pipeline visualization tool by @AndSonder in #9976
[llm]fix ci by @lugimzzz in #9989
[DSK] DeepSeek Support FP8 by @ming1753 in #9956
[LLM INFER] update step_paddle by @yuanlehome in #9991
[AutoParallel] Add pp stage id by @xuxinyi389 in #9965
[CI] Fix tokenizer load in PRM by @DrownFish19 in #9997
fix tenercore precision while split kv by @lizhenyun01 in #9994
support intermediate_api baichuan test by @Function-Samuel in #9988
【Inference】Add benchmark client test scripts by @gzy19990617 in #9996
[LLM] Support for automatic deployment of services, modification of environment variable names by @ltd0924 in #9966
Add moe flex dispatcher by @umiswing in #9977
add deepseek doc by @yuanlehome in #9964
[Feat] Sage Attention Kernels Support for sm80, sm89, sm90 by @l1cacheDell in #9848
[LLM] support fix seq len and cmd run service by @ltd0924 in #10004
[Doc] Add Qwen/QwQ-32B model ids by @DrownFish19 in #10005
[LLM] Fix MTP for pipeline parallel by @DrownFish19 in #9972
[LLM] Update license by @DrownFish19 in #10003
【Infer】remove some bug config for block gemm by @ckl117 in #10002
Default set FLAGS_cascade_attention_max_partition_size as 32K by @lizhenyun01 in #10013
[Distribution] Support DualPipeV for GPT3 by @zhangyuqin1998 in #9993
[inference]add docker doc by @bukejiyu in #9998
【Docs】Update speculate decoding docs by @freeliuzc in #10017
[LLM] fix llm model path and support download from txt by @ltd0924 in #10029
[CherryPick] add import check of local_layer by @pkuzyc in #10038
fix mla nan in mtp by @lizhenyun01 in #10041
[cherry-pick] update doc by @bukejiyu in #10043
[CI] fix install issue for requirements-dev.txt by @ZHUI in #10051
[Doc] 支持用户自行下载静态图 by @ltd0924 in #10046
[cherry-pick] (PR10034 [server]Add a model download script and fix bugs for the server) by @bukejiyu in #10035
[LLM] 增加版本号 by @ltd0924 in #10056
check mtp triton cache by @ckl117 in #10065
Update version setup.py by @ZHUI in #10070
[Doc] 文档完善，新增模型环境设备需求 by @ltd0924 in #10073
change_h100_to_h800 by @yuanlehome in #10091
[Doc] 完善文档，更新示例模型 by @ltd0924 in #10085
【Serving】Fix serving bug release by @freeliuzc in #10101

New Contributors

@thinking-computer made their first contribution in #9571
@Wangzheee made their first contribution in #9569
@lcykww made their first contribution in #9562
@shang-mt made their first contribution in #9666
@dfmz759837901 made their first contribution in #9411
@jie-z-0607 made their first contribution in #9673
@aooxin made their first contribution in #9711
@zty-king made their first contribution in #9853
@Fantasy-02 made their first contribution in #9845
@zhaohaixu made their first contribution in #9921
@XuLingnan made their first contribution in #9598
@micelvrice made their first contribution in #9970

Full Changelog: v3.0.0-beta3...v3.0.0-beta4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v3.0.0-beta4

重点更新：

模型新增

推理部署

模型训练：

其他重点特性：

1. 模型、框架组件更新

2. LLM 训练更新

3. Inference 更新

4. AutoParallel / 分布式训练更新

5. CI、文档、Benchmark 及测试脚本更新

6. NPU/XPU 及硬件相关更新

7. Bug 修复、性能优化及其他改进

8. 环境/依赖及版本兼容更新

What's Changed

New Contributors

Contributors

Uh oh!