CARVIEW |
Select Language
HTTP/2 200
date: Tue, 22 Jul 2025 00:19:12 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"bfee4ad100bf9a9ca22e27625e33f66b"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=Q374fynDzOjn27DHkZFohor%2Bcr6MYCHsgZi2C9NXNp%2Ba9CG%2F7Rfnl1m5rJoeyCsPNzrBOcPU0zOyc9yP7cs7DJ740%2FfWY%2FPt7j1VXhAGJWUgLEImQ3QLOnr3WArz4x22HwL3x1aYfCAS80dUs0vkzxiHVGdRQVGRU3QlyjF48Io0bI%2BvGVxniFAxtanEyEQ5ce4WA1nNM4T5v6RTznu8E5nWRq3T6d4FT2EJ1qIvxXjQ1q8BNlE80C0KYrTgdReM5mvQrjbrqCoC9h6wDoqw6A%3D%3D--%2FBcpRMjgPhwnZHCC--hl3t01SxcvSlcRFZXLgeUg%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.1226853906.1753143551; Path=/; Domain=github.com; Expires=Wed, 22 Jul 2026 00:19:11 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Wed, 22 Jul 2026 00:19:11 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: C4C8:16FC0B:A188:101D4:687ED8FF
Release v3.0.0-beta4 · PaddlePaddle/PaddleNLP · GitHub
Loading
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
v3.0.0-beta4
Pre-release
Pre-release
Compare
·
15 commits
to release/3.0-beta4-new
since this release
a286abc
This commit was created on GitHub.com and signed with GitHub’s verified signature.
本次版本中,我们全面集成了 DeepSeek R1类的思考模型。推理团队深度优化了模型推理,速度业界领先。此外,我们还发布了自研PP-UIE信息抽取模型。本次重点更新如下。
重点更新:
-
模型新增
-
推理部署
- 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理,MTP投机解码。
- FP8推理,单机输出超1000 tokens/s;4比特单机部署,输出超2100 tokens/s!
- 首次协同推理团队,发布统一推理部署镜像,热门模型一键部署。推理部署使用文档全面更新,体验全面提升!见文档。
- 全面支持DeepSeek V3/R1满血版FP8、INT8、4比特量化推理,MTP投机解码。
-
模型训练:
- 新增大模型 Embedding 训练,支持INF-CL超大batch size训练。
- 新增MergeKit模型融合工具,缓解对齐代价。见文档。
- 低资源训练 全面优化。16G小显存可以流畅训练。
-
其他重点特性:
- 文档页面,新增模型列表展示。用户可查看、下载对应模型文件。见文档。
- 训练新增 adam-mini 优化器。AdamW优化器支持 BF16 动量。
下面是一些对应的更新细节:
1. 模型、框架组件更新
- 模型新增
- 模型新增列表:
- paddlenlp/PP-UIE-0.5B, paddlenlp/PP-UIE-1.5B, paddlenlp/PP-UIE-7B, paddlenlp/PP-UIE-14B
- deepseek-ai/DeepSeek-V3, deepseek-ai/DeepSeek-V3-Base,deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero,
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B, deepseek-ai/DeepSeek-R1-Distill-Llama-8B, deepseek-ai/DeepSeek-R1-Distill-Qwen-14B, deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B, deepseek-ai/DeepSeek-R1-Distill-Qwen-32B, deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
- Qwen/Qwen2.5-7B-Instruct-1M,Qwen/Qwen2.5-14B-Instruct-1M, Qwen/QwQ-32B, Qwen/QwQ-32B-Preview
- PR #9738: Deepseek V3 模型新增。PR #9876: 增加 MTP 支持。PR #9797:修复 TP问题。 PR #9643: Deepseek llama3.3 新增模型说明(@DrownFish19)
- PR #9906: Deepseek V3 支持动态图直接加载 Float8 参数并进行推理 (@ZHUI)
- PR #9845: 新增PP-UIE系列模型 @Fantasy-02 i PR #9911 & PR #9913: PP-UIE 相关文档更新(@DrownFish19)
- 模型新增列表:
- Tokenizer 改进
- PR #9548、PR #9577、PR #9594: “Hackathon No.43” 系列,完善 TokenizerFast 功能支持(@yinfan98)
- PR #9745: 修复 AutoTokenizer 问题(@DrownFish19)PR #9837: 保存额外的 special tokens(@DesmonDay)
- Unified Checkpoint 相关:
- MergeKit 功能增强与优化
- 新增功能与优化
- PR #9561: 新增 mergekit_with_sparsify 功能,支持稀疏化合并(@Mangodadada)。
- PR #9702: 优化 MergeKit 的 GPU 支持,提升处理效率(@Mangodadada)。
- PR #9811: 添加 LoRA(低秩适配器)合并功能,扩展模型融合能力(@lugimzzz)。
- 工具更新与维护
- PR #9885: 对 MergeKit 工具进行代码更新与维护,优化整体逻辑。
- 日志与调试支持
- 新增功能与优化
- 低资源特性优化
- PR #9804: 添加 use_fused_linear_cross_entropy 支持,减小显存。加入 pre_divided_factor 避免FP16溢出。
- 文档更新、其他:
2. LLM 训练更新
- 通用训练
- PR #9204: 更新 chatglmv2 的 tensor/pipeline 并行(@DrownFish19)
- PR #9827: 为 Qwen2Moe 和 Deepseek 增加 pipeline 与 flashmask 支持(@DrownFish19)
- Embedding 训练
- PR #9508: Embedding trainer 新增(@DesmonDay)PR #9673: 增加 INF-CL 超大batch训练支持(@jie-z-0607)
- PR #9656: Trainer 中修复加载 rng 状态问题(@DesmonDay)
- PR #9721: 修复 embedding 随机性问题(@DesmonDay)
- DPO训练
- 新功能和特性
- PR #9542: 增加 adam-mini 优化器支持(@lugimzzz)
- PR #9732: 支持BF16动量adamw 训练 (@lugimzzz)
- PR #9830: 修复非 flash 模式下 checkpoint 保存的问题(@SylarTiaNII)
- PR #9705: Cherry-Pick:在 optimizer step 前校验 loss(@SylarTiaNII)
- PR #9704: Cherry-Pick:为 LLM 训练增加异步 metrics dumper(@SylarTiaNII)
- 训练文档及问题修复
3. Inference 更新
- Predictor & Flask 更新
- PR #9831: 修复 multibatch 推理问题(@DrownFish19)
- PR #9841: 修复 position_ids 相关问题(@DrownFish19)
- PR #9864: 更新 Deepseek 推理(@DrownFish19)
- PR #9828: Flask 服务使 Inference 兼容 OpenAI API(@ZHUI)
- MTP功能优化
- PR #9856: Inference 中支持 mtp 与 Deepseek-v3(@freeliuzc)
- PR #9894: 修复 Deepseek_v3 在多 GPU 模式下的 mtp 问题(@freeliuzc)
- PR #9936: 增加 mtp serving 支持(@freeliuzc)
- 部署优化
- PR #9872: 支持多机部署 LLM(@ltd0924)
- PR #9791: 合并 fastdeploy 部分代码(@kevincheng2)
- Kernel优化
- 文档更新、测试
- PR #9613: Inference 模块支持 llama3.2 及文档更新(@yuanlehome)
- PR #9921: 修复 llama 的 block_size 设置(@zhaohaixu)
- PR #9711: 为 LLM predictor 增加 common models 和参数单元测试(@aooxin)
4. AutoParallel / 分布式训练更新
- 自动并行
- PR #9578: 增加 llama2-7b-cinn 的测试(@zhangbo9674)
- 基础配置与 CI 集成
- PR #9538: 增加 qwen model_auto 与 CI(@blacksheep-Aristotle)
- PR #9541: 增加 llama3.1 自动并行配置(@zhiqiu)
- PR #9551: 为 gpt 和 baichuan 自动 CI 加入支持(@blacksheep-Aristotle)
- PR #9591: 增加 gpt、baichuan 及 qwen 的 ce 支持(@blacksheep-Aristotle)
- PR #9412: 增加 single_model 网络和使用 intermediate API(@blacksheep-Aristotle)
- PR #9943: 通过 training_args 控制 split input(@blacksheep-Aristotle)
- 测试、验证与功能开关
- PR #9621: 增加 PIR recompute 测试(@waliwali777)
- PR #9647: 修改 loss_base 以支持 dropout 后 SPMD(@deepllz)
- PR #9714: 增加阶段 1 tensor fusion 相关开关(@AndSonder)
- PR #9672: 修复 recompute 测试在 to_static=1 下运行问题(@waliwali777)
- PR #9688: 自动并行下合并 ckpt 供推理使用(@xuxinyi389)
- PR #9750 & PR #9753: 修复 ernine auto trainer 相关 CI 错误(@blacksheep-Aristotle)
- PR #9749: 为 benchmark 开启 tensor fusion(@AndSonder)
- PR #9810: 增加 sharding tensor fusion save/load 开关(@AndSonder)
- PR #9862: 支持 deepseekv2 下的 DP/MP(@xuxinyi389)
- PR #9823: 增加 support ppo ckpt 功能(@xuxinyi389)
5. CI、文档、Benchmark 及测试脚本更新
- CI 脚本及警告过滤
- PR #9547: 更新 CI 脚本(@Liujie0926)
- PR #9612: CI 中过滤 paddle.to_tensor 警告(@DrownFish19)
- PR #9626: 更新 a100 loss_base 配置(@Liujie0926)
- PR #9889: CI 脚本更新(@Liujie0926)
- PR #9524: LLM benchmark 中新增 qwen2.5-7b(@Liujie0926)
- PR #9662 & PR #9722: 更新 LLM_benchmark 脚本(@Liujie0926)
- 文档与说明改进
- PR #9585: 修复文档中失效链接(@DrownFish19)
- PR #9668: 更新 README.md(@ZHUI)
- PR #9785: 更新面向文档的 README(@ZHUI)
- PR #9746: 文档修复(@DrownFish19)
- PR #9725: 调整 benchmark 环境变量和模型配置(@XieYunshen)
- PR #9877: 修正 inference 和 servering 的文档(@ZHUI)
- PR #9834: 发布 DeepSeek 新闻及说明(@DrownFish19)
- PR #9922: 更正精调文档错误(@sijunhe)
- Benchmark 配置与测试
- PR #9651: 修复 benchmark 多机任务异常退出的问题(@XieYunshen)
- PR #9891: 更新 gpt-13b 在 dygraph 模式下的最佳配置(@liym27)
6. NPU/XPU 及硬件相关更新
- NPU 适配与修复
- PR #9499: 适配 NPU 用于 FusedHeadAndCrossEntropy(@tianhaodongbd)
- PR #9573: 修复 NPU 下的 where 问题(@tianhaodongbd)
- PR #9762: 适配新版 flash_attention_npu API(@will-jl944)
- XPU 功能与优化
- PR #9549: qwen2 支持 flash_attn on XPU(@will-jl944)
- PR #9660: qwen2 支持 fused_rope(@will-jl944)
- PR #9789: 支持 XPU 下的 empty_cache(@will-jl944)
- PR #9796: 支持 XPU 用于自动并行 LLaMa(@From00)
- PR #9854: 为 deepseek 增加 XPU 下 fused op(@QingshuChen)
7. Bug 修复、性能优化及其他改进
- 状态加载与多线程问题
- PR #9464: 修复多线程下 load_state_dict 的问题(@DesmonDay)
- 各类模型与算子问题修复
- PR #9603: 修复 qwen2 modeling 中 d2s bug(@wawltor)
- PR #9569: 修复 dynamic 与 static 模式下的 norm outputs 问题(@Wangzheee)
- PR #9652: 修复 paddle.where 问题(@will-jl944)
- PR #9638: 增加 config replace_with_c_embedding(@Xing-lil)
- PR #9699: 修复 loraga amp 问题(@greycooker)
- PR #9752: 修复 get_block_shape_and_split_kv_block 的 bug(@lizhenyun01)
- PR #9759: 修复 speculate_verify_and_update op(@Wanglongzhi2001)
- PR #9674: 将 speculate_step 合并到 step op 中(@Wanglongzhi2001)
- PR #9757: Trainer 模块中更新 sequence parallel(@DesmonDay)
- PR #9765: 修复 loraga merge 问题(@greycooker)
- PR #9777: 分布式训练下 Cherry-Pick 支持 fuse optimizer(@SylarTiaNII)
- PR #9783: 修复 ce 错误(@blacksheep-Aristotle)
- PR #9779: 修复 pickle unsafe-load 问题(@DrownFish19)
- PR #9760: MoE 模块修复 expert parallel(@DesmonDay)
- PR #9790: 为 server infer 添加 pir_model 路径(@aooxin)
- PR #9706: Cherry-Pick 集成 PDC SDK 用于 LLM 训练容错(@SylarTiaNII)
- PR #9624: 添加 FLAGS 用于替换四个参数以便更好地加速(@zhink)
- PR #9806: 修复 LLAMA 参数解析 bug(@will-jl944)
- PR #9829: 更新 mixtral.md 文件(@yuanlehome)
- PR #9859: 修复 dsk rope 差异问题(@yuanlehome)
8. 环境/依赖及版本兼容更新
- requirements 及安装更新
- PR #9514: 更新 py38 下的 requirements.txt (@ZHUI)
- PR #9118: 更新安装依赖(@DrownFish19)
- PR #9953: 针对 py38 增加 tokenizers 依赖(@DrownFish19)
- Python 版本兼容性
What's Changed
- Update requirements.txt for py38 by @ZHUI in #9514
- [Unified Checkpoint] fix single card loading without master weights by @DesmonDay in #9540
- Fix multi-threading load_state_dict by @DesmonDay in #9464
- delete generate_rank_mapping when export multi cards model by @yuanlehome in #9552
- [LLM] dpo support qwen2 with flashmask by @wtmlon in #9543
- [XPU] qwen2 supports flash_attn on XPU by @will-jl944 in #9549
- [AutoParallel]: add qwen model_auto and ci by @blacksheep-Aristotle in #9538
- add llama3.1 config for auto_parallel by @zhiqiu in #9541
- Add more model support for speculate_decoding and refactor speculate_decoding by @Wanglongzhi2001 in #9504
- [Intel_HPU]FSDPA custom kernel API update by @yanfeich in #9556
- [Unified Checkpoint] fix load missing keys by @DesmonDay in #9523
- 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 3 by @yinfan98 in #9548
- adapt code to amsgrad supported adamw by @HydrogenSulfate in #9568
- [CI]update scripts by @Liujie0926 in #9547
- Adapting npu for FusedHeadAndCrossEntropy by @tianhaodongbd in #9499
- 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 4 by @yinfan98 in #9577
- fix(export_model): fix export_model.py python path by @thinking-computer in #9571
- Fix_ckpt_oom_paddlenlp by @Xing-lil in #9507
- Add GPUEventTimer by @sneaxiy in #9582
- [npu] fix where bug by @tianhaodongbd in #9573
- [doc] Fix dead links by @DrownFish19 in #9585
- [AutoParallel]:add gpt & baichuan auto ci by @blacksheep-Aristotle in #9551
- Add llama2-7b-cinn test by @zhangbo9674 in #9578
- [AutoParallel]:add gpt&baichuan&qwen ce by @blacksheep-Aristotle in #9591
- fix dpo pp eval by @lugimzzz in #9607
- [LLM] update tensor and pipeline parallel for chatglmv2 by @DrownFish19 in #9204
- [Install] Update requirment.txt by @DrownFish19 in #9118
- [Trainer]Fix _get_eval_sampler by @greycooker in #9374
- fix benchmark scripts by @XieYunshen in #9597
- [Trainer] Add embedding trainer by @DesmonDay in #9608
- [CI] filter paddle.to_tensor warnings when set_state_dict by @DrownFish19 in #9612
- fix ckpt quant log by @wtmlon in #9606
- fix the d2s bug in qwen2 modeling by @wawltor in #9603
- 【Hackathon 7th No.43】完善 TokenizerFast 功能支持 part 5 by @yinfan98 in #9594
- fix pp_config bug by @tianhaodongbd in #9605
- Speedup FusedHeadAndCrossEntropy by @will-jl944 in #9601
- fix get_save_output op and refactor specu_decoding by @Wanglongzhi2001 in #9576
- [Inference] Fix docs and support llama3.2 by @yuanlehome in #9613
- fix by @DrownFish19 in #9628
- fix norm outputs in dynamic and static mode by @Wangzheee in #9569
- [CI]update a100 loss_base for gpt by @Liujie0926 in #9626
- [LLM benchmark]add qwen2.5-7b by @Liujie0926 in #9524
- Checkpoint Compression Doc by @wtmlon in #9614
- Update unified_checkpoint.md by @DesmonDay in #9634
- add llama and nv-embed training by @Li-Z-Q in #9323
- [News] Unified Checkpoint by @DrownFish19 in #9632
- feat(sdaa): support sdaa backend infer by @thinking-computer in #9570
- [llm]update dpo criterion by @lugimzzz in #9620
- [llm]add adam-mini by @lugimzzz in #9542
- Update version for beta3 by @ZHUI in #9553
- [LLM DOCs] Add deepseek llama3.3 new models by @DrownFish19 in #9643
- [Tokenizer] Fix tokenizer of llama3.3 by @DrownFish19 in #9641
- [AutoParallel] Add test for PIR recompute by @waliwali777 in #9621
- Update README.md for 3.0 beta3 by @ZHUI in #9644
- Add replace_with_parallel_cross_entropy flag by @waliwali777 in #9579
- [AutoParallel] change loss_base after dropout support spmd by @deepllz in #9647
- [Embedding] Add embedding training by @DesmonDay in #9508
- [PEFT]Add LoRA-GA by @greycooker in #9592
- mergekit_with_sparsify by @Mangodadada in #9561
- Fix paddle.where by @will-jl944 in #9652
- Add config replace_with_c_embedding by @Xing-lil in #9638
- Update embedding trainer state by @DesmonDay in #9629
- MoRA Implementation by @lcykww in #9562
- [llm]update peft docs by @lugimzzz in #9655
- [Trainer] Fix loading rng state by @DesmonDay in #9656
- fix qwen&baichaun&gpt ci error by @blacksheep-Aristotle in #9650
- [llm] fix lora by @lugimzzz in #9659
- [XPU] qwen2 supports fused_rope by @will-jl944 in #9660
- update hygon dcu docs by @TimeYWL in #9298
- Make the timer compatible with devices other than GPU by @deepllz in #9665
- [Trainer] update remove_master_weight by @DesmonDay in #9640
- [DOC] Update README.md by @ZHUI in #9668
- [Mthreads] support llama 13B train by @shang-mt in #9666
- Structured Index of Documents by @dfmz759837901 in #9411
- 【Qwen2-VL Inference】add qwen2-vl high performance inference by @chang-wenbin in #9575
- merge docs by @Mangodadada in #9657
- [CI]update blacklist for gpt3 by @Liujie0926 in #9555
- [体验优化] 整合训练的CUDA和Triton算子为 paddlenlp_kernel by @JunnYu in #9471
- [Unified Checkpoint] bug fix by @DesmonDay in #9669
- Add tied_weight_keys for pipeline model by @DesmonDay in #9663
- Optimize performance for Qwen2 model by @sneaxiy in #9616
- [MLU] add mlu llama readme by @PeiyuLau in #9671
- Set tensor parallel name mapping when fusion is used by @sneaxiy in #9685
- [LLM] add deploy server by @kevincheng2 in #9581
- [Embedding] Add inf-cl in embedding trainer by @jie-z-0607 in #9673
- [Fix]fix loraga amp by @greycooker in #9699
- [LLM INFER] cutlass 3.x gemm on sm90 by @ckl117 in #9398
- [Iluvatar] Add readme for llama-13b by @tianyuzhou668 in #9670
- [AutoParallel] merge ckpt for inference by @xuxinyi389 in #9688
- update gpt&baichuan&qwen ce name by @blacksheep-Aristotle in #9697
- fix docs by @xuxinyi389 in #9703
- [Inference] Use cuda core(int8_sq) for m <=4 in gemm_dequant OP by @zhink in #9707
- [LLM] [Cherry-Pick] valid loss before optimizer step (#9255) by @SylarTiaNII in #9705
- [llm]support dpo pp for qwen & llama by @lugimzzz in #9695
- support qwen dpo fused kernel by @wtmlon in #9686
- [AutoParallel] Fix recompute test running under
to_static=1
by @waliwali777 in #9672 - [LLM_benchmark]update LLM_benchmark scripts by @Liujie0926 in #9662
- [LLM] [Cherry-Pick] add asynchronous metrics dumper for llm training by @SylarTiaNII in #9704
- [llm] Add KTO by @lugimzzz in #9689
- [Embedding] Fix embedding random by @DesmonDay in #9721
- remove refined recompute deep copy by @JunnYu in #9617
- add single_model network and use intermediate api by @blacksheep-Aristotle in #9412
- Refactor custom devices. by @ZHUI in #9734
- Add offload_recompute_inputs by @will-jl944 in #9715
- [LLM] [Cherry-Pick] Integrate PDC SDK for LLM training fault tolerance platform by @SylarTiaNII in #9706
- add common models and common params unit test for llm predictor. by @aooxin in #9711
- Added FLAGS to replace four params and the value can be adjusted for better speedup by @zhink in #9624
- [AutoParallel] add parameter enable_stage1_tensor_fusion_blanced_save_load and enable_stage1_tensor_fusion by @AndSonder in #9714
- Adapt to new npu flash_attention api by @will-jl944 in #9735
- [AutoParallel] Add test for PIR refined recompute by @waliwali777 in #9679
- [Docs] Fix by @DrownFish19 in #9746
- Bugfix update predictor.py by @ZHUI in #9742
- Modify the environment variables and model configuration of the bench… by @XieYunshen in #9725
- [Unified Checkpoint] Fix expert parallel by @DesmonDay in #9741
- [AutoParallel]:ufix ernie ci error by @blacksheep-Aristotle in #9750
- fix import bugs. by @aooxin in #9751
- [AutoParallel]ckpt support local views keys to global views keys by @xuxinyi389 in #9604
- Add XLMRoBERTaModel in paddlenlp by @jie-z-0607 in #9720
- [AutoParallel]:fix ernine auto_trainer error by @blacksheep-Aristotle in #9753
- fix get_block_shape_and_split_kv_block by @lizhenyun01 in #9752
- fix speculate_verify_and_update op by @Wanglongzhi2001 in #9759
- [Inference]merge speculate_step into step op by @Wanglongzhi2001 in #9674
- [NPU] Adapt to new flash_attention_npu api by @will-jl944 in #9762
- [Trainer] update sequence parallel by @DesmonDay in #9757
- [tokenizer] Fix AutoTokenizer by @DrownFish19 in #9745
- [LLM] Add DeepseekV3 by @DrownFish19 in #9738
- [AutoParallel] open tensor_fusion for benchmark by @AndSonder in #9749
- fix loraga merge by @greycooker in #9765
- Fix ernie ci auto trainer error by @blacksheep-Aristotle in #9758
- Update README.md by @ZHUI in #9766
- Fix matryoshka norm loss by @DesmonDay in #9774
- [Distributed] [Cherry-Pick] support fuse optimizer (#9519) by @SylarTiaNII in #9777
- Update register_sequence_parallel_allreduce_hooks by @DesmonDay in #9782
- Fix ce error by @blacksheep-Aristotle in #9783
- fix pickle unsafe-load by @DrownFish19 in #9779
- [MoE] fix expert parallel by @DesmonDay in #9760
- fix dpo pp criterion by @wtmlon in #9786
- add pir_model path for server infer. by @aooxin in #9790
- [LLM] [Cherry-Pick] support flash device on static model (#9619) by @SylarTiaNII in #9787
- [LLM Benchmark]update scripts by @Liujie0926 in #9722
- mergekit gpu 1226 by @Mangodadada in #9702
- [LLM] merge code from fastdeploy by @kevincheng2 in #9791
- support eagle for llama by @freeliuzc in #9812
- [CI] Fix by @ZHUI in #9633
- wrap model when lora is ON and only do evaluation. by @wtmlon in #9803
- Update README.md for documention by @ZHUI in #9785
- [Checkpoint compression] Support sharding stage1 v2 by @DesmonDay in #9817
- [LLM] Update model convert and fix TP for deepseekv3 by @DrownFish19 in #9797
- [AutoParallel] add sharding tensor_fusion save load switch by @AndSonder in #9810
- 修复benchmark多机任务异常退出的处理 by @XieYunshen in #9651
- Fix LLAMA arg parsing bug in pp by @will-jl944 in #9806
- Update mixtral.md by @yuanlehome in #9829
- [XPU] Support empty_cache on XPUs by @will-jl944 in #9789
- [Inference] Fix multibatch inference by @DrownFish19 in #9831
- Fix position_ids for infra by @DrownFish19 in #9841
- [LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek by @DrownFish19 in #9827
- [Mergekit]update & add LoRA merge by @lugimzzz in #9811
- [Unified Checkpoint] Fix expert parallel by @DesmonDay in #9821
- [Inference] Flask server compatible with OpenAI api. by @ZHUI in #9828
- [LLM] fix checkpoint save for non flash mode by @SylarTiaNII in #9830
- [DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) by @yuanlehome in #9769
- 解决类型注解Python版本兼容性问题 by @zty-king in #9853
- [Tokenizer] save extra special tokens by @DesmonDay in #9837
- [Bugfix] Fix dsk rope diff by @yuanlehome in #9859
- Support lower memory cards. by @ZHUI in #9804
- Support XPU for auto-paralllel LLaMa by @From00 in #9796
- [XPU] Add fused op for deepseek by @QingshuChen in #9854
- [Inference] Update deepseek by @DrownFish19 in #9864
- [PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model by @ZHUI in #9855
- [Inference]Support mtp with deepseek-v3 by @freeliuzc in #9856
- [AutoParallel] Support deepseekv2 with DP/MP by @xuxinyi389 in #9862
- [LLM] move modeling.py and modeling_nv.py to transformers by @Li-Z-Q in #9676
- [Docs] fix docs for inference and servering by @ZHUI in #9877
- [Docs] news of DeepSeek by @DrownFish19 in #9834
- [AutoParallel]support_ppo_ckpt by @xuxinyi389 in #9823
- suppport intermediate_api llama test by @liym27 in #9850
- Update MergeKit by @lugimzzz in #9885
- [LLM] Support multi machine deployment by @ltd0924 in #9872
- 【SpecInfer】修复 InferenceWithReference 接收率不高的 bug by @Wanglongzhi2001 in #9880
- update the best conf for gpt-13b in dygraph mode by @liym27 in #9891
- [Inference]fix deepseek_v3 with mtp in multi-gpu mode by @freeliuzc in #9894
- [TaskFlow] Fix pir for taskflow by @DrownFish19 in #9822
- [LLM-IE] Add pp-uie to Taskflow by @Fantasy-02 in #9845
- [DOC] Update README for PP-UIE by @DrownFish19 in #9911
- 【benchmark】align benchmark conf for static baichuan2 gpt3 by @liym27 in #9901
- [DOC] PP-UIE by @DrownFish19 in #9913
- add gpu whl by @bukejiyu in #9890
- add count trained tokens by @lugimzzz in #9800
- 更正精调文档错误 by @sijunhe in #9922
- [CI]update ci scripts by @Liujie0926 in #9889
- [LLM]: fix block_size setting for llama. by @zhaohaixu in #9921
- support qwen2_5_vl by @chang-wenbin in #9924
- [DSK] Fix some bugs for dsk-v3 by @yuanlehome in #9874
- support intermediate_api gpt-3 test by @Function-Samuel in #9912
- support intermediate_api qwen test by @Function-Samuel in #9910
- [LLM] Add MTP for Deepseekv3 by @DrownFish19 in #9876
- [taskflow] Fix taskflow bug by @Fantasy-02 in #9930
- 【Inference】Support mtp serving by @freeliuzc in #9936
- [Autoparallel] Mtp for DeepSeekV3 by @xuxinyi389 in #9884
- [Unified Checkpoint] Fix split param loading directly when using ignore_merge_optimizer by @DesmonDay in #9935
- [DSK] Implement mla use matrix-absorption by @yuanlehome in #9875
- use training_args to contorl split input by @blacksheep-Aristotle in #9943
- [requirements] tokenizers for py38 by @DrownFish19 in #9953
- [LLM] update llm server dockerfiles by @kevincheng2 in #9940
- 【Inference】fix dynamic_forward of mtp by @freeliuzc in #9947
- [RL] Fix PPO and add GRPO by @DrownFish19 in #9925
- [doc] update config and add docs for grpo by @DrownFish19 in #9962
- Add Process Reward Model. by @XuLingnan in #9598
- [Feature] Support float8 dtype storage and deepseek v3 with fp8 inference. by @ZHUI in #9906
- [AutoParallel] Add auto parallel moe layer by @pkuzyc in #9886
- [llm]add bf16 moment adamw by @lugimzzz in #9732
- [MergeKit]add log by @lugimzzz in #9948
- Longlora by @micelvrice in #9970
- Fix update paddle_patch.py by @ZHUI in #9968
- support MMLU eval by @vivienfanghuagood in #9967
- Update paddle_patch.py by @ZHUI in #9978
- [XPU] change llama loss func on xpu by @AndSonder in #9973
- [Inference] refine csrc/tools/build_wheel.sh by @bukejiyu in #9971
- [DSK] mla use tensor core by @yuanlehome in #9952
- Update paddle_patch.py by @ZHUI in #9984
- [LLM]fix ci by @lugimzzz in #9986
- Fix mtp speed by @freeliuzc in #9987
- [Trainer]fix wandb proxy by @greycooker in #9960
- [LLM] add moe parallel groups by @SylarTiaNII in #9982
- [AutoParallel] Fix pipeline visualization tool by @AndSonder in #9976
- [llm]fix ci by @lugimzzz in #9989
- [DSK] DeepSeek Support FP8 by @ming1753 in #9956
- [LLM INFER] update step_paddle by @yuanlehome in #9991
- [AutoParallel] Add pp stage id by @xuxinyi389 in #9965
- [CI] Fix tokenizer load in PRM by @DrownFish19 in #9997
- fix tenercore precision while split kv by @lizhenyun01 in #9994
- support intermediate_api baichuan test by @Function-Samuel in #9988
- 【Inference】Add benchmark client test scripts by @gzy19990617 in #9996
- [LLM] Support for automatic deployment of services, modification of environment variable names by @ltd0924 in #9966
- Add moe flex dispatcher by @umiswing in #9977
- add deepseek doc by @yuanlehome in #9964
- [Feat] Sage Attention Kernels Support for sm80, sm89, sm90 by @l1cacheDell in #9848
- [LLM] support fix seq len and cmd run service by @ltd0924 in #10004
- [Doc] Add Qwen/QwQ-32B model ids by @DrownFish19 in #10005
- [LLM] Fix MTP for pipeline parallel by @DrownFish19 in #9972
- [LLM] Update license by @DrownFish19 in #10003
- 【Infer】remove some bug config for block gemm by @ckl117 in #10002
- Default set FLAGS_cascade_attention_max_partition_size as 32K by @lizhenyun01 in #10013
- [Distribution] Support DualPipeV for GPT3 by @zhangyuqin1998 in #9993
- [inference]add docker doc by @bukejiyu in #9998
- 【Docs】Update speculate decoding docs by @freeliuzc in #10017
- [LLM] fix llm model path and support download from txt by @ltd0924 in #10029
- [CherryPick] add import check of local_layer by @pkuzyc in #10038
- fix mla nan in mtp by @lizhenyun01 in #10041
- [cherry-pick] update doc by @bukejiyu in #10043
- [CI] fix install issue for requirements-dev.txt by @ZHUI in #10051
- [Doc] 支持用户自行下载静态图 by @ltd0924 in #10046
- [cherry-pick] (PR10034 [server]Add a model download script and fix bugs for the server) by @bukejiyu in #10035
- [LLM] 增加版本号 by @ltd0924 in #10056
- check mtp triton cache by @ckl117 in #10065
- Update version setup.py by @ZHUI in #10070
- [Doc] 文档完善,新增模型环境设备需求 by @ltd0924 in #10073
- change_h100_to_h800 by @yuanlehome in #10091
- [Doc] 完善文档,更新示例模型 by @ltd0924 in #10085
- 【Serving】Fix serving bug release by @freeliuzc in #10101
New Contributors
- @thinking-computer made their first contribution in #9571
- @Wangzheee made their first contribution in #9569
- @lcykww made their first contribution in #9562
- @shang-mt made their first contribution in #9666
- @dfmz759837901 made their first contribution in #9411
- @jie-z-0607 made their first contribution in #9673
- @aooxin made their first contribution in #9711
- @zty-king made their first contribution in #9853
- @Fantasy-02 made their first contribution in #9845
- @zhaohaixu made their first contribution in #9921
- @XuLingnan made their first contribution in #9598
- @micelvrice made their first contribution in #9970
Full Changelog: v3.0.0-beta3...v3.0.0-beta4
Assets 2
You can’t perform that action at this time.