CARVIEW |
Select Language
HTTP/2 200
date: Sat, 26 Jul 2025 02:40:12 GMT
content-type: text/html; charset=utf-8
vary: X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, X-Requested-With,Accept-Encoding, Accept, X-Requested-With
etag: W/"384671ee27457d3e93aaf74c1e44f604"
cache-control: max-age=0, private, must-revalidate
strict-transport-security: max-age=31536000; includeSubdomains; preload
x-frame-options: deny
x-content-type-options: nosniff
x-xss-protection: 0
referrer-policy: no-referrer-when-downgrade
content-security-policy: default-src 'none'; base-uri 'self'; child-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com www.githubstatus.com collector.github.com raw.githubusercontent.com api.github.com github-cloud.s3.amazonaws.com github-production-repository-file-5c1aeb.s3.amazonaws.com github-production-upload-manifest-file-7fdce7.s3.amazonaws.com github-production-user-asset-6210df.s3.amazonaws.com *.rel.tunnels.api.visualstudio.com wss://*.rel.tunnels.api.visualstudio.com objects-origin.githubusercontent.com copilot-proxy.githubusercontent.com proxy.individual.githubcopilot.com proxy.business.githubcopilot.com proxy.enterprise.githubcopilot.com *.actions.githubusercontent.com wss://*.actions.githubusercontent.com productionresultssa0.blob.core.windows.net/ productionresultssa1.blob.core.windows.net/ productionresultssa2.blob.core.windows.net/ productionresultssa3.blob.core.windows.net/ productionresultssa4.blob.core.windows.net/ productionresultssa5.blob.core.windows.net/ productionresultssa6.blob.core.windows.net/ productionresultssa7.blob.core.windows.net/ productionresultssa8.blob.core.windows.net/ productionresultssa9.blob.core.windows.net/ productionresultssa10.blob.core.windows.net/ productionresultssa11.blob.core.windows.net/ productionresultssa12.blob.core.windows.net/ productionresultssa13.blob.core.windows.net/ productionresultssa14.blob.core.windows.net/ productionresultssa15.blob.core.windows.net/ productionresultssa16.blob.core.windows.net/ productionresultssa17.blob.core.windows.net/ productionresultssa18.blob.core.windows.net/ productionresultssa19.blob.core.windows.net/ github-production-repository-image-32fea6.s3.amazonaws.com github-production-release-asset-2e65be.s3.amazonaws.com insights.github.com wss://alive.github.com api.githubcopilot.com api.individual.githubcopilot.com api.business.githubcopilot.com api.enterprise.githubcopilot.com; font-src github.githubassets.com; form-action 'self' github.com gist.github.com copilot-workspace.githubnext.com objects-origin.githubusercontent.com; frame-ancestors 'none'; frame-src viewscreen.githubusercontent.com notebooks.githubusercontent.com; img-src 'self' data: blob: github.githubassets.com media.githubusercontent.com camo.githubusercontent.com identicons.github.com avatars.githubusercontent.com private-avatars.githubusercontent.com github-cloud.s3.amazonaws.com objects.githubusercontent.com release-assets.githubusercontent.com secured-user-images.githubusercontent.com/ user-images.githubusercontent.com/ private-user-images.githubusercontent.com opengraph.githubassets.com copilotprodattachments.blob.core.windows.net/github-production-copilot-attachments/ github-production-user-asset-6210df.s3.amazonaws.com customer-stories-feed.github.com spotlights-feed.github.com objects-origin.githubusercontent.com *.githubusercontent.com; manifest-src 'self'; media-src github.com user-images.githubusercontent.com/ secured-user-images.githubusercontent.com/ private-user-images.githubusercontent.com github-production-user-asset-6210df.s3.amazonaws.com gist.github.com; script-src github.githubassets.com; style-src 'unsafe-inline' github.githubassets.com; upgrade-insecure-requests; worker-src github.githubassets.com github.com/assets-cdn/worker/ github.com/assets/ gist.github.com/assets-cdn/worker/
server: github.com
content-encoding: gzip
accept-ranges: bytes
set-cookie: _gh_sess=7SIEndH6W6NiJGwKEiNeVHea47%2BmujeILgPaa2haFBUl6MnnknLdhFGM55PPmslSVfghdctSXaHRAo53ao77pt7XMpGyv7Kohw8r3PccLTVf6cPrMeemB8olA1U%2F7h6QYTJk0OCvQAKvAjuuJ8WhTRG84357q38gwMOKhkAZlQZpqTk9VbmPkk57rcvtn0xsTfKNnLzwKWjvw0B%2FY7uHGsJoEwdXkcqVEsukFzYy%2FZc8zei9jjs%2F0XKCFVTtTuxy4Uq8DKlIao%2F%2BBoQxYoRrPg%3D%3D--M2rh5WQIfMUcPt2E--KPwoOLPR%2FPEby0NbXFnyXQ%3D%3D; Path=/; HttpOnly; Secure; SameSite=Lax
set-cookie: _octo=GH1.1.183588010.1753497611; Path=/; Domain=github.com; Expires=Sun, 26 Jul 2026 02:40:11 GMT; Secure; SameSite=Lax
set-cookie: logged_in=no; Path=/; Domain=github.com; Expires=Sun, 26 Jul 2026 02:40:11 GMT; HttpOnly; Secure; SameSite=Lax
x-github-request-id: BC76:1AEE08:D4914:14C47E:6884400B
Releases ยท vectorch-ai/ScaleLLM ยท GitHub
27 May 07:58
02 Mar 02:34
Loading
26 Jan 22:13
Loading
26 Oct 03:12
Loading
04 Sep 23:00
Loading
22 Aug 01:49
Loading
04 Aug 00:38
Loading
25 Jul 12:02
Loading
24 Jul 06:12
Loading
04 Jul 00:34
Loading
Skip to content
Navigation Menu
{{ message }}
-
Notifications
You must be signed in to change notification settings - Fork 37
Releases: vectorch-ai/ScaleLLM
Releases ยท vectorch-ai/ScaleLLM
v0.2.5
Compare
What's Changed
- ci: fix whell build script by @guocuimi in #418
- kernel: added attention combine kernel to support split kv by @guocuimi in #419
- kernel: refactor and added more unittests for attn combine kernel by @guocuimi in #420
- moe: added token dispatcher interface for MOE layer. by @guocuimi in #421
- moe: added local token dispatcher pytorch implementation for testing by @guocuimi in #422
- nccl: added all2all for nccl process group by @guocuimi in #423
- moe: added all to all token dispatcher pytorch implementation by @guocuimi in #424
- upgrade cutlass to 3.9 by @guocuimi in #425
- kernel: added fused gate for moe by @guocuimi in #426
- chore: added pre-commit-config by @guocuimi in #427
- kernel: added moe permute kernels by @guocuimi in #428
- chore: clean up attn dependencies by @guocuimi in #429
- chore: clean up JinjaChatTemplate by @guocuimi in #430
- test: added different dtype unittests for moe permute kernels by @guocuimi in #431
- refactor: use __ldlu to load/store data and refactor code for moe permute kernels by @guocuimi in #432
- upgrade pytorch to 2.7 by @guocuimi in #434
- chore: build manylinux2_28 builder image by @guocuimi in #435
- fix: fix manylinux2_28 build by @guocuimi in #436
- upgrade vcpkg after switch to manylinux_2_28 by @guocuimi in #437
- chore: add option to install py module into scalellm folder by @guocuimi in #438
- chore: add script to install zsh for devbox by @guocuimi in #439
- ci: enable docker cache by @guocuimi in #441
- kenerl: add kernel for moe permutation with mask map by @guocuimi in #433
- kernel: added align block permutation kernel for moe by @guocuimi in #442
- build: added build for blackwell by @guocuimi in #459
- chore: upgrade cutlass to v4.0 by @guocuimi in #460
- ci: change self-hosted runner tags by @guocuimi in #461
Full Changelog: v0.2.4...v0.2.5
Assets 14
- 49.2 MB
2025-05-27T07:58:34Z - 49.2 MB
2025-05-27T07:58:34Z - 49.2 MB
2025-05-27T07:58:34Z - 49.2 MB
2025-05-27T07:58:34Z - 49.7 MB
2025-05-27T07:58:34Z - 49.7 MB
2025-05-27T07:58:34Z - 49.7 MB
2025-05-27T07:58:34Z - 49.7 MB
2025-05-27T07:58:34Z - 72.7 MB
2025-05-27T07:58:34Z - 72.7 MB
2025-05-27T07:58:34Z -
2025-05-27T02:45:33Z -
2025-05-27T02:45:33Z - Loading
v0.2.4
Compare
What's Changed
- ci: add option to skip nvbench build by @guocuimi in #390
- ci: build devel image with cuda 12.8 for blackwell by @guocuimi in #391
- kernel: added query packing support for attention by @guocuimi in #392
- refactor: rename attention to mha to differentiate it from mla by @guocuimi in #393
- kernel: added triton aot compiler by @guocuimi in #394
- kernel: generate smaller kernel instantiations by @guocuimi in #395
- kernel: fix register spilling issue for attention head_dim=256 by @guocuimi in #397
- upgrade libtorch to 2.6.0 and cutlass to 3.8.0 by @guocuimi in #398
- kernel: added simple MLA kernel by @guocuimi in #396
- kernel: added pipeline support for mla by @guocuimi in #399
- kernel: added ping-pong rmem support for MLA by @guocuimi in #400
- kernel: revert experimental TiledMMA separation change. by @guocuimi in #401
- kernel: put query alwasy in registers for mha by @guocuimi in #402
- kernel: use 8 warps to avoid register spilling for mla with hdim=512 by @guocuimi in #403
- kernel: revert mla ping-pong rmem change by @guocuimi in #404
- kernel: refactor mask logic to avoid using hard-coded stride. by @guocuimi in #405
- kernel: added causal mask for MLA kernel by @guocuimi in #406
- kernel: added blk_n=16 for MLA to support sm_86/sm_89 with only 100kb smem by @guocuimi in #407
- kernel: fix mask bugs for MLA by @guocuimi in #408
- kernel: use differnt TiledMma for GEMM qk and pv by @guocuimi in #409
- kernel: added stage support for MLA kernel by @guocuimi in #410
- misc: upgrade cuda version and add devcontainer for manylinux by @guocuimi in #412
- kernel: added q and kv oob handling for MLA kernel by @guocuimi in #413
- kernel: optimize mask loop for MLA kernel by @guocuimi in #414
- kernel: added paged kv support for MLA kernel by @guocuimi in #415
- kernel: fix kv oob issue and added more unittests for paged MLA by @guocuimi in #416
- kernel: use FastDivmod in attention kernels by @guocuimi in #417
Full Changelog: v0.2.3...v0.2.4
Assets 18
v0.2.3
Compare
What's Changed
- misc: remove legacy logic to support quantization for other types. by @guocuimi in #350
- upgrade pytorch to 2.5.1 by @guocuimi in #351
- added cuda 12.6 build image by @guocuimi in #353
- fix cmake version issue for manylinux image by @guocuimi in #354
- kernel: added attention kernel for sm80 (Happy new year!) by @guocuimi in #355
- ci: fix package test workflow by @guocuimi in #357
- kernel: refactor attention kernel for readibility by @guocuimi in #358
- dev: config dev container with proper extensions by @guocuimi in #359
- kernel: added attention bench for profiling before optimization by @guocuimi in #360
- kernel: added logits soft cap support for attention by @guocuimi in #362
- tools: added attention traits viewer by @guocuimi in #363
- kernel: added swizzle for shared memory to avoid bank conflict by @guocuimi in #364
- kernel: added causal, alibi, sliding window mask for attention by @guocuimi in #365
- kernel: refactor attention kernel and add more unittests by @guocuimi in #366
- kernel: added M/N OOB handling for attention by @guocuimi in #367
- tools: update svg build to generate small file by @guocuimi in #368
- kernel: Added attention params and tile for different input types. by @guocuimi in #369
- kernel: added mqa and gqa support for attention by @guocuimi in #370
- kernel: added var len and paged kv cache support for attention by @guocuimi in #371
- kernel: added varlen and pagedkv unittests for attention by @guocuimi in #372
- kernel: added attention kernel launch by @guocuimi in #373
- kernel: added build script to generate kernel instantiations for attention by @guocuimi in #374
- kernel: change attention input shape from [head, seq, dim] to [seq, head, dim] by @guocuimi in #375
- kernel: added head_dim=96 support for attention by @guocuimi in #376
- kernel: optimize attention kernel performance by @guocuimi in #377
- upgrade cutlass to 3.7.0 by @guocuimi in #379
- kernel: handle kv block range for attention kernel by @guocuimi in #382
- kernel: use cp_async_zfill instead of cute::clear for oob handling by @guocuimi in #383
- kernel: seperate oob iterations for better performance. by @guocuimi in #384
- refactor: remove batch_prefill interface by @guocuimi in #385
- refactor: stop build flash_infer kernel by @guocuimi in #386
- feat: integrate in-house scale attention and use it by default by @guocuimi in #380
- kernel: only zfill k once to improve perf for attention by @guocuimi in #387
- refactor: skip flash_attn build by @guocuimi in #388
- refactor: clean up kv cache set/get apis and improve slot id calculation perf by @guocuimi in #389
Full Changelog: v0.2.2...v0.2.3
Assets 29
1 person reacted
v0.2.2
Compare
What's Changed
- kernel: added flash infer attention impl by @guocuimi in #327
- refactor: flatten block tables to 1d tensor by @guocuimi in #328
- kernel: added script to generate instantiation for flashinfer kernels by @guocuimi in #329
- refactor: move flash attn and flash infer into attention folder by @guocuimi in #330
- kernel: port flash infer handler + wrapper logics by @guocuimi in #331
- ut: added unittests for flash infer kernels by @guocuimi in #332
- refactor: replaced last_page_len with kv_indptr for flash infer kernel by @guocuimi in #333
- feat: added pass-in alibi slopes support for flash infer kernel by @guocuimi in #334
- refactor: move paged kv related logic into paged_kv_t by @guocuimi in #335
- ut: added fp8 kv unittests for flash infer kernel by @guocuimi in #336
- ci: added pip cache to avoid redownloading by @guocuimi in #337
- upgrade pytorch to 2.4.1 by @guocuimi in #341
- ci: run package test in docker by @guocuimi in #345
- ci: build cuda 12.4 for scalellm cpp images by @guocuimi in #346
- Upgrade pytorch to 2.5.0 by @guocuimi in #347
- ut: add more tests for different warp layout by @guocuimi in #340
- misc: attention kernel refactoring by @guocuimi in #339
Full Changelog: v0.2.1...v0.2.2
Assets 29
1 person reacted
v0.2.1
Compare
What's Changed
- feat: added awq marlin qlinear by @guocuimi in #315
- build: speed up compilation for marlin kernels by @guocuimi in #316
- test: added unittests for marlin kernels by @guocuimi in #317
- refactor: clean up build warnings and refactor marlin kernels by @guocuimi in #318
- fix: clean up build warnings: "LOG" redefined by @guocuimi in #319
- cmake: make includes private and disable jinja2cpp build by @guocuimi in #320
- ci: allow build without requiring a physical gpu device by @guocuimi in #321
- fix: put item into asyncio.Queue in a thread-safe way by @guocuimi in #324
- refactor: added static switch for marlin kernel dispatch by @guocuimi in #325
- feat: fix and use marlin kernel for awq by default by @guocuimi in #326
Full Changelog: v0.2.0...v0.2.1
Assets 37
v0.2.0
Compare
What's Changed
- kernel: port softcap support for flash attention by @guocuimi in #298
- test: added unittests for attention sliding window by @guocuimi in #299
- model: added gemma2 with softcap and sliding window support by @guocuimi in #300
- kernel: support kernel test in python via pybind by @guocuimi in #301
- test: added unittests for marlin fp16xint4 gemm by @guocuimi in #302
- fix: move eos out of stop token list to honor ignore_eos option by @guocuimi in #305
- refactor: move models to upper folder by @guocuimi in #306
- kernel: port gptq marlin kernel and fp8 marlin kernel by @guocuimi in #307
- rust: upgrade rust libs to latest version by @guocuimi in #309
- refactor: remove the logic loading individual weight from shared partitions by @guocuimi in #311
- feat: added fused column parallel linear by @guocuimi in #313
- feat: added gptq marlin qlinear layer by @guocuimi in #312
- kernel: port awq repack kernel by @guocuimi in #314
Full Changelog: v0.1.9...v0.2.0
Assets 37
v0.1.9
Compare
What's Changed
- ci: cancel all previous runs if a new one is triggered by @guocuimi in #283
- pypi: fix invalid classifier by @guocuimi in #284
- refactor: remove exllama kernels by @guocuimi in #285
- kernel: added marlin dense and sparse kernels by @guocuimi in #287
- debug: added environment collection script. by @guocuimi in #288
- kernel: added triton kernel build support by @guocuimi in #289
- feat: added THUDM/glm-4* support by @guocuimi in #292
- fix: handle unfinished utf8 bytes for tiktoken tokenizer by @guocuimi in #293
- triton: fix build error and add example with unittest by @guocuimi in #294
- model: added qwen2 support by @guocuimi in #295
- feat: added sliding window support for QWen2 by @guocuimi in #296
- ci: fix pytest version to avoid flakiness by @guocuimi in #297
Full Changelog: v0.1.8...v0.1.9
Assets 37
v0.1.8
2e14170
This commit was created on GitHub.com and signed with GitHubโs verified signature.
Compare
Assets 37
v0.1.7
f0f7e07
This commit was created on GitHub.com and signed with GitHubโs verified signature.
Compare
What's Changed
- build: fix build error with gcc-13 by @guocuimi in #264
- kernel: upgrade cutlass to 3.5.0 + cuda 12.4 for sm89 fp8 support by @guocuimi in #265
- cmake: define header only library instead of symbol link for cutlass and flashinfer by @guocuimi in #266
- feat: added range to support Range-for loops by @guocuimi in #267
- kernel: added attention cpu implementation for testing by @guocuimi in #268
- build: added nvbench as submodule by @guocuimi in #269
- build: upgrade cmake required version from 3.18 to 3.26 by @guocuimi in #270
- ci: build and test in devel docker image by @guocuimi in #272
- ci: use manylinux image to build wheel and run pytest by @guocuimi in #271
- attention: added tile logic using cute::local_tile into cpu attention by @guocuimi in #273
- kernel: added playground for learning and experimenting cute. by @guocuimi in #274
- feat: added rope scaling support for llama3.1 by @guocuimi in #277
- update docs for llama3.1 support and bump up version by @guocuimi in #278
Full Changelog: v0.1.6...v0.1.7
Assets 26
1 person reacted
v0.1.6
7aeb7fa
This commit was created on GitHub.com and signed with GitHubโs verified signature.
Compare
What's Changed
- alllow deploy docs when triggered on demand by @guocuimi in #253
- [model] support vision language model llava. by @liutongxuan in #178
- dev: fix issues in run_in_docker script by @guocuimi in #254
- dev: added cuda 12.4 build support by @guocuimi in #255
- build: fix multiple definition issue by @guocuimi in #256
- fix: check against num_tokens instead of num_prompt_tokens for shared blocks by @guocuimi in #257
- bugfix: fix invalid max_cache_size when device is cpu. by @liutongxuan in #259
- ci: fail test if not all tests were passed successfully by @guocuimi in #263
- Revert "[model] support vision language model llava. (#178)" by @guocuimi in #262
Full Changelog: v0.1.5...v0.1.6
Assets 26
1 person reacted
Previous Next
You canโt perform that action at this time.