Carview!

Blog

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

TLDR: Efficient full-parameter fine-tuning of GPT-OSS-20B & Qwen3-14B models on a single NVIDIA GH200 and…

Xinyu Lian, Minjia Zhang (SSAIL Lab, University of Illinois Urbana-Champaign), Masahiro Tanaka (Anyscale), Olatunji Ruwase (Snowflake)October 9, 2025

Blog Community

When Quantization Isn’t Enough: Why 2:4 Sparsity Matters

TL;DR Combining 2:4 sparsity with quantization offers a powerful approach to compress large language models…

Mohammad Mozaffari, Jesse Cai, Supriya RaoOctober 6, 2025

Blog

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

PyTorch now offers native quantized variants of Phi4-mini-instruct, Qwen3, SmolLM3-3B and gemma-3-270m-it through a collaboration…

Meta: Jerry Zhang, Scott Roy, Mergen Nachin, Kimish Patel, Supriya Rao, Jack Zhang, Guang Yang & Unsloth AI: Daniel HanSeptember 19, 2025

Blog

Experience in Reducing PT2 Compilation Time for Meta Internal Workloads

The Challenge of PyTorch 2.0 Compilation Since the release of PyTorch 2.0 (PT2) and its…

Mingming Ding, James Wu, Oguz Ulgen, Sam Larsen, Bob Ren, Laith Sakka, Pian Pawakapan, Animesh Jain, Edward Yang, Yuzhen Huang, Ruilin Chen, Daohang Shi, Shuai Yang, Menglu Yu, Chunzhi Yang, Jade NieSeptember 18, 2025

Blog

High-performance quantized LLM inference on Intel CPUs with native PyTorch

PyTorch 2.8 has just been released with a set of exciting new features, including a…

Intel PyTorch TeamSeptember 17, 2025

Blog

PyTorch 2.8 Brings Native XCCL Support to Intel GPUs: Case Studies from Argonne National Laboratory

Intel announces a major enhancement for distributed training in PyTorch 2.8: the native integration of…

Intel PyTorch Team, Argonne National LaboratorySeptember 12, 2025

Blog Community

Disaggregated Inference at Scale with PyTorch & vLLM

Key takeaways: PyTorch and vLLM have been organically integrated to accelerate cutting-edge generative AI applications,…

Hongyi Jia, Jinghui Zhang, Lu Fang, Stephen Chen, Yan Cui, Ye (Charlotte) Qi, Zijing LiuSeptember 12, 2025

Blog

Distributed Checkpoint: Efficient checkpointing in large-scale jobs

As training jobs become larger, the likelihood of failures such as preemptions, crashes, or infrastructure…

Meta: Saurabh Mishra, Meet Vadakkanchery, Pradeep Fernando, Saiteja Samudrala Google: Gerson Kroiz, Jingxin Ye, Viacheslav KovalevskyiSeptember 11, 2025

Annie Tallund at WeAreDevelopers Conference

Blog Community

Yellow Teaming on Arm: A look inside our responsible AI workshop

A few months back, I traveled to Berlin to attend the WeAreDevelopers World Congress. During…

Annie TallundSeptember 5, 2025

Blog

Fast 2-Simplicial Attention: Hardware-Efficient Kernels in TLX

In this blog post, we explore the kernel design details presented in the paper Fast…

Sijia Chen, Timothy Chou, Aurko Roy†, Hongtao Yu, Yuanwei (Kevin) Fang, Xiaodong Wang, Jiecao Yu, Tony CW Liu†, Chuanhao Zhuge, Josh Fromm, Ying Zhang†, Rohan Anil†, Ajit MathewsSeptember 5, 2025

Blog

PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs

Large Language Models (LLMs) have transformed tasks across numerous industries, including drafting emails, generating code,…

Intel PyTorch TeamSeptember 3, 2025

Blog

Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster

tldr: 1.22x - 1.28x training acceleration with MXFP8, equivalent convergence compared to BF16. We recently…

Less Wright, Vasiliy Kuznetsov, Daniel Vega-Myhre, Driss Guessous, Hamid Shojanazeri, Elias Ellison, Martin Cala, Ethan PetersenSeptember 3, 2025

Blog

A Primer on LLM Post-Training

Large Language Models (LLMs) have revolutionized how we write and consume documents. In the past…

Davide TestuggineAugust 26, 2025

Blog

DRAMA Model Inference Efficiency Boosted by 1.7x-2.3x

TL;DR NJTs (Nested Jagged Tensors) boost DRAMA model inference efficiency by 1.7x-2.3x, making it more…

Shreya GoyalAugust 22, 2025

Blog

ZenFlow: Stall-Free Offloading Engine for LLM Training

Introduction ZenFlow is a new extension to DeepSpeed introduced in summer 2025, designed as a…

Tingfeng Lan, Yusen Wu, Bin Ma, Zhaoyuan Su, Rui Yang, Tekin Bicer, Masahiro Tanaka, Olatunji Ruwase, Dong Li, Yue ChengAugust 20, 2025

Blog

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

In this post, we present an optimized Triton BF16 Grouped GEMM kernel for running training…

Less Wright, Adnan Hoque, Garrett GoonAugust 18, 2025

Blog

PyTorch Wheel Variants, the Frontier of Python Packaging

charliemarsh’s tweet, creator of uv PyTorch is the leading machine learning framework for developing and…

Eli UriegasAugust 13, 2025

Blog Community

PyTorch Day China Recap

On June 7, 2025, PyTorch Day China was held in Beijing, co-hosted by PyTorch Foundation…

PyTorch FoundationAugust 12, 2025

Blog

Introducing Mixed Precision Training in Opacus

Introduction We integrate mixed and low-precision training with Opacus to unlock increased throughput and training…

Iden Kalemaj, Huanyu ZhangAugust 12, 2025

Blog

Bringing Generative AI to the Masses with ExecuTorch and KleidiAI

Key Takeaways: ExecuTorch 0.7 now enables KleidiAI by default, delivering automatic acceleration on Arm CPUs…

Gian Marco Iodice, GenAI Engineering Lead, Arm, Mary Bennion, Director Ecosystem, Arm, Digant Desai, Software Engineer, MetaAugust 11, 2025

Blog

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips

When Quantization Isn’t Enough: Why 2:4 Sparsity Matters

TorchAO Quantized Models and Quantization Recipes Now Available on HuggingFace Hub

Experience in Reducing PT2 Compilation Time for Meta Internal Workloads

High-performance quantized LLM inference on Intel CPUs with native PyTorch

PyTorch 2.8 Brings Native XCCL Support to Intel GPUs: Case Studies from Argonne National Laboratory

Disaggregated Inference at Scale with PyTorch & vLLM

Distributed Checkpoint: Efficient checkpointing in large-scale jobs

Yellow Teaming on Arm: A look inside our responsible AI workshop

Fast 2-Simplicial Attention: Hardware-Efficient Kernels in TLX

PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs

Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster

A Primer on LLM Post-Training

DRAMA Model Inference Efficiency Boosted by 1.7x-2.3x

ZenFlow: Stall-Free Offloading Engine for LLM Training

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

PyTorch Wheel Variants, the Frontier of Python Packaging

PyTorch Day China Recap

Introducing Mixed Precision Training in Opacus

Bringing Generative AI to the Masses with ExecuTorch and KleidiAI

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news