HOME
ABOUT
- RESULTS
- differences
- BENEFITS
- HISTORY
- TEAM
- LOCATION
- FACILITIES
- BANKING
- MEMBERSHIPS
- APPROVALS
- LICENCES
- SUPPLIERS
- SPONSORSHIPS
- MEDIA
- PRIVACY
AUCTIONS
SHIPPING
FEES
- TS REWARDS
TOOLS
guides
FAQ
CONTACT
- CONNECT

VEHICLES
BRAND
- JAPANESE CARS
  - DAIHATSU
  - EUNOS
  - FORD
  - HONDA
  - ISUZU
  - LEXUS
  - MAZDA
  - MITSUBISHI
  - MITSUOKA
  - NISSAN
  - SUBARU
  - SUZUKI
  - TOYOTA
- GERMAN CARS
- AMERICAN CARS
- BRITISH CARS
- ITALIAN CARS
- FRENCH CARS
- SWEDISH CARS
- KOREAN CARS
TYPE
- mobility
- VENDING
- instruction
- TAXIS
- AMBULANCES
- FIRE ENGINES
- HEARSES
- LIMOUSINES
- COMMERCIAL
CLASS
FUEL
TRUCKS
minitrucks
- DAIHATSU
- HONDA
- MAZDA
- MITSUBISHI
- NISSAN
- SUBARU
- SUZUKI
- DUMP
- CRANE
- CAMPER
- REFRIGERATED
- 4WD
- NEW
BUSES
MOTORHOMES
- YAHOO!
- RAKUTEN
- DEALER

PARTS
- FREE REPORT
- PARTS CONTAINERS
- PARTS SYSTEMS
- PARTS PROTECTION
- BODY SHELLS
- DISMANTLING
- ONLINE PARTS
- NEW PARTS
- INTERIOR PARTS
- EXTERIOR PARTS
  - BONNETS
  - BUMPERS
  - GRILLES
  - FENDERS
  - DOORS
  - TRUNKS
  - SPOILERS
  - LIGHTS
  - EMBLEMS
  - CAMERAS
- ENGINES
- TRANSMISSIONS
- WHEELS & TYRES
  - WHEELS
  - TYRES
CUTS
PERFORMANCE PARTS
TRUCK PARTS
MOTORBIKE PARTS
- MOTORBIKE ENGINES
- MOTORBIKE ACCESSORIES

MOTORBIKES
MARINE
FORKLIFTS
MACHINERY
AGRICULTURAL
OTHER
COUNTRY
- AUSTRALIA
- CANADA
- KENYA
- MYANMAR
- NEW ZEALAND
- PAKISTAN
- TANZANIA
- UNITED STATES

CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 date: Sun, 28 Dec 2025 10:58:10 GMT content-type: text/html server: cloudflare cf-ray: 9b508a7b6c9e6cd6-BLR cf-cache-status: HIT content-encoding: gzip last-modified: Sun, 28 Dec 2025 05:30:21 GMT strict-transport-security: max-age=31536000 content-security-policy: frame-ancestors 'self' surrogate-control: max-age=84581 surrogate-key: hanlab.mit.edu 64f4d663be17b6544a586bac pageId:64fa8034ba7a4362dbb7965a 64fa8034ba7a4362dbb79678 64fa8034ba7a4362dbb79679 64fa8034ba7a4362dbb7967a 64fa8034ba7a4362dbb7967c 64fa8034ba7a4362dbb7967b 64fa8034ba7a4362dbb79680 64fd2bfab5e959690efbf0d7 64fa8034ba7a4362dbb7967f 65132b418abde427dfa0ad89 651486a3e6569f353fbb6525 651485befa5ae90e0f462880 64fa8034ba7a4362dbb7967d x-frame-options: SAMEORIGIN x-lambda-id: 65279f46-0cc6-4d41-8075-11f75b09ff5e vary: Accept-Encoding set-cookie: _cfuvid=KAid6JyecqRQBWokqafwToki9dcW8ZiaydStjIT7NP4-1766919490252-0.0.1.1-604800000; path=/; domain=.hanlab.mit.edu; HttpOnly; Secure; SameSite=None alt-svc: h3=":443"; ma=86400 MIT HAN Lab

About
Song Han
News
Publications
Blog
Course
Awards
Talks
Media
Team
Gallery

Efficient AI Computing,
Transforming the Future.

Who We Are

Welcome to MIT HAN Lab! We specialize in efficient generative AI, including large language models (LLMs), multi-modal models (VLMs/VLAs), and diffusion models. Today’s foundation models are remarkably powerful but prohibitively costly in terms of computation, energy, and scalability. At MIT HAN Lab, we integrate algorithm–system co-design to push the frontier of AI efficiency and performance. Our research spans the entire AI stack—from pre-training and post-training to model compression and deployment—bridging fundamental breakthroughs with real-world applications. By rethinking how AI is designed with GPU efficiency in mind, we aim to make generative AI faster, greener, and more accessible.

Alumni: Ji Lin (OpenAI), Hanrui Wang (Co-Founder @Eigen AI), Zhijian Liu (assistant professor @UCSD), Han Cai (NVIDIA Research), Haotian Tang (Google Deepmind->Meta), Yujun Lin (NVIDIA Research), Wei-Chen Wang (Co-Founder @Eigen AI), Wei-Ming Chen (NVIDIA).

Highlights

Accelerating LLM and Generative AI [slides]:

LLM Quantization: AWQ, TinyChat enables on-device LLM inference with 4bit quantization (best paper award at MLSys'24), with 19 million downloads on HuggingFace. SmoothQuant is a training-free and accuracy-preserving 8-bit post-training quantization (PTQ) solution for LLMs. QServe speeds up the large scale LLM serving with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). COAT enables memory efficient FP8 training.

Long Context LLM: StreamingLLM enables LLMs to generate infinite-length texts with a fixed memory budget by preserving the "attention sinks" in the KV-cache. StreamingVLM introduceda streaming-aware KV cache with attention sinks to enable real-time understanding of infinite video streams.Quest leverages query-aware sparsity in long-context KV cache to boost inference throughput. DuoAttention reduces both LLM's decoding and pre-filling memory and latency with retrieval and streaming heads. LServe accelerates long-context LLM serving with hardware-aware unified sparse attention framework.
Sparse Attention: SpAtten invented cascade KV cache pruning and head pruning. XAttention accelerate long-context prefilling with block sparse attention and anti-diagnol scoring. Sparse VideoGen introduced an online profiling strategy to identify spatial-temporal sparsity and a hardware-efficient layout transformation. Radial Attention identified the Spatiotemporal Energy Decay phenomenon and proposed a corresponding O(n log n) sparse attention mechanism. Sparse VideoGen2 introduced semantic-aware permutation and efficient dynamic block size attention kernels.

Efficient Visual Generation: HART is an autoregressive visual generation model capable of directly generating 1024×1024 images on a laptop. SANA enables 4K image synthesis under low computation, using deep compression auto-encoder (DC-AE) and linear diffusion transformer. SANA-1.5 explores efficient training scaling and inference scaling for diffusion models. SANA-Sprint is a one-step distilled diffusion model enabling real-time generation. SVDQuant further enables 4-bit diffusion models (W4A4) by absorbing the outliers with low-rank components. SANA-Video introduced the Linear Diffusion Transformer and a constant-memory KV cache. DC-VideoGen introduced a chunk-causal Deep Compression Video Autoencoder and the AE-Adapt-V adaptation strategy.
Efficient Visual Language Models: VILA, VILA-U, LongVILA are a family of efficient visual language models for both understanding and generation. LongVILA efficiently scales to 6K frames of video.

We Work On

The incredible potential of large models in Artificial Intelligence Generated Content (AIGC), including cutting-edge technologies like Large Language Models (LLMs) and Diffusion Models, have revolutionized a wide range of applications, spanning natural language processing, content generation, creative arts, and more. However, large model size, and high memory and computational requirements present formidable challenges. We aim to tackle these hurdles head-on and make these advanced AI technologies more practical, democratizing access to these future-changing technologies for everyone.

Efficient AI Hardware & System

Efficient AI Algorithm

Efficient AI Hardware & System

Efficiency improvements in deep learning often start with refining algorithms, but these theoretical gains, like reducing FLOPs and model size, don't always easily lead to practical speed and energy savings. The demand arises for specialized hardware and software systems to bridge this gap. These specialized software and hardware systems create a fresh design dimension independent of the algorithm space. This opens up opportunities for holistic optimization by co-designing both the algorithm and the software/hardware systems.

Efficient AI Algorithm

Efficient AI Hardware & System

Efficient AI Algorithm

News

Oct 2026
10/1/2026
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
appears at
to appear at
In Submission
.
SANA-Video is a fast, efficient diffusion model that generates high-quality, minute-long videos upto 720×1280 resolution. It uses linear attention and a constant-memory KV cache to handle long videos with fixed memory, enabling real-time (27PFS) 1 mintue video generation.
SANA-Video Paper Code Slides Video

Mar 2026
3/23/2026
Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
appears at
to appear at
ASPLOS 2026
.
TLT is a lossless acceleration framework for reasoning-oriented LLM RL training, introducing adaptive speculative decoding to eliminate long-tail generation bottlenecks. It achieves over 1.7× end-to-end speedup while fully preserving model quality and producing a high-quality draft model for efficient deployment.
TLT Paper Code Slides Video

Oct 2025
10/10/2025
StreamingVLM: Real-Time Understanding for Infinite Video Streams
appears at
to appear at
arXiv 2025
.
StreamingVLM enables real-time understanding of infinite videos with low, stable latency. By aligning training on overlapped video chunks with an efficient KV cache, it runs at 8 FPS on a single H100. It achieves a 66.18% win rate vs. GPT-4o mini on a new benchmark with videos averaging over 2 hours long.
StreamingVLM Paper Code Slides Video

Sep 2025
9/30/2025
DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space
appears at
to appear at
ArXiv
.
DC-Gen is a general post-training framework that accelerates pre-trained text-to-image diffusion models.
DC-Gen Paper Code Slides Video

Sep 2026
9/30/2026
DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder
appears at
to appear at
Arxiv
.
We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation with a Deep Compression Video Autoencoder and a robust adapation strategy AE-Adapt-V.
DC-VideoGen Paper Code Slides Video

Oct 2025
10/23/2025
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
appears at
to appear at
ICCV 2025
.
SANA-Sprint is a one-step distilled diffusion model enabling real-time generation; Deployable on laptop GPU; Top-notch GenEval & DPGBench results.
SANA-Sprint Paper Code Slides Video

Jul 2025
7/13/2025
SANA-1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer
appears at
to appear at
ICML 2025
.
SANA-1.5 explores efficient training scaling and inference scaling for diffusion models; Deployable on laptop GPU; Top-notch GenEval & DPGBench results.
SANA-1.5 Paper Code Slides Video

Aug 2025
8/21/2025
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
appears at
to appear at
NeurIPS 2025
.
Jet-Nemotron is a family of hybrid models leveraging both full and linear attention, offering accuracy on par with leading full-attention LMs like Qwen3, LLama3.2, and Gemma3n. Jet-Nemotron-2B provides a 47x generation throughput speedup under a 64K context length compared to Qwen3-1.7B-Base, achieving top-tier accuracy with exceptional efficiency.
Jet-Nemotron Paper Code Slides Video

Jul 2025
7/19/2025
XAttention: Block Sparse Attention with Antidiagonal Scoring
appears at
to appear at
ICML 2025
.
A plug-and-play method that uses antidiagonal sums to efficiently identify important parts of the attention matrix, achieving up to 13.5x speedup on long-context tasks with comparable accuracy to full attention.
XAttention Paper Code Slides Video

Dec 2025
12/7/2025
Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation
appears at
to appear at
NeurIPS 2025
.
A O(nlogn) Sparse Attention Mask for Long Video Generation
Radial Attention Paper Code Slides Video

Mar 2025
3/21/2025
HART has been highlighted by MIT news: AI tool generates high-quality images faster than state-of-the-art approaches!
HART

Jan 2025
1/31/2025
HART has been accepted at ICLR 2025!
HART

Dec 2024
12/15/2024
🔥⚡ We release TinyChat 2.0, the latest version with significant advancements in prefilling speed of Edge LLMs and VLMs, 1.5-1.7x faster than the previous version of TinyChat. Please refer to our blog for more details.
AWQ

Dec 2024
12/1/2024
DistriFusion is integrated in NVIDIA's TensorRT-LLM for distributed inference on high-resolution image generation.
DistriFusion

Aug 2024
8/30/2024
The TinyML and Efficient Deep Learning Computing course will be returning in Fall, with recorded sessions on YouTube!
6.5940

Aug 2024
8/1/2024
🔥 NVIDIA TensorRT-LLM, AMD, Google Vertex AI, Amazon Sagemaker, Intel Neural Compressor, FastChat, vLLM, HuggingFace TGI, and LMDeploy adopt AWQ to improve LLM serving efficiency. Our AWQ models on HuggingFace has received over 6 million downloads.
AWQ

Jul 2024
7/29/2024
DistriFusion is supported in ColossalAI!
DistriFusion

Jun 2024
6/6/2024
AWQ is presented at MLSys 2024. Talk video has been released!
AWQ

May 2024
5/31/2024
SmoothQuant is adopted by AMD Instinct MI300X using Composable Kernel.
SmoothQuant

May 2024
5/30/2024
Congrats on graduation! Cheers on the next move: Zhijian Liu: assistant professor at UCSD, Hanrui Wang: assistant professor at UCLA, Ji Lin: OpenAI, Han Cai: NVIDIA Research, Wei-Chen Wang (postdoc): Amazon, Wei-Ming Chen (postdoc): NVIDIA.

May 2024
5/29/2024
TorchQuantum is highlighted in UnitaryHack.
QuantumNAS

May 2024
5/12/2024
🏆 AWQ receives the Best Paper Award at MLSys 2024. 🎉
AWQ

Apr 2024
4/4/2024
DistriFusion was selected as a HIGHLIGHT poster in CVPR 2024!
DistriFusion

Mar 2024
3/29/2024
We show SmoothQuant can enable W8A8 quantization for Llama-1/2, Falcon, Mistral, and Mixtral models with negligible loss.
SmoothQuant

Feb 2024
2/24/2024
AWQ has been accepted to MLSys 2024!
AWQ

Feb 2024
2/13/2024
Our work StreamingLLM is covered by MIT News as spotlight!
StreamingLLM

Feb 2024
2/1/2024
We released new version of quantized GEMM/GEMV kernels in TinyChat, leading to 38 tokens/second inference speed on NVIDIA Jetson Orin!
AWQ

Feb 2024
2/1/2024
We supported VILA Vision Languague Models in AWQ & TinyChat! Check our latest demos with multi-image inputs!
AWQ

Jan 2024
1/31/2024
🔥 AWQ has been integrated by Google Vertex AI!
‍
AWQ

Jan 2024
1/17/2024
SmoothQuant is adopted by Microsoft ONNX Runtime.
SmoothQuant

Jan 2024
1/7/2024
StreamingLLM is integrated by HPC-AI Tech SwiftInfer to support infinite input length for LLM inference.
StreamingLLM

Jan 2024
1/2/2024
StreamingLLM is integrated into NVIDIA TensorRT-LLM!
StreamingLLM

Dec 2023
12/15/2023
StreamingLLM is integrated by CMU, UW, and OctoAI, enabling endless and efficient LLM generation on iPhone!
StreamingLLM

Dec 2023
12/15/2023
Congrats Ji Lin completed and defended his PhD thesis: "Efficient Deep Learning Computing: From TinyML to Large Language Model". Ji joined OpenAI after graduation.

Dec 2023
12/5/2023
SmoothQuant is adopted by NVIDIA TensorRT-LLM.
SmoothQuant

Dec 2023
12/5/2023
AWQ is integrate by NVIDIA TensorRT-LLM, can fit Falcon-180B on a single H200GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100.
AWQ

Dec 2023
12/5/2023
AWQ is integrated by HuggingFace Transformers' main branch.
AWQ

Dec 2023
12/5/2023
Attention Sink is integrated by HuggingFace Transformers' main branch.
StreamingLLM

Nov 2023
11/30/2023
AWQ has been integrated by Amazon Sagemaker Containers!
AWQ

Nov 2023
11/1/2023
SmoothQuant is adopted by Amazon SageMaker.
SmoothQuant

Nov 2023
11/1/2023
🔥 AWQ is now integrated natively in Hugging Face transformers through from_pretrained. You can either load quantized models from the Hub or your own HF quantized models.
‍
AWQ

Oct 2023
10/29/2023
TorchQuantum is used in winning team for ACM Quantum Computing for Drug Discovery.
QuantumNAS

Oct 2023
10/20/2023
StreamingLLM is integrated into Intel Extension for Transformers.
StreamingLLM

Oct 2023
10/9/2023
Attention Sinks, an library from community enables StreamingLLM on more Huggingface LLMs. blog.
StreamingLLM

Oct 2023
10/1/2023
AWQ is integrated into NVIDIA TensorRT-LLM.
AWQ

Sep 2023
9/30/2023
AWQ is integrated into Intel Neural Compressor.
AWQ

Sep 2023
9/1/2023
AWQ is integrated into FastChat, vLLM, HuggingFace TGI, and LMDeploy.
‍
AWQ

Sep 2023
9/1/2023
AWQ is integrated into FastChat, vLLM, HuggingFace TGI, and LMDeploy.
AWQ

Jul 2023
7/30/2023
The TinyML and Efficient Deep Learning Computing course will be returning in Fall, with live sessions on YouTube!
6.5940

Jul 2023
7/9/2023
SpAtten and SpAtten-Chip won the 1st Place Award at 2023 DAC University Demo.
SpAtten

Jul 2023
7/1/2023
We released TinyChat, an efficient and lightweight chatbot interface based on AWQ. TinyChat enables efficient LLM inference on both cloud and edge GPUs. Llama-2-chat models are supported! Check out our implementation here.
AWQ

Apr 2023
4/25/2023
TorchQuantum received UnitaryFund.
QuantumNAS

Mar 2023
3/1/2023
SmoothQuant is adopted by Intel Neural-Compressor.
SmoothQuant

Jan 2023
1/31/2023
Argoverse 2 dataset implements their baseline detector with TorchSparse.
TorchSparse

Oct 2023
10/29/2023
Congrats
QuantumNAS
team on
1st Place Award
of
ACM Quantum Computing for Drug Discovery Contest
on
@
ICCAD 2023

2023
.
QuantumNAS

Nov 2022
11/1/2022
Congrats
HAT
team on
First Place (1/150)
of
ACM/IEEE TinyML Design Contest
on
Memory Occupation Track
@
ICCAD

2022
.
HAT

Jul 2020
7/30/2020
Congrats
SPVNAS
team on
First Place
of
SemanticKITTI leaderboard
on
3D semantic segmentation
@
ECCV

2020
.
SPVNAS

Jun 2021
6/1/2021
Congrats
SPVNAS
team on
First Price
of
6th AI Driving Olympics
on
nuScenes Semantic Segmentation
@
ICRA

2021
.
SPVNAS

Oct 2019
10/1/2019
Congrats
OFA
team on
First Place
of
Low-Power Computer Vision Workshop at ICCV 2019
on
DSP
@
ICCV

2019
.
OFA

Jun 2019
6/1/2019
Congrats
OFA
team on
First Place
of
Low-Power Image Recognition Challenge
on
classification, detection
@
IEEE

2019
.
OFA

Jun 2020
6/1/2020
Congrats
OFA
team on
First Place
of
Low-Power Computer Vision Challenge
on
CPU Detection, FPGA
@
CVPR

2020
.
OFA

Jun 2019
6/1/2019
Congrats
ProxylessNAS
team on
First Place
of
Visual Wake Words Challenge
on
TF-lite track
@
CVPR

2019
.
ProxylessNAS

Oct 2023
10/1/2023
Congrats
Qinghao Hu
on
2023 Google PhD Fellowship
.

May 2024
5/1/2024
Congrats
Qinghao Hu
on
2024 Rising Stars in ML and Systems
.

Feb 2024
2/18/2024
Congrats
Hanrui Wang
on
Rising Star in Solid-State Circuits at ISSCC
.

Nov 2023
11/12/2023
Congrats
Zhijian Liu
on
2023 Rising Stars in Data Science
.

Jan 2023
1/25/2023
Congrats
Hanrui Wang
on
MARC 2023 Best Pitch Award
.

Nov 2022
11/1/2022
Congrats
Hanrui Wang
on
Gold Medal of ACM Student Research Competition
.

Aug 2023
8/17/2023
Congrats
Hanrui Wang
on
2023 Rising Stars in ML and Systems
.

May 2023
5/1/2023
Congrats
Song Han
on
2023 Sloan Research Fellowship
.

May 2022
5/1/2022
Congrats
Song Han
on
2022 Red Dot Award
.

May 2021
5/1/2021
Congrats
Song Han
on
2021 Samsung Global Research Outreach (GRO) Award
.

May 2021
5/1/2021
Congrats
Song Han
on
2021 NVIDIA Academic Partnership Award
.

May 2020
5/1/2020
Congrats
Song Han
on
2020 NVIDIA Academic Partnership Award
.

May 2020
5/1/2020
Congrats
Song Han
on
2020 IEEE "AIs 10 to Watch: The Future of AI" Award
.

May 2020
5/1/2020
Congrats
Song Han
on
2020 NSF CAREER Award
.

May 2019
5/1/2019
Congrats
Song Han
on
2019 MIT Technology Review list of 35 Innovators Under 35
.

May 2020
5/1/2020
Congrats
Song Han
on
2020 SONY Faculty Award
.

May 2017
5/1/2017
Congrats
Song Han
on
2017 SONY Faculty Award
.

May 2018
5/1/2018
Congrats
Song Han
on
2018 SONY Faculty Award
.

May 2018
5/1/2018
Congrats
Song Han
on
2018 Amazon Machine Learning Research Award
.

May 2019
5/1/2019
Congrats
Song Han
on
2019 Amazon Machine Learning Research Award
.

May 2019
5/1/2019
Congrats
Song Han
on
2019 Facebook Research Award
.

Aug 2022
8/1/2022
Congrats
Ji Lin
on
the 2022 Qualcomm Innovation Fellowship
.

Aug 2023
8/17/2023
Congrats
Zhijian Liu
on
2023 Rising Stars in ML and Systems
.

May 2021
5/1/2021
Congrats
Hanrui Wang
on
2021 Qualcomm Innovation Fellowship
.

May 2021
5/1/2021
Congrats
Han Cai
on
the 2021 Qualcomm Innovation Fellowship
.

May 2021
5/1/2021
Congrats
Zhijian Liu
on
the 2021 Qualcomm Innovation Fellowship
.

May 2020
5/1/2020
Congrats
Ji Lin
on
the 2020 Nvidia Graduate Fellowship Finalist
.

May 2021
5/1/2021
Congrats
Yujun Lin
on
the 2021 DAC Young Fellowship
.

May 2022
5/1/2022
Congrats
Hanrui Wang
on
2022 ACM Student Research Competition Award 1st Place
.

Aug 2022
8/24/2022
Congrats
Zhijian Liu
on
the 2022 MIT Ho-Ching and Han-Ching Fund Award
.

May 2021
5/1/2021
Congrats
Yujun Lin
on
the 2021 Qualcomm Innovation Fellowship
.

May 2020
5/1/2020
Congrats
Hanrui Wang
on
the 2020 Nvidia Graduate Fellowship Finalist
.

May 2020
5/1/2020
Congrats
Hanrui Wang
on
the 2021 Analog Devices Outstanding Student Designer Award
.

May 2020
5/1/2020
Congrats
Hanrui Wang
on
the 2020 DAC Young Fellowship
.

Aug 2018
8/24/2018
Congrats
Yujun Lin
on
the 2018 Robert J. Shillman Fellowship
.

Jun 2019
6/9/2019
Congrats
Hanrui Wang Park
team
on
Best Paper Award
of
ICML 2019 Reinforcement Learning for Real Life Workshop

.
Park

Apr 2023
4/29/2023
Congrats
Hanrui Wang QuantumNAT
team
on
Best Poster Award
of
2023 NSF Athena AI Institute

.
QuantumNAT

Sep 2022
9/17/2022
Congrats
Hanrui Wang
team
on
Best Paper Award
of
IEEE International Conference on Quantum Computing and Engineering (QCE)

.

Jun 2024
6/13/2024
Congrats
AWQ
team
on
Best Paper Award
of
MLSys 2024

.
AWQ

Jun 2023
6/15/2023
Congrats
Song Han EIE Retrospective
team
on
Top 5 cited papers in 50 years of ISCA
of

.
EIE Retrospective

May 2017
5/15/2017
Congrats
Song Han
team
on
Best Paper Award
of
FPGA 2017

.

May 2016
5/15/2016
Congrats
Song Han
team
on
Best Paper Award
of
ICLR 2016

.

Jul 2023
7/15/2023
Congrats
SpAtten
team
on
Best Demo Award
of
DAC University Demo

.
SpAtten

May 2023
5/3/2023
Congrats
Wei-Chen Wang
team
on
2023 NSF Athena AI Institute Best Poster Award
of

.

May 2022
5/3/2022
Congrats
Hanrui Wang QuantumNAS
team
on
Best Poster Award
of
2022 NSF Athena AI Institute

.
QuantumNAS

Dec 2020
12/15/2020
Congrats
Hanrui Wang
team
on
Best Presentation Award
of
DAC 2020 Young Fellow

.

Nov 2025
11/24/2025
A new blog post
Infinite Context Length with Global but Constant Attention Memory
is published.
By reducing complexity from O(N^2) to O(N), Linear Attention is the key to processing ultra-long sequences. This post explores its mathematical core—"state accumulation"—and how it unlocks infinite context for LLMs and long video generation.

Aug 2025
8/22/2025
A new blog post
Statistics behind Block Sparse Attention
is published.
A statistical model revealing how block sparse attention achieves efficiency and accuracy through learned similarity gaps.

Aug 2025
8/25/2025
A new blog post
Why Stacking Sliding Windows Can't See Very Far
is published.
A mathematical explanation of why sliding window attention's effective receptive field is O(W) rather than the theoretical O(LW), regardless of depth, due to information dilution and exponential decay from residual connections.

Aug 2025
8/7/2025
A new blog post
How Attention Sinks Keep Language Models Stable
is published.
We discovered why language models catastrophically fail on long conversations: when old tokens are removed to save memory, models produce complete gibberish. We found models dump massive attention onto the first few tokens as "attention sinks"—places to park unused attention since softmax requires weights to sum to 1. Our solution, StreamingLLM, simply keeps these first 4 tokens permanently while sliding the window for everything else, enabling stable processing of 4 million+ tokens instead of just thousands. This mechanism is now in HuggingFace, NVIDIA TensorRT-LLM, and OpenAI's latest models.

Jul 2025
7/3/2025
A new blog post
Radial Attention: O(nlogn) Sparse Attention for Long Video Generation with 2–4× Speedups in Training and Inference
is published.
A sparse attention mechanism with O(nlogn) computational complexity for long video generation. It can speed up both training and inference by 2–4×. The code is available at https://github.com/mit-han-lab/radial-attention

Feb 2025
2/21/2025
A new blog post
SVDQuant Meets NVFP4: 4× Smaller and 3× Faster FLUX with 16-bit Quality on NVIDIA Blackwell GPUs
is published.
SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku.

Feb 2025
2/10/2025
A new blog post
RTX 5090 Workstation Configuration Journey
is published.
With the arrival of the RTX 5090, we built a high-performance workstation to maximize its AI computing potential. In this blog post, we share our experience—from overcoming setup challenges to testing its performance.

Dec 2024
12/12/2024
A new blog post
TinyChat 2.0: Accelerating Edge AI with Efficient LLM and VLM Deployment
is published.
Explore the latest advancement in TinyChat – the 2.0 version with significant advancements in prefilling speed of Edge LLMs and VLMs. Apart from the 3-4x decoding speedups achieved with AWQ quantization, TinyChat 2.0 now delivers state-of-the-art Time-To-First-Token, which is 1.5-1.7x faster than the legacy version of TinyChat.

Nov 2024
11/7/2024
A new blog post
SVDQuant: Accurate 4-Bit Quantization Powers 12B FLUX on a 16GB 4090 Laptop with 3x Speedup
is published.
A new post-training training quantization paradigm for diffusion models, which quantize both the weights and activations of FLUX.1 to 4 bits, achieving 3.5× memory and 8.7× latency reduction on a 16GB laptop 4090 GPU. Code: https://www.github.com/mit-han-lab/nunchaku

Oct 2024
10/10/2024
A new blog post
Block Sparse Attention
is published.
We introduce Block Sparse Attention, a library of sparse attention kernels that supports various sparse patterns, including streaming attention with token granularity, streaming attention with block granularity, and block-sparse attention. By incorporating these patterns, Block Sparse Attention can significantly reduce the computational costs of LLMs, thereby enhancing their efficiency and scalability. We release the implementation of Block Sparse Attention, which is modified based on FlashAttention 2.4.2.

Mar 2024
3/10/2024
A new blog post
Patch Conv: Patch Convolution to Avoid Large GPU Memory Usage of Conv2D
is published.
In this blog, we introduce Patch Conv to reduce memory footprint when generating high-resolution images. PatchConv significantly cuts down the memory usage by over 2.4× compared to existing PyTorch implementation. Code: https://github.com/mit-han-lab/patch_conv

Feb 2024
2/29/2024
A new blog post
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
is published.
In this blog, we introduce DistriFusion, a training-free algorithm to harness multiple GPUs to accelerate diffusion model inference without sacrificing image quality. It can reduce SDXL latency by up to 6.1× on 8 A100s. Our work has been accepted by CVPR 2024 as a highlight. Code: https://github.com/mit-han-lab/distrifusion

Mar 2024
3/3/2024
A new blog post
TinyChat: Visual Language Models & Edge AI 2.0
is published.
Explore the latest advancement in TinyChat and AWQ – the integration of Visual Language Models (VLM) on the edge! The exciting advancements in VLM allows LLMs to comprehend visual inputs, enabling seamless image understanding tasks like caption generation, question answering, and more. With the latest release, TinyChat now supports leading VLMs such as VILA, which can be easily quantized with AWQ, empowering users with seamless experience for image understanding tasks.

Nov 2022
11/28/2022
A new blog post
On-Device Training Under 256KB Memory
is published.
In MCUNetV3, we enable on-device training under 256KB SRAM and 1MB Flash, using less than 1/1000 memory of PyTorch while matching the accuracy on the visual wake words application. It enables the model to adapt to newly collected sensor data and users can enjoy customized services without uploading the data to the cloud thus protecting privacy.

May 2020
5/22/2020
A new blog post
Efficiently Understanding Videos, Point Cloud and Natural Language on NVIDIA Jetson Xavier NX
is published.
Thanks to NVIDIA’s amazing deep learning eco-system, we are able to deploy three applications on Jetson Xavier NX soon after we receive the kit, including efficient video understanding with Temporal Shift Module (TSM, ICCV’19), efficient 3D deep learning with Point-Voxel CNN (PVCNN, NeurIPS’19), and efficient machine translation with hardware-aware transformer (HAT, ACL’20).

Jul 2020
7/2/2020
A new blog post
Auto Hardware-Aware Neural Network Specialization on ImageNet in Minutes
is published.
This tutorial introduces how to use the Once-for-All (OFA) Network to get specialized ImageNet models for the target hardware in minutes with only your laptop.

Jul 2020
7/3/2020
A new blog post
Reducing the carbon footprint of AI using the Once-for-All network
is published.
“The aim is smaller, greener neural networks,” says Song Han, an assistant professor in the Department of Electrical Engineering and Computer Science. “Searching efficient neural network architectures has until now had a huge carbon footprint. But we reduced that footprint by orders of magnitude with these new methods.”

Sep 2023
9/6/2023
A new blog post
TinyChat: Large Language Model on the Edge
is published.
Running large language models (LLMs) on the edge is of great importance. In this blog, we introduce TinyChat, an efficient and lightweight system for LLM deployment on the edge. It runs Meta's latest LLaMA-2 model at 30 tokens / second on NVIDIA Jetson Orin and can easily support different models and hardware.

Dec 2023
12/5/2023
Song Han
presented "
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction
" at
Google
.
EfficientViT Video Slides Media Event

Dec 2023
12/4/2023
Song Han
presented "
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
" at
Apple
.
AWQ Video Slides Media Event

Oct 2023
10/18/2023
Song Han
presented "
Efficient Large Language Model
" at
2023 MIT AI Hardware Fall Research Update
.
SmoothQuant Video Slides Media Event

Oct 2023
10/18/2023
Song Han
presented "
TinyML: Enable Efficient Deep Learning on Mobile Devices
" at
2023 MIT AI Hardware Fall Research Update
.
On-Device Training Video Slides Media Event

Oct 2023
10/2/2023
Song Han
presented "
Quantization for Foundation Models
" at
the ICCV 2023 Workshop on Low-Bit Quantized Neural Networks
.
Video Slides Media Event

Oct 2023
10/2/2023
Song Han
presented "
Efficient Vision Transformer
" at
the ICCV 2023 Workshop on Resource-Efficient Deep Learning for Computer Vision (RCV'23)
.
Video Slides Media Event

Sep 2023
9/29/2023
Song Han
presented "
TinyChat for On-device LLM
" at
the IAP MIT Workshop on the Future of AI and Cloud Computing Applications and Infrastructure
.
Video Slides Media Event

Aug 2023
8/1/2023
Ji Lin
presented "
SmoothQuant, AWQ, TinyChat
" at
UC Berkeley SkyLab
.
Video Slides Media Event

Jun 2023
6/1/2023
Ji Lin
presented "
SmoothQuant, AWQ
" at
NVIDIA
.
AWQ Video Slides Media Event

Jun 2023
6/1/2023
Zhijian Liu
presented "
Efficient 3D Perception for Autonomous Vehicles
" at
CVPR Workshop on Efficient Computer Vision
.
Video Slides Media Event

Jun 2023
6/1/2023
Song Han
presented "
Efficient Deep Learning Computing with Sparsity
" at
CVPR Workshop on Efficient Computer Vision
.
Video Slides Media Event

Nov 2021
11/1/2021
Song Han
presented "
Plenary: Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
TinyML Technical Forum Asia
.
Video Slides Media Event

Nov 2021
11/1/2021
Song Han
presented "
TinyML and Efficient Deep Learning for Automotive Applications
" at
Hyundai Motor Group Developers Conference
.
Video Slides Media Event

Oct 2021
10/1/2021
Song Han
presented "
Today’s AI is Too Big
" at
Industry-Academia Partnership
.
Video Slides Media Event

Oct 2021
10/1/2021
Song Han
presented "
Challenges and Directions of Low-Power Computer Vision
" at
International Conference on Computer Vision (ICCV) Workshop Panel
.
Video Slides Media Event

Oct 2021
10/1/2021
Song Han
presented "
TinyML Techniques for Greener, Faster and Sustainable AI
" at
IBM IEEE CAS/EDS – AI Compute Symposium
.
Video Slides Media Event

Oct 2021
10/1/2021
Song Han
presented "
Computationally Efficient Large-Scale AI
" at
Microsoft Research Summit
.
Video Slides Media Event

Oct 2021
10/1/2021
Song Han
presented "
Efficient Methods & Hardware for TinyML
" at
Sony Professor Lecture Series
.
Video Slides Media Event

Sep 2021
9/1/2021
Song Han
presented "
One-For-All Network on FPGAs
" at
Xilinx Adaptive Computing Conference
.
Video Slides Media Event

Sep 2021
9/1/2021
Song Han
presented "
TinyML and Efficient Deep Learning
" at
Synopsys ARC Processor Summit
.
Video Slides Media Event

Aug 2021
8/1/2021
Song Han
presented "
TinyML and Efficient Deep Learning
" at
Machine Learning Summer School 2021 Taiwan
.
Video Slides Media Event

Aug 2021
8/1/2021
Song Han
presented "
Putting AI On A Diet: TinyML and Efficient Deep Learning
" at
Silicon Research Cooperation (SRC) AI Hardware E-Workshops
.
Video Slides Media Event

Aug 2021
8/1/2021
Song Han
presented "
Frontiers of AI Accelerators: Technologies, Circuits and Applications
" at
Hong Kong University of Science and Technology, AI Chip Center for Emerging Smart Systems
.
Video Slides Media Event

Aug 2021
8/1/2021
Song Han
presented "
AutoML for Tiny Machine Learning
" at
AutoML Workshop at Knowledge Discovery and Data Mining (KDD) Conference
.
Video Slides Media Event

Jul 2021
7/1/2021
Song Han
presented "
MCUNet and Tiny Machine Learning for Mobile Devices
" at
Apple
.
Video Slides Media Event

Jul 2021
7/1/2021
Song Han
presented "
TinyML and Efficient Deep Learning
" at
Alibaba
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Shanghai Jiaotong University
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Princeton University
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Ford
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Samsung
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
MLOps World – Machine Learning in Production
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Efficient Deep Learning for Computer Vision Workshop at CVPR
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
Machine Learning for Analog and Digital Design
" at
VLSI symposia workshop on AI/Machine Learning for Circuit Design and Optimization
.
Video Slides Media Event

Jun 2021
6/1/2021
Song Han
presented "
NAAS: Neural-Accelerator Architecture Search
" at
4th International Workshop on AI-assisted Design for Architecture at ISCA
.
Video Slides Media Event

May 2021
5/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Apple’s On-Device ML Workshop
.
Video Slides Media Event

Apr 2021
4/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
ISQED’21 Embedded Tutorials
.
Video Slides Media Event

Apr 2021
4/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
MLSys’21 On-Device Intelligence Workshop
.
Video Slides Media Event

Mar 2021
3/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
TinyML Summit
.
Video Slides Media Event

Jan 2021
1/1/2021
Song Han
presented "
Efficient AI: Reducing the Carbon Footprint of AI in the Internet of Things (IoT)
" at
MIT ILP Japan conference
.
Video Slides Media Event

Jan 2021
1/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Microsoft
.
Video Slides Media Event

Jan 2021
1/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Stanford MLSys seminar
.
Video Slides Media Event

Jan 2021
1/1/2021
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
Boeing
.
Video Slides Media Event

Nov 2020
11/1/2020
Song Han
presented "
Putting AI on a Diet: TinyML and Efficient Deep Learning
" at
MIT ILP webinar session on low power/edge/efficient computing
.
Video Slides Media Event

Apr 2020
4/1/2020
Song Han
presented "
AutoML for TinyML with Once-for-all-Network
" at
ICLR’20 NAS workshop
.
Video Slides Media Event

Apr 2020
4/1/2020
Song Han
presented "
Once-for-All: Train One Network and Specialize it for Efficient Deployment
" at
TinyML Webinar
.
Video Slides Media Event

Mar 2020
3/1/2020
Song Han
presented "
Faster, Power-Efficient Video Recognition
" at
EmTech Digital
.
Video Slides Media Event

Nov 2024
11/9/2024
SVDQuant
is covered by
MarkTechPost
: "
SVDQuant: A Novel 4-bit Post-Training Quantization Paradigm for Diffusion Models
".

Jan 2025
1/15/2025
SVDQuant
is covered by
Forbes
: "
New Diffusion Models Offer Keys To Success For Resource-Scarce Systems
".

Aug 2020
8/7/2020
HAT
is covered by
MIT News
: "
Shrinking deep learning’s carbon footprint
".

Feb 2021
2/10/2021
SpAtten
is covered by
MIT News
: "
A language learning system that pays attention — more efficiently than ever before
".

Mar 2022
3/21/2022
QuantumNAS
is covered by
MIT News
: "
Making quantum circuits more robust
".

May 2020
5/29/2020
HAT
is covered by
VentureBeat
: "
New AI technique speeds up language models on edge devices
".

Feb 2024
2/13/2024
StreamingLLM
is covered by
MIT News, MIT Homepage
: "
A new way to let AI chatbots converse all day without crashing
".

Sep 2023
9/15/2023
EfficientViT
is covered by
marktechpost
: "
MIT Researchers Introduce A Novel Lightweight Multi-Scale Attention For On-Device Semantic Segmentation
".

Nov 2023
11/16/2023
PockEngine
is covered by
MIT News
: "
Technique enables AI on edge devices to keep learning over time
".

Oct 2023
10/5/2023
StreamingLLM
is covered by
VentureBeat
: "
StreamingLLM shows how one token can keep AI models running smoothly indefinitely
".

Oct 2022
10/4/2022
On-Device Training
is covered by
MIT News, MIT Homepage
: "
Learning on the edge
".

Dec 2021
12/8/2021
MCUNet-v2
is covered by
MIT News
: "
Tiny machine learning design alleviates a bottleneck in memory usage on internet-of-things devices
".

Dec 2020
12/13/2020
MCUNet
is covered by
WIRED
: "
AI Algorithms Are Slimming Down to Fit in Your Fridge
".

Nov 2020
11/13/2020
MCUNet
is covered by
MIT News, MIT Homepage
: "
System brings deep learning to “internet of things” devices
".

Sep 2023
9/13/2023
EfficientViT
is covered by
MIT News, MIT Homepage
: "
AI model speeds up high-resolution computer vision
".

Apr 2020
4/23/2020
OFA
is covered by
VentureBeat
: "
MIT aims for energy efficiency in AI model training
".

Jul 2021
7/13/2021
OFA
is covered by
Xilinx News
: "
Bringing OFA (Once-for-All) to FPGA
".

Jun 2020
6/8/2020
OFA
is covered by
Qualcomm News
: "
Research from MIT shows promising results for on-device AI
".

Apr 2020
4/23/2020
OFA
is covered by
MIT News
: "
Reducing the carbon footprint of artificial intelligence
".

Apr 2019
4/2/2019
ProxylessNAS
is covered by
IEEE Spectrum
: "
Using AI to Make Better AI New approach brings faster, AI-optimized AI within reach for image recognition and other applications
".

Mar 2019
3/21/2019
ProxylessNAS
is covered by
MIT News
: "
Kicking neural network design automation into high gear
".

Aug 2023
8/7/2023
SmoothQuant
is covered by
Intel News
: "
Smaller is Better: Q8-Chat LLM is an Efficient Generative AI Experience on Intel® Xeon® Processors
".

Mar 2020
3/25/2020
PVCNN
is covered by
NVIDIA News
: "
NVIDIA Jetson Community Project Spotlight: Point-Voxel CNN for Efficient 3D Deep Learning
".

Our Full-Stack Projects

To choose projects, simply check the boxes of the categories, topics and techniques.

Efficient AI Algorithm

Efficient AI Hardware & System

Topics

Large Language Models (LLMs)

Generative AI

TinyML

Autonomous Driving

Training

Quantum Computing

Techniques

Pruning & Sparsity

Data Augmentation

Quantization

Neural Architecture Search (NAS)

Distillation

New Architecture

ML for Hardware & System

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Efficient AI Hardware & System

Efficient AI Algorithm

Large Language Models (LLMs)

Data Augmentation

ML for Hardware & System

Neural Architecture Search (NAS)

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

In Submission

Junsong Chen¹²*, Yuyang Zhao¹*, Jincheng Yu¹*, Ruihang Chu⁴, Junyu Chen¹, Shuai Yang¹, Xianbang Wang³, Yicheng Pan⁴, Daquan Zhou⁵, Huan Ling¹, Haozhe Liu⁶, Hongwei Yi¹, Hao Zhang¹, Muyang Li³, Yukang Chen¹, Han Cai¹, Sanja Fidler¹, Ping Luo², Song Han¹³, Enze Xie¹

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16× faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4× speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

More Close

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

In Submission

(

)

SANA-Video is a fast, efficient diffusion model that generates high-quality, minute-long videos upto 720×1280 resolution. It uses linear attention and a constant-memory KV cache to handle long videos with fixed memory, enabling real-time (27PFS) 1 mintue video generation.

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Arxiv

Junyu Chen*, Wenkun He*, Yuchao Gu*, Yuyang Zhao, Jincheng Yu, Junsong Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Muyang Li, Haocheng Xi, Ligeng Zhu, Enze Xie, Song Han, Han Cai

We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation. DC-VideoGen can be applied to any pre-trained video diffusion model, improving efficiency by adapting it to a deep compression latent space with lightweight fine-tuning. The framework builds on two key innovations: (i) a Deep Compression Video Autoencoder with a novel chunk-causal temporal design that achieves 32x/64x spatial and 4x temporal compression while preserving reconstruction quality and generalization to longer videos; and (ii) AE-Adapt-V, a robust adaptation strategy that enables rapid and stable transfer of pre-trained models into the new latent space. Adapting the pre-trained Wan-2.1-14B model with DC-VideoGen requires only 10 GPU days on the NVIDIA H100 GPU. The accelerated models achieve up to 14.8x lower inference latency than their base counterparts without compromising quality, and further enable 2160x3840 video generation on a single GPU.

More Close

DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder

Arxiv

(

)

We introduce DC-VideoGen, a post-training acceleration framework for efficient video generation with a Deep Compression Video Autoencoder and a robust adapation strategy AE-Adapt-V.

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

ASPLOS 2026

Qinghao Hu*, Shang Yang*, Junxian Guo, Xiaozhe Yao, Yujun Lin, Yuxian Gu, Han Cai, Chuang Gan, Ana Klimovic, Song Han

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement Learning (RL), encounters critical efficiency bottlenecks: response generation during RL training exhibits a persistent long-tail distribution, where a few very long responses dominate execution time, wasting resources and inflating costs. To address this, we propose TLT, a system that accelerates reasoning RL training losslessly by integrating adaptive speculative decoding. Applying speculative decoding in RL is challenging due to the dynamic workloads, evolving target model, and draft model training overhead. TLT overcomes these obstacles with two synergistic components: (1) Adaptive Drafter, a lightweight draft model trained continuously on idle GPUs during long-tail generation to maintain alignment with the target model at no extra cost; and (2) Adaptive Rollout Engine, which maintains a memory efficient pool of pre-captured CUDAGraphs and adaptively select suitable SD strategies for each input batch. Evaluations demonstrate that TLT achieves over 1.7× end-to-end RL training speedup over state-of-the-art systems, preserves the model accuracy, and yields a high-quality draft model as a free byproduct suitable for efficient deployment.

More Close

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

ASPLOS 2026

(

)

TLT is a lossless acceleration framework for reasoning-oriented LLM RL training, introducing adaptive speculative decoding to eliminate long-tail generation bottlenecks. It achieves over 1.7× end-to-end speedup while fully preserving model quality and producing a high-quality draft model for efficient deployment.

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

NeurIPS 2025

Xingyang Li*, Muyang Li*, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, Song Han

Recent advances in diffusion models have enabled high-quality video generation, but the additional temporal dimension significantly increases computational costs, making training and inference on long videos prohibitively expensive. In this paper, we identify a phenomenon we term Spatiotemporal Energy Decay in video diffusion models: post-softmax attention scores diminish as spatial and temporal distance between tokens increase, akin to the physical decay of signal or waves over space and time in nature. Motivated by this, we propose Radial Attention, a scalable sparse attention mechanism with O(nlogn) complexity that translates energy decay into exponentially decaying compute density, which is significantly more efficient than standard O(n2) dense attention and more expressive than linear attention. Specifically, Radial Attention employs a simple, static attention mask where each token attends to spatially nearby tokens, with the attention window size shrinking with temporal distance. Moreover, it allows pre-trained video diffusion models to extend their generation length with efficient LoRA-based fine-tuning. Extensive experiments show that Radial Attention maintains video quality across Wan2.1-14B, HunyuanVideo, and Mochi 1, achieving up to a 1.9× speedup over the original dense attention. With minimal tuning, it enables video generation up to 4× longer while reducing training costs by up to 4.4× compared to direct fine-tuning and accelerating inference by up to 3.7× compared to dense attention inference.

More Close

Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation

NeurIPS 2025

(

)

A O(nlogn) Sparse Attention Mask for Long Video Generation

Our Impacts

We actively collaborate with industry partners on efficient AI, model compression and acceleration. Our research has influenced and landed in many industrial products: Intel OpenVino, Intel Neural Network Distiller, Intel Neural Compressor, Apple Neural Engine, NVIDIA Sparse Tensor Core, NVIDIA TensorRT LLM, AMD-Xilinx Vitis AI, Qualcomm AI Model Efficiency Toolkit (AIMET), Amazon AutoGluon, Facebook PyTorch, Microsoft NNI, SONY Neural Architecture Search Library, SONY Model Compression Toolkit, ADI MAX78000/MAX78002 Model Training and Synthesis Tool.