Hi, this is Wei Huang(黄炜)’s website!

I am currently a Ph.D in HKU, supervised by Prof.Xiaojuan Qi, Prof.Shiming Zhang. I am also co-supervised by Prof.Zhongrui Wang.

I obtained my bachelor’s degree in Jun 2023, supervised by Prof.Si Liu.

Now, I am fortunate to intern at NVIDIA Research, working with Dr.Yukang Chen and supervised by Prof.Song Han. I am also guided by Dr.Hongxu Yin and Dr.Sifei Liu

I focus on efficient & tiny deep learning for lightweight, long-sequence, and fast AI.

This direction covers, but is not limited to, the following topics:

🚀 Efficient Compression: The compression of LLMs, VLM, and Diffusion Model (ultra low-bit quantization, pruning, and sparsity).

🧠 Efficient Reasoning: Reinforcement Learning for long-sequence & low-cost LLMs and VLMs’ reasoning.

🎬 Efficient Generation: Real-time and interactive long-video generation.

Wearable AI: Edged AI for wearable context, and for sensitive organic electrochemical transistor (OECT).

🔥 Brain-mimic Computing: Neuromorphic computing and hardware acceleration (e.g. spiking neural network-SNN).

🔥 News

  • 2025.10:  🎉🎉 One paper for wearable AI guided glucose management (A Wearable, Dual Closed-loop Insulin Delivery System for Precision Diabetes Management) is accepted by Advanced Materials, Top Interdisciplinary Journal, IF=26.8!
  • 2025.09:  🎉🎉 Two papers are accepted by Neurips’25! One for scaling long-video reasoning (Long-RL: Scaling RL to Long Videos) and one for unified reasoning model (Mindomni: Unleashing reasoning generation in vision language models with rgpo). All the codes are opensourced now!
  • 2025.05:  🎉🎉 One paper for structural mixed-precision low-bit quantization for LLMs (SliM-LLM) is accepted by ICML’25! All the codes are opensourced now!
  • 2025.02:  🎉🎉 One paper for efficient fine-grained chain-of-thought video understanding framework (VideoEspresso) is accepted by CVPR’25, Oral Paper 0.73%! All the codes are opensourced now!
  • 2025.01:  🎉🎉 Three papers are accepted by ICLR’25! One for MoE-LLM compression (MC-MoE: MoE-LLM compression) and two papers (InfoMax: data pruning; From-Layers-to-States: dynamic neural network layer) for data efficiency and dynamic neural networks. All the codes are opensourced now!
  • 2024.12:  🎉🎉 One Technical Report is accepted by Visual Intelligence
  • 2024.05:  🎉🎉 One paper for snn security on rram is accepted by ICCAD’24! All the codes are opensourced now!
  • 2024.04:  🎉🎉 One paper for post-training binary quantization of LLMs is accepted by ICML’24! All the codes are opensourced now!

💬 Invited Talks and Report

  • 2025.11: 青稞社区 online talk on QeRL. Please see the video.
  • 2025.10: Our OmniVinci was reported by 机器之心,Sina(新浪财经). Please see the link.
  • 2025.10: Our LongLive was reported by 新智元. Please see the link.
  • 2025.07: Our Scaling RL to Long Videos was reported by 机器之心. Please see the link.
  • 2025.06: AI-Time online talk on VideoEspresso. Please see the video.
  • 2024.05: BiLLM was reported by IEEE Spectrum. Thanks to Matthew for the interview and report. Please see the link.
  • 2024.05: AI-Time online talk on BiLLM. Please see the video.
  • 2024.04: Our emperical study How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study (new version: An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs) was reported by QbitAI (量子位). Please see the link.
  • 2024.03: Our BiLLM: Pushing the Limit of Post-Training Quantization for LLMs was reported by QbitAI (量子位). Please see the link.

📝 Publications

Arxiv 2025
sym

QeRL: Beyond Efficiency – Quantization-enhanced Reinforcement Learning for LLMs sym

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen

  • 🧠 4-bit quantized RL training.
  • 💪 Train a 32B LLM on a single H100 GPU.
  • 🎯 Accuracy on par with bfloat16-level accuracy.
  • 🔥 Supports NVFP4 quantization format.
[paper] [code] [abstract]
Arxiv 2025
sym

LongLive: Real-time Interactive Long Video Generation sym

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen

  • Generates video in real time as users enter text prompts.
  • 20.7 FPS on a single H100, up to 240s per clip. Fine-tunes SOTA short-video models (e.g., Wan) into long-video generators.
  • One step closer to World Models.
[paper] [code] [abstract]
Neurips 2025
sym

Scaling RL to Long Videos sym

Yukang Chen*, Wei Huang*, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

  • MR-SP infrastructure with sequence parallelism and vLLM-based cached-embedding rollouts, enabling up to 8,192 frames, hour-long RL on 8×A100, 2.1× speedup, and strong results (VideoMME 65.1% no subs, 71.1% with subs) via LongVILA-R1-7B.
  • Two-stage pipeline combining CoT-SFT and RL to scale reasoning for long-horizon video understanding.
  • LongVideo-Reason (104K long-video QA pairs) with high-quality chain-of-thought annotations across diverse domains.
[paper] [code] [abstract]
ICML 2025
sym

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models sym

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

  • A novel scheme that observes and proves the structure-clustering of salient elements in LLMs weight matrix.
  • The first group-wise mixed-precision quantization framework for LLMs.
  • Serve as a plug-and-play approach to GPTQ/Omniquant/…, improving the inference-friendly method under low-bit quantization.
[paper] [code] [abstract]
CVPR 2025 Oral
sym

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection sym

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, Si Liu

  • A novel dataset designed to enhance video reasoning by addressing the limitations of existing datasets in terms of scale and granularity.
  • We proposed a Hybrid LVLMs Collaboration framework achieving cost-effective and accurate video reasoning, outperforming baseline models on the majority of tasks across our proposed benchmark.
  • VideoEspresso sets a new starting point in video reasoning, offering rich annotations that facilitate advanced multimodal understanding.
ICLR 2025
sym

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More sym

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi

  • MC-MoE for accurate weight-only quantization (Weight=1.5~2.5bit).
  • MC-MoE for efficient online dynamic pruning (additional compression ratio > 10%)
  • MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency.
  • For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%.
[paper] [code] [abstract]
Visual Intelligence
sym

An Empirical Study of LLaMA3 Quantization: From LLMs to MLLMs sym

Wei Huang, Xingyu Zheng, Xudong Ma, Haotong Qin, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno

  • Explore the performance of LLaMA3 series models under existing post-training quantization and LoRA-finetuning methods.
  • Point out the significant performance loss of MLLMs based on LLaMA3 under low-bit post-training quantization.
  • Highlights the significant performance gap under low bit-width that needs to be bridged in future developments.
[paper] [code] [abstract]
ICML 2024
sym

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs sym

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi

  • Compress LLM weights to as low as 1.08-1.1 bit and exceeds the performance of previous quantization methods at 2-bit or even 3-bit.
  • Implements high-performance binary LLM in PTQ mode, efficiently achieving 1bit LLM compression without additional training and backpropagation.
[paper] [code] [abstract]
Arxiv
sym

On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks

Wei Huang, Haotong Qin, Yangdong Liu, Jingzhuo Liang, Yulun Zhang, Ying Li, Xianglong Liu

  • Combine IP-core-level chip runtime clock and power awareness with network sensitivity, achieving a better balance of computational efficiency and accuracy on edge devices.
  • Allow target networks to be compressed and deployed with high accuracy on edge chips with limited computational resources and ultra-low power consumption.
  • Efficiently perform online quantization and optimization without additional devices or data access.
[paper] [abstract]

📖 Educations

  • 2023.09 - (now), Ph.D. Student in Department of Electrical Electronic Engineering, The University of HongKong.
  • 2019.09 - 2023.06, B.Eng. in Computer Science, School of Computer Science and Engineering, Beihang University.

🗒️ Academic Services

  • Conference Reviewer: ICLR, Neurips, ICML, ECCV, CVPR, ICCV
  • Journal Reviewer: IEEE TPAMI, Neural Networks.
  • Program Committee member for Practical Deep Learning Workshop, IEEE CAI 2024.

🎖 Honors and Awards

-2023 Outstanding Graduate, Beihang University.

-2023 Outstanding Project of the 16th National College Student Innovation and Entrepreneurship Competition, China.

-2022 Outstanding Project of the 15th National College Student Innovation and Entrepreneurship Competition, China.

💻 Internships & Teaching Services

  • 2025.06 - Now, Multimodal Large Language Model Intern, NVIDIA.
  • 2022.09 - 2023.01, AI algorithm internship on model inference acceleration, Enflame, China.
  • 2022.08 - 2023.01, TA for Frontiers in Artificial Intelligence, Beihang University.
  • 2022.08 - 2023.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
  • 2021.08 - 2022.01, TA for Computer Hardware Basics, the head of TA team, Beihang University.
  • 2021.03 - 2021.06, TA for Discrete Mathematics, the head of TA team, Beihang University.