| CARVIEW |
Xiuyu Li
I am a Ph.D. candidate affiliated with Berkeley AI Research (BAIR) at UC Berkeley,
advised by Prof. Kurt Keutzer. Previously, I received a B.A. in Computer Science and Math from Cornell University. During my undergrad years, I was fortunate to work with Prof.
Zhiru Zhang, Prof.
Vitaly Shmatikov, and Prof.
Song Han.
Email: xiuyu [at] berkeley [dot] edu
Research
My current research interests are enhancing the reasoning capabilities of large language models (LLMs) and developing scalable AI agents. This pursuit is built on my broader expertise in making generative models more efficient in both training and inference across language and vision.
Efficient Generative Models (Quantization & Sparsity): SparseLoRA (ICML'25) speeds up LLM finetuning with contextual sparsity. Q-Diffusion (ICCV'23) and SVDQuant (ICLR'25) are pioneering works for diffusion models quantization. SVG (ICML'25) accelerates video generation speed by 2x via attention sparsity. SqueezeLLM (ICML'24) achieves near-lossless 3-bit quantization for LLMs.
Long-context LLMs/VLMs: STORM (ICCV'25 CLVL) and NVILA (CVPR'25) propose efficient VLM architectures for long video understanding. LLoCO (EMNLP'24) improves long-context LLMs via context compression and parameter-efficient finetuning.
ML Systems: LongVILA (ICLR'25) is a framework for distributed training of VLMs on hour-long videos. TorchSparse (MLSys'22, MICRO'23) is a high-performance CUDA library for sparse convolution.
Evaluation: RouterBench (ICML'24 Agentic Markets) is the first benchmark for LLM routing. ArtBench (arXiv'22) is high-quality dataset for artwork generation. LINKX (NeurIPS'21) offers diverse large-scale non-homophilous graph datasets with a strong baseline.
Selected Publications
For the most up-to-date list of publications, please see google scholar.* indicates co-first author † indicates project lead
Jiayi Pan*, Xiuyu Li*, Long Lian*, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr
COLM, 2025
[abs] [paper] [code]
Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.
Samir Khaki*, Xiuyu Li*†, Junxian Guo*, Ligeng Zhu, Konstantinos N. Plataniotis, Amir Yazdanbakhsh, Kurt Keutzer, Song Han, Zhijian Liu
ICML, 2025
[abs] [paper] [code] [website]
Fine-tuning LLMs is both computationally and memory-intensive. While parameter-efficient finetuning methods, such as DoRA, reduce the number of trainable parameters and lower memory usage, they do not decrease computational cost. In some cases, they even slow down fine-tuning. In this paper, we introduce SparseLoRA, a method that accelerates LLM fine-tuning through contextual sparsity. We propose a lightweight, trainingfree SVD sparsity estimator that dynamically selects a sparse subset of weights for loss and gradient computation. Also, we systematically analyze and address sensitivity across layers, tokens, and training steps. Our experimental results show that SparseLoRA reduces computational cost by up to 1.7x and a measured speedup of up to 1.4x while maintaining accuracy across various downstream tasks, including commonsense and arithmetic reasoning. We will release our code to encourage further research. This will support the development of fine-tuning methods that are both parameter- and computation-efficient.
Dacheng Li*, Shiyi Cao*, Chengkun Cao, Xiuyu Li, Shangyin Tan, Kurt Keutzer, Jiarong Xing, Joseph E. Gonzalez, Ion Stoica
EMNLP Findings, 2025
[abs] [paper] [code]
Increasing test-time compute for LLMs shows promise across domains but remains underexplored in code generation, despite extensive study in math. In this paper, we propose S*, the first hybrid test-time scaling framework that substantially improves the coverage and selection accuracy of generated code. S* extends the existing parallel scaling paradigm with sequential scaling to push performance boundaries. It further leverages a novel selection mechanism that adaptively generates distinguishing inputs for pairwise comparison, combined with execution-grounded information to robustly identify correct solutions. We evaluate across 12 Large Language Models and Large Reasoning Model and show: (1) S* consistently improves performance across model families and sizes, enabling a 3B model to outperform GPT-4o-mini; (2) S* enables non-reasoning models to surpass reasoning models - GPT-4o-mini with S* outperforms o1-preview by 3.7% on LiveCodeBench; (3) S* further boosts state-of-the-art reasoning models - DeepSeek-R1-Distill-Qwen-32B with S* achieves 85.7% on LiveCodeBench, approaching o1 (high) at 88.5%.
Jindong Jiang*, Xiuyu Li*, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon
ICCV CLVL workshop, 2025
[abs] [paper] [website]
Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to 8× and the decoding latency by 2.4-2.9× for the fixed numbers of input frames.
Sijun Tan*, Xiuyu Li*, Shishir Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Ada Popa
EMNLP, 2024
[abs] [paper] [code]
Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead of the self-attention mechanism and the substantial KV cache sizes during generation. We propose a novel approach to address this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning. Our method enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions accurately. We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning while using 30x fewer tokens during inference. LLoCO achieves up to 7.62x speed-up and substantially reduces the cost of long document question answering, making it a promising solution for efficient long context processing. Our code is publicly available online.
Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, Kurt Keutzer
ICCV, 2023
[abs] [paper] [code] [website] [talk]
Integration: NVIDIA TensorRT
Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with time step-aware calibration and shortcut-splitting quantization in this work. Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to >100 for traditional PTQ) in a training-free manner. Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time.
Sehoon Kim*, Coleman Hooper*, Amir Gholami*, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer
ICML, 2024
[abs] [paper] [code]
Integration: Intel oneAPI
Generative Large Language Models (LLMs) have demonstrated remarkable results for a wide range of tasks. However, deploying these models for inference has been a significant challenge due to their unprecedented resource requirements. This has forced existing deployment frameworks to use multi-GPU inference pipelines, which are often complex and costly, or to use smaller and less performant models. In this work, we demonstrate that the main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, specifically for single batch inference. While quantization has emerged as a promising solution by representing model weights with reduced precision, previous efforts have often resulted in notable performance degradation. To address this, we introduce SqueezeLLM, a post-training quantization framework that not only enables lossless compression to ultra-low precisions of up to 3-bit, but also achieves higher quantization performance under the same memory constraint. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format. When applied to the LLaMA models, our 3-bit quantization significantly reduces the perplexity gap from the FP16 baseline by up to 2.1x as compared to the state-of-the-art methods with the same memory requirement. Furthermore, when deployed on an A6000 GPU, our quantized models achieve up to 2.3x speedup compared to the baseline. Our code is open-sourced and available online.
Haotian Tang*, Zhijian Liu*, Xiuyu Li*, Yujun Lin, Song Han
MLSys, 2022
[abs] [paper] [code] [website]
Deep learning on point clouds has received increased attention thanks to its wide applications in AR/VR and autonomous driving. These applications require low latency and high accuracy to provide real-time user experience and ensure user safety. Unlike conventional dense workloads, the sparse and irregular nature of point clouds poses severe challenges to running sparse CNNs efficiently on the general-purpose hardware, and existing sparse acceleration techniques for 2D images do not translate to 3D point clouds. In this paper, we introduce TorchSparse, a high-performance point cloud inference engine that accelerates the sparse convolution computation on GPUs. TorchSparse directly optimizes the two bottlenecks of sparse convolution: data movement and irregular computation. It optimizes the data orchestration by quantization and fused locality-aware memory access, reducing the memory movement cost by 2.7×. It also adopts adaptive MM grouping to trade computation for better regularity, achieving 1.4-1.5× speedup for matrix multiplication. Evaluated on seven representative models across three benchmark datasets, TorchSparse achieves 1.6× and 1.5× measured end-to-end speedup over the state-of-the-art MinkowskiEngine and SpConv, respectively.
Talks
NVIDIA
Bain Capital Ventures
Salesforce AI Research FutureForum
Projects
A Mamba-2.8B model finetuned with DPO. It is one of the most downloaded Mamba models on Hugging Face.