I received my Ph.D. degree (21’) from the Department of Computer Science at University of California, Los Angeles (UCLA). My research interests lie in the intersection of statistical machine learning, natural language processing and cognition. Current research themes include:
Trustworthy AI: Crafting faithful, interpretable and trustworthy AI frameworks.
Human-like Conversational Agents: Building interactive models that align with human values and social norms.
Efficient Language Models: Efficient training and inference of long-context language models.
I am always looking for self-motivated students and long-term collaborators. Please contact me if you have excellent background or share similar research interests with me.
Two papers are accepted to NeurIPS 2025! Absolute Zero is selected as Spolight (Top 3.2%)!
Aug, 2025
I will be serving as Area Chair for ICLR 2026.
Aug, 2025
Three papers on MoE routers (RouterLens), reinforced query reasoners for deep retrieval (TongSearch), new preference optimization formula with utility anchors (UAPO) are accepted to EMNLP 2025!
Jun, 2025
VideoLLaMB is accepted to ICCV 2025. Congratulations to Yuxuan and Yiqi!
OmniMMI is accepted to CVPR’25. We devised the first-ever benchmark for streaming interactive Omni understanding. Please try your models on OmniMMI Leaderboard.
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
@inproceedings{zhao2025absolutezero,title={Absolute Zero: Reinforced Self-play Reasoning with Zero Data},author={Andrew Zhao and Yiran Wu and Yang Yue and Tong Wu and Quentin Xu and Yang Yue and Matthieu Lin and Shenzhi Wang and Qingyun Wu and Zilong Zheng and Gao Huang},year={2025},booktitle={Advances in Neural Information Processing Systems (NeurIPS)},url={https://arxiv.org/abs/2505.03335},}
Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce Minecraft Universe (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments.
@inproceedings{zheng2025mcu,title={MCU: An Evaluation Framework for Open-Ended Game Agents},author={Zheng, Xinyue and Lin, Haowei and He, Kaichen and Wang, Zihao and Zheng, Zilong and Liang, Yitao},booktitle={Proceedings of the 42nd International Conference on Machine Learning},year={2025}}
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at this URL.
@inproceedings{wu2025tokenswift,title={TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation},author={Wu, Tong and Shen, Junzhe and Jia, Zixia and Wang, Yuxuan and Zheng, Zilong},booktitle={Proceedings of the 42nd International Conference on Machine Learning},year={2025}}
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers. In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks. This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a 5.5 points improvement over its competitors across three VideoQA benchmarks, and 2.06 points on egocentric planning. Comprehensive results on the MVBench show that VideoLLaMB-7B achieves markedly better results than previous 7B models of same LLM. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to 8 times. Besides, the frame retrieval results on our specialized Needle in a Video Haystack (NIAVH) benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory scaling, ensuring both high performance and cost-effectiveness, thereby setting a new foundation for long-form video-language models in both academic and practical applications.
@inproceedings{wang2025videollamb,title={VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges},author={Yuxuan Wang and Cihang Xie and Yang Liu and Zilong Zheng},year={2025},booktitle={International Conference on Computer Vision},url={https://arxiv.org/abs/2409.01071},}
Reinforced Query Reasoners for Reasoning-intensive Retrieval Tasks EMNLP'25
Xubo Qin, Jun Bai, Jiaqi Li, Zixia Jia#, and Zilong Zheng#, in EMNLP, 2025.
Traditional information retrieval (IR) methods excel at textual and semantic matching but struggle in reasoning-intensive retrieval tasks that require multi-hop inference or complex semantic understanding between queries and documents. One promising solution is to explicitly rewrite or augment queries using large language models (LLMs) to elicit reasoning-relevant content prior to retrieval. However, the widespread use of large-scale LLMs like GPT-4 or LLaMA3-70B remains impractical due to their high inference cost and limited deployability in real-world systems. In this work, we introduce Reinforced Query Reasoner (RQR), a family of small-scale language models for query reasoning and rewriting in reasoning-intensive retrieval. Our approach frames query reformulation as a reinforcement learning problem and employs a novel semi-rule-based reward function. This enables smaller language models, e.g., Qwen2.5-7B-Instruct and Qwen2.5-1.5B-Instruct, to achieve reasoning performance rivaling large-scale LLMs without their prohibitive inference costs. Experiment results on BRIGHT benchmark show that, with BM25 as retrievers, both RQR-7B and RQR-1.5B models significantly outperform existing baselines, including prompt-based query reasoners and some latest dense retrievers trained for reasoning-intensive retrieval tasks, offering superior adaptability for real-world deployment.
@inproceedings{qin2025rqr,title={Reinforced Query Reasoners for Reasoning-intensive Retrieval Tasks},author={Xubo Qin and Jun Bai and Jiaqi Li and Zixia Jia and Zilong Zheng},year={2025},booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)}}
Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs EMNLP'25
Jun Bai, Minghao Tong, Yang Liu, Zixia Jia#, and Zilong Zheng#, in EMNLP, 2025.
Context faithfulness is essential for reliable reasoning in context-dependent scenarios. However, large language models often struggle to ground their outputs in the provided context, resulting in irrelevant responses. Inspired by the emergent expert specialization observed in mixture-of-experts architectures, this work investigates whether certain experts exhibit specialization in context utilization—offering a potential pathway toward targeted optimization for improved context faithfulness. To explore this, we propose Router Lens, a method that accurately identifies context-faithful experts. Our analysis reveals that these experts progressively amplify attention to relevant contextual information, thereby enhancing context grounding. Building on this insight, we introduce Context-faithful Expert Fine-Tuning (CEFT), a lightweight optimization approach that selectively fine-tunes context-faithful experts. Experiments across a wide range of benchmarks and models demonstrate that CEFT matches or surpasses the performance of full fine-tuning while being significantly more efficient1.
@inproceedings{bai2025routerlens,title={Understanding and Leveraging the Expert Specialization of Context Faithfulness in Mixture-of-Experts LLMs},author={Jun Bai and Minghao Tong and Yang Liu and Zixia Jia and Zilong Zheng},year={2025},booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)}}
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 real-world interactive videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enhance real-time interactive reasoning with minimum finetuning on pre-trained MLLMs. Extensive experimental results reveal that the existing MLLMs fall short in interactive streaming understanding, particularly struggling with proactive tasks and multi-turn queries. Our proposed M4, though lightweight, demonstrates a significant improvement in handling proactive tasks and real-time interactions.
@inproceedings{cvpr25omnimmi,title={OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts},author={Wang, Yuxuan and Wang, Yueqian and Chen, Bo and Wu, Tong and Zhao, Dongyan and Zheng, Zilong},booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},year={2025}}
Rule-based reasoning has been acknowledged as one of the fundamental problems in reasoning, while deviations in rule formats, types, and complexity in real-world applications pose severe challenges. Recent studies have shown that large reasoning models (LRMs) have remarkable reasoning capabilities, and their performance is substantially enhanced by reinforcement learning (RL). However, it remains an open question whether small reasoning models (SRMs) can learn rule-based reasoning effectively with robust generalization across diverse tasks and domains. To address this, we introduce Reinforced Rule-based Reasoning, a.k.a. RuleReasoner, a simple yet effective method to conduct rule-based reasoning via a wide collection of curated tasks and a novel domain-aware dynamic sampling approach. Specifically, RuleReasoner resamples each training batch by updating the sampling weights of different domains based on historical rewards. This facilitates domain augmentation and flexible online learning schedules for RL, obviating the need for pre-hoc human-engineered mix-training recipes used in existing methods. Empirical evaluations on in-distribution (ID) and out-of-distribution (OOD) benchmarks reveal that RuleReasoner outperforms frontier LRMs by a significant margin (Δ4.1% average points on eight ID tasks and Δ10.4% average points on three OOD tasks over OpenAI-o1). Notably, our approach also exhibits higher computational efficiency compared to prior dynamic sampling methods for RL.
@misc{liu2025rulereasoner,title={RuleReasoner: Reinforced Rule-based Reasoning via Domain-aware Dynamic Sampling},author={Yang Liu and Jiaqi Li and Zilong Zheng},year={2025},eprint={2506.08672},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2506.08672},}
Reasoning ability, a core component of human intelligence, continues to pose a significant challenge for Large Language Models (LLMs) in the pursuit of AGI. Although model performance has improved under the training scaling law, significant challenges remain, particularly with respect to training algorithms, such as catastrophic forgetting, and the limited availability of novel training data. As an alternative, test-time scaling enhances reasoning performance by increasing test-time computation without parameter updating. Unlike prior methods in this paradigm focused on token space, we propose leveraging latent space for more effective reasoning and better adherence to the test-time scaling law. We introduce LatentSeek, a novel framework that enhances LLM reasoning through Test-Time Instance-level Adaptation (TTIA) within the model's latent space. Specifically, LatentSeek leverages policy gradient to iteratively update latent representations, guided by self-generated reward signals. LatentSeek is evaluated on a range of reasoning benchmarks, including GSM8K, MATH-500, and AIME2024, across multiple LLM architectures. Results show that LatentSeek consistently outperforms strong baselines, such as Chain-of-Thought prompting and fine-tuning-based methods. Furthermore, our analysis demonstrates that LatentSeek is highly efficient, typically converging within a few iterations for problems of average complexity, while also benefiting from additional iterations, thereby highlighting the potential of test-time scaling in the latent space. These findings position LatentSeek as a lightweight, scalable, and effective solution for enhancing the reasoning capabilities of LLMs.
@misc{li2025seekdarkreasoningtesttime,title={Seek in the Dark: Reasoning via Test-Time Instance-Level Policy Gradient in Latent Space},author={Hengli Li and Chenxi Li and Tong Wu and Xuekai Zhu and Yuxuan Wang and Zhaoxin Yu and Eric Hanchen Jiang and Song-Chun Zhu and Zixia Jia and Ying Nian Wu and Zilong Zheng},year={2025},eprint={2505.13308},archivePrefix={arXiv},primaryClass={cs.LG},url={https://arxiv.org/abs/2505.13308},}
A prerequisite for social coordination is bidirectional communication between teammates, each playing two roles simultaneously: as receptive listeners and expressive speakers. For robots working with humans in complex situations with multiple goals that differ in importance, failure to fulfill the expectation of either role could undermine group performance due to misalignment of values between humans and robots. Specifically, a robot needs to serve as an effective listener to infer human users’ intents from instructions and feedback and as an expressive speaker to explain its decision processes to users. Here, we investigate how to foster effective bidirectional human-robot communications in the context of value alignment—collaborative robots and users form an aligned understanding of the importance of possible task goals. We propose an explainable artificial intelligence (XAI) system in which a group of robots predicts users’ values by taking in situ feedback into consideration while communicating their decision processes to users through explanations. To learn from human feedback, our XAI system integrates a cooperative communication model for inferring human values associated with multiple desirable goals. To be interpretable to humans, the system simulates human mental dynamics and predicts optimal explanations using graphical models. We conducted psychological experiments to examine the core components of the proposed computational framework. Our results show that real-time human-robot mutual understanding in complex cooperative tasks is achievable with a learning model based on bidirectional communication. We believe that this interaction framework can shed light on bidirectional value alignment in communicative XAI systems and, more broadly, in future human-machine teaming systems. An explainable artificial intelligence collaboration framework enables in situ bidirectional human-robot value alignment.
@article{doi:10.1126/scirobotics.abm4183,author={Luyao Yuan and Xiaofeng Gao and Zilong Zheng and Mark Edmonds and Ying Nian Wu and Federico Rossano and Hongjing Lu and Yixin Zhu and Song-Chun Zhu },title={In situ bidirectional human-robot value alignment},journal={Science Robotics},volume={7},number={68},pages={eabm4183},year={2022},doi={10.1126/scirobotics.abm4183},URL={https://www.science.org/doi/abs/10.1126/scirobotics.abm4183},eprint={https://www.science.org/doi/pdf/10.1126/scirobotics.abm4183}}
Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning CVPR'21 Oral
Exploiting internal statistics of a single natural image has long been recognized as a significant research paradigm where the goal is to learn the internal distribution of patches within the image without relying on external training data. Different from prior works that model such a distribution implicitly with a top-down latent variable model (e.g., generator), this paper proposes to explicitly represent the statistical distribution within a single natural image by using an energy-based generative framework, where a pyramid of energy functions, each parameterized by a bottom-up deep neural network, are used to capture the distributions of patches at different resolutions. Meanwhile, a coarse-to-fine sequential training and sampling strategy is presented to train the model efficiently. Besides learning to generate random samples from white noise, the model can learn in parallel with a self-supervised task (e.g., recover the input image from its corrupted version), which can further improve the descriptive power of the learned model. The proposed model is simple and natural in that it does not require an auxiliary model (e.g., discriminator) to assist the training. Besides, it also unifies internal statistics learning and image generation in a single framework. Experimental results presented on various image generation and manipulation tasks, including super-resolution, image editing, harmonization, style transfer, etc., have demonstrated the effectiveness of our model for internal learning.
@inproceedings{zheng2021patchgencn,title={Patchwise Generative ConvNet: Training Energy-Based Models from a Single Natural Image for Internal Learning},author={Zheng, Zilong and Xie, Jianwen and Li, Ping},booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},year={2021}}
Reasoning Visual Dialogs with Structural and Partial Observations CVPR'19 Oral
We propose a novel model to address the task of Visual Dialog which exhibits complex dialog structures. To obtain a reasonable answer based on the current question and the dialog history, the underlying semantic dependencies between dialog entities are essential. In this paper, we explicitly formalize this task as inference in a graphical model with partially observed nodes and unknown graph structures (relations in dialog). The given dialog entities are viewed as the observed nodes. The answer to a given question is represented by a node with missing value. We first introduce an Expectation Maximization algorithm to infer both the underlying dialog structures and the missing node values (desired answers). Based on this, we proceed to propose a differentiable graph neural network (GNN) solution that approximates this process. Experiment results on the VisDial and VisDial-Q datasets show that our model outperforms comparative methods. It is also observed that our method can infer the underlying dialog structure for better dialog reasoning.
@inproceedings{zheng2019reasoning,title={Reasoning Visual Dialogs with Structural and Partial Observations},author={Zheng, Zilong and Wang, Wenguan and Qi, Siyuan and Zhu, Song-Chun},booktitle={Computer Vision and Pattern Recognition (CVPR), 2019 IEEE Conference on},year={2019}}
Learning Descriptor Networks for 3D Shape Synthesis and Analysis CVPR'18 Oral
This paper proposes a 3D shape descriptor network, which is a deep convolutional energy-based model, for modeling volumetric shape patterns. The maximum likelihood training of the model follows an “analysis by synthesis” scheme and can be interpreted as a mode seeking and mode shifting process. The model can synthesize 3D shape patterns by sampling from the probability distribution via MCMC such as Langevin dynamics. The model can be used to train a 3D generator network via MCMC teaching. The conditional version of the 3D shape descriptor net can be used for 3D object recovery and 3D object super-resolution. Experiments demonstrate that the proposed model can generate realistic 3D shape patterns and can be useful for 3D shape analysis.
@inproceedings{xie2018learning,title={Learning Descriptor Networks for 3D Shape Synthesis and Analysis},author={Xie, Jianwen and Zheng, Zilong and Gao, Ruiqi and Wang, Wenguan and Zhu, Song-Chun and Wu, Ying Nian},booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)},pages={8629--8638},year={2018}}