Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved
remarkable progress. However, despite their success in single-view setting, these works often
struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence
in hallucinated regions remains challenging due to the inherent stochasticity of generative
models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative
hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out
video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval
strategy that adaptively selects salient videos from previous generations as conditional inputs.
In addition, Our training incorporates progressive context-scaling to improve convergence,
self-conditioning to enhance robustness against long-range visual degradation caused by error
accumulation, and a long-video conditioning mechanism to support extended video generation.
Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer
achieves state-of-the-art video re-rendering, delivering superior view synchronization,
high-fidelity visuals, accurate camera control, and diverse view transformations (e.g.,
third-person → third-person, and head-view → gripper-view in robotic manipulation).
Learning Video Generation for Robotic Manipulation with Collaborative Trajectory
Control
Recent advances in video diffusion models have demonstrated strong potential for generating
robotic decision-making data, with trajectory conditions further enabling fine-grained control.
However, existing trajectory-based methods primarily focus on individual object motion and
struggle to capture multi-object interaction crucial in complex robotic manipulation. This
limitation arises from multi-feature entanglement in overlapping regions, which leads to degraded
visual fidelity. To address this, we present RoboMaster, a novel framework that models
inter-object dynamics through a collaborative trajectory formulation. Unlike prior methods that
decompose objects, our core is to decompose the interaction process into three sub-stages:
pre-interaction, interaction, and post-interaction. Each stage is modeled using the feature of the
dominant object, specifically the robotic arm in the pre- and post-interaction phases and the
manipulated object during interaction, thereby mitigating the drawback of multi-object feature
fusion present during interaction in prior work. To further ensure subject semantic consistency
throughout the video, we incorporate appearance- and shape-aware latent representations for
objects. Extensive experiments on the challenging Bridge V2 dataset, as well as in-the-wild
evaluation, demonstrate that our method outperforms existing approaches, establishing new
state-of-the-art performance in trajectory-controlled video generation for robotic manipulation.
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
This paper aims to manipulate multi-entity 3D motions in video generation. Previous methods on
controllable video generation primarily leverage 2D control signals to manipulate object motions
and have achieved remarkable synthesis results. However, 2D control signals are inherently limited
in expressing the 3D nature of object motions. To overcome this problem, we introduce
3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D
space, given user-desired 6DoF pose (location and rotation) sequences of entities. At the core of
our approach is a plug-and-play 3D-motion grounded object injector that fuses multiple input
entities with their respective 3D trajectories through a gated self-attention mechanism. In
addition, we exploit an injector architecture to preserve the video diffusion prior, which is
crucial for generalization ability.
To mitigate video quality degradation, we introduce a domain adaptor during training and employ an
annealed sampling strategy during inference. To address the lack of suitable training data, we
construct a 360°-Motion Dataset, which first correlates collected 3D human and animal
assets with GPT-generated trajectory and then captures their motion with 12 evenly-surround
cameras on diverse 3D UE platforms. Extensive experiments show that 3DTrajMaster sets a new
state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.
GeoWizard: Unleashing Diffusion Prior for 3D Geometry Estimation from a Single
Image
We introduce GeoWizard, a new generative foundation model designed for estimating geometric
attributes, e.g., depth and normals, from single images. While significant research has already
been conducted in this area, the progress has been substantially limited by the low diversity
and poor quality of publicly available datasets. As a result, the prior works either are
constrained to limited scenarios or suffer from the inability to capture geometric details. In
this paper, we demonstrate that generative models, as opposed to traditional discriminative
models (e.g., CNNs and Transformers), can effectively address the inherently ill-posed problem.
We further show that leveraging diffusion priors can markedly improve generalization, detail
preservation, and efficiency in resource usage. Specifically, we extend the original stable
diffusion model to jointly predict depth and normal, allowing mutual information exchange and
high consistency between the two representations. More importantly, we propose a simple yet
effective strategy to segregate the complex data distribution of various scenes into distinct
sub-distributions. This strategy enables our model to recognize different scene layouts,
capturing 3D geometry with remarkable fidelity. GeoWizard sets new benchmarks for zero-shot
depth and normal prediction, significantly enhancing many downstream applications such as 3D
reconstruction, 2D content creation, and novel viewpoint synthesis.
PanopticNeRF-360: Panoramic 3D-to-2D Label Transfer in Urban Scenes
Training perception systems for self-driving cars requires substantial annotations. However,
manual labeling in 2D images is highly labor-intensive. While existing datasets provide rich
annotations for pre-recorded sequences, they fall short in labeling rarely encountered
viewpoints, potentially hampering the generalization ability for perception models. In this
paper, we present PanopticNeRF-360, a novel approach that combines coarse 3D annotations with
noisy 2D semantic cues to generate consistent panoptic labels and high-quality images from any
viewpoint. Our key insight lies in exploiting the complementarity of 3D and 2D priors to
mutually enhance geometry and semantics. Specifically, we propose to leverage noisy semantic and
instance labels in both 3D and 2D spaces to guide geometry optimization. Simultaneously, the
improved geometry assists in filtering noise present in the 3D and 2D annotations by merging
them in 3D space via a learned semantic field. To further enhance appearance, we combine MLP and
hash grids to yield hybrid scene features, striking a balance between high-frequency appearance
and predominantly contiguous semantics. Our experiments demonstrate PanopticNeRF-360's
state-of-the-art performance over existing label transfer methods on the challenging urban
scenes of the KITTI-360 dataset. Moreover, PanopticNeRF-360 enables omnidirectional rendering of
high-fidelity, multi-view and spatiotemporally consistent appearance, semantic and instance
labels.
Panoptic NeRF: 3D-to-2D Label Transfer for Panoptic Urban Scene Segmentation
Large-scale training data with high-quality annotations is critical for training semantic and
instance segmentation models. Unfortunately, pixel-wise annotation is labor-intensive and
costly, raising the demand for more efficient labeling strategies. In this work, we present a
novel 3D-to-2D label transfer method, Panoptic NeRF, which aims for obtaining per-pixel 2D
semantic and instance labels from easy-to-obtain coarse 3D bounding primitives. Our method
utilizes NeRF as a differentiable tool to unify coarse 3D annotations and 2D semantic cues
transferred from existing datasets. We demonstrate that this combination allows for improved
geometry guided by semantic information, enabling rendering of accurate semantic maps across
multiple views. Furthermore, this fusion process resolves label ambiguity of the coarse 3D
annotations and filters noise in the 2D predictions. By inferring in 3D space and rendering to
2D labels, our 2D semantic and instance labels are multi-view consistent by design. Experimental
results show that Panoptic NeRF outperforms existing label transfer methods in terms of accuracy
and multi-view consistency on challenging urban scenes of the KITTI-360 dataset.
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Camera control has been actively studied in text or image conditioned video generation tasks.
However, altering camera trajectories of a given video remains under-explored, despite its
importance in the field of video creation. It is non-trivial due to the extra constraints of
maintaining multiple-frame appearance and dynamic synchronization. To address this, we present
ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the
dynamic scene of an input video at novel camera trajectories. The core innovation lies in
harnessing the generative capabilities of pre-trained text-to-video models through a simple yet
powerful video conditioning mechanism -- its capability often overlooked in current research. To
overcome the scarcity of qualified training data, we construct a comprehensive multi-camera
synchronized video dataset using Unreal Engine 5, which is carefully curated to follow
real-world filming characteristics, covering diverse scenes and camera movements. It helps the
model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse
inputs through a meticulously designed training strategy. Extensive experiments tell that our
method substantially outperforms existing state-of-the-art approaches and strong baselines. Our
method also finds promising applications in video stabilization, super-resolution, and
outpainting.
OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception,
Reconstruction and Generation
Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of
large-scale real-scanned 3D databases. To facilitate the development of 3D perception,
reconstruction, and generation in the real world, we propose OmniObject3D, a large
vocabulary 3D
object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several
appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily
categories, sharing common classes with popular 2D datasets (e.g., ImageNet and LVIS),
benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D
object
is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview
rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional
scanners support high-quality object scans with precise shapes and realistic appearances.
With
the vast exploration space offered by OmniObject3D, we carefully set up four evaluation
tracks:
a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d)
3D
object generation. Extensive studies are performed on these four benchmarks, revealing new
observations, challenges, and opportunities for future research in realistic 3D vision.
Honors & Awards
ICCV Best Paper Award Candidate
2025
CVPR Best Paper Award Candidate
2023
Hong Kong PhD Fellowship Scheme (HKPFS), Hong Kong SAR
2023
CUHK Vice-Chancellor HKPFS Scholarship
2023
Outstanding Graduation Thesis Award of Zhejiang University
2022
National Scholarship, Ministry of Education of P.R. China