| CARVIEW |
- Overview
- Quick Guide
- Gallery
- Technique
- Function
- UniVA-Bench
- Team
- Citation
- Acknowledge
UniVA: Universal Video Agent
towards Open-Source Next-Generation Video Generalist
🤖 Highly automated, interactive, proactive video creation experience
- Multi-round co-creation. Talk like a director; UniVA iterates shots & stories with you.
- Deep memory & context. Global + user memory keep preferences, lore, styles consistent.
- Implicit intent reading. Understands vague & evolving instructions; less prompt hacking.
- Proactive agent. Auto plans, checks, and suggests better shots & stories, not just obeys.
- End-to-end workspace. UnivA plans, calls tools, and delivers full videos.
- Extensiblility. MCP-native, modular design, easy to extend with new models & tools.
🎬 Omnipotent, unified, industrial-grade video production engine
- Any-conditioned pipeline. Text / Image / Entity / Video → controllable video in one framework.
- Super HD & consistent. Cinematic quality with stable identity & objects.
- Complex narratives. Multi-scene, multi-role, multi-shot stories under structured control.
- Ultra-long & fine-grained editing. From long-form cuts to per-shot/per-object refinement.
- Grounded by understanding. Long-video comprehension & segmentation guide generation & edits.
Describe a universe, a campaign, a pet, or a long-form story! UniVA will plan, compose and produce the video for you.
🚀 Try UniVA Demo SystemQuick Guide
Video Gallery
Object Consistency - Girl Dance
Input
I want a 20-second-long video with 4 segments. The main subject of the video is a girl dancing. However, the background of each of the 4 segments needs to be different. I need the first segment to be a cyberpunk background, the second segment to be an aesthetic ink dream, the third segment to be a retro film block, and the fourth segment to be an abstract geometric dynamic. The main subject of the four segments should be consistent and the movements should be coherent.
Output
Complex Generation - BreadTalk Ad
Input
Please create an advertisement based on the following product advertising requirements. 1. Kneading dough in hands, close-up shot, highlighting the texture of the dough. 2. Sprinkling cherry blossom petals on freshly baked bread, slow motion close-up. 3. Customers taste bread in the store and show satisfied smiles. 4. The brand logo appears, with the text: 'BreadTalk'.
Output
Complex Generation - Short Documentary
Input
Please generate a 30-second short documentary video based on the following story beats. 1. Close-up of clay meeting a spinning wheel; fingers press and a rib tool carves spirals as slip flicks outward under warm studio light. 2. Over-the-shoulder time-lapse: the vessel rises from cylinder to wide bowl; wet sheen glistens while the wheel slows. 3. Kiln-loading montage: shelves slide in, the door seals, orange heat blooms; a thermocouple readout climbs as a notebook of glaze formulas flips. 4. Slow-motion glaze pour coats the cooled bowl; cross-dissolve into a firing time-lapse where crystalline patterns emerge. 5. Morning reveal: final bowl on a wooden table beside steaming tea; the potter signs the foot and exhales in quiet satisfaction.
Output
Video2Video - Story Video
Input
Recreate a new video that mirrors the original's style—cinematic transitions, lighting, pacing, and tone—but tells the story of an elderly man reliving his youth through a dreamlike journey across time.
Output
Complex Generation - Mood Piece
Input
Please generate a 20-second mood piece based on the following sequence. 1. Macro close-up of raindrops striking a neon-lit puddle; ripples mirror street signage in shimmering bokeh. 2. Slow-motion silhouettes under translucent umbrellas traverse a zebra crossing while headlight trails streak past. 3. An elevated train thunders by; droplets bead and slide down a window as the interior sound softens to breath. 4. A cat shelters under an awning; a vendor hands a steaming paper cup to a passerby, vapor curling into mist. 5. Dawn edges in: clouds lift to a pastel sky, a final drip falls from a traffic light, and the city exhales.
Output
Video2Video - Prequel Story
Input
Create a prequel to the original video that introduces the backstory of the same characters, matching their look, voice, and animation style, but telling a different story that leads into the original events.
Output
Complex Generation - Professional Video
Input
Clip 1 (Morning Preparation): The man stands before a mirror and adjusts the collar of his grey overcoat, his eyes filled with confidence. Clip 2 (Focused at Work): The man is focused on his computer screen, his fingers typing swiftly on the keyboard. Clip 3 (Afternoon Meeting): In a bright meeting room, the man engages in a meeting, listening attentively. Clip 4 (End of the Day): At dusk, he closes his laptop, and looks out the window, with an expression of satisfaction and relief on his face.
Output
Video2Video - Style Transfer
Input
Maintain all plot and motion as-is, and apply a Chinese ink-painting style to the visuals.
Output
These are only the opening shots.
UniVA is built for longer stories, richer worlds, and multi-step video workflows that go far beyond the samples above. If you can imagine it, UniVA can orchestrate it.
Try it end-to-end with our live agentic system:
🚀 Try UniVA Demo No fixed templates. Just your imagination + UniVA’s tools.Technical Highlights
Dual-Agent Orchestration
A planner reasons over long-horizon goals and memory, and an executor is grounded in MCP tools. Together they turn vague prompts into structured, verifiable plans instead of one-shot guesses.
Composable Tool Fabric
UniVA connects video, vision, language, and utility tools through the Model Context Protocol. Agents dynamically select and chain tools, enabling plug-and-play expansion and multi-step pipelines instead of isolated black-box calls.
- Atomic & workflow tools for generation, editing, tracking, understanding.
- Dynamic routing: the agent chooses the right tools for each stage.
Three-Level Memory Mechanism
A hierarchical memory design maintains story state, user preference, and tool context, so characters, styles, and constraints stay coherent across long videos and iterative edits.
- Trace memory for past plans, calls, and outcomes.
- User memory for persistent style, brand, and constraints.
- Task memory for scripts, entities, and storyboards.
Function Walkthrough
UniVA integrates an extensive, modular toolset via the Model Context Protocol (MCP), enabling flexible plug-and-play extension across diverse media tasks.
- Functions are organized into three families: Video Tools , Non-Video Tools , and Non-AI Tools .
- [Atom] Single-purpose, fine-grained tools for precise generation or editing.
- [Workflow] High-level pipelines that compose multiple atoms to solve complex tasks.
Note: The above taxonomy represents only the current set of meta-functions supported in UniVA. Thanks to the MCP architecture, our system is inherently infinitely extensible, i.e., allowing seamless integration of any number of new tools, capabilities, and media modalities in the future.
UniVA-Bench
UniVA-Bench is a unified benchmark for agent-oriented video intelligence, mirroring real workflows where understanding, generation, and editing are intertwined instead of isolated single-step tasks.
Long-Video QA & Reasoning
Multi-question QA on the same long video, covering narrative, style, transitions, and fine-grained semantics to test temporal reasoning and memory.
- Shot-level & story-level queries.
- Requires using context across the entire video.
Multi-Step Long-Video Editing
Realistic editing chains: replacement, attribute change, style transfer, and composition while keeping characters, scenes, and story logic consistent.
- Plans must select & combine proper tools.
- Rewards temporal and identity coherence.
Tool-Augmented Video Creation
Three creation modes, including LongText2Video, Image/Entities2Video, and Video2Video, testing whether agents can plan, preserve identity, and produce coherent multi-shot narratives.
- Storyboard-first, identity-aware generation.
- Evaluates end-to-end agentic pipelines.
Team
Zhengyang Liang*
SMU
Daoan Zhang*
UR
Huichi Zhou
UCL
Rui Huang
NUS
Bobo Li
NUS
Yuechen Zhang
CUHK
Shengqiong Wu
NUS
Xiaohan Wang
Stanford
Jiebo Luo
UR
Lizi Liao
SMU
Hao Fei#
NUS
*Core contributors, equal contribution ([email protected];
[email protected])
#Project lead, correspondence ([email protected])
Citation
If you find UniVA useful for your research, please kindly cite our work.
@misc{liang2025univauniversalvideoagent,
title={UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist},
author={Zhengyang Liang and Daoan Zhang and Huichi Zhou and Rui Huang and Bobo Li and Yuechen Zhang and Shengqiong Wu and Xiaohan Wang and Jiebo Luo and Lizi Liao and Hao Fei},
year={2025},
eprint={2511.08521},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.08521},
}
Acknowledgement
We sincerely thank our colleagues, collaborators, and research partners for their valuable discussions and constructive feedback that helped shape the design and implementation of UniVA.
The current version of UniVA is a research prototype, and its overall capability is subject to the performance limitations of various backend modules, including perception, reasoning, and generative components. We will continue to refine and expand these modules to further enhance the agent’s reliability, scalability, and generalization ability in future releases.
Open-Source Policy: The code and models of UniVA are released under an open academic license. They are freely available for research and educational purposes, but strictly prohibited for any form of commercial use without explicit written permission from the authors. Unauthorized commercial usage or redistribution may constitute an infringement of intellectual property rights and will be subject to legal responsibility.
For collaboration, licensing inquiries, or commercial partnerships, please contact us at [email protected].