You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model
Abstract: In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream task of ophthalmic surgical workflow understanding.
Introduction
This repository is for our work submitted to MICCAI25, titled "Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model".
We have released the training and inference codes of Ophora. The model checkpoint and dataset are released.
The curated large-scale dataset Ophora-160K can be accessed at Ophora-160K datasets.
Prepare Environment
Training and inference with Ophora require an environment compatible with the CogVideoX-2b model.
Please refer to its official page for installation instructions and dependencies: CogVideoX-2b on Hugging Face.
To prepare dataset for model training
bash prepare_dataset.sh
Train
Transfer Pre-Training
bash TPT.sh
Privacy-Preserving Fine-tuning
bash P2FT.sh
Inference
We provide phase captions written by professional ophthalmologists based on the phase labels in the Cataract-1K dataset.
You can use the Cataract-1K-phase_prompts.csv file for inference.
bash sample.sh
Citation
@article{li2025ophora,
title={Ophora: A large-scale data-driven text-guided ophthalmic surgical video generation model},
author={Li, Wei and Hu, Ming and Wang, Guoan and Liu, Lihao and Zhou, Kaijin and Ning, Junzhi and Guo, Xin and Ge, Zongyuan and Gu, Lixu and He, Junjun},
journal={arXiv preprint arXiv:2505.07449},
year={2025}
}