Shiyao Xu, Benedetta Liberatori, Gül Varol, Paolo Rota
3DV 2026
sorry for the delay but i'm really working on updating everything🥹TODO List:
- update website page
- README
- release Dataset <<< need to be updated but at least the
.npyis released - train scripts for DEMO
- train scripts for UniMotion
- evaluation script (but eval seems to be isolated, i don't remember)
- pretrained weights (on huggingface?)
- dataset generation scripts
- maybe? some ablation exp and setting? if someone has interests
for conda:
conda create python=3.9 -n demo # it's finally 3.9 since my env is broken. 3.10+ should also work
conda activate demo
# torch==2.6.0+cu124 or sth else
conda install nvidia/label/cuda-12.1.1::cuda-toolkit
pip install torch torchvision torchmetrics sentencepiece peft einops fastapi gradio numpy openai opencv_python pillow ray requests shortuuid tqdm uvicorn scipy bitsandbytes deepspeed tensorboard
# i prefer transformers==4.44.0
pip install transformers==4.44.0
pip install flash-attn --no-build-isolation
# for evaluation
pip install git+https://github.com/openai/CLIP.git
# need openjdk=8
conda install -c conda-forge openjdk=8
pip install pycocoevalcap
# need bert_score for SODA(B)
pip install bert_scoreinference results see datasets/result.json
pretrain models see huggingface or google drive.
please follow HumanML3D, actually download AMASS dataset, and follow H3D steps. note that we only need new_joints info, so maybe no motion_representation script in H3D needed.
download CompMo dataset from huggingface.
vis (mp4) and the data (npy files) can also be found in this google drive (<<<tbd, since the mp4s are really large, so...).
-
put humanml3d and compmo anywhere you want.
-
make sure you also download
stageX.jsonand put it indatasets/ -
you need to modify the
motionpath instage1.jsonandstage2.json. you need to re-generated the json.
# need to replace the downloaded dataset path.
python data/generate_json.py
now you can train the model!
tbc. this part of code and script depend on STMC. so maybe you can also generate you own dense kind of data.
tbd, will release in some google drive. (or huggingface?)
just need to modify your HumanmL3D/new_joints path
deepspeed train.py --deepspeed scripts/zero2.json --freeze_backbone True --conv_type plain --tune_mm_mlp_adapter True --data_path datasets/stage1.json --motion_folder YOUR_JOINTS_ROOT_OF_HUMANML3D --data_root YOUR_JOINTS_ROOT_OF_HUMANML3D --motion_dim 1056 --exp_name stage1 --output_dir logs/stage1 --log_base logs --vision_tower mlp
need to modify your compmo path
deepspeed train.py --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed scripts/zero2.json --conv_type llama_3 --pretrain_mm logs/stage1/mm_projector.bin \
--group_by_modality_length True --exp_name stage2 --output_dir logs/stage2 \
--data_path datasets/stage2.json --motion_folder YOUR_COMPMO_PATH --data_root YOUR_COMPMO_PATH \
--motion_dim 1056 --num_train_epochs 5 --gradient_accumulation_steps 2 --per_device_train_batch_size 4 \
--evaluation_strategy no --save_strategy steps --save_steps 5000 --save_total_limit 1 \
--learning_rate 2e-5 --warmup_ratio 0.1 --lr_scheduler_type cosine --logging_steps 1 \
--model_max_length 3072 --gradient_checkpointing True --max_grad_norm=1.0 \
--dataloader_num_workers 4 --lazy_preprocess True --bf16 True --vision_tower mlp \
--log_base logs --model_name_or_path logs/stage1
python inference.py --model_path logs/stage2 --data datasets/test.json --output datasets/results/test.json --motion_dim 1056
in eval dir:
python prepare.py --pred_file ../datasets/result.json --comp_path YOUR_COMP_PATH
# for temporal metrics: tIoU, F1;
python overlap.py --pred_file data/result.json --ref_file data/gt.json --verbose
# for CIDEr, ROUGE_L, METEOR, BLEU1, BLEU4;
python cider.py -s data/result.json -r data/gt.json -v
# for SODA
python soda.py -p data/result.json -r data/gt.json -v
# for SODA(B)
python soda.py -p data/result.json -r data/gt.json -v -m BertScore
# for TMR
TODO.
# for CAR (please refer to their code)
For dense captioning, we follow the evaluation used in Chapter-Llama for part of our evaluation, and also TMR similarity from TMR, CAR evaluation from ChronAccRet.
@article{xu2025densemotioncaptioning,
title={Dense Motion Captioning},
author={Shiyao Xu and Benedetta Liberatori and Gül Varol and Paolo Rota},
journal={arXiv preprint arXiv:2511.05369},
url={https://arxiv.org/abs/2511.05369},
year={2025}
}