π€ Models(RLLaVA)
RLLaVA is a user-friendly framework for multi-modal RL. It features an RL-central design that decouples algorithm logic from distributed execution, enables modular customization of algorithms, models, and engines, and is optimized for resource-constrained setups to make advanced RL research more accessible.
- π― RL-Centric: Implements an algorithm-driven approach tailored for RL, decoupling logic from distributed execution so researchers can focus on innovation without distributed system complexities.
- π¦ Modular Design: Develop, extend, and customize RL algorithms and multi-modal architectures as easily as snapping together building blocks.
- β‘ Resource-Efficient: Optimized for resource-constrained teamsβmost tasks run on a single 24GB GPU, making multi-modal RL truly accessible.
- π οΈ User-Friendly: Minimalist code with familiar HuggingFace & PyTorch APIs for seamless setup and extensions.
git clone https://github.com/TinyLoopX/RLLaVA && cd RLLaVA
conda create -n rllava python==3.12 && conda activate rllava
bash ./install.shWe provide ready-to-run scripts for various algorithms and tasks in the examples/ directory.
# Example: Train with GRPO
bash examples/algorithms/qwen2_5_vl_3b_geoqa3k_grpo.shYou can explore more examples in the directory structure:
examples/
βββ algorithms/ # Algorithm comparisons and ablations (GRPO, RLOO, DAPO, etc.)
βββ tasks/ # End-to-end task scripts:
βββ math/ # Geometry, reasoning, and equation solving
βββ counting/ # Object counting and compositional queries
βββ grounding/ # Visual grounding and detection-style tasks
βββ agent_search/# Web searchβaugmented agents
βββ agent_code/ # Code-generation agents with tool use
βββ ... # More real-world multi-modal benchmarksRLLaVA makes it easy to define custom tasks. You only need 3 files:
- Reward function β
examples/reward_function/your_task.py - Prompt template β
examples/format_prompt/your_task.jinja - Launch script / command β Point to dataset + reward + prompt (no need to modify YAML directly):
torchrun -m rllava.train.pipeline.rlvr \
config=examples/config.yaml \
data.train_files=your_org/dataset@train \
data.format_prompt=./examples/format_prompt/your_task.jinja \
reward.reward_function=./examples/reward_function/your_task.py:compute_score \
algorithm.adv_estimator=grpo # Switch algorithms here (rloo, remax, ppo, etc.)For detailed usage instructions, please refer to examples/README.md
We support a broad family of RL methods, enabled by simple config switches:
- GRPO, RLOO, REINFORCE++, OPO, REMAX, GPG, PPO, DAPO, GMPO, GSPO, DR-GRPO, CLIP-COV, KL-COV
Models:
- Qwen2-VL/Qwen2.5-VL/Qwen3-VL vision language models
- TinyLLaVA-style architectures with customizable vision encoders, connectors, and LLMs
- Support for LLMs (e.g., Qwen3, LLaMA) in text-only RL scenarios
Backends:
- Training: FSDP, FSDP2οΌDeepSpeed
- Inference: vLLM, HuggingFace
We welcome contributions! We're especially interested in new RL algorithms, multi-modal tasks, and resource-constrained improvements. Have questions? Join our WeChat group:
Our RL algorithms and distributed training implementation draw inspiration from the open-source community, particularly veRL, EasyR1, and AReaL.
@misc{zhao2025rllava,
title = {RLLaVA: An RL-central Framework for Language and Vision Assistants},
author = {Lei Zhao, Zihao Ma, Boyu Lin, Yuhe Liu, Wenjun Wu, Lei Huang},
howpublished = {\url{https://github.com/TinyLoopX/RLLaVA}},
year = {2025}
}
