| CARVIEW |
Key capabilities
Unified vision tasks through next point prediction
Object Detection
Detect and localize objects by taking category names as natural language inputs, enabling flexible and intuitive text-based object detection.
Object Referring
Identify and localize objects corresponding to natural language referring expressions, enabling fine-grained alignment between linguistic descriptions and visual content.
Object Pointing
Predict the precise point location of a target object specified by a natural language description, allowing fine-grained and lightweight spatial localization.
OCR
Detect and recognize words or text lines by predicting bounding boxes or polygons corresponding to textual regions in the image.
Visual Prompting
Detect all objects belonging to the same category as the provided visual prompt, where the reference object is specified by one or more bounding boxes in the input image.
Keypoint Detection
Detect instances and output a standardized set of semantic keypoints (e.g., 17 joints for humans/animals), providing structured pose representations.
Unified architecture for multiple vision tasks
Next point prediction framework
Rex-Omni reformulates visual perception as a next point prediction problem, unifying diverse vision tasks within a single generative framework. It predicts spatial outputs (e.g., boxes, points, polygons) auto-regressively and is optimized through a two-stage training pipeline—large-scale Supervised Fine-Tuning (SFT) for grounding, followed by GRPO-based reinforcement learning to refine geometry awareness and behavioral consistency.
Quick Start
Get started with Rex-Omni in just a few lines of code
# Install Rex-Omni
conda create -n rexomni python=3.10
pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu124
git clone https://github.com/IDEA-Research/Rex-Omni.git
cd Rex-Omni
pip install -v -e .
from PIL import Image
from rex_omni import RexOmniWrapper, RexOmniVisualize
# Initialize model
rex = RexOmniWrapper(
model_path="IDEA-Research/Rex-Omni",
backend="transformers"
)
# Load image and run detection
image = Image.open("your_image.jpg")
results = rex.inference(
images=image,
task="detection",
categories=["person", "car", "dog"]
)
# Visualize results
vis = RexOmniVisualize(
image=image,
predictions=results[0]["extracted_predictions"]
)
vis.save("result.jpg")
Open innovation
To enable the research community to build upon this work, we're publicly releasing Rex-Omni with comprehensive tutorials and examples.
Research
Complete model architecture, training details, and evaluation results
Code
Full implementation with easy-to-use Python package and tutorials
Demo
Interactive Gradio demo and comprehensive example scripts
Tutorials
Step-by-step guides for each vision task with Jupyter notebooks