| CARVIEW |
Learning from Reward-Free Offline Data:
A Case for Planning with
Latent Dynamics Models
5Meta FAIR
*Equal contribution, order determined by coin flip,
Overview
Abstract
A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations.
In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties—such as data diversity, trajectory quality, and environment variability—affect the performance of these approaches.
Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment leayouts, in trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.
Training and Planning with Latent Dynamics Model
Environments & Datasets
The task: reach a specified goal in top-down navigation
We present two top-down navigation environments - Two Rooms and Diverse Mazes. In both, the goal is to reach a specified goal state. A typical task for Two Rooms is illustrated on the left. We test algorithms' ability to learn from data of varying quality. On the right is an example trajectory in the offline dataset.
Example Trajectories from the Two Rooms Dataset ... (click to expand)
The following GIFs show example trajectories collected in the dataset. These illustrate the variety of movements and behaviors captured in different data settings.
Good quality-trajectories (length=90)
Random policy trajectories
Trajectories of length 16
Trajectories of length 32
Trajectories of length 64
Diverse Mazes: testing generalization to new environments
Main Results
Generalization to new environments
Low distribution shift
CRL
GCBC
GCIQL
HILP
HIQL
PLDM
Medium distribution shift
High distribution shift
Generalizing from suboptimal training data
Left: To test the importance of the dataset quality, we mix the random policy trajectories and good quality trajectories. As the amount of good quality data goes towards 0, methods begin to fail, with GCIQL, HILP, and PLDM being the most robust ones.
Center: We measure methods' performance when trained with different trajectory lengths. We find that many goal-conditioned methods fail when train trajectories are short, which causes far-away goals to become out-of-distribution for the resulting policy.
Right: We measure methods' performance with datasets of varying sizes. We see that PLDM and GCIQL are the most sample efficient and manage to get almost 50% success rate even with a few thousand transitions.
Summary
Having thoroughly evaluated six methods for learning from reward-free offline trajectories. The table below summarizes their performance across several key challenges: (i) Transfer to new environments, (ii) Zero-shot transfer to a new task, (iii) Data-efficiency, (iv) Best-case performance when data is abundant and high-quality, (v) Ability to learn from random or suboptimal trajectories, and (vi) Ability to stitch suboptimal trajectories to solve long-horizon tasks.
Takeaways
- PLDM exhibits robustness to data quality, a high level of data efficiency, best-of-class generalization to new layouts, and excels at adapting to tasks beyond goal-reaching;
- Learning a well-structured latent-space (e.g. using HILP) enables trajectory stitching and robustness to data quality, although it is more data-hungry than other methods;
- Model-free GCRL methods are a great choice when the data is plentiful and good quality.
BibTeX
@article{sobal2025learning,
title={Learning from Reward-Free Offline Data: A Case for Planning with Latent Dynamics Models},
author={Sobal, Vlad and Zhang, Wancong and Cho, Kynghyun and Balestriero, Randall and Rudner, Tim G. J. and LeCun, Yann},
journal={arXiv preprint arXiv:2502.14819},
year={2025},
archivePrefix={arXiv},
}