You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We also provide steps for running the encoder, which can be found in the file!
python3 model/test_hub.py
Model Lessons
We detail some lessons about training these kinds of models for some insights.
With an optimized training paradigm for different architectures, sometimes certain architectures are just straight up better. We found that a compact hierarchical ViT "Hiera" was exceedingly better than the other models. Thanks SAM2 for the inspiration!
Pretraining gets you very far for ViTs. ViTs become slightly overrated when you don't pretrain them compared to the tried and true Resnets of the world. The attention operation is quite expensive (On^2), warranting the use of patches (16 by 16) for ViTs. There are optimizations that can be made with attention (flash or deformable attention), but we didn't get to them.
Don't get fancy with optimizers and learning rates; if you are trying to get your dense prediction model to work with tiny adjustments to the learning rate, you should look into things like dataset/architecture/etc.
Sparse prediction in vision is pretty vicious compared to dense prediction. No one seems to have "won" sparse prediction, but it looks like dense prediction scales nicely with 1. a simple architecture with a simple loss function (L1), 2. good and curated data, and 3. a massive amount of that data.
More quality, diverse data has a bigger effect than you would think for this kind of training. Models like Pi3 easily clear VGGT probably because they absolutely sent it with dynamic data.
About
[In Submission] Calibration of Tactile Sensors for High Resolution Stress Tensor and Deformation for Dexterous Manipulation