CARVIEW

MOTORHOMES

Select Language

HTTP/2 301 server: GitHub.com content-type: text/html location: https://tsagkas.github.io/vl-fields/ access-control-allow-origin: * strict-transport-security: max-age=31556952 expires: Mon, 29 Dec 2025 00:33:27 GMT cache-control: max-age=600 x-proxy-cache: MISS x-github-request-id: 331A:2D8B9D:816CC6:91671B:6951C9FF accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 00:23:27 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210084-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766967807.150197,VS0,VE203 vary: Accept-Encoding x-fastly-request-id: 8c7b9df614fa9fd43765ea6203a7b59832586fda content-length: 162 HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Sat, 27 Dec 2025 10:48:07 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"694fb967-3da2" expires: Mon, 29 Dec 2025 00:33:27 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 3C90:3FD64F:819E13:919A34:6951C9FE accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 00:23:27 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210084-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766967807.367088,VS0,VE214 vary: Accept-Encoding x-fastly-request-id: 9ff2368a626295cabf89869a43da0a669a456d53 content-length: 4203 VL-Fields

VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations

Nikolaos Tsagkas, Oisin Mac Aodha, Chris Xiaoxuan Lu

University of Edinburgh

Paper arXiv Poster

Spotlight Talk + Poster: workshop on effective Representations, Abstractions, and Priors for Robot Learning @ICRA 2023 [link]

Abstract

We present Visual-Language Fields (VL-Fields), a neural implicit spatial representation that enables open-vocabulary semantic queries. Our model encodes and fuses the geometry of a scene with vision-language trained latent features by distilling information from a language-driven segmentation model. VL-Fields is trained without requiring any prior knowledge of the scene object classes, which makes it a promising representation for the field of robotics. Our model outperformed the similar CLIP-Fields model in the task of semantic segmentation by almost 10%.

Open-Vocabulary Queries

Demo

Dust the Blinds:

Clean the Table:

Semantic segmentation

Training Pipeline

Encoding the scene photometry and geometry

VL-Fields jointly encodes the geometry and appearance of a scene, along with the visual-language features. This allows us rely only on the neural-fields for re-rendering the input video, without the need of a stored point-cloud (like in CLIP-Fields).

Related Work

There's a lot of excellent work for grounding language into neural implicit representations.

DFF introduced the idea of distilling knowledge from large language-vision models, for the purpose of grounding language into neural-fields.

CLIP-Fields demonstrated how such models can be used in the field of mobile robotics, for the purpose of commanding robots with natural language queries.

More recently, LERF addressed the limitations in utilizing fine-tuned VL models (e.g., LSeg), by directly extracting vision-language features from CLIP.

BibTeX

@article{tsagkas2023vlfields,
  title   =  {VL-Fields: Towards Language-Grounded Neural Implicit Spatial Representations},
  author  =  {Tsagkas, Nikolaos and Mac Aodha, Oisin and Lu, Chris Xiaoxuan},
  journal =  {arXiv preprint arXiv:2305.12427},
  year    =  {2023}
}

Original Source | Taken Source