CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Tue, 24 Dec 2024 12:28:58 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"676aa90a-837c" expires: Tue, 30 Dec 2025 02:38:25 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 6688:2685F2:969797:A953E5:695338C8 accept-ranges: bytes age: 0 date: Tue, 30 Dec 2025 02:28:25 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210043-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767061705.405745,VS0,VE211 vary: Accept-Encoding x-fastly-request-id: 30579b10d85e062b081200a781e9e29742bb1902 content-length: 5414 LangSurf

LangSurf: Language-Embedded Surface Gaussians for 3D Scene Understanding

arXiv 2024

Hao Li^1,* Roy Qin^{2,*, †} Zhengyu Zou^1,* Diqi He¹
Bohan Li³ Bingquan Dai² Dingwen Zhang^{1, †} Junwei Han¹

¹Northwestern Polytechnical University ²Tsinghua University ³Shanghai Jiaotong University

Paper arXiv Code

We propose LangSurf, a model that aligns language features with object surfaces to enhance 3D scene understanding

Brief introduction

Abstract. Applying Gaussian Splatting to perception tasks for 3D scene understanding is becoming increasingly popular. Most existing works primarily focus on rendering 2D feature maps from novel viewpoints, which leads to an imprecise 3D language field with outlier languages, ultimately failing to align objects in 3D space. By utilizing masked images for feature extraction, these approaches also lack essential contextual information, leading to inaccurate feature representation. To address these issues, we propose a LangSurf, which accurately aligns the 3D language fields with the surface of objects, facilitating precise 2D and 3D segmentation with text query, widely expanding the downstream tasks such as removal and editing. The core of LangSurf is a joint training strategy that flattens the language Gaussian on the object surfaces using geometry supervision and contrastive losses to assign accurate language features to the Gaussians of objects. In addition, we also introduce the Hierarchical-Context Awareness Module to extract features at the image level for contextual information then perform hierarchical mask pooling using masks segmented by SAM to obtain fine-grained language features in different hierarchies. Extensive experiments on open-vocabulary 2D and 3D semantic segmentation demonstrate that LangSurf outperforms the previous SOTA method by a large margin. Our method is capable of segmenting objects in 3D space, boosting the effectiveness of our approach in instance recognition, removal, and editing, which is also supported by comprehensive experiments.

Overview of proposed LangSurf. Given input views, we reconstruct a language-embedded surface field to enable 2D / 3D open-vocabulary segmentation as well as downstream tasks. Our pipeline contains two main steps: 1) Hierarchical-Context Awareness Module extracts context-aware features with multiple hierarchies; 2) Language-Embedded Training utilizes a joint training strategy to construct language-embedded surface field.