| CARVIEW |
Select Language
HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://ificl.github.io/stereocrw/
x-github-request-id: C6B2:1387E:84A2F3:9530B1:6952166C
accept-ranges: bytes
date: Mon, 29 Dec 2025 05:49:32 GMT
via: 1.1 varnish
age: 0
x-served-by: cache-bom-vanm7210025-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766987372.152433,VS0,VE200
vary: Accept-Encoding
x-fastly-request-id: db1e1580d29acec4aab767fa2447d18dbee24136
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Wed, 15 Mar 2023 17:07:58 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"6411fb6e-3125"
expires: Mon, 29 Dec 2025 05:59:32 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: E442:234FE9:867E94:970CA2:69521665
accept-ranges: bytes
age: 0
date: Mon, 29 Dec 2025 05:49:32 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210025-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1766987372.372087,VS0,VE213
vary: Accept-Encoding
x-fastly-request-id: 662bd29c67e912a4b2b8efe3d157dc623b587a31
content-length: 3037
Sound Localization by Self-Supervised Time Delay Estimation
Sound Localization by Self-Supervised Time Delay Estimation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
![]() |
|
|
| Sounds in the world arrive at one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their direction. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive random walk of Jabri et al. to learn a cycle-consistent representation for binaural matching, resulting in a model that performs on par with supervised methods on "in the wild" internet recordings. We also propose a multimodal contrastive learning model that solves a visually-guided localization task: estimating the time delay for a particular person in a multi-speaker mixture, given a visual representation of their face. |
|
|
|
|
In-the-wild video results |
Visually-guided time delay estimation |
|
|
|
Binaural car demo |
iPhone video demo (Video Credits) |
|
|
|
![]() |
Ziyang Chen, David F. Fouhey, Andrew Owens. Sound Localization by Self-Supervised Time Delay Estimation. arXiv 2022. (Arxiv) |
Acknowledgements |

