CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Thu, 23 Jan 2025 11:16:59 GMT access-control-allow-origin: * etag: W/"6792252b-33c5" expires: Mon, 29 Dec 2025 23:49:53 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: BEF2:36A0B4:963849:A8A9A9:69531149 accept-ranges: bytes age: 0 date: Mon, 29 Dec 2025 23:39:53 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210067-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1767051594.774788,VS0,VE218 vary: Accept-Encoding x-fastly-request-id: fe2a75b67d7b6ed8b96effaf2698d656a0bbd6f7 content-length: 4193 HyperE2VID: Improving Event-based Video Reconstruction via Hypernetworks

HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks

Burak Ercan ^1,2 Onur Eker ^1,2 Canberk Sağlam ^1,3 Aykut Erdem ^4,5 Erkut Erdem ¹

¹ Hacettepe University, Computer Engineering Department
² HAVELSAN Inc. ³ ROKETSAN Inc.
⁴ Koç University, Computer Engineering Department ⁵ Koç University, KUIS AI Center

IEEE Transactions on Image Processing, 2024

Paper
Code
Video

Here we present HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach extends existing static architectures by using hypernetworks and dynamic convolutions to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We show that this dynamic architecture can generate higher-quality videos than previous state-of-the-art, while also reducing memory consumption and inference time.

Approach Overview

Since events are generated asynchronously only when the intensity of a pixel changes, the resulting event voxel grid is a sparse tensor, incorporating information only from the changing parts of the scene. The sparsity of these voxel grids is also highly varying. This makes it hard for neural networks to adapt to new data and leads to unsatisfactory video reconstructions that contain blur, low contrast, or smearing artifacts. Unlike the previous methods that try to process the highly varying event data with static networks in which the network parameters are kept fixed after training, our proposed model, HyperE2VID, employs a dynamic neural network architecture . Specifically, we enhance the main network (a convolutional encoder-decoder architecture similar to E2VID) by employing dynamic convolutions, whose parameters are generated via hypernetworks, dynamically at inference time.

Some important aspects of our approach are:

The dynamically generated parameters are also spatially varying such that there exists a separate convolutional kernel for each pixel, allowing them to adapt to different spatial locations as well as each input. This spatial adaptation enables the network to learn and use different filters for static and dynamic parts of the scene where events are generated at low and high rates, respectively.
To avoid the high computational cost of generating per-pixel adaptive filters, we use two filter decomposition steps while generating per-pixel dynamic filters. First, we decompose filters into per-pixel filter atoms generated dynamically. Second, we further decompose each filter atom into pre-fixed multi-scale Fourier-Bessel bases.
We guide the dynamic filter generation through a context that represents the current scene being observed. This context is obtained by fusing events and previously reconstructed images. These two modalities complement each other since intensity images capture static parts of the scene better, while events excel at dynamic parts. By fusing them, we obtain a context tensor that better represents both static and dynamic parts of the scene.
We also employ a curriculum learning strategy to train the network more robustly, particularly in the early epochs of training when the reconstructed intensity images are far from optimal.

For more details please see our paper.

Results

To evaluate our method, we utilize sequences from three real-world datasets, namely the Event Camera Dataset (ECD) dataset, the Multi Vehicle Stereo Event Camera (MVSEC) dataset and the High-Quality Frames (HQF) dataset. We evaluate the methods using three full-reference evaluation metrics, mean squared error (MSE), structural similarity (SSIM), and learned perceptual image patch similarity (LPIPS) when high-quality, distortion-free ground truth frames are available. To assess image quality under challenging scenarios, such as low light and fast motion, where ground truth frames are of low quality, we use a no-reference metric, BRISQUE.

Quantitative Results

Qualitative Results

EVREAL Result Analysis Tool

For more results and experimental analyses of HyperE2VID, please see the interactive result analysis tool of EVREAL (Event-based Video Reconstruction Evaluation and Analysis Library).

BibTeX

@article{ercan2024hypere2vid,
title={{HyperE2VID}: Improving Event-Based Video Reconstruction via Hypernetworks},
author={Ercan, Burak and Eker, Onur and Saglam, Canberk and Erdem, Aykut and Erdem, Erkut},
journal={IEEE Transactions on Image Processing},
year={2024},
volume={33},
pages={1826--1837},
doi={10.1109/TIP.2024.3372460},
publisher={IEEE}}

Acknowledgements

This work was supported in part by KUIS AI Center Research Award, TUBITAK-1001 Program Award No. 121E454, and BAGEP 2021 Award of the Science Academy to A. Erdem.

Original Source | Taken Source