PIB: Prioritized Information Bottleneck Theoretic Framework with Distributed Online Learning for Edge Video Analytics (IEEE ToN'25)
This is the open-source repository of the paper published in IEEE ToN (Paper PDF).
Collaborative perception systems leverage multiple edge devices, such as vision-enabled edge sensors or autonomous cars, to enhance sensing quality and minimize occlusions. Despite their advantages, challenges such as limited channel capacity and data redundancy impede their effectiveness. To address these issues, we introduce the Prioritized Information Bottleneck (PIB) framework for edge video analytics. This framework prioritizes the shared data based on the signal-to-noise ratio (SNR) and camera coverage of the region of interest (RoI), reducing spatial-temporal data redundancy to transmit only essential information. This strategy avoids the need for video reconstruction at edge servers and maintains low latency. It leverages a deterministic information bottleneck method to extract compact, relevant features, balancing informativeness and communication costs. For high-dimensional data, we apply variational approximations for practical optimization. To reduce communication costs in fluctuating connections, we propose a gate mechanism based on distributed online learning (DOL) to filter out less informative messages and efficiently select edge servers. Moreover, we establish the asymptotic optimality of DOL by proving the sublinearity of its regrets. To validate the effectiveness of the PIB framework, we conduct real-world experiments on three types of edge devices with varied computing capabilities. Compared to five coding methods for image and video compression, PIB improves mean object detection accuracy (MODA) by 17.8% while reducing communication costs by 82.65% under poor channel conditions.
To replicate the environment and dependencies used in this project, you will need the following packages:
kornia==0.6.1
matplotlib==3.5.3
numpy==1.21.5
pillow==9.4.0
python==3.7.12
pytorch==1.10.0
torchaudio==0.10.0
torchvision==0.11.0
tqdm==4.66.4
Figure 1: System model.
Our system includes edge cameras positioned across various scenes, each covering a specific field of view. The combined fields of view enhance comprehensive perception of each scene. In high-density pedestrian areas, the goal is to enable collaborative perception for predicting pedestrian occupancy despite limited channel capacity and poor conditions. The framework uses edge servers to receive and process video data from the cameras, which is then analyzed by a cloud server connected via fast wired links. This setup ensures efficient scene understanding and real-time analytics, prioritizing essential data for transmission and processing.
Our experiments employ the Wildtrack dataset from EPFL. This dataset features high-resolution images captured by seven cameras positioned in an urban environment, recording natural pedestrian trajectories [Chavdarova et al., 2018].
We conduct simulations using the following settings:
- Operating Frequency: 2.4 GHz
- Path Loss Exponent: 3.5
- Shadowing Deviation: 8 dB
- Interference Power: Devices emit an interference power of 0.1 Watts.
- Device Density: 10 to 100 devices per 100 square meters, testing various data processing loads.
- Bandwidth: 2 MHz
- Camera Placement: Cameras are located approximately 200 meters from the edge server.
To evaluate the performance of our PIB framework, we compare it against five baselines, including both video coding and image coding approaches:
-
TOCOM-TEM
A task-oriented communication framework that utilizes a temporal entropy model for edge video analytics. It applies the deterministic Information Bottleneck (IB) principle to extract and transmit compact, task-relevant features, integrating spatial-temporal data on the server for enhanced inference accuracy. -
JPEG
A widely used image compression standard that employs lossy compression algorithms to reduce image data size. JPEG is commonly used to decrease communication loads in networked camera systems. -
H.265 (HEVC)
Also known as High Efficiency Video Coding, H.265 offers up to 50% better data compression than its predecessor H.264, while maintaining the same video quality. It is crucial for efficient data transmission in high-density camera networks. -
H.264 (AVC)
Known as Advanced Video Coding, H.264 significantly enhances video compression efficiency, allowing high-quality video transmission at lower bit rates. -
AV1
AOMedia Video 1 (AV1) is an open, royalty-free video coding format developed by the Alliance for Open Media (AOMedia). It outperforms existing codecs like H.264 and H.265, making it ideal for online video applications with improved compression efficiency.
As shown in Figure 2, we demonstrate how communication bottlenecks and delayed cameras affect perception accuracy:
Figure 2: Impact of communication bottlenecks and delayed cameras on perception accuracy.
Figure 3 illustrates the trade-off between communication bottlenecks and latency in our system:
Figure 3: Communication bottleneck vs latency.
As shown in Figure 4, our experimental setup features a practical hardware testbed that includes three distinct edge devices: NVIDIA Jetson™ Orin Nano™ 4GB, NVIDIA Jetson™ Orin NX™ 16GB, and ThinkStation™ P360. The edge devices collaboratively interact with edge servers equipped with RTX 5000 Ada GPUs for efficient video decoding.
Figure 4: Edge device configuration.
The Jetson™ Orin NX™ 16GB/ Jetson™ Orin Nano™ devices are configured with a PyTorch deep learning environment. The configuration for Jetson NX differs from x86 architectures, and setting up the environment requires following the official NVIDIA installation guide for PyTorch on the Jetson platform. For detailed instructions, you can refer to the official PyTorch installation guide for Jetson or this helpful tutorial.
The encoding latency results of our PIB in different edge devices are presented in Table 1. It can be observed that the feature map generation phase dominates the overall encoding latency, while the entropy coding phase contributes a negligible amount of time. Furthermore, edge devices with higher computing capacity exhibit significantly lower encoding latency.
Table 1: Encoder Latency Across Different Platforms
Phase | Nano (ms) | Orin NX (ms) | P360 (ms) |
---|---|---|---|
Feature map generation | 755.32±69.32 | 227.54±2.65 | 37.49±0.90 |
Entropy coding | 10.83±3.51 | 1.79±0.75 | 0.40±0.11 |
Total encoder latency | 766.15±70.55 | 229.34±2.67 | 37.80±0.94 |
- Create and activate the Conda environment:
conda create -n PIB_env python=3.7.12
conda activate PIB_env
- Install the required packages:
pip install kornia==0.6.1 matplotlib==3.5.3 numpy==1.21.5 pillow==9.4.0
pip install torch==1.10.0 torchaudio==0.10.0 torchvision==0.11.0 tqdm==4.66.4
The training process consists of two main stages: feature extraction and coding/inference.
Run feature extraction using main_feature_extraction.py
. The script supports various parameters:
python main_feature_extraction.py \
--dataset_path "/path/to/your/dataset" \
--epochs 30 \
--beta 1e-5 \
--target_rate 80 \
--delays "X1 X2 X3 X4 X5 X6 X7" # Xi represents frame delay for i-th camera, calculated based on channel conditions
Key parameters:
--dataset_path
: Path to your dataset directory--epochs
: Number of training epochs (default: 30)--beta
: Information bottleneck trade-off parameter (default: 1e-5)--target_rate
: The constraint on the communication cost (KB)--delays
: Frame delays for each camera (space-separated values). Each value X represents the number of frames delayed for that camera, calculated based on network conditions in utils/channel.py
After feature extraction, run the coding and inference stage using main_coding_and_inference.py
:
python main_coding_and_inference.py \
--dataset_path "/path/to/your/dataset" \
--model_path "/path/to/trained/model/MultiviewDetector.pth" \
--epochs 10 \
--delays "X1 X2 X3 X4 X5 X6 X7" # Xi represents frame delay for i-th camera, calculated based on channel conditions
Key parameters:
--dataset_path
: Path to your dataset directory--model_path
: Path to the trained model from Stage 1--epochs
: Number of inference epochs (default: 10)--delays
: Frame delays for each camera (space-separated values). Each value X represents the number of frames delayed for that camera, calculated based on network conditions in utils/channel.py
- First, run feature extraction:
CUDA_VISIBLE_DEVICES=0,1 python main_feature_extraction.py \
--dataset_path "/data/Wildtrack" \
--epochs 30 \
--beta 1e-5 \
--target_rate 80
- Then, run coding and inference using the trained model:
CUDA_VISIBLE_DEVICES=0,1 python main_coding_and_inference.py \
--dataset_path "/data/Wildtrack" \
--model_path "logs_feature_extraction/YYYY-MM-DD_HH-MM-SS/MultiviewDetector.pth" \
--epochs 10
Note: Replace the model path with your actual trained model path, which will be in the logs directory with a timestamp.
The following video demonstrates the perception results from a single camera (the 4th edge camera). Notice the field of view limitations, and instances where objects remain undetected (highlighted regions).
single-4.mp4
The next video shows the improved perception coverage when the 4th and 7th edge cameras collaborate. While collaboration enhances the coverage, there are still some occluded regions compared to the results from seven edge cameras.
double.mp4
We utilize all cameras (seven edge cameras) to cooperate with each other and improve perception coverage. Although we see rapid growth in streaming data rates, it is noted that this solution provides the best coverage compared to the combinations mentioned above.
7-camera_Compression.mp4
If you find this code useful for your research, please cite our papers:
@article{fang2025ton,
title={Prioritized Information Bottleneck Theoretic Framework with Distributed Online Learning for Edge Video Analytics},
author={Fang, Z. and Hu, S. and Wang, J. and Deng, Y. and Chen, X. and Fang, Y.},
journal={IEEE/ACM Transactions on Networking},
year={Jan. 2025},
note={DOI: 10.1109/TON.2025.3526148},
publisher={IEEE}
}
@inproceedings{fang2024pib,
author = {Z. Fang and S. Hu and L. Yang and Y. Deng and X. Chen and Y. Fang},
title = {{PIB: P}rioritized Information Bottleneck Framework for Collaborative Edge Video Analytics},
booktitle = {IEEE Global Communications Conference (GLOBECOM)},
year = {Dec. 2024},
pages = {1--6},
address = {Cape Town, South Africa}
}
We gratefully acknowledge the contributions of the following projects:
- MVDet for their invaluable tools and insights into multi-view detection.
- TOCOM-TEM for providing task-oriented communication framework for edge video analytics.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.