HTTP/2 301
server: GitHub.com
content-type: text/html
location: https://frunyang.github.io/TDMI/
x-github-request-id: 4E14:1387E:9E8A1D:B22F3C:6953B172
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 11:03:14 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210025-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767092594.272472,VS0,VE202
vary: Accept-Encoding
x-fastly-request-id: d69210f2355e30836bebba37fcfbc14bd6a85063
content-length: 162
HTTP/2 200
server: GitHub.com
content-type: text/html; charset=utf-8
last-modified: Tue, 09 May 2023 03:19:27 GMT
access-control-allow-origin: *
strict-transport-security: max-age=31556952
etag: W/"6459bbbf-2ff2"
expires: Tue, 30 Dec 2025 11:13:14 GMT
cache-control: max-age=600
content-encoding: gzip
x-proxy-cache: MISS
x-github-request-id: CCB9:3FD64F:9FF894:B39F30:6953B171
accept-ranges: bytes
age: 0
date: Tue, 30 Dec 2025 11:03:14 GMT
via: 1.1 varnish
x-served-by: cache-bom-vanm7210025-BOM
x-cache: MISS
x-cache-hits: 0
x-timer: S1767092594.489100,VS0,VE217
vary: Accept-Encoding
x-fastly-request-id: 9af7c881cd69bed58c50ef95cbcd0d57cfd3c84d
content-length: 3947
TDMI CVPR 2023
Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video
Jilin University, University of Birmingham
* Corresponding Author
CVPR 2023
Motivation
|
Directly leveraging optical flow can be distracted by irrelevant clues such as background and
blur (a), and sometimes fails in scenarios with fast motion and mutual occlusion (b). Our proposed
framework proceeds with temporal difference encoding and useful information disentanglement to
capture more tailored temporal dynamics (c), yielding more robust pose estimations (d).
|
Abstract
|
Temporal modeling is crucial for multi-frame human pose estimation.
Most existing methods directly employ optical flow or deformable convolution to predict
full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person
or background. Without further efforts to excavate meaningful motion priors, their results are
sub-optimal, especially in complicated spatiotemporal interactions. On the other hand, the
temporal difference has the ability to encode representative motion information which can
potentially be valuable for pose estimation but has not been fully exploited. In this paper,
we present a novel multi-frame human pose estimation framework, which employs temporal
differences across frames to model dynamic contexts and engages mutual information objectively
to facilitate useful motion information disentanglement. To be specific, we design a multi-stage
Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage
feature difference sequences to derive informative motion representation. We further propose a
Representation Disentanglement module from the mutual information perspective, which can grasp
discriminative task-relevant motion signals by explicitly defining useful and noisy constituents
of the raw motion features and minimizing their mutual information. These place us to rank No.1 in
the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve
state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21.
|
Framework
|
Overall pipeline of the proposed framework.
The goal is to detect the human pose of the key frame. Given an input sequence,
we first extract their visual features. Our multi-stage Temporal Difference Encoder
takes these features as input and outputs the motion feature. Then, the feature is
handed to the Representation Disentanglement module which performs useful information
disentanglement. Finally, the useful motion feature and the visual feature are used to
obtain the final pose estimation.
|
Qualitative Results
|
Visual results of our TDMI framework on benchmark datasets. Challenging scenes such as fast motion or
pose occlusion are involved.
|
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under
grant No. 62203184. This work is also supported in part by the MSIT, Korea, under the
ITRC program (IITP-2022-2020-0-01789) (50%) and the High-Potential Individuals Global Training Program
(RS2022-00155054) (50%) supervised by the IITP.
|