I am currently a Senior Manager of R&D at SenseTime, working closely with Ziwei Liu and Chen Qian. Before that, I finished a two-year joint post-doc of SenseTime and CASIA supervised by Prof. Liang Wang, from 2019 to 2021.
I received my Ph.D. degree from the School of Information and Communication Engineering in BUPT, advised by Prof. Honggang Zhang and Chun-guang Li, in 2018. I am also very hornored to be jointly supervised by Prof. Gang Wang when visiting the NTU ROSE-Lab from 2016 to 2017, at Singapore. In 2012, I received my bachelor’s degree in Telecommunications from JiLin University.
My research interests include fundamental algorithms and application technologies in computer vision, computer graphics and machine learning. Leveraging these tools, we have delivered several industrail AI products focused on Human Perception and Editting. Moving forward, amidst the technological trends of large models, I will continue to delve into more universal multi-modal perception, generation, and interaction technologies.
Object part parsing involves segmenting objects into semantic parts, which has drawn great attention recently. The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. HDTR first generates the pyramid multi-granularity pixel representations under the supervision of the object part parsing maps at different semantic levels and then assigns each region an initial part embedding. Moreover, HDTR generates an edge pixel representation to extend the capability of the network to capture detailed information. Afterward, we design a Hierarchical Part Transformer to upgrade the part embeddings to their hierarchical counterparts with the assistance of the multi-granularity pixel representations. Next, we propose a Hierarchical Pixel Transformer to infer the hierarchical information from the part embeddings to enrich the pixel representations. Note that both transformer decoders rely on the structural relations between object parts, i.e., dependency, composition, and decomposition relations. The experiments on five large-scale datasets, i.e., LaPa, CelebAMask-HQ, CIHP, LIP and Pascal Animal, demonstrate that our method sets a new state-of-the-art performance for object part parsing.
Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow
Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes. While many existing methods rely on Generative Adversarial Networks (GANs) to achieve this, flaws can still occur, particularly at high resolutions. Recently, the diffusion model has emerged as a promising alternative for generating high-quality images in various applications. However, simply using clothes as a condition for guiding the diffusion model to inpaint is insufficient to maintain the details of the clothes. To overcome this challenge, we propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model’s generation effectively. The warping module performs initial processing on the clothes, which helps to preserve the local details of the clothes. We then combine the warped clothes with clothes-agnostic person image and add noise as the input of diffusion model. Additionally, the warped clothes is used as local conditions for each denoising process to ensure that the resulting output retains as much detail as possible. Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON(DCI-VTON), effectively utilizes the power of the diffusion model, and the incorporation of the warping module helps to produce high-quality and realistic virtual try-on results. Experimental results on VITON-HD demonstrate the effectiveness and superiority of our method. Source code and trained models will be publicly released at: https://github.com/bcmi/DCI-VTON-Virtual-Try-On.
Amodal Instance Segmentation via Prior-Guided Expansion
Junjie Chen, Li Niu, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang
Currently, video semantic segmentation mainly faces two challenges: 1) the demand of temporal consistency; 2) the balance between segmentation accuracy and inference efficiency. For the first challenge, existing methods usually use optical flow to capture the temporal relation in consecutive frames and maintain the temporal consistency, but the low inference speed by means of optical flow limits the real-time applications. For the second challenge, flow based key frame warping is one mainstream solution. However, the unbalanced inference latency of flow-based key frame warping makes it unsatisfactory for real-time applications. Considering the segmentation accuracy and inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection. The key selection and query selection strategies are separately applied to filter out temporal and spatial redundancy in our temporal transformer. Specifically, our STT can reduce the time complexity of temporal transformer by a large margin without harming the segmentation accuracy and temporal consistency. Experiments on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves the state-of-the-art segmentation accuracy and temporal consistency with comparable inference speed.
Dual Attention Matching Network for Context-Aware Feature Sequence Based Person Re-Identification
Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C. Kot, and Gang Wang