| CARVIEW |
Select Language
Analysis
Failure Case Analysis
We provide more qualitative results below and analyze the failure cases. We find that our framework has the following limitations which could be further improved in the future.• The strategy we use to convert semantic segmentation into instance segmentation is not entirely effective. For simplicity, in TextPSG, we identify each connected component in the semantic segmentation to be an individual object instance. However, this strategy may fail when instances overlap or are occluded, resulting in either an underestimation or an overestimation of instances. As shown below, our strategy can successfully separate the two cows in (ii), but mistakenly divides the car behind the tree into three parts in (i).
• Our framework faces difficulty in locating small objects in the scene due to limitations in resolution and the grouping strategy for location. As shown below, in (ii) and (iv), our method can identify large objects such as large cows, birds, grass, and sea, but struggles to locate relatively small objects such as small cows in (ii) and people in (iv).
• The relation prediction of our framework requires enhancement, as it is not adequately conditioned on the image. While the label generator uses both image features and predicted object semantics to determine the relation, it sometimes seems to lean heavily on the object semantics, potentially neglecting the actual image content. As shown below, in (i), the relations between the blue mask of the car and the green mask of the car are predicted as both being "in front of", which is not reasonable. In this case, "beside" may be a more appropriate prediction (in this case, the first limitation about the segmentation conversion also exists).
Model Diagnosis
For a clearer understanding of the efficacy of our framework, we conduct a further model diagnosis to answer the following question.• Why does our framework only achieve semantic segmentation through learning, rather than panoptic segmentation (and thus requires further segmentation conversion to obtain instance segmentation)?
Here, we use two captions in different granularity to execute region-entity alignment, with (a) one describing the two sheep individually while (b) the other merges them in plural form. It shows that our framework has the capability to assign distinct masks to individual instances. However, the nature of caption data, where captions often merge objects of the same semantics in plural form, limits our framework from differentiating instances. It is the weak supervision provided by the caption data that constrains our framework. We argue that a superior image-caption-pair dataset with more detailed granularity in captions may achieve panoptic segmentation through learning, which could be a valuable future direction to explore.
Citation
@InProceedings{Zhao_2023_ICCV,
author = {Zhao, Chengyang and Shen, Yikang and Chen, Zhenfang and Ding, Mingyu and Gan, Chuang},
title = {TextPSG: Panoptic Scene Graph Generation from Textual Descriptions},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {2839-2850}
}
Contact
If you have any questions, please feel free to contact us:
- Chengyang Zhao: zhaochengyang@pku.edu.cn
- Zhenfang Chen: chenzhenfang2013@gmail.com
- Chuang Gan: ganchuang1990@gmail.com