Exporters From Japan
Wholesale exporters from Japan   Company Established 1983
CARVIEW
Select Language


Analysis


Failure Case Analysis

We provide more qualitative results below and analyze the failure cases. We find that our framework has the following limitations which could be further improved in the future.
The strategy we use to convert semantic segmentation into instance segmentation is not entirely effective. For simplicity, in TextPSG, we identify each connected component in the semantic segmentation to be an individual object instance. However, this strategy may fail when instances overlap or are occluded, resulting in either an underestimation or an overestimation of instances. As shown below, our strategy can successfully separate the two cows in (ii), but mistakenly divides the car behind the tree into three parts in (i).

Our framework faces difficulty in locating small objects in the scene due to limitations in resolution and the grouping strategy for location. As shown below, in (ii) and (iv), our method can identify large objects such as large cows, birds, grass, and sea, but struggles to locate relatively small objects such as small cows in (ii) and people in (iv).

The relation prediction of our framework requires enhancement, as it is not adequately conditioned on the image. While the label generator uses both image features and predicted object semantics to determine the relation, it sometimes seems to lean heavily on the object semantics, potentially neglecting the actual image content. As shown below, in (i), the relations between the blue mask of the car and the green mask of the car are predicted as both being "in front of", which is not reasonable. In this case, "beside" may be a more appropriate prediction (in this case, the first limitation about the segmentation conversion also exists).

input


Model Diagnosis

For a clearer understanding of the efficacy of our framework, we conduct a further model diagnosis to answer the following question.
Why does our framework only achieve semantic segmentation through learning, rather than panoptic segmentation (and thus requires further segmentation conversion to obtain instance segmentation)?
Here, we use two captions in different granularity to execute region-entity alignment, with (a) one describing the two sheep individually while (b) the other merges them in plural form. It shows that our framework has the capability to assign distinct masks to individual instances. However, the nature of caption data, where captions often merge objects of the same semantics in plural form, limits our framework from differentiating instances. It is the weak supervision provided by the caption data that constrains our framework. We argue that a superior image-caption-pair dataset with more detailed granularity in captions may achieve panoptic segmentation through learning, which could be a valuable future direction to explore.

input




Citation



@InProceedings{Zhao_2023_ICCV,
    author    = {Zhao, Chengyang and Shen, Yikang and Chen, Zhenfang and Ding, Mingyu and Gan, Chuang},
    title     = {TextPSG: Panoptic Scene Graph Generation from Textual Descriptions},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {2839-2850}
}



Contact


If you have any questions, please feel free to contact us:

  • Chengyang Zhao: zhaochengyangPrevent spamming@Prevent spammingpku.edu.cn
  • Zhenfang Chen: chenzhenfang2013Prevent spamming@Prevent spamminggmail.com
  • Chuang Gan: ganchuang1990Prevent spamming@Prevent spamminggmail.com