CARVIEW

MOTORHOMES

Select Language

HTTP/2 200 server: GitHub.com content-type: text/html; charset=utf-8 last-modified: Mon, 11 Nov 2024 18:40:23 GMT access-control-allow-origin: * strict-transport-security: max-age=31556952 etag: W/"67324f97-11215" expires: Sun, 28 Dec 2025 12:16:25 GMT cache-control: max-age=600 content-encoding: gzip x-proxy-cache: MISS x-github-request-id: 51AC:2D64E0:78964B:874252:69511D41 accept-ranges: bytes age: 0 date: Sun, 28 Dec 2025 12:06:25 GMT via: 1.1 varnish x-served-by: cache-bom-vanm7210096-BOM x-cache: MISS x-cache-hits: 0 x-timer: S1766923586.757097,VS0,VE223 vary: Accept-Encoding x-fastly-request-id: 75501d1a82c8d733e08a216137b530cc9742c3cb content-length: 6322 FlexCap: Describe Anything in Images in Controllable Detail

FlexCap: Describe Anything in Images in Controllable Detail

Debidatta Dwibedi¹, Vidhi Jain^1,2, Jonathan Tompson¹, Andrew Zisserman¹, Yusuf Aytar¹
¹Google DeepMind ²Carnegie Mellon University

Accepted at NeurIPS 2024

Paper

Video

Abstract

We introduce a versatile flexible-captioning vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images.
This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a localize-then-describe approach with FlexCap can be better at open-ended object detection than a describe-then-localize approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog.

Describing Same Region with Different Lengths

FlexCap generates controllably rich localized descriptions for any region in an image. It has the flexibility to produce captions in a controllable manner which allows the full spectrum of valid descriptions to be explored from short object category names to fully-detailed captions.

For results on Length Conditioned Captions , click on any image below to inspect closely.

Describing Different Regions of Same Image

FlexCap can help in open-world detection by describing salient regions. Unlike prior dense captioning works, FlexCap generates more diverse sentences to describe visual content in controllable detail.

Here we present interactive showcase of results for region captioning here. Explore the images for Interactive Region-Captioning. Click on any image below to inspect closely.

Extracting Object Attributes with Prefixes

Training FlexCap on a large dataset leads to an emergent capability: the model can extract desired information for a specific image region using input prefixes. We present below some examples of attributes that FlexCap can generate.

Click on the image to inspect the bounding box and caption closely.

BibTeX

@inproceedings{
dwibedi2024flexcap,
title={FlexCap: Describe Anything in Images in Controllable Detail},
author={Debidatta Dwibedi and Vidhi Jain and Jonathan Tompson and Andrew Zisserman and Yusuf Aytar},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=P5dEZeECGu}
}

Original Source | Taken Source

FlexCap: Describe Anything in Images in Controllable Detail

Abstract

Describing Same Region with Different Lengths

Describing Different Regions of Same Image

Extracting Object Attributes with Prefixes

Human Action

Object Use

Text

Book Title

Author

Photo Location

Noteworthy

Object Material

Object Color

FlexCapLLM

BibTeX