| CARVIEW |
Canvas Coordinates: ([[canvasPoint_.x]], [[canvasPoint_.y]])
Image Coordinates: ([[imagePoint.x]], [[imagePoint.y]])
ViewBox Coordinates: ([[viewBoxPoint_.x]], [[viewBoxPoint_.y]])
imageTransform translation: ([[imageTransform_.x]],[[imageTransform_.y]])>
imageTransform scale: [[imageTransform_.scale]]
ScaleRatio_: [[scaleRatio]]
([[item.x]], [[item.y]])
Label: [[item.label.text]]. Position: ([[item.point.x]], [[item.point.y]]). Times: ([[item.time_interval.start]], [[item.time_interval.end]])
Connecting Vision and Language with
Localized Narratives
Publication
Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari
ECCV (Spotlight), 2020
[PDF] [BibTeX] [1'30'' video] [10' video]
@inproceedings{PontTuset_eccv2020,
author = {Jordi Pont-Tuset and Jasper Uijlings and Soravit Changpinyo and Radu Soricut and Vittorio Ferrari},
title = {Connecting Vision and Language with Localized Narratives},
booktitle = {ECCV},
year = {2020}
}
Abstract
Explore Localized Narratives
License
Code
Here is the documentation about the file formats used.
Alternatively, you can manually download the data below.
Downloads
Large files are split in shards (a list of them will appear when you click below).
In parantheses, the number of Localized Narratives in each split. Please note that some images have more than one Localized Narrative annotation, e.g. 5k images in COCO are annotated 5 times.
File formats
The annotations are in JSON Lines format, that is, each line of the file is an independent valid JSON-encoded object. The largest files are split into smaller sub-files (shards) for ease of download. Since each line of the file is independent, the whole file can be reconstructed by simply concatenating the contents of the shards.
Each line represents one Localized Narrative annotation on one image by one annotator and has the following fields:
- dataset_id String identifying the dataset and split where the image belongs, e.g. mscoco_val2017.
- image_id String identifier of the image, as specified on each dataset.
- annotator_id Integer number uniquely identifying each annotator.
- caption Image caption as a string of characters.
- timed_caption List of timed utterances, i.e. {utterance, start_time, end_time} where utterance is a word (or group of words) and (start_time, end_time) is the time during which it was spoken, with respect to the start of the recording.
- traces List of trace segments, one between each time the mouse pointer enters the image and goes away from it. Each trace segment is represented as a list of timed points, i.e. {x, y, t}, where x and y are the normalized image coordinates (with origin at the top-left corner of the image) and t is the time in seconds since the start of the recording. Please note that the coordinates can go a bit beyond the image, i.e. <0 or >1, as we recorded the mouse traces including a small band around the image.
- voice_recording Relative URL path with respect to https://storage.googleapis.com/localized-narratives/voice-recordings where to find the voice recording (in OGG format) for that particular image.
Below a sample of one Localized Narrative in this format:
{
dataset_id: 'mscoco_val2017',
image_id: '137576',
annotator_id: 93,
caption: 'In this image there are group of cows standing and eating th...',
timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...],
traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...],
voice_recording: 'coco_val/coco_val_137576_93.ogg'
}
Below you can download the automatic speech-to-text transcriptions from the voice recordings. The format is a list of text chunks, each of which is a list of ten alternatives along with its confidence.
Please note: the final caption text of Localized Narratives is given manually by the annotators. The automatic transcriptions below are only used to temporally align the manual transcription to the mouse traces. The timestamps used for this, though, were not stored, so the alignment process cannot be reproduced. To have some timestamps, you'd need to re-run Google's speech-to-text transcription (here the code we used). Given that the API is constantly evolving, though, the transcription will likely not match the one stored below.