1. Introduction

Integrating vision and language has long been a dream in work on artificial intelligence (AI). In the past two years, we have witnessed an explosion of work that brings together vision and language from images to videos and beyond. The available corpora have played a crucial role in advancing this area of research.

We propose a set of quality metrics for evaluating and analyzing vision & language datasets objectively. If you plan to release a dataset in this space, please demonstrate how this dataset is similar to/different from related datasets. Current releases explain differences in a primarily qualitative fashion. Using the suggested metrics, we can also measure quantitative, objective differences. This approach is critical for understanding how well the datasets advancing research generalize; and further, what their ''blind spots'' may be.

You may also add your own datasets to this resource. Please contact for github access.

If you use this resource or the tools provided, please cite the relevant paper:

Ferraro, F. and Mostafazadeh, N. and Huang, T. and Vanderwende, L. and Devlin, J. and Galley, M. and Mitchell, M. (2015). A Survey of Current Datasets for Vision and Language Research. Proceedings of EMNLP 2015. [bibtex]

2. Comparison tools

3. Image Captioning

3-1. User-generated Captions

3-2. Crowd-sourced Captions

4. Video Captioning

4. Beyond Visual Description Datasets