We introduce the first dataset for sequential vision-to-language, and explore how this data may be used for the task of visual storytelling. The dataset includes 81,743 unique photos in 20,211 sequences, aligned to descriptive and story language. VIST is previously known as "SIND", the Sequential Image Narrative Dataset (SIND).
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
The dog was ready to go. | He had a great time on the hike. | And was very happy to be in the field. | His mom was so proud of him. | It was a beautiful day for him. |
Photos by kameraschwein / CC BY-NC-ND 2.0
Photos by mharvey75 / CC BY-NC 2.0, janelle / CC BY-NC-ND 2.0, lance_mountain / CC BY-NC-ND 2.0
Photos by rbieber / CC BY-NC-ND 2.0