VIST: Visual Storytelling Dataset

Frequently Asked Questions (FAQ)

On SIS dataset, there are several image missing when I downloaded the images. So I simply filtered them out. I got 40071 / 4980 / 5050 (train / val / test) stories after the preprocessing. I wonder is this number correct?

Some images may be removed by their owners. Owners hold the copyright on some of the scraped/crawled images, so the image content itself is up to them.
In your paper's baseline method, you use encoder-decoder GRU model. I wonder could you provide more details on that? What kind of image features do you use? Are your encoder and decoder GRU share the weight or not? What is the parameter size?
The details of the training were:
1. Extract 4096-dim FC 7 features using VGG16 without fine tuning
2. The encoder reads over the 5 images in a sequence. The order of images are reversed (i.e., the first image in the sequence is the last one read in, following what is commonly done for machine translation. This is probably not important though).
3. The encoder and decoder are 1000 dimensional GRU (no weight sharing)
4. The target word embedding size is 250 dimension (i.e., the dimension when the word that was just produced is fed into the decoder GRU).
5. The target vocab size is words that occur 3 or more times in the training. Other words are mapped to UNK (there is a constraint in the decoder that UNK cannot be produced at test time, however).
6. 0.5 dropout on the image FC7 input (i.e., 50% of the 4096-dim FC7 features are dropped out before being fed into encoder GRU. This is probably not important).
7. 0.5 dropout on the decoder GRU layer before applying it to the output layer.
8. If the story model is co-trained with caption data, you should use a token in the encoder GRU to indicate which type of output to produce.
What evaluation parameter you are using?
Is it "self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, '-', '-', '-stdio', '-l', 'en', '-t', 'hter', '-norm'] "?
What METEOR score if you random shuffle the training and gt pair? Do you experience the similar problem?

That is correct, except we did not use -norm -- we normalized it our own way, using our own preprocessing (described in the paper). And -l en is default.

So, it is:
"self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, '-', '-', '-stdio', '-t', 'hter'] "

For the final score calculations, the -refCount flag is used to score the generated sentence against the multiple references.

I'm not sure what did you get out with the random shuffle -- We compare the top-1 generated result with the (varying) number of reference stories for the same sequence.
What are the detailed architectures of the baseline models? There are two possible architectures as far as I concerned: a) They may encode the WHOLE image sequence into a vector and use it as the initial hidden state of the decoders to generate a complete story word by word; b) They may generate each sentence word by word conditioned on the encoded feature of the CURRENT image, while the hidden state of the decodes are initialized by the final states after the previous sentences are generated. Could you tell me one is the same to the baseline models in the paper?

“(a)” is the one. It's analogous to machine translation sequence-to-sequence models. You run an RNN (GRU in our case) over the fc7 vectors from the 5 images (in reverse order, which may not be important), then take the final hidden state and feed that into the decoder model.
How about the parameters of the models, e.g., what is the embedding size of the words and the number of the hidden units in the encoder and the decoder? I’m currently having words embedded as 300-dimension vectors and the size of hidden layer (both in encoder and decoder) is 256, could it work?

We used 250-dim embeddings, 1000-dimensional GRU for encoder and decoder. Source images are 4096-dimensional fc7 vectors with 0.5 dropout applied to the fc7 vectors before feeding them into the RNN. No fine tuning on the CNN used to generate the fc7 vectors. I also used 0.5 dropout on the target hidden state before the softmax output layer. The vocab was all words that occur 3 or more times in the training data. Other words were mapped to in the training data, but I did not allow to be emitted at test time (i.e., if the decoder proposes as the highest probability word I just skip it and go to the next highest probability word).
Another question about the experiment results in Table 6 and Table 7. It reads that they are obtained on half the SIS validation set. Could you please tell me which half did you utilize? Will the baseline models obtain similar results (in terms of METEOR score) on the whole validation set or the test set?

It was the second half of the dataset.

Visual Storytelling Dataset (VIST)

Frequently Asked Questions (FAQ)