That is correct, except we did not use -norm -- we normalized it our own way, using our own preprocessing (described in the paper). And -l en is default.
So, it is:
"self.meteor_cmd = ['java', '-jar', '-Xmx2G', METEOR_JAR, '-', '-', '-stdio', '-t', 'hter'] "
For the final score calculations, the -refCount flag is used to score the generated sentence against the multiple references.
I'm not sure what did you get out with the random shuffle -- We compare the top-1 generated result with the (varying) number of reference stories for the same sequence.