Idiap/ETHZ Faces and Poses Dataset dataset by L. Jie, B. Caputo and V. Ferrari contains 1703 image-caption pairs. [author] Captions contain the names of some of the persons appearing in the corresponding image, as well as verbs indicating what they are doing. The images were collected by querying Google Images using query keywords generated by combining different names (sport stars and politicians) and verbs (from sports and social interactions). The captions are derived from the snippet of text returned by google-images and typically mention the action of at least one person in the image as well as names/verbs not appearing in the image. In addition to the image-caption pairs, this release also includes : ground-truth associations between names and verbs in the captions ground-truth lists of which names from the caption appear in the images ground-truth locations of the persons in the images name-verb pairs extracted automatically from the captions using  face and upper-body bounding-boxes detecting using [4,5]. These are included to facilitate a direct comparison to our results.