We present a new large-scale dataset that contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5000 frames in addition to a larger set of 20000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts. The Cityscapes Dataset is intended for 1) assessing the performance of vision algorithms for two major tasks of semantic urban scene understanding: pixel-level and instance-level semantic labeling; 2) supporting research that aims to exploit large volumes of (weakly) annotated data, e.g. for training deep neural networks.

Related datasets