道：一个大规模跟踪任何物体的大规模基准

论文标题

道：一个大规模跟踪任何物体的大规模基准

TAO: A Large-Scale Benchmark for Tracking Any Object

论文作者

Dave, Achal, Khurana, Tarasha, Tokmakov, Pavel, Schmid, Cordelia, Ramanan, Deva

论文摘要

多年来，多目标跟踪基准一直集中在少数类别上。这些数据集主要是由监视和自动驾驶应用程序进行的，为人，车辆和动物提供了轨道，而无视世界上绝大多数物体的轨道。相比之下，在相关的对象检测领域中，引入大规模，多样的数据集（例如，可可）在开发高度健壮的解决方案方面取得了重大进展。为了弥合这一差距，我们引入了一个类似的多样化数据集，用于跟踪任何对象（TAO）。它由2,907个高分辨率视频组成，在不同的环境中捕获，平均时间为半分钟。重要的是，我们采用了一种自下而上的方法来发现833个类别的大型词汇，这比先前跟踪基准测试的数量级要多。为此，我们要求注释者标记在视频中任何时候移动的对象，并在Factum后给他们命名。我们的词汇既大大又与现有的跟踪数据集有很大不同。为了确保注释的可伸缩性，我们采用了一种联合方法，将手动努力集中在视频中相关对象（例如那些移动的对象）上标记曲目。我们对最先进的跟踪器进行了广泛的评估，并在开放世界中就大型摄影跟踪进行了许多重要发现。特别是，我们表明，现有的单对象跟踪器在野外应用于此情况时会遇到困难，并且基于检测的多物体跟踪器实际上与用户定位化的跟踪器具有竞争力。我们希望我们的数据集和分析能够提高跟踪社区的进一步进展。

For many years, multi-object tracking benchmarks have focused on a handful of categories. Motivated primarily by surveillance and self-driving applications, these datasets provide tracks for people, vehicles, and animals, ignoring the vast majority of objects in the world. By contrast, in the related field of object detection, the introduction of large-scale, diverse datasets (e.g., COCO) have fostered significant progress in developing highly robust solutions. To bridge this gap, we introduce a similarly diverse dataset for Tracking Any Object (TAO). It consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. Importantly, we adopt a bottom-up approach for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks. To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum. Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets. To ensure scalability of annotation, we employ a federated approach that focuses manual effort on labeling tracks for those relevant objects in a video (e.g., those that move). We perform an extensive evaluation of state-of-the-art trackers and make a number of important discoveries regarding large-vocabulary tracking in an open-world. In particular, we show that existing single- and multi-object trackers struggle when applied to this scenario in the wild, and that detection-based, multi-object trackers are in fact competitive with user-initialized ones. We hope that our dataset and analysis will boost further progress in the tracking community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题