论文标题
广阔的人群计数:用于大型场景中计数的多视图融合网络
Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in Large Scenes
论文作者
论文摘要
单视图像中的人群计数在现有计数数据集上取得了出色的性能。但是,单视图不适用于大型场景(例如公共公园,漫长的地铁平台或活动空间),因为单个相机无法捕捉到整个场景的足够细节,例如,例如,当场景太大而无法适应相机的视野太长时间,以至于在很大程度上范围太多,或者在很大的范围内,大量的群体,或者很大的对象,范围很大。因此,要解决广泛的计数任务,需要具有重叠的视野的多个摄像机。在本文中,我们提出了一个深层神经网络框架,用于多视图人群计数,该框架融合了来自多个相机视图的信息,以预测3D世界地面平面上的场景级密度图。我们考虑了Fusion框架的三个版本:晚融合模型融合了摄像头视图密度;天真的早期融合模型融合了摄像头视图特征图;多视图多尺度的早期融合模型可确保与同一地面平面点对齐的特征具有一致的尺度。旋转选择模块进一步确保了特征的一致旋转比对。我们在3个多视图计数数据集,PETS2009,DUKEMTMC和一个新收集的包含拥挤的街道交叉点的新收集的多视图计数数据集上测试了3个融合模型。与其他多视图计数基线相比,我们的方法获得了最新的结果。
Crowd counting in single-view images has achieved outstanding performance on existing counting datasets. However, single-view counting is not applicable to large and wide scenes (e.g., public parks, long subway platforms, or event spaces) because a single camera cannot capture the whole scene in adequate detail for counting, e.g., when the scene is too large to fit into the field-of-view of the camera, too long so that the resolution is too low on faraway crowds, or when there are too many large objects that occlude large portions of the crowd. Therefore, to solve the wide-area counting task requires multiple cameras with overlapping fields-of-view. In this paper, we propose a deep neural network framework for multi-view crowd counting, which fuses information from multiple camera views to predict a scene-level density map on the ground-plane of the 3D world. We consider three versions of the fusion framework: the late fusion model fuses camera-view density map; the naive early fusion model fuses camera-view feature maps; and the multi-view multi-scale early fusion model ensures that features aligned to the same ground-plane point have consistent scales. A rotation selection module further ensures consistent rotation alignment of the features. We test our 3 fusion models on 3 multi-view counting datasets, PETS2009, DukeMTMC, and a newly collected multi-view counting dataset containing a crowded street intersection. Our methods achieve state-of-the-art results compared to other multi-view counting baselines.