论文标题
人类实例分割和通过数据关联和单阶段检测器进行跟踪
Human Instance Segmentation and Tracking via Data Association and Single-stage Detector
论文作者
论文摘要
人类视频实例细分在计算机对人类活动的理解中起着重要作用,并且在虚拟现实中广泛用于视频处理,视频监视和人类建模。大多数当前的VIS方法基于Mask-RCNN框架,其中数据匹配的目标外观和运动信息将增加计算成本,并对分割实时性能产生影响;另一方面,现有的VIS数据集较少关注视频中出现的所有人员。在本文中,为了解决问题,我们开发了一种基于单级检测器的人类视频实例分割的新方法。为了在整个视频中跟踪实例,我们采用了数据关联策略来匹配视频序列中的相同实例,在该序列中,我们以端到端的方式共同学习目标实例的外观及其亲和力。我们还采用了质心抽样策略来增强实例的嵌入提取能力,这是为了使实例位置偏向每个实例掩盖的内部掩盖,并具有浓烈的重叠条件。结果,即使存在角色活动的突然变化,实例位置也不会从面具中移出,因此可以缓解同一实例表示相同实例的问题。最后,我们通过组装几个视频实例分割数据集来收集PVIS数据集,以填补当前缺乏专用于人类视频细分的数据集的空白。进行了基于此类数据集的大量模拟。仿真结果验证了拟议工作的有效性和效率。
Human video instance segmentation plays an important role in computer understanding of human activities and is widely used in video processing, video surveillance, and human modeling in virtual reality. Most current VIS methods are based on Mask-RCNN framework, where the target appearance and motion information for data matching will increase computational cost and have an impact on segmentation real-time performance; on the other hand, the existing datasets for VIS focus less on all the people appearing in the video. In this paper, to solve the problems, we develop a new method for human video instance segmentation based on single-stage detector. To tracking the instance across the video, we have adopted data association strategy for matching the same instance in the video sequence, where we jointly learn target instance appearances and their affinities in a pair of video frames in an end-to-end fashion. We have also adopted the centroid sampling strategy for enhancing the embedding extraction ability of instance, which is to bias the instance position to the inside of each instance mask with heavy overlap condition. As a result, even there exists a sudden change in the character activity, the instance position will not move out of the mask, so that the problem that the same instance is represented by two different instances can be alleviated. Finally, we collect PVIS dataset by assembling several video instance segmentation datasets to fill the gap of the current lack of datasets dedicated to human video segmentation. Extensive simulations based on such dataset has been conduct. Simulation results verify the effectiveness and efficiency of the proposed work.