论文标题
端到端视频对象检测的时空可学习建议
Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection
论文作者
论文摘要
本文通过利用时间信息进行视频对象检测来提出生成对象建议的新颖想法。现代基于区域的视频对象检测器中的功能聚合在很大程度上依赖于从单帧RPN生成的学到的建议。这无关紧要的是诸如NMS之类的其他组件,并在低质量框架上产生不可靠的建议。为了解决这些限制,我们提出了SparseVod,这是一种新型的视频对象检测管道,该管道采用稀疏的R-CNN来利用时间信息。特别是,我们在稀疏R-CNN的动态头部引入了两个模块。首先,添加了基于时间ROI对齐操作的时间特征提取模块以提取ROI建议功能。其次,是由序列级别的语义聚集动机的,我们结合了注意引导的语义提案特征聚合模块,以在检测前增强对象特征表示。拟议的跨度伏特有效地减轻了复杂的后处理方法的开销,并使整个管道端到端训练。广泛的实验表明,我们的方法在MAP中显着将单帧稀疏RCNN提高了8%-9%。此外,除了在具有Resnet-50骨架的Imagenet VID数据集上实现最新的80.3%地图外,我们的跨度VOD在增加IOU阈值(IOU> 0.5)方面的较大余量优于现有的基于建议的方法。
This paper presents the novel idea of generating object proposals by leveraging temporal information for video object detection. The feature aggregation in modern region-based video object detectors heavily relies on learned proposals generated from a single-frame RPN. This imminently introduces additional components like NMS and produces unreliable proposals on low-quality frames. To tackle these restrictions, we present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information. In particular, we introduce two modules in the dynamic head of Sparse R-CNN. First, the Temporal Feature Extraction module based on the Temporal RoI Align operation is added to extract the RoI proposal features. Second, motivated by sequence-level semantic aggregation, we incorporate the attention-guided Semantic Proposal Feature Aggregation module to enhance object feature representation before detection. The proposed SparseVOD effectively alleviates the overhead of complicated post-processing methods and makes the overall pipeline end-to-end trainable. Extensive experiments show that our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP. Furthermore, besides achieving state-of-the-art 80.3% mAP on the ImageNet VID dataset with ResNet-50 backbone, our SparseVOD outperforms existing proposal-based methods by a significant margin on increasing IoU thresholds (IoU > 0.5).