论文标题

功能流:视频对象检测的网络内功能流估计

Feature Flow: In-network Feature Flow Estimation for Video Object Detection

论文作者

Jin, Ruibing, Lin, Guosheng, Wen, Changyun, Wang, Jianliang, Liu, Fayao

论文摘要

表达像素位移的光流在许多计算机视觉任务中广泛使用,以提供像素级运动信息。但是,随着卷积神经网络的显着进展,提出了最新的最新方法来直接解决特征级别的问题。由于特征向量的位移与像素位移不一致,因此一种常见方法是:向前光流向神经网络并在任务数据集中微调该网络。通过这种方法,他们希望微调网络能够生成张量编码特征级运动信息。在本文中,我们重新考虑了事实上的范式,并在视频对象检测任务中分析了其缺点。为了减轻这些问题,我们建议使用\ textbf {i} n-network \ textbf {f} eature \ textbf {f}低估计模块(IFF模块)进行新的网络(IFF-net),以进行视频对象检测。如果不在任何附加数据集上进行预训练,我们的IFF模块能够直接产生\ textbf {特征流},该模块指示特征位移。我们的IFF模块由一个浅层模块组成,该模块与检测分支共享特征。这种紧凑的设计使我们的IFF-NET能够准确检测对象,同时保持快速推理速度。此外,我们基于\ textit {persupervision}提出了转换残差损失(TRL),这进一步改善了我们的IFF-net的性能。我们的IFF-NET优于现有方法,并在ImageNet VID上设置最先进的性能。

Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent to the pixel displacement, a common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset. With this method,they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an \textbf{I}n-network \textbf{F}eature \textbf{F}low estimation module (IFF module) for video object detection. Without resorting pre-training on any additional dataset, our IFF module is able to directly produce \textbf{feature flow} which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on \textit{self-supervision}, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and sets a state-of-the-art performance on ImageNet VID.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源