功能流：视频对象检测的网络内功能流估计

论文标题

功能流：视频对象检测的网络内功能流估计

Feature Flow: In-network Feature Flow Estimation for Video Object Detection

论文作者

Jin, Ruibing, Lin, Guosheng, Wen, Changyun, Wang, Jianliang, Liu, Fayao

论文摘要

表达像素位移的光流在许多计算机视觉任务中广泛使用，以提供像素级运动信息。但是，随着卷积神经网络的显着进展，提出了最新的最新方法来直接解决特征级别的问题。由于特征向量的位移与像素位移不一致，因此一种常见方法是：向前光流向神经网络并在任务数据集中微调该网络。通过这种方法，他们希望微调网络能够生成张量编码特征级运动信息。在本文中，我们重新考虑了事实上的范式，并在视频对象检测任务中分析了其缺点。为了减轻这些问题，我们建议使用\ textbf {i} n-network \ textbf {f} eature \ textbf {f}低估计模块（IFF模块）进行新的网络（IFF-net），以进行视频对象检测。如果不在任何附加数据集上进行预训练，我们的IFF模块能够直接产生\ textbf {特征流}，该模块指示特征位移。我们的IFF模块由一个浅层模块组成，该模块与检测分支共享特征。这种紧凑的设计使我们的IFF-NET能够准确检测对象，同时保持快速推理速度。此外，我们基于\ textit {persupervision}提出了转换残差损失（TRL），这进一步改善了我们的IFF-net的性能。我们的IFF-NET优于现有方法，并在ImageNet VID上设置最先进的性能。

Optical flow, which expresses pixel displacement, is widely used in many computer vision tasks to provide pixel-level motion information. However, with the remarkable progress of the convolutional neural network, recent state-of-the-art approaches are proposed to solve problems directly on feature-level. Since the displacement of feature vector is not consistent to the pixel displacement, a common approach is to:forward optical flow to a neural network and fine-tune this network on the task dataset. With this method,they expect the fine-tuned network to produce tensors encoding feature-level motion information. In this paper, we rethink this de facto paradigm and analyze its drawbacks in the video object detection task. To mitigate these issues, we propose a novel network (IFF-Net) with an \textbf{I}n-network \textbf{F}eature \textbf{F}low estimation module (IFF module) for video object detection. Without resorting pre-training on any additional dataset, our IFF module is able to directly produce \textbf{feature flow} which indicates the feature displacement. Our IFF module consists of a shallow module, which shares the features with the detection branches. This compact design enables our IFF-Net to accurately detect objects, while maintaining a fast inference speed. Furthermore, we propose a transformation residual loss (TRL) based on \textit{self-supervision}, which further improves the performance of our IFF-Net. Our IFF-Net outperforms existing methods and sets a state-of-the-art performance on ImageNet VID.

下载PDF全文

下载文献需遵守相关版权规定

论文标题