长期短期关系网络用于视频动作检测

论文标题

长期短期关系网络用于视频动作检测

Long Short-Term Relation Networks for Video Action Detection

论文作者

Li, Dong, Yao, Ting, Qiu, Zhaofan, Li, Houqiang, Mei, Tao

论文摘要

人们已经众所周知，建模人类对象或对象对象关系有助于检测任务。然而，这个问题并不是一个小问题，尤其是在探索人类演员，对象和场景之间的相互作用（统称为人类 - 文本）以促进视频动作探测器时。难度源于视频中可靠关系的方面，不仅应依赖于当前剪辑中的短期人类语言关系，而且还取决于视频的远距离跨度。这激发了我们在视频中捕获短期和长期关系。在本文中，我们提出了一个新的长期短期关系网络，称为LSTR，该网络在新颖的汇总并传播了与增强功能的视频动作检测相关关系。从技术上讲，区域提案网络（RPN）被重新生成每个视频剪辑中首先生产3D边界框，即小管。然后，LSTR通过时空注意机制在每个剪辑中的短期人类秘密相互作用进行建模，并以级联的方式通过图形卷积网络（GCN）进行长期时间动力学的原因。在四个基准数据集上进行了广泛的实验，与最先进的方法相比，报道了卓越的结果。

It has been well recognized that modeling human-object or object-object relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as human-context) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题