时间动作定位的时间融合网络：提交活动网络挑战2020（任务E）

论文标题

时间动作定位的时间融合网络：提交活动网络挑战2020（任务E）

Temporal Fusion Network for Temporal Action Localization:Submission to ActivityNet Challenge 2020 (Task E)

论文作者

Qing, Zhiwu, Wang, Xiang, Sang, Yongpeng, Gao, Changxin, Zhang, Shiwei, Sang, Nong

论文摘要

该技术报告分析了我们在HACS竞争中使用的一种时间动作定位方法，该方法在2020年活动网络挑战中托管。我们任务的目的是在未修剪的视频类别中定位动作的开始时间和结束时间，并预测动作类别。我们利用视频级别的功能信息来训练多个视频级别的动作分类模型。通过这种方式，我们可以在视频中获得动作类别。第二，我们专注于生成高质量的时间提案。为此，我们应用BMN来生成大量建议以获得高召回率。然后，我们通过采用称为完善网络的级联结构网络来完善这些建议，该网络可以预测位置的偏移和新的iou在地面真理的监督下。要使建议更准确，我们使用双向LSTM，非定位和变压器来捕获每种建议的局部特征和乘坐多个模型的全球特征之间的局部特征之间的时间关系。根据MAP，测试集的40.53％，并在此挑战中获得排名1。

This technical report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.Firstly, we utilize the video-level feature information to train multiple video-level action classification models. In this way, we can get the category of action in the video.Secondly, we focus on generating high quality temporal proposals.For this purpose, we apply BMN to generate a large number of proposals to obtain high recall rates. We then refine these proposals by employing a cascade structure network called Refine Network, which can predict position offset and new IOU under the supervision of ground truth.To make the proposals more accurate, we use bidirectional LSTM, Nonlocal and Transformer to capture temporal relationships between local features of each proposal and global features of the video data.Finally, by fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题