论文标题
通过双重图像和视频变压器进行多标签电影预告片类型分类改善转移学习
Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification
论文作者
论文摘要
在本文中,我们研究了成像网空间和动力学时空表示向多标签电影预告片类型分类(MTGC)的可传递性。特别是,我们对在ImageNet和动力学上预处理的Convnet和Transformer模型的可传递性进行了广泛的评估,这是一种新的手动策划的电影预告片数据集,该数据集由12,000个由10个不同类型和相关元数据标记的12,000个视频组成。我们分析了可以影响可传递性的不同方面,例如帧速率,输入视频扩展和时空建模。为了减少Imagenet/动力学和预告片之间的时空结构差距12K,我们提出了双图像和视频变压器体系结构(Divita),该架构执行了射击检测,以将预告片分段到高度相关的剪辑中,从而为预定的backbone提供了更具凝聚力的输入,并提高了Trespbone和提高Travises for Transcrability and a Kiin for Trapterability and a Kiin 3.85%。我们的结果表明,在图像网或动力学上学习的表示形式相对可转移至Trailers12K。此外,两个数据集都提供可以合并的互补信息以提高分类性能(与顶级单次审计相比,增长率为2.91%)。有趣的是,与顶级变压器相比,使用轻量级弯曲作为审计的骨架,分类性能下降了3.46%,同时仅需要11.82%的参数和0.81%的拖鞋。
In this paper, we study the transferability of ImageNet spatial and Kinetics spatio-temporal representations to multi-label Movie Trailer Genre Classification (MTGC). In particular, we present an extensive evaluation of the transferability of ConvNet and Transformer models pretrained on ImageNet and Kinetics to Trailers12k, a new manually-curated movie trailer dataset composed of 12,000 videos labeled with 10 different genres and associated metadata. We analyze different aspects that can influence transferability, such as frame rate, input video extension, and spatio-temporal modeling. In order to reduce the spatio-temporal structure gap between ImageNet/Kinetics and Trailers12k, we propose Dual Image and Video Transformer Architecture (DIViTA), which performs shot detection so as to segment the trailer into highly correlated clips, providing a more cohesive input for pretrained backbones and improving transferability (a 1.83% increase for ImageNet and 3.75% for Kinetics). Our results demonstrate that representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k. Moreover, both datasets provide complementary information that can be combined to improve classification performance (a 2.91% gain compared to the top single pretraining). Interestingly, using lightweight ConvNets as pretrained backbones resulted in only a 3.46% drop in classification performance compared with the top Transformer while requiring only 11.82% of its parameters and 0.81% of its FLOPS.