事件类型的视听融合层识别视频识别

论文标题

事件类型的视听融合层识别视频识别

Audio-Visual Fusion Layers for Event Type Aware Video Recognition

论文作者

Senocak, Arda, Kim, Junsik, Oh, Tae-Hyun, Ryu, Hyeonggon, Li, Dingzeyu, Kweon, In So

论文摘要

在任何给定时刻，人类大脑不断被多感觉信息及其复杂的相互作用所淹没。通过在我们的大脑中结合或分离来自动分析此类信息。尽管对于人类的大脑来说，这项任务似乎毫不费力，但建造可以执行类似任务的机器非常具有挑战性，因为复杂的互动不能用单一类型的集成来处理，但需要更复杂的方法。在本文中，我们提出了一个新模型，以解决多任务学习方案中各个事件特定层的多感官集成问题。与以前使用单一类型的融合的作品不同，我们设计了特定于事件的层来处理不同的视听关系任务，从而实现了不同的视听形成方式。实验结果表明，我们的特定事件层可以发现视频中视听关系的独特属性。此外，尽管我们的网络配有单个标签，但它可以输出其他真实的多标签来表示给定的视频。我们证明，我们提出的框架还公开了流行的基准数据集中视频数据类别和数据集的方式的模态偏差。

Human brain is continuously inundated with the multisensory information and their complex interactions coming from the outside world at any given moment. Such information is automatically analyzed by binding or segregating in our brain. While this task might seem effortless for human brains, it is extremely challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with single type of integration but requires more sophisticated approaches. In this paper, we propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme. Unlike previous works where single type of fusion is used, we design event-specific layers to deal with different audio-visual relationship tasks, enabling different ways of audio-visual formation. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos. Moreover, although our network is formulated with single labels, it can output additional true multi-labels to represent the given videos. We demonstrate that our proposed framework also exposes the modality bias of the video data category-wise and dataset-wise manner in popular benchmark datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题