声音和视觉表示学习，并通过多个训练预处理任务

论文标题

声音和视觉表示学习，并通过多个训练预处理任务

Sound and Visual Representation Learning with Multiple Pretraining Tasks

论文作者

Vasudevan, Arun Balajee, Dai, Dengxin, Van Gool, Luc

论文摘要

不同的自我监督任务（SSL）从数据中揭示了不同的功能。学习的功能表示形式可以针对每个下游任务表现出不同的性能。从这个角度来看，这项工作旨在结合多个SSL任务（多SSL），这些任务可以很好地概括为所有下游任务。具体而言，对于这项研究，我们分别研究了双耳声音和图像数据。对于双耳声音，我们提出了三个SSL任务，即空间对齐，前景对象的时间同步以及双耳音频和时间间隙预测。我们研究了多SSL的几种方法，并深入了解视频检索，空间声音超级分辨率以及Omniaudio数据集上的语义预测的下游任务性能。我们对双耳声音表示的实验表明，SSL任务的增量学习（IL）多SSL优于单个SSL任务模型，并且在下游任务性能中进行了完全监督的模型。为了检查其他模式的适用性，我们还制定了用于图像表示学习的多SSL模型，并使用最近提出的SSL任务Mocov2和Densecl。在这里，多SSL超过了MOCOV2，DENSECL和DETCO等最新方法，在VOC07分类中，COCO检测的VOC07分类的2.06％，3.27％和1.19％，+2.83，+2.83，+1.56和+1.61 AP。代码将公开可用。

Different self-supervised tasks (SSL) reveal different features from the data. The learned feature representations can exhibit different performance for each downstream task. In this light, this work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks. Specifically, for this study, we investigate binaural sounds and image data in isolation. For binaural sounds, we propose three SSL tasks namely, spatial alignment, temporal synchronization of foreground objects and binaural audio and temporal gap prediction. We investigate several approaches of Multi-SSL and give insights into the downstream task performance on video retrieval, spatial sound super resolution, and semantic prediction on the OmniAudio dataset. Our experiments on binaural sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models and fully supervised models in the downstream task performance. As a check of applicability on other modality, we also formulate our Multi-SSL models for image representation learning and we use the recently proposed SSL tasks, MoCov2 and DenseCL. Here, Multi-SSL surpasses recent methods such as MoCov2, DenseCL and DetCo by 2.06%, 3.27% and 1.19% on VOC07 classification and +2.83, +1.56 and +1.61 AP on COCO detection. Code will be made publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题