使用基于转移学习的联合优化策略的视听场景分类

论文标题

使用基于转移学习的联合优化策略的视听场景分类

Audio-Visual Scene Classification Using A Transfer Learning Based Joint Optimization Strategy

论文作者

Chen, Chengxin, Wang, Meng, Zhang, Pengyuan

论文摘要

最近，视听场景分类（AVSC）引起了多学科社区的越来越多的关注。先前的研究倾向于采用管道训练策略，该策略使用训练有素的视觉和声学编码器首先提取高级表示（嵌入），然后利用它们来训练视听分类器。通过这种方式，提取的嵌入非常适合单模式分类器，但不一定适合多模式的分类器。在本文中，我们提出了一个联合培训框架，使用声学特征和原始图像直接作为AVSC任务的输入。具体来说，我们将预训练的图像模型的底层作为视觉编码器，并在训练过程中共同优化场景分类器和基于1D-CNN的声学编码器。我们评估了2021年Tau Urban Audio-Visual Visual Scenes的开发数据集的方法。实验结果表明，我们提出的方法在传统的管道培训策略上取得了重大改进。此外，我们最好的单个系统的表现优于先前的最先进方法，在官方测试折叠上产生了0.1517的日志损失，精度为94.59％。

Recently, audio-visual scene classification (AVSC) has attracted increasing attention from multidisciplinary communities. Previous studies tended to adopt a pipeline training strategy, which uses well-trained visual and acoustic encoders to extract high-level representations (embeddings) first, then utilizes them to train the audio-visual classifier. In this way, the extracted embeddings are well suited for uni-modal classifiers, but not necessarily suited for multi-modal ones. In this paper, we propose a joint training framework, using the acoustic features and raw images directly as inputs for the AVSC task. Specifically, we retrieve the bottom layers of pre-trained image models as visual encoder, and jointly optimize the scene classifier and 1D-CNN based acoustic encoder during training. We evaluate the approach on the development dataset of TAU Urban Audio-Visual Scenes 2021. The experimental results show that our proposed approach achieves significant improvement over the conventional pipeline training strategy. Moreover, our best single system outperforms previous state-of-the-art methods, yielding a log loss of 0.1517 and accuracy of 94.59% on the official test fold.

下载PDF全文

下载文献需遵守相关版权规定

论文标题