使用贴平谱变形金刚改进了零击音频标记和分类

论文标题

使用贴平谱变形金刚改进了零击音频标记和分类

Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

论文作者

Primus, Paul, Widmer, Gerhard

论文摘要

用于标记和分类声信号的标准机器学习模型无法处理训练过程中未见的类。通过基于适应性的类描述来预测类，零射击（ZS）学习克服了这一限制。这项研究旨在研究基于自发的音频嵌入体系结构对ZS学习的有效性。为此，我们将最近的贴图频谱变压器与两个经典的卷积体系结构进行了比较。我们在三个任务和三个不同的基准数据集上评估了这三个架构：在音频集上的通用标记，ESC-50上的环境声音分类以及OpenMIC上的仪器标记。我们的结果表明，基于自我注意的嵌入方法的表现都优于所有这些设置中的卷积架构。通过相应地设计培训和测试数据，我们观察到，当训练和新测试类之间的“语义距离”很大时，预测性能会遭受重大影响，这种效果值得进行更详细的研究。

Standard machine learning models for tagging and classifying acoustic signals cannot handle classes that were not seen during training. Zero-Shot (ZS) learning overcomes this restriction by predicting classes based on adaptable class descriptions. This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning. To this end, we compare the very recent patchout spectrogram transformer with two classic convolutional architectures. We evaluate these three architectures on three tasks and on three different benchmark datasets: general-purpose tagging on AudioSet, environmental sound classification on ESC-50, and instrument tagging on OpenMIC. Our results show that the self-attention-based embedding methods outperform both compared convolutional architectures in all of these settings. By designing training and test data accordingly, we observe that prediction performance suffers significantly when the `semantic distance' between training and new test classes is large, an effect that will deserve more detailed investigations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题