使用基于注意力的事件嵌入的自动音频字幕

论文标题

使用基于注意力的事件嵌入的自动音频字幕

Automatic Audio Captioning using Attention weighted Event based Embeddings

论文作者

Bhosale, Swapnil, Chakraborty, Rupayan, Kopparapu, Sunil Kumar

论文摘要

自动音频字幕（AAC）是指将音频转化为描述音频事件，事件源及其关系的自然语言的任务。目前，AAC数据集中的有限样本已经建立了将转移学习与音频事件检测（AED）作为父任务的趋势。朝这个方向朝着本文迈进，我们提出了一个编码器构建结构，其中具有轻巧的（即具有较少可学习的参数）BI-LSTM复发层，以进行AAC，并比较两个先进的预训练的AED模型作为嵌入式提取器的性能。我们的结果表明，有效的基于AED的嵌入提取器与时间关注和增强技术相结合，能够通过计算密集型体系结构超越现有文献。此外，我们提供了证据表明，作为模型的一部分而生成的不均匀注意的加权编码能够促进解码器在产生每个令牌时对音频的特定部分的看法。

Automatic Audio Captioning (AAC) refers to the task of translating audio into a natural language that describes the audio events, source of the events and their relationships. The limited samples in AAC datasets at present, has set up a trend to incorporate transfer learning with Audio Event Detection (AED) as a parent task. Towards this direction, in this paper, we propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC and compare the performance of two state-of-the-art pre-trained AED models as embedding extractors. Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature with computationally intensive architectures. Further, we provide evidence of the ability of the non-uniform attention weighted encoding generated as a part of our model to facilitate the decoder glance over specific sections of the audio while generating each token.

下载PDF全文

下载文献需遵守相关版权规定

论文标题