自动音频字幕的音频特征序列的时间子采样

论文标题

自动音频字幕的音频特征序列的时间子采样

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

论文作者

Nguyen, Khoa, Drossos, Konstantinos, Virtanen, Tuomas

论文摘要

音频字幕是自动为通用音频信号内容创建文本描述的任务。典型的音频字幕方法依赖于深神经网络（DNN），其中DNN的目标是将输入音频序列映射到单词的输出序列，即标题。但是，文本描述的长度大大低于音频信号的长度，例如10个单词与数千个音频特征向量。这清楚地表明，输出单词对应于多个输入特征向量。在这项工作中，我们提出了一种方法，该方法专注于通过将时间子采样应用于音频输入序列，从而明确利用序列之间的长度差异。我们采用序列到序列方法，该方法使用固定长度向量作为来自编码器的输出，并在编码器的RNN之间应用时间子采样。我们通过使用免费可用的数据集Clotho来评估方法的好处，并评估时间次采样因素的影响。我们的结果表明对所有考虑的指标有所改进。

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题