进行连续数据的强大无监督分开 - 使用音乐音频的案例研究

论文标题

进行连续数据的强大无监督分开 - 使用音乐音频的案例研究

Towards Robust Unsupervised Disentanglement of Sequential Data -- A Case Study Using Music Audio

论文作者

Luo, Yin-Jyun, Ewert, Sebastian, Dixon, Simon

论文摘要

删除的顺序自动编码器（DSAE）代表一类概率图形模型，该模型描述了具有动态潜在变量和静态潜在变量的观察到的序列。前者以与观测值相同的帧速率编码信息，而后者在全球范围内控制整个序列。这引入了归纳偏见，并促进了基本的本地和全球因素的无监督分解。在本文中，我们表明，香草dsae对动态潜在变量的模型结构和容量的选择敏感，并且容易折叠静态潜在变量。作为对策，我们提出了TS-DSAE，这是一个两阶段的培训框架，首先学习序列级别的先验分布，随后将其用于正规化模型并促进辅助目标以促进分离。在广泛的模型配置中，对全局因子崩溃问题进行了完全无监督和鲁棒性。它还避免了典型的解决方案，例如通常涉及费力参数调整的对抗训练和特定于域的数据增强。我们进行定量和定性评估，以证明其在人工音乐和现实音乐音频数据集上的分离方面的鲁棒性。

Disentangled sequential autoencoders (DSAEs) represent a class of probabilistic graphical models that describes an observed sequence with dynamic latent variables and a static latent variable. The former encode information at a frame rate identical to the observation, while the latter globally governs the entire sequence. This introduces an inductive bias and facilitates unsupervised disentanglement of the underlying local and global factors. In this paper, we show that the vanilla DSAE suffers from being sensitive to the choice of model architecture and capacity of the dynamic latent variables, and is prone to collapse the static latent variable. As a countermeasure, we propose TS-DSAE, a two-stage training framework that first learns sequence-level prior distributions, which are subsequently employed to regularise the model and facilitate auxiliary objectives to promote disentanglement. The proposed framework is fully unsupervised and robust against the global factor collapse problem across a wide range of model configurations. It also avoids typical solutions such as adversarial training which usually involves laborious parameter tuning, and domain-specific data augmentation. We conduct quantitative and qualitative evaluations to demonstrate its robustness in terms of disentanglement on both artificial and real-world music audio datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题