论文标题
MuseFormer:变形金刚具有细腻和粗粒的关注,以吸引音乐一代
Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation
论文作者
论文摘要
符号音乐的发电旨在自动生成音乐得分。最近的趋势是在音乐发电中使用变压器或其变体,但是,这是次优的,因为全部关注无法有效地对典型的较长的音乐序列(例如,超过10,000个标记)建模,并且现有模型在生成音乐重复结构方面存在缺点。在本文中,我们提出了MuseFormer,这是一种具有新颖的精细和粗颗粒的变压器,以吸引音乐发电。具体而言,有了细粒度的关注,特定条形的令牌直接与与音乐结构最相关的所有标记(例如,通过相似性统计数据选择)最相关的所有标记(例如,前1、2、2、2、2、2、4和8条);有了粗粒的关注,一个令牌只能参与其他条形的汇总,而不是每个标记的汇总,以降低计算成本。优点是两个方面。首先,它可以通过细粒度的注意和其他上下文信息来捕获与音乐结构相关的相关性。其次,它是有效的,并且与其全注意功能相比,它可以建模超过3倍的音乐序列。客观和主观实验结果都表明了其具有高质量和更好结构的长音乐序列的能力。
Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e.g., over 10,000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.