mirostat：一种直接控制困惑性的神经文本解码算法

论文标题

mirostat：一种直接控制困惑性的神经文本解码算法

Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity

论文作者

Basu, Sourya, Ramachandran, Govardana Sachitanandam, Keskar, Nitish Shirish, Varshney, Lav R.

论文摘要

神经文本解码对于使用语言模型生成高质量文本很重要。为了生成高质量的文本，流行的解码算法，例如TOP-K，TOP-P（核）和基于温度的采样截断或扭曲语言模型的不可靠的低概率尾巴。尽管这些方法在参数调整后会产生高质量的文本，但它们是临时的。他们对输出统计数据提供的控制知之甚少，这很重要，因为最近的报告显示，特定可能性范围的文本质量最高。首先，我们对TOP-K，TOP-P和温度采样中的困惑性提供了理论分析，发现在Zipfian统计下，跨透镜在TOP-P采样中的函数在TOP-P采样中的函数近似为p的函数，而在TOP-K采样中，它是k的非线性函数。我们使用此分析来设计一种基于反馈的自适应TOP-K文本解码算法，称为mirostat，该算法生成具有困惑性预定值的文本（任何长度），从而无需进行任何调整而没有任何调整。实验表明，对于TOUP-K和TOP-P采样中K值的低值，复杂性与生成的文本长度显着下降，这也与文本中的过度重复（无聊陷阱）相关。另一方面，对于k和p的较大值，我们发现困惑与生成的文本长度增加，这与文本中的不连贯相关（混乱陷阱）。 Mirostat避免了这两个陷阱：实验表明，跨深头与生成的文本的重复关系接近线性。该关系几乎与采样方法无关，但略微依赖于所使用的模型。因此，对于给定的语言模型，对困惑性的控制也可以控制重复。对人类评估者的流利性，连贯性和质量的实验进一步验证了我们的发现。

Neural text decoding is important for generating high-quality texts using language models. To generate high-quality text, popular decoding algorithms like top-k, top-p (nucleus), and temperature-based sampling truncate or distort the unreliable low probability tail of the language model. Though these methods generate high-quality text after parameter tuning, they are ad hoc. Not much is known about the control they provide over the statistics of the output, which is important since recent reports show text quality is highest for a specific range of likelihoods. Here, first we provide a theoretical analysis of perplexity in top-k, top-p, and temperature sampling, finding that cross-entropy behaves approximately linearly as a function of p in top-p sampling whereas it is a nonlinear function of k in top-k sampling, under Zipfian statistics. We use this analysis to design a feedback-based adaptive top-k text decoding algorithm called mirostat that generates text (of any length) with a predetermined value of perplexity, and thereby high-quality text without any tuning. Experiments show that for low values of k and p in top-k and top-p sampling, perplexity drops significantly with generated text length, which is also correlated with excessive repetitions in the text (the boredom trap). On the other hand, for large values of k and p, we find that perplexity increases with generated text length, which is correlated with incoherence in the text (confusion trap). Mirostat avoids both traps: experiments show that cross-entropy has a near-linear relation with repetition in generated text. This relation is almost independent of the sampling method but slightly dependent on the model used. Hence, for a given language model, control over perplexity also gives control over repetitions. Experiments with human raters for fluency, coherence, and quality further verify our findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题