论文标题
紧密语言模型的衡量理论表征
A Measure-Theoretic Characterization of Tight Language Models
论文作者
论文摘要
语言建模是自然语言处理中的核心任务,涉及估计字符串的概率分布。在大多数情况下,所有有限字符串的估计分布总和为1。但是,在某些病理情况下,概率质量可以``泄漏''在一组无限序列上。为了更精确地表征泄漏的概念,本文提供了对语言建模的理论处理。我们证明,许多流行的语言模型家族实际上很紧张,这意味着它们不会从这个意义上泄漏。我们还概括了先前作品中提出的紧密性的特征。
Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can ``leak'' onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.