台式机：用于评估语言模型和语义解析的语言模型的基准

论文标题

台式机：用于评估语言模型和语义解析的语言模型的基准

BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing

论文作者

Roy, Subhro, Thomson, Sam, Chen, Tongfei, Shin, Richard, Pauls, Adam, Eisner, Jason, Van Durme, Benjamin

论文摘要

最近的工作表明，当输出被限制为有效的语义表示时，从提示或微调语言模型中产生的生成可以很好地表现。我们介绍了基准，该基准是评估受约束语言模型解析的基准，其中包括七个语义解析数据集的无上下文语法和两个具有多种输出表示的句法解析数据集，以及一个受约束的解码接口，仅生成这些语法范围的有效输出。我们为每个数据集提供低，中和高资源分割，从而可以在不同的数据制度下准确比较各种语言模型。我们的基准测试支持使用基于及时的学习和微调的语言模型评估。我们基准了八种语言模型，其中包括仅通过API可用的两个GPT-3变体。我们的实验表明，当模型输出约束为有效时，编码器描述器预期的语言模型可以实现相似的性能或超过句法和语义解析的最新方法。

Recent work has shown that generation from a prompted or fine-tuned language model can perform well at semantic parsing when the output is constrained to be a valid semantic representation. We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing, that includes context-free grammars for seven semantic parsing datasets and two syntactic parsing datasets with varied output representations, as well as a constrained decoding interface to generate only valid outputs covered by these grammars. We provide low, medium, and high resource splits for each dataset, allowing accurate comparison of various language models under different data regimes. Our benchmark supports evaluation of language models using prompt-based learning as well as fine-tuning. We benchmark eight language models, including two GPT-3 variants available only through an API. Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.

下载PDF全文

下载文献需遵守相关版权规定

论文标题