是自然语言处理的强化学习（不是）：基准，基准和自然语言政策优化的基础

论文标题

是自然语言处理的强化学习（不是）：基准，基准和自然语言政策优化的基础

Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

论文作者

Ramamurthy, Rajkumar, Ammanabrolu, Prithviraj, Brantley, Kianté, Hessel, Jack, Sifa, Rafet, Bauckhage, Christian, Hajishirzi, Hannaneh, Choi, Yejin

论文摘要

我们解决了将预先训练的大语言模型（LMS）与人类偏好保持一致的问题。如果我们将文本生成视为顺序决策问题，则加强学习（RL）似乎是一个自然的概念框架。但是，使用RL进行基于LM的一代面临经验挑战，包括由于组合动作空间而引起的训练不稳定，以及缺乏用于LM对齐的开源库和基准。因此，研究界的一个问题提出了：RL是NLP的实用范式吗？为了帮助回答这一点，我们首先引入了一个开源模块化库RL4LMS（语言模型的增强学习），以优化使用RL的语言生成器。该库由policy RL算法组成，这些算法可用于训练带有任意奖励功能的HuggingFace库（Wolf等，2020）中的任何编码器或编码器decoder lm。接下来，我们介绍GRUE（一般的增强语言理解评估）基准，这是一组6个语言生成任务，这些任务不是由目标字符串监督，而是通过奖励功能来捕获人类偏爱的自动措施。 GRUE是对NLP任务的RL算法的第一个排行榜式评估。最后，我们介绍了一种易于使用的，表现的RL算法，NLPO（自然语言策略优化），该算法学会了有效地减少语言生成中的组合动作空间。我们表明1）RL技术通常比将LMS与人类偏好保持一致的监督方法更好； 2）基于自动评估和人类评估，NLPO表现出比以前的策略梯度方法更大的稳定性和性能（例如PPO（Schulman等人，2017年））。

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题