admin管理员组

文章数量:1591087

LLMs:《OPT: Open Pre-trained Transformer Language Models》翻译与解读

导读:本文主要介绍了开放预训练变换器(Open Pre-trained Transformers,OPT),这是一套仅解码器的预训练变换器,参数范围从125M到175B的自回归语言模型。该论文介绍了OPT(Open Pre-trained Transformer)模型,这是一种开放预训练的Transformer语言模型。其中OPT-175B是一个具有1750亿参数的模型,在公共数据集上训练而成,是第一个可供更广泛的人工智能研究社区使用的模型。之所以共享这个模型,Meta AI 希望更多的社区参与理解关于大模型的基本技术。

>> 模仿GPT-3模型并叙述详细细节:作者的目标是复制GPT-3类模型的性能和规模,并应用最佳实践来提高数据策划和训练效率。他们描述了模型的训练细节,并在多个自然语言处理和对话设置中评估了性能。OPT模型是以完全开放(模型、代码和数据)的方式训练的Transformer语言模型,训练数据包括从互联网上收集的大规模文本数据。研究人员还提供了模型的代码和预训练权重,使其能够被研究人员和开发者广泛使用,为语言模型的进一步研究和应用提供了便利。

>> 能耗仅为GPT-3的1/7(FSDP【Meta】+张量并行【NVIDIA】):Meta AI 在开发 OPT-175B 时考虑到了能源效率,其碳足迹仅为 GPT-3 的 1/7。这是通过在 Megatron-LM 中结合 Meta 的开源全切片数据并行 (FSDP) API 和 NVIDIA 的张量并行抽象来实现的。

>> 提出了一种新的开放预训练方法(目的是为促进学术研究和交流):用于构建可扩展、高效的自然语言处理模型。通过共享预训练权重和数据集,可以加速模型的开发和部署。实验结果表明,这种方法可以提高模型的性能和鲁棒性,适用于多种自然语言处理任务。因为绝大多数大语言模型训练成本高昂,导致大部分研究人员都无法负担大语言模型的训练或使用;同时,各大企业发布的大语言预训练模型由于商业目的也都无法完整访问模型权重,只能通过 API 调用获取结果,阻碍了学术的交流与研究。

>> 探讨模型负责任性+伦理性:作者还讨论了模型的限制以及发布这些模型时需要考虑的负责任因素。他们认为整个AI社区将受益于共同制定负责任的大型语言模型指南,并希望广泛访问这类模型将增加对伦理考虑的多样性。

目录

《OPT: Open Pre-trained Transformer Language Models》翻译与解读

Abstract

1、Introduction

2 、Method方法

2.1 、Models模型

2.2 Training Setup训练设置

2.3 Pre-training Corpus预训练语料库

RoBERTa

The Pile

PushShift.io Reddit

2.4、Training Efficiency训练效率

2.5、Training Processes训练过程

硬件故障、损失发散和其他中途更改Hardware Failures 、Loss Divergences 、Other Mid-flight Changes

3、Evaluations评估

3.1  Prompting & Few-Shot提示和少样本学习

零样本学习Zero-shot

一次和少次样本学习One-shot and Few-shot

3.2、Dialogue对话

4 Bias & Toxicity Evaluations偏见和有害性评估

4.1 Hate Speech Detection仇恨言论检测

4.2 CrowS-Pairs

4.3 StereoSet

4.4 Real Toxicity Prompts

4.5 Dialogue Safety Evaluations对话安全性评估

5 Limitations限制

6、Considerations for Release发布考虑因素

7 Related Work相关工作

8 Conclusion结论

Acknowledgements致谢


《OPT: Open Pre-trained Transformer Language Models》翻译与解读

时间

2022年5月2日

地址

论文地址:https://arxiv/abs/2205.01068

开源代码和小规模预训练模型
GitHub - facebookresearch/metaseq: Repo for external large-scale work

OPT-175B模型

https://docs.google/forms/d/e/1FAIpQLSe4IP4N6JkCEMpCP-yY71dIUPHngVReuOmQKDEI1oHFUaVg7w/viewform

lhttps://arxiv/abs/2205.01068

作者

Meta AI

Abstract

Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their com-putational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no ac-cess is granted to the full model weights, mak-ing them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is com-parable to GPT-3,1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastruc-ture challenges we faced, along with code for experimenting with all of the released models.

大型语言模型通常需要数十万个计算日进行训练,展现出了令人瞩目的零样本和少样本学习能力。鉴于它们的计算成本,这些模型很难在没有大量资金的情况下进行复制。对于那些通过API可用的模型,没有对完整的模型权重进行访问,这使得对它们的研究变得困难。我们提供了开放预训练Transformers(OPT),这是一套仅解码器的预训练变换器,参数范围从125M到175B,我们的目标是与感兴趣的研究人员充分而负责地共享这些模型。我们展示了OPT-175B与GPT-3.1相媲美的性能,但仅需要1/7的碳足迹进行开发。我们还发布了详细描述我们所面临的基础设施挑战的日志,以及用于尝试所有发布模型的代码。

1、Introduction

Large language models (LLMs) trained on massive text collections have shown surprising emergent capabilities to generate text and perform zero- and few-shot learning (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022). While in some cases the public can interact with these models through paid APIs, full model access is currently limited to only a few highly resourced labs.2 This restricted access has limited researchers’ ability to study how and why these large language models work, hindering progress on improving known challenges in areas such as robustness, bias, and toxicity.

In this technical report, we present Open Pre- trained Transformers (OPT), a suite of decoder- only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match the per- formance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. Our aim in de- veloping this suite of OPT models is to enable re- producible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the collective research community as a whole, which is only possible when models are available for study.

在大规模文本集上训练的大型语言模型(LLM)展现出令人惊讶的生成文本和零次和少次学习能力(Brown等人,2020年;Lieber等人,2021年;Smith等人,2022年;Rae等人,2021年;Chowd- hery等人,2022年)。虽然在某些情况下公众可以通过付费API与这些模型互动,但完整的模型访问目前仅限于少数资源丰富的实验室。这种受限的访问已经限制了研究人员研究这些大型语言模型的工作原理和原因的能力,阻碍了在健壮性、偏见和有害性等领域改进已知挑战的进展。

在本技术报告中,我们介绍了开放预训练变换器(OPT),这是一套仅包含解码器的预训练变换器,参数范围从125M到175B不等,我们希望与感兴趣的研究人员全面而负责任地分享。我们训练OPT模型,以大致匹配GPT-3系列模型的性能和大小,同时应用最新的数据收集和高效训练的最佳实践。我们开发这套OPT模型的目的是实现可复现和负责任的大规模研究,并在研究这些LLM的影响时引入更多的声音。风险、伤害、偏见、有害性等定义应该由整个研究界作为一个整体明确阐述,这只有在模型可供研究时才有可能。

We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Ac- cess will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry re- search laboratories. We are also releasing both the logbook of our model creation as well as our code- base, metaseq,3 which enabled training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hard- ware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. While this is a significant achievement, the energy cost of creating such a model is still nontrivial, and repeated efforts to replicate a model of this size will only amplify the growing compute footprint of these LLMs.

We believe the entire AI community  — aca- demic researchers, civil society, policymakers, and industry — must work together to develop clear guidelines around responsible AI in general and responsible LLMs in particular, given their cen- trality in many downstream language applications.

我们将释放我们所有的模型,参数范围从125M到66B,并将根据请求向学术研究人员、与政府、民间社会和学术界有关的组织以及工业研究实验室提供OPT-175B的完整研究访问权限。我们还发布了我们模型创建的日志和我们的代码库metaseq,它使得在992个80GB的A100 GPU上训练OPT-175B成为可能,每个GPU的利用率达到147 TFLOP/s。通过这种实现方式,并且使用最新一代的NVIDIA硬件,我们只需1/7的碳足迹即可开发OPT-175B。虽然这是一项重大成就,但创建这样的模型的能源成本仍然是非微不足道的,并且反复努力复制这样大小的模型只会放大这些LLM不断增长的计算足迹。

我们认为整个人工智能社区-学术研究人员、民间社会、决策者和工业界-必须共同努力制定有关负责任人工智能的明确准则,尤其是负责任的LLM,鉴于它们在许多下游语言应用中的核心地位。

Table 1: Model architecture details. We report the number of layers (#L), number of attention heads (#H), and the embedding size (dmodel). We also report the peak Learning Rate (LR) and global batch size in num- ber of tokens (Batch).

表1:模型架构细节。我们报告层数(#L)、注意头数(#H)和嵌入大小(dmodel)。我们还报告了峰值学习率(LR)和以标记数表示的全局批大小(Batch)。

2 Method方法

2.1 Models模型

We present results on eight Transformer language models ranging from 125 million to 175 billion parameters. Architectural details are displayed in Table 1. In the interest of transparency, and to re-duce risk of training instabilities, our models and hyperparameters largely follow Brown et al. (2020), with variations in batch size mostly to obtain in-creased computational efficiency.

我们展示了八个Transformer语言模型,参数范围从1.25亿到1750亿。表1显示了架构细节。为了透明起见,并减少训练不稳定性的风险,我们的模型和超参数主要遵循Brown等人(2020)的方法,只有批大小有所变化,主要是为了提高计算效率。

2.2 Training Setup训练设置

For weight initialization, we follow the same set-tings provided in the Megatron-LM codebase,4 us-ing a normal distribution with zero mean and stan-dard deviation of 0.006. Standard deviation for output layers are scaled by a 1.0/√2L term where L is the total number of layers. All bias terms are initialized as 0, and all models are trained with ReLU activation and a sequence length of 2048.

We use an AdamW optimizer (Loshchilov and Hutter, 2017) with (β1, β2) set to (0.9, 0.95), and weight decay of 0.1. We follow a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in our smaller baselines, and decaying down to 10% of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required (see Section 2.5). Our batch sizes range from 0.5M to 4M depending on the model size (see Table 1) and is kept constant throughout the course of training.

We use a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We clip gradient norms at 1.0, except for some mid-flight changes that reduce this threshold down from 1.0 to 0.3 (see Section 2.5). We also in-clude a gradient predivide factor to reduce the risk of over/underflows when computing the gradient across all ranks (splitting the division by the world size of N into two division operations by √N).

对于权重初始化,我们遵循Megatron-LM代码库中提供的相同设置,使用均值为0、标准差为0.006的正态分布。输出层的标准差通过一个1.0/√2L的项进行缩放,其中L是总层数。所有的偏置项都初始化为0,并且所有的模型都使用ReLU激活函数和序列长度为2048进行训练。

我们使用AdamW优化器(Loshchilov和Hutter,2017),将(β1,β2)设置为(0.9,0.95),权重衰减为0.1。我们遵循线性学习率调度,在OPT-175B的前2000个步骤或我们较小的基准模型的375M个标记上,从0逐渐升高到最大学习率,并在300B个标记上衰减到最大LR的10%。还需要对学习率进行一些中途更改(见第2.5节)。我们的批大小根据模型大小而变化,范围从0.5M到4M,并在整个训练过程中保持不变。

我们始终使用0.1的dropout率,但不对嵌入层应用任何dropout。我们将梯度范数剪裁在1.0,除了一些中途更改,将此阈值从1.0降低到0.3(见第2.5节)。我们还包括一个梯度预除法因子,以降低在计算所有等级(通过将除以N的分割成两个除法操作,每个操作除以√N)的梯度时产生溢出或下溢的风险。

2.3 Pre-training Corpus预训练语料库

The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and PushShift.io Red-dit (Baumgartner et al., 2020; Roller et al., 2021). All corpora were previously collected or filtered to contain predominantly English text, but a small amount of non-English data is still present within the corpus via CommonCrawl.

We removed duplicated documents across all datasets by filtering out documents via Min-hashLSH (Rajaraman and Ullman, 2011) with a Jaccard similarity ≥ .95. We found the Pile was particularly full of duplicate documents, and ad-vise future researchers using the Pile to perform additional de-duplication processing.

We tokenize all corpora using the GPT-2 byte level BPE tokenizer (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020). Our final corpus contains roughly 180B tokens.

预训练语料库包含RoBERTa(Liu等人,2019b)、Pile(Gao等人,2021a)和PushShift.io Reddit(Baumgartner等人,2020;Roller等人,2021)中使用的数据集的串联。所有的语料库在之前已经被收集或筛选,以主要包含英文文本,但仍然有少量非英文数据通过CommonCrawl存在于语料库中。

我们通过使用Jaccard相似度≥0.95的Min-hashLSH(Rajaraman和Ullman,2011)筛选出所有数据集中的重复文档。我们发现Pile数据集中有大量重复的文档,并建议将来使用Pile的研究人员进行额外的去重处理。

我们使用GPT-2字节级BPE分词器(Sennrich等人,2016;Radford等人,2019;Brown等人,2020)对所有语料进行分词。我们最终的语料库包含大约1800亿个标记。

RoBERTa

We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) sub-sets of the RoBERTa corpus and utilized an up-dated version of CCNews, containing news stories crawled through September 28, 2021. This CC-News v2 corpus was preprocessed the same way as the original RoBERTa CCNews (Liu et al., 2019b).

我们包括RoBERTa语料库的BookCorpus(Zhu等人,2015)和Stories(Trinh和Le,2018)子集,并使用更新的CCNews,其中包含截至2021年9月28日爬取的新闻故事。这个CC-News v2语料库的预处理方式与原始的RoBERTa CCNews(Liu等人,2019b)相同。

The Pile

We included a subset of the Pile (Gao et al., 2021a), including: CommonCrawl,DM Mathematics, Project Gutenberg, Hack-erNews, OpenSubtitles, OpenWebText2, USPTO and Wikipedia. Other subsets of the Pile were elim-inated as we found they increased the risk of insta-bilities, as measured by tendency to cause spikes in gradient norms at the 1.3B scale, or were other-wise deemed unsuitable. All subsets went through additional ad-hoc whitespace normalization.

我们包括了Pile的一个子集(Gao等人,2021a),包括:CommonCrawl、DM Mathematics、Project Gutenberg、HackerNews、OpenSubtitles、OpenWebText2、USPTO和Wikipedia。其他Pile的子集被排除在外,因为我们发现它们增加了在13亿规模下梯度范数突然增加的风险,或者被认为不适合。所有的子集都经过额外的自适应空格标准化处理。

PushShift.io Reddit

We included a subset of the Pushshift.io corpus produced by Baumgart-ner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees into language-model-accessible documents, we ex-tracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about 66%.

我们包括了由Baumgartner等人(2020)制作的Pushshift.io语料库的一个子集,并且之前被Roller等人(2021)使用过。为了将对话树转化为语言模型可访问的文档,我们提取了每个线程中最长的评论链,并且丢弃了树中的其他路径。这将语料库减少了约66%。

2.4、Training Efficiency训练效率

We trained OPT-175B on 992 80GB A100 GPUs, by utilizing Fully Sharded Data Parallel (Artetxe et al., 2021) with Megatron-LM Tensor Parallelism (Shoeybi et al., 2019). We achieve utilization of up to 147 TFLOP/s per GPU. We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16. To avoid under-flows, we used dynamic loss scaling, as described in Micikevicius et al. (2017).

我们使用了992个80GB的A100 GPU,在Megatron-LM Tensor Parallelism (Shoeybi et al., 2019)的基础上,利用Fully Sharded Data Parallel (Artetxe et al., 2021)对OPT-175B进行了训练。我们每个GPU的利用率达到了147 TFLOP/s。我们将Adam状态保持在FP32精度上,因为我们将其在所有主机上进行了分片,而模型权重保持在FP16精度上。为了避免下溢,我们使用了动态损失缩放,如Micikevicius等人(2017)所述。

2.5、Training Processes训练过程

Here we describe significant training process ad-justments that arose during OPT-175B pre-training.

在OPT-175B的预训练过程中,我们进行了一些重要的训练过程调整。

硬件故障、损失发散和其他中途更改Hardware Failures Loss Divergences Other Mid-flight Changes

We faced a significant num-ber of hardware failures in our compute cluster while training OPT-175B. In total, hardware fail-ures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months. During manual restarts, the training run was paused, and a series of diagnostics tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was re-sumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate 70+ automatic restarts due to hardware failures.

在训练OPT-175B的过程中,我们遇到了大量的计算集群硬件故障。总共,硬件故障导致至少35次手动重启,并在两个月的时间内循环使用了100多个主机。在手动重启期间,训练暂停,并进行了一系列的诊断测试以检测问题节点。被标记的节点被隔离,训练从最后保存的检查点继续进行。鉴于主机更替数量与手动重启数量之间的差异,我们估计自动重启次数超过70次,这是由于硬件故障造成的。

Loss divergences were also an issue in our training run. When the loss diverged, we found that lowering the learning rate and restart-ing from an earlier checkpoint allowed for the job to recover and continue training. We noticed a cor-relation between loss divergence, our dynamic loss scalar crashing to 0, and the l2-norm of the activa-tions of the final layer spiking. These observations led us to pick restart points for which our dynamic loss scalar was still in a “healthy” state (≥ 1.0), and after which our activation norms would trend downward instead of growing unboundedly. Our empirical LR schedule is shown in Figure 1. Early in training, we also noticed that lowering gradient clipping from 1.0 to 0.3 helped with stability; see our released logbook for exact details. Figure 2 shows our validation loss with respect to training iterations.

损失发散也是我们训练过程中的一个问题。当损失发散时,我们发现降低学习率并从较早的检查点重新开始训练可以使任务恢复并继续训练。我们注意到损失发散、动态损失缩放崩溃为0以及最终层激活的l2范数飙升之间存在相关性。这些观察结果导致我们选择动态损失缩放仍处于“健康”状态(≥1.0)的重启点,并且在此之后,我们的激活范数会趋于下降,而不是无限增长。我们的经验性学习率调度如图1所示。在训练初期,我们还注意到将梯度裁剪从1.0降低到0.3有助于稳定性;具体细节请参阅我们发布的日志。图2显示了验证损失随训练迭代次数的变化。

We conducted a number of other experimental mid-flight changes to handle loss divergences. These included: switch-ing to vanilla SGD (optimization plateaued quickly, and we reverted back to AdamW); resetting the dy-namic loss scalar (this helped recover some but not all divergences); and switching to a newer version of Megatron (this reduced pressure on activation norms and improved throughput).

我们进行了一些其他实验性的中途更改来处理损失发散。其中包括:切换到普通的SGD优化算法(优化很快陷入平台期,我们又切换回AdamW);重置动态损失缩放(这有助于恢复一些但不是全部的发散);切换到更新版本的Megatron(这减少了激活范数的压力并提高了吞吐量)。

Figure 2: Validation Perplexity. Our mid-flight LR changes had clear effects on validation perplexity.

图2:验证困惑度。我们的中途学习率更改对验证困惑度产生了明显影响。

3、Evaluations评估

3.1  Prompting & Few-Shot提示和少样本学习

We evaluate our model on 16 standard NLP tasks utilized in the literature: HellaSwag (Zellers et al., 2019), StoryCloze (Mostafazadeh et al., 2016), PIQA (Bisk et al., 2020), ARC Easy and Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), WinoGrad (Levesque et al., 2011), Wino-Grande (Sakaguchi et al., 2020), and SuperGLUE (Wang et al., 2019). We follow GPT-3 (Brown et al., 2020) by using their prompts and overall ex-perimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation set-tings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)

We report performance in accuracy (omitting F1 for MultiRC and ReCoRD for consistency in eval-uation metrics). For the Winograd Schema Chal-lenge (WSC) task in the SuperGLUE benchmark, we follow (Brown et al., 2020) and formulate the task as multiple choice questions, which is known to affect performance (Liu et al., 2020).

我们在文献中使用的16个标准NLP任务上评估我们的模型:HellaSwag (Zellers et al., 2019)、StoryCloze (Mostafazadeh et al., 2016)、PIQA (Bisk et al., 2020)、ARC Easy和Challenge (Clark et al., 2018)、OpenBookQA (Mihaylov et al., 2018)、WinoGrad (Levesque et al., 2011)、Wino-Grande (Sakaguchi et al., 2020)和SuperGLUE (Wang et al., 2019)。我们遵循GPT-3 (Brown et al., 2020)的方法,使用他们的提示和整体实验设置。我们主要与GPT-3进行比较,旨在重新实现他们的评估设置,但也包括其他LLM在每个任务上的表现(Lieber et al., 2021;Rae et al., 2021;Hoffmann et al., 2022;Black et al., 2022)。

我们以准确率来报告性能(忽略MultiRC和ReCoRD的F1以保持评估指标的一致性)。对于SuperGLUE基准测试中的Winograd Schema Challenge (WSC)任务,我们按照Brown等人(2020)的方法,将任务转化为多项选择问题,这已经被证明会影响性能(Liu et al., 2020)。

零样本学习Zero-shot

Overall average zero-shot perfor-mance across all 14 tasks may be seen in Figure 3. Overall, we see our average performance follows the trend of GPT-3. However, performance can vary radically across the tasks: for a full break-down, see Appendix A. Note that we intentionally removed MultiRC and WIC from these averages, as these datasets seem to systematically favor GPT-3 or OPT disproportionately.

图3展示了在所有14个任务中的平均零样本学习性能。总体而言,我们的平均性能与GPT-3的趋势相似。然而,不同任务之间的性能差异很大:具体细节请参见附录A。请注意,我们故意从这些平均值中去除了MultiRC和WIC,因为这些数据集似乎系统地偏向于GPT-3或OPT。

Our performance roughly matched GPT-3 for 10 tasks, and underperformed in 3 tasks (ARC Chal-lenge and MultiRC). In 3 tasks (CB, BoolQ, WSC), we find both GPT and OPT models display unpre-dictable behavior with respect to scale, likely due to the small size of the validation set in these 3 tasks (56, 277, and 104 examples, respectively). In WIC, we see that the OPT models always out-perform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem question-able, given WIC being a binary classification task.5 For MultiRC, we are unable to replicate the GPT-3 results using the Davinci API6 within our evalua-tion setup, suggesting differences in the methods of evaluation on this task. For BoolQ and WSC, we note that both OPT and GPT models seem to hover around majority-class accuracy, suggesting small perturbations in probability masses may be dominating the evaluations.

我们的性能在10个任务中与GPT-3大致相当,而在3个任务中表现不佳(ARC Challenge和MultiRC)。在3个任务(CB、BoolQ、WSC)中,我们发现无论是GPT模型还是OPT模型在规模上都表现出不可预测的行为,这可能是由于这3个任务中验证集的规模较小(分别为56、277和104个示例)所导致的。在WIC任务中,我们发现OPT模型总是优于GPT-3模型,尽管Brown等人(2020)报告的数字也存在问题,因为WIC是一个二分类任务。对于MultiRC任务,我们无法使用Davinci API6在我们的评估设置中复制GPT-3的结果,这表明在这个任务上评估方法存在差异。对于BoolQ和WSC任务,我们注意到OPT和GPT模型都似乎围绕着多数类准确率徘徊,这表明概率质量的微小扰动可能主导了评估结果。

Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021) perform roughly consistently with others for their parameter sizes, while PaLM (Chowdhery et al., 2022) generally performs better across all settings, even when controlling for num-ber of parameters. We speculate the high perfor-mance of PaLM comes predominantly from higher quality and diversity of pre-training data.

Chinchilla (Hoffmann et al., 2022)和Gopher (Rae et al., 2021)的性能与其他模型在参数规模上大致一致,而PaLM (Chowdhery et al., 2022)在所有设置下的性能通常更好,即使在控制参数数量时也是如此。我们推测PaLM的高性能主要来自于更高质量和多样性的预训练数据。

一次和少次样本学习One-shot and Few-shot

Average multi-shot in-context performance is shown in Figure 4 (again, omitting MultiRC and WIC), with detailed perfor-mances shown in Appendix A. Across the average of all metrics, we find that OPT models perform similarly to GPT-3 models. However, as with zero-shot, breaking down these results per task shows a different story: in the same set of 10 datasets as zero-shot, we see similar performance across the two models. Some of the remaining datasets show inconsistent performance with respect to model size for both OPT and GPT-3 models (BoolQ, CB, WSC, RTE). In MultiRC, we consistently see un-derperformance of OPT models compared to GPT- 3 models. Similar to our zero-shot evaluation, we hypothesize our one- and few-shot evaluation setup may differ significantly from Brown et al. (2020).

图4显示了在多个上下文中的平均多次样本学习性能(同样排除了MultiRC和WIC),详细性能见附录A。在所有度量指标的平均值上,我们发现OPT模型的表现与GPT-3模型类似。然而,与零样本学习一样,对每个任务的结果进行细分则呈现出不同的情况:在与零样本学习相同的10个数据集中,我们发现两种模型的性能类似。剩下的一些数据集对于OPT和GPT-3模型在模型规模方面的性能表现不一致(BoolQ、CB、WSC、RTE)。在MultiRC任务中,我们始终观察到OPT模型相对于GPT-3模型的性能不佳。与零样本评估类似,我们推测我们的一次和少次样本学习评估设置与Brown等人(2020)的设置可能存在显著差异。

3.2Dialogue对话

Given that LLMs are known to be an integral com-ponent of modern dialogue models (Adiwardana et al., 2020; Roller et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), we additionally evaluate OPT-175B on several open source dialogue datasets. In particular, we fol-low Roller et al. (2021), and evaluate on ConvAI2 (Dinan et al., 2020b), Wizard of Wikipedia (Di-nan et al., 2019b), Empathetic Dialogues (Rashkin et al., 2019), and Blended Skill Talk (Smith et al., 2020). We additionally evaluate on the more recent Wizard of Internet dataset (Komeili et al., 2021). We focus our comparisons primarily against ex-isting open source dialogue models including the fine-tuned BlenderBot 1 (Roller et al., 2021) and its pre-training counterpart Reddit 2.7B. We also compare against the fine-tuned R2C2 BlenderBot, a 2.7B parameter BlenderBot-like model trained by Shuster et al. (2022).

鉴于LLM已经被确认是现代对话模型的一个重要组成部分(Adiwardana等人,2020;Roller等人,2021;Thoppilan等人,2022;Rae等人,2021;Chowdhery等人,2022),我们额外评估了OPT-175B在几个开源对话数据集上的性能。特别是,我们按照Roller等人(2021)的方法,在ConvAI2(Dinan等人,2020b)、维基百科向导(Dinan等人,2019b)、共情对话(Rashkin等人,2019)和混合技能对话(Smith等人,2020)上进行评估。我们还在较新的互联网向导数据集(Komeili等人,2021)上进行了评估。我们主要与现有的开源对话模型进行比较,包括经过微调的BlenderBot 1(Roller等人,2021)及其预训练对应模型Reddit 2.7B。我们还将其与由Shuster等人(2022)训练的经过微调的R2C2 BlenderBot,即一个包含2.7B参数的BlenderBot类似模型进行比较。

We report Perplexity and Unigram F1 (UF1) overlap, following the metrics of the ConvAI2 com-petition (Dinan et al., 2020b). To control for dif-ferent tokenization in each of the models, we nor-malize all perplexities to be in the space of the GPT-2 tokenizer (Radford et al., 2019). We also note which models are supervised with respect to these dialogue tasks and which are unsupervised. For OPT-175B, all generations are performed using greedy decoding up to a maximum of 32 tokens. We do not attempt to prompt the model at all except for alternating “Person 1:” and “Person 2:” lines of dialogue. The remaining models use the generation parameters found in BlenderBot 1.

我们报告困惑度和Unigram F1(UF1)重叠度,遵循ConvAI2竞赛(Dinan等人,2020b)的度量标准。为了控制不同模型中的标记化差异,我们将所有困惑度标准化为GPT-2的标记化器(Radford等人,2019)空间中的数值。我们还注意到哪些模型在这些对话任务中是有监督的,哪些是无监督的。对于OPT-175B,所有生成都是使用贪婪解码器,最多生成32个标记。除了交替使用“Person 1:”和“Person 2:”对话行之外,我们并不尝试给模型提供任何提示。其他模型使用BlenderBot 1中发现的生成参数。

Results are shown in Table 2. We see that OPT-175B significantly outperforms the also-unsupervised Reddit 2.7B model on all tasks, and performs competitively with the fully supervised BlenderBot 1 model, especially in the ConvAI2 dataset. On the Wizard-of-Internet dataset, which is fully unsupervised for all models, we see that OPT-175B obtains the lowest perplexity but still has lower UF1 than the models with Wizard-of-Wikipedia supervision.

结果如表2所示。我们可以看到,在所有任务中,OPT-175B的性能明显优于同样无监督的Reddit 2.7B模型,并且在ConvAI2数据集中与经过完全监督的BlenderBot 1模型相比表现竞争力。在完全无监督的Wizard-of-Internet数据集中,OPT-175B的困惑度最低,但在UF1上仍然低于具有Wizard-of-Wikipedia监督的模型。

We were somewhat surprised that the evaluations of the unsupervised OPT-175B model were as com-petitive as BlenderBot 1 on the ConvAI2 dataset. This may indicate leakage of the ConvAI2 dataset into the general pre-training corpus or even into the validation data as evaluated in Table 2. To address concerns of leakage, we searched our pre-training corpus for the first conversation in the ConvAI2 dataset, but we did not find any overlap. We addi-tionally evaluated OPT-175B on the ConvAI2 hid-den test set, which has never been publicly released, and achieved 10.7 ppl and .185 UF1, matching the performance of the validation set. Furthermore, we evaluated OPT-175B on a subset of the ConvAI2-like MultiSessionChat (MSC) dataset (Xu et al., 2021b) and obtained a perplexity of 9.7 and UF1 of .177, indicating the model is generalizing well across multiple PersonaChat-like datasets. Since both MSC and WoI datasets were released after the CommonCrawl snapshot used in pre-training cor-pus, there is minimal risk of leakage. We conclude that OPT-175B has a strong ability to maintain a consistent persona across conversations, a behav-ior also highlighted in LaMDA (Thoppilan et al.,2022).

我们对无监督的OPT-175B模型在ConvAI2数据集上的评估结果与BlenderBot 1的竞争性相当令人惊讶。这可能表明ConvAI2数据集泄漏到了一般的预训练语料库中,甚至泄漏到了表2中评估的验证数据中。为了解决泄漏的问题,我们搜索了我们的预训练语料库中与ConvAI2数据集中第一个对话相匹配的部分,但我们没有发现任何重叠之处。我们还在从未公开发布过的ConvAI2隐藏测试集上对OPT-175B进行了评估,结果显示困惑度为10.7,UF1为0.185,与验证集的性能相匹配。此外,我们还对类似于ConvAI2的MultiSessionChat(MSC)数据集(Xu等人,2021b)的子集进行了OPT-175B的评估,得到困惑度为9.7,UF1为0.177,这表明该模型在多个PersonaChat类似的数据集上具有良好的泛化能力。由于MSC和WoI数据集是在用于预训练语料库的CommonCrawl快照之后发布的,泄漏的风险很小。我们得出结论,OPT-175B在对话中具有保持一致的个人特征的强大能力,这也是LaMDA(Thoppilan等人,2022)中强调的行为。

4 Bias & Toxicity Evaluations偏见和有害性评估

To understand the potential harm of OPT-175B, we evaluate a series of benchmarks related to hate speech detection, stereotype awareness, and toxic content generation. While there may be shortcom-ings in these benchmarks (Blodgett et al., 2021; Ja-cobs and Wallach, 2021), these measurements pro-vide a first step towards understanding the limita-tions of OPT-175B. We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).

为了了解OPT-175B的潜在危害,我们评估了一系列与仇恨言论检测、刻板意识和有害内容生成相关的基准测试。尽管这些基准测试可能存在缺陷(Blodgett等人,2021;Jacobs和Wallach,2021),但这些测量结果是了解OPT-175B限制的第一步。我们主要与GPT-3 Davinci进行比较,因为这些基准测试在Brown等人(2020)中还不可用。

4.1 Hate Speech Detection仇恨言论检测

Using the ETHOS dataset provided in Mollas et al.(2020) and instrumented by Chiu and Alexander (2021), we measure the ability of OPT-175B to identify whether or not certain English statements are racist or sexist (or neither). In the zero-, one-,and few-shot binary cases, the model is presented with text and asked to consider whether the text is racist or sexist and provide a yes/no response. In the few-shot multiclass setting, the model is asked to provide a yes/no/neither response.

Results are presented in Table 3. With all of our one-shot through few-shot configurations, OPT- 175B performs considerably better than Davinci. We speculate this occurs from two sources: (1) evaluating via the Davinci API may be bringing in safety control mechanisms beyond the original 175B GPT-3 model used in Brown et al. (2020); and (2) the significant presence of unmoderated social media discussions in the pre-training dataset has provided additional inductive bias to aid in such classification tasks.

我们使用Mollas等人(2020)提供的ETHOS数据集,并由Chiu和Alexander(2021)进行改进,来衡量OPT-175B识别特定英语陈述是否具有种族主义或性别歧视(或两者都没有)的能力。在零、一和少样本的二元情况下,模型被呈现文本,并被要求判断文本是否具有种族主义或性别歧视,并给出是/否的回答。在少样本多类别设置中,模型被要求给出是/否/两者都不是的回答。

结果如表3所示。在我们的一次样本到少次样本的所有配置中,OPT-175B的性能远远优于Davinci。我们推测这可能来自两个原因:(1)通过Davinci API进行评估可能带来了超出Brown等人(2020)中使用的原始175B GPT-3模型的安全控制机制;(2)预训练数据集中未经审核的社交媒体讨论的存在对这类分类任务提供了额外的归纳偏差。

4.2 CrowS-Pairs

Developed for masked language models, CrowS-Pairs (Nangia et al., 2020) is a crowdsourced bench-mark aiming to measure intrasentence level biases in 9 categories: gender, religion, race/color, sex-ual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each exam-ple consists of a pair of sentences representing a stereotype, or anti-stereotype, regarding a certain group, with the goal of measuring model preference towards stereotypical expressions. Higher scores indicate higher bias exhibited by a model.

When compared with Davinci in Table 4, OPT- 175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Pushshift.io Red-dit corpus has a higher incidence rate for stereo-types and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned more discriminatory associations, which directly impacts its performance on CrowS-Pairs.

CrowS-Pairs(Nangia等人,2020)是为掩码语言模型开发的,旨在衡量9个类别中的句内偏见,包括性别、宗教、种族/肤色、性取向、年龄、国籍、残疾、外貌和社会经济地位。每个示例由一对句子组成,表示关于某个群体的刻板印象或反刻板印象,目的是衡量模型对刻板表达的偏好程度。得分越高,表示模型表现出的偏见越高。

与Davinci在表4中的比较结果显示,OPT-175B在几乎所有类别中都表现出更多的刻板偏见,除了宗教。同样,这很可能是由于训练数据的差异导致的;Nangia等人(2020)表明,Pushshift.io Reddit语料库中的刻板和歧视性文本的发生率高于其他语料库(如维基百科)。鉴于这是OPT-175B的主要数据来源,该模型可能已学习到更多的歧视性关联,这直接影响其在CrowS-Pairs上的性能。

Table 4: CrowS-Pairs evaluation. Lower is better for all categories, indicating more fairness. The OPT-175B model performs worse than Davinci in most categories.

表4:CrowS-Pairs评估。所有类别越低越好,表明更公平。OPT-175B模型在大多数类别中表现不如Davinci 。

4.3 StereoSet

Following Lieber et al. (2021) and Artetxe et al.(2021), we use StereoSet (Nadeem et al., 2021) to measure stereotypical bias across 4 categories: profession, gender, religion, and race. In addition to intrasentence measurement (similar to CrowS-Pairs), StereoSet includes measurement at the inter-sentence level to test a model’s ability to incorpo-rate additional context. To account for a potential trade-off between bias detection and language mod-eling capability, StereoSet includes two metrics:Language Modeling Score (LMS) and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test score (ICAT). Unlike Lieber et al. (2021), we normalize scores by token count, rather than character count, which they report improves metrics for several models.

我们使用StereoSet(Nadeem等人,2021)来衡量4个类别中的刻板偏见:职业、性别、宗教和种族,这与Lieber等人(2021)和Artetxe等人(2021)的做法相似。除了句内测量(类似于CrowS-Pairs),StereoSet还包括句间级别的测量,以测试模型整合额外上下文的能力。为了考虑偏见检测和语言建模能力之间的潜在权衡,StereoSet包括两个指标:语言建模分数(LMS)和刻板分数(SS),然后将它们结合起来形成理想化上下文关联测试分数(ICAT)。与Lieber等人(2021)不同的是,我们通过令牌数对得分进行了归一化,而不是字符数,他们报告称这样可以改善多个模型的指标。

Results are shown in Table 5. We see that Davinci and OPT-175B exhibit similar scores on aggregate (overall ICAT is very close between the two). In particular, Davinci outperforms in the areas of profession and race, while OPT-175B out-performs in the areas of Gender and Religion. OPT- 175B performs better across the board on the SS metric, while Davinci generally outperforms on the LMS metric.

结果如表5所示。我们可以看到,Davinci和OPT-175B在总体上的得分相似(整体ICAT在两者之间非常接近)。特别是在职业和种族方面,Davinci的性能优于OPT-175B,而OPT-175B在性别和宗教方面表现更好。与CrowS-Pairs一样,这可能与训练数据中的偏差有关,因为OPT-175B的训练数据中包含了来自社交媒体的大量未经审核的文本。

4.4 Real Toxicity Prompts

We evaluate the tendency of OPT-175B to respond with toxic language via the RealToxicityPrompts (Gehman et al., 2020) dataset. Following PaLM (Chowdhery et al., 2022), we sample 25 genera-tions of 20 tokens using nucleus sampling (Holtz-man et al., 2020) (p = 0.9) for each of 10, 000 randomly sampled prompts from RTP, and report mean toxicity probabilities of the continuations, stratified across bucketed toxicities of the original prompts. For comparison, we report bucketed toxi-city rates from Davinci and PaLM.

Results are shown in Figure 5. Overall, we see that OPT-175B has a higher toxicity rate than ei-ther PaLM or Davinci. We also observe that all 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases, which is consistent with the observations of Chowdhery et al. (2022). As with our exper-iments in hate speech detection, we suspect the inclusion of unmoderated social media texts in the pre-training corpus raises model familiarity with, and therefore propensity to generate and detect, toxic text. This strong awareness of toxic language may or may not be desirable depending on the specific requirements of downstream applications. Future applications of OPT-175B should consider this aspect of the model, and take additional miti-gations, or avoid usage entirely as appropriate.

我们通过RealToxicityPrompts(Gehman等,2020)数据集评估OPT-175B生成有害语言的倾向性。沿用PaLM(Chowdhery等,2022)的做法,我们从RTP中随机抽取10,000个提示,并对每个提示使用核心抽样(nucleus sampling)(Holtzman等,2020)(p = 0.9)生成25个长度为20的文本,然后报告连续文本的平均有害概率,并根据原始提示的有害程度分组。为了比较,我们还报告了Davinci和PaLM的有害率分组。

结果如图5所示。总体而言,我们可以看到OPT-175B的有害率高于PaLM和Davinci。我们还观察到,随着提示的有害程度增加,这三个模型生成有害文本的可能性也增加,这与Chowdhery等人(2022)的观察一致。与我们在仇恨言论检测实验中的观察类似,我们怀疑在预训练语料库中包含未经审核的社交媒体文本,使得模型对生成和检测有害文本的倾向性增强。这种强烈的有害语言意识在具体应用的要求下可能是有利的,也可能是不可取的。未来使用OPT-175B时应考虑模型的这一特点,并根据需要采取额外的缓解措施或完全避免使用。

4.5 Dialogue Safety Evaluations对话安全性评估

Finally, we compare OPT-175B on two Dialogue Safety evaluations. The first, SaferDialogues (Ung et al., 2021), measures the ability to recover from explicit safety failures, usually in the form of apol-ogizing or recognizing its mistake. The second, the Safety Bench Unit Tests (Dinan et al., 2021), mea-sures how unsafe a model’s response is, stratified across 4 levels of topic sensitivity: Safe, Realis-tic, Unsafe, and Adversarial. As with the other dialogue evaluations (Section 3.2), we compare to several existing open source dialogue models.

Results for both experiments are shown in Ta-ble 6. We observe that OPT-175B has similar per-formance as the Reddit 2.7B model across both SaferDialogues and the Unit Tests, with OPT-175B performing marginally better in the Safe and Adver-sarial settings. Consistent with Roller et al. (2021) and Xu et al. (2020), we find that the models fine-tuned on curated dialogue datasets (BlenderBot 1, R2C2) have overall lower toxicity. We conclude that future experimentation of OPT-175B for dia-logue should contain explicit fine-tuning on curated datasets in order to improve the safety profile.

最后,我们将OPT-175B与两个对话安全性评估进行了比较。第一个是SaferDialogues(Ung等,2021),该评估衡量从明显的安全失败中恢复的能力,通常以道歉或认识到错误的形式。第二个是安全性基准单元测试(Dinan等,2021),它衡量模型响应的不安全性,根据话题敏感性分为4个级别:安全、真实、不安全和对抗性。与其他对话评估一样(第3.2节),我们将其与几个现有的开源对话模型进行了比较。

两个实验的结果如表6所示。我们观察到,在SaferDialogues和单元测试中,OPT-175B与Reddit 2.7B模型的性能相似,OPT-175B在安全和对抗性设置下略微优于其他模型。与Roller等人(2021)和Xu等人(2020)一致,我们发现在经过精心策划的对话数据集(BlenderBot 1、R2C2)上进行微调的模型整体上具有较低的有害性。我们得出结论,未来在对话中使用OPT-175B时,应在精心策划的数据集上进行明确的微调,以改善安全性能。

5 Limitations限制

In Sections 3.1 and 4, we carried out extensive evaluation of all released models at varying scales. We saw parity in performance for standard evalu-ation datasets used in the GPT-3 models. More-over, we performed safety, bias, and inclusion eval-uations, again seeing largely comparable perfor-mance with some variations in toxicity and hate speech detection. However, such evaluations may not fully characterize the complete limitations of these models. In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs (Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021).

在第3.1节和第4节中,我们对不同规模的所有发布模型进行了广泛的评估。我们发现在GPT-3模型中使用的标准评估数据集的性能基本相当。此外,我们进行了安全性、偏见性和包容性评估,同样发现在有害性和仇恨言论检测方面存在一些差异,但总体上性能相对可比。然而,此类评估可能无法完全揭示这些模型的全部限制。总的来说,我们在定性上观察到OPT-175B存在与其他LLM(Brown等,2020;Lieber等,2021;Thoppilan等,2022;Rae等,2021;Smith等,2022;Chowdhery等,2022;Bender等,2021)中提到的相同限制。

In particular, we found OPT-175B does not work well with declarative instructions or point-blank interrogatives. Prompting with such instructions tends to produce a simulation of a dialogue begin-ning with such an instruction, rather than an execu-tion of the instruction. Future work into instruction learning, in the vein of InstructGPT (Ouyang et al., 2022), may alleviate these limitations.

OPT-175B also tends to be repetitive and can eas-ily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtz-man et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled. Future work may wish to incorporate more modern strategies for reducing repetition and improving diversity, such as unlikelihood training (Welleck et al., 2020) or best-first decoding (Meis-ter et al., 2020).

Similar to other LLMs, OPT-175B can produce factually incorrect statements (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022). This can be particularly harmful in applications where information accuracy is critical, such as healthcare and scientific discovery (Wei-dinger et al., 2021b). Recently, several efforts have reported that retrieval-augmented models can im-prove factual correctness of LLMs (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021). We believe OPT-175B will also bene-fit from retrieval-augmentation in future iterations.

特别是,我们发现OPT-175B在处理陈述性指令或直接疑问句方面表现不佳。使用这类指令进行提示往往会产生对话的模拟,而不是执行指令。未来的工作可以借鉴InstructGPT(Ouyang等,2022)的指令学习方法,以减轻这些限制。

OPT-175B还容易出现重复性表现,并且容易陷入循环。虽然抽样可以降低重复行为的发生率(Holtzman等,2020),但我们的经验发现,当只进行一次生成时,重复行为并未完全消除。未来的工作可以考虑采用更现代的策略来减少重复并改善多样性,例如非似然训练(Welleck等,2020)或最佳优先解码(Meister等,2020)。

与其他LLM类似,OPT-175B可能会生成事实错误的陈述(Adiwardana等,2020;Brown等,2020;Roller等,2021;Rae等,2021;Chowdhery等,2022;Thoppilan等,2022)。在信息准确性至关重要的应用领域(如医疗保健和科学发现)中,这可能会带来实质性的伤害(Weidinger等,2021b)。最近,一些工作已经报道检索增强模型可以提高LLM的事实准确性(Lewis等,2020;Komeili等,2021;Thoppilan等,2022;Borgeaud等,2021;Shuster等,2022;Nakano等,2021)。我们相信OPT-175B在未来的迭代中也会从检索增强中获益。

As shown in Section 4, we also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when pro-vided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find (Dinan et al., 2021). There has been a great deal of work on mitigations for toxicity and bi-ases (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022). Depending on downstream applica-tions, future uses of OPT-175B may need to employ these or novel mitigation approaches, especially be-fore any real world deployment. Given our primary goal as a replication of GPT-3, we choose not to apply these mitigations in this first release.

正如第4节所示,我们还发现OPT-175B具有生成有害语言和强化有害刻板印象的倾向性,即使在相对无害的提示(Gehman等,2020)下,也很容易找到对抗性提示(Dinan等,2021)。针对有害性和偏见的缓解方法已经进行了大量的研究(Dathathri等,2019;Dinan等,2019a;Sheng等,2019;Dinan等,2020a;Liu等,2019a;Krause等,2020;Xu等,2020;Liang等,2021;Dinan等,2021;Xu等,2021a;Dhamala等,2021;Schick等,2021;Ouyang等,2022)。根据下游应用的不同,未来使用OPT-175B可能需要采用这些或新的缓解方法,尤其是在任何实际部署之前。鉴于我们作为GPT-3复制的主要目标,我们选择在这个首次发布中不应用这些缓解措施。

In summary, we still believe this technology is premature for commercial deployment. Despite including data sheets and model cards, we believe more scrutiny should be afforded to the training data with additional data characterization and se-lection criteria in order to use data responsibly. The current practice is to feed the model with as much data as possible and minimal selection within these datasets. Despite having comprehensive evalua-tions, we would ideally have more streamlined and consistent evaluation setups to ensure replicability and reproducibility of evaluation scenarios. Dif-ferences in prompting styles and number of shots for in-context learning could create variations that lead to different results. We hope that the public release of the OPT models will enable many more researchers to work on these important issues.

总之,我们仍然认为这项技术在商业部署上还为时过早。尽管包括了数据表和模型卡片,我们认为在使用数据时应给予更多的审查,并在数据选择方面加入额外的数据表征和选择标准,以负责任地使用数据。目前的做法是尽可能地向模型提供尽可能多的数据,并在这些数据集中进行最小的选择。尽管已经进行了全面的评估,但我们希望能有更简化和一致的评估设置,以确保评估场景的可复制性和可重现性。在上下文学习中的提示风格和生成次数的差异可能会导致不同的结果。我们希望OPT模型的公开发布能够让更多的研究人员在这些重要问题上进行研究。

6Considerations for Release发布考虑因素

Following the recommendations for individual re- searchers generated by the Partnership for AI,7 along with the governance guidance outlined by NIST,8 we are disclosing all of the details in- volved in training OPT-175B through our log- book,9 our code, and providing researchers access to model weights for OPT-175B, along with a suite of smaller baselines mirroring the setup for OPT- 175B. We aim to be fully accountable for the devel- opment lifecycle of OPT-175B, and only through increasing transparency around LLM development can we start understanding the limitations and risks of LLMs before broader deployment occurs.

By sharing a detailed account of our day-to-day training process, we disclose not only how much compute was used to train the current version of OPT-175B, but also the human overhead required when underlying infrastructure or the training pro- cess itself becomes unstable at scale. These details are generally omitted from previous publications, likely due to the inability to fully ablate changes made mid-flight (without drastically increasing the compute budget). We hope that by revealing how certain ad-hoc design decisions were made, we can improve upon these practices in the future, and col- lectively increase the experimental robustness in developing models at this scale.

根据AI伙伴关系组织(Partnership for AI)对个别研究人员的建议以及美国国家标准与技术研究院(NIST)提出的治理指导,我们通过我们的日志、代码以及提供研究人员对OPT-175B模型权重的访问,披露了训练OPT-175B所涉及的所有细节,同时还提供了一套与OPT-175B设置相似的较小基线模型。我们希望在LLM开发过程中能够完全负责任,并且只有通过增加LLM开发的透明度,我们才能在更广泛的部署之前开始了解LLM的限制和风险。

通过详细介绍我们日常训练过程的细节,我们不仅披露了用于训练当前版本OPT-175B所使用的计算量,还披露了在基础架构或训练过程本身在规模上变得不稳定时所需的人力开销。这些细节通常在先前的出版物中被省略,可能是由于无法完全消除中途进行的更改(而不会大幅增加计算预算)。我们希望通过揭示某些即席设计决策的制定方式,能够在未来改进这些实践,并在开发这一规模的模型时共同增强实验鲁棒性。

Outside of these notes, the metaseq codebase itself is the final source of truth in many of our implementation details. By releasing our develop- ment codebase, we aim to shed light on any imple- mentation detail that may have been omitted from being explicitly enumerated in this paper, as it is either considered a detail of standard practice in the field, or is simply a detail we failed to account for. This current codebase is also the only known open-source implementation of training a decoder- only transformer that is ≥175B parameters without the use of pipeline paralellism on NVIDIA GPUs.

To enable experimentation at 175B scale, we are providing researchers with direct access to the pa-rameters of OPT-175B. The reasoning here is two-fold: enable Responsible AI research into LLMs while simultaneously reducing the environmental impact of pursuing research at this scale. There is a growing body of work detailing ethical and social risks from deploying language models with emer-gent capabilities at scale (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021). By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, be-fore broader commercial deployment occurs.

除了这些注释,元序列(metaseq)代码库本身是我们许多实现细节的最终真相来源。通过发布我们的开发代码库,我们旨在揭示可能未在本文中明确列举的任何实现细节,因为它们要么被认为是该领域的标准实践细节,要么只是我们未能考虑到的细节。这个当前的代码库也是已知的训练只有NVIDIA GPU上的≥175B参数的解码器的开源实现。

为了使175B规模的实验成为可能,我们为研究人员提供了对OPT-175B参数的直接访问。这里的推理有两个方面:一方面,使得对LLM进行负责任的AI研究成为可能,同时减少追求这一规模的研究对环境的影响。部署具有新兴能力的大规模语言模型存在伦理和社会风险的工作越来越多(Weidinger等,2021a;Bommasani等,2021;Dinan等,2021;Kenton等,2021)。通过将对OPT-175B的访问限制在非商业许可的研究社区,我们旨在首先集中开发工作量的量化,然后再进行更广泛的商业部署。

Furthermore, there exists significant compute and carbon cost to reproduce models of this size. While OPT-175B was developed with an estimated carbon emissions footprint (CO2eq) of 75 tons,10 GPT-3 was estimated to use 500 tons (Patterson et al., 2021), while Gopher required 380 tons (Rae et al., 2021). These estimates are not universally re-ported, and the accounting methodologies for these calculations are also not standardized. In addition, model training is only one component of the over-all carbon footprint of AI systems; we must also consider experimentation and eventual downstream inference cost, all of which contribute to the grow-ing energy footprint of creating large-scale models (Wu et al., 2022). By releasing our logbook, we hope to highlight the gap between a theoretical car-bon cost estimate that assumes no hardware failures or training instabilities, versus one that aims to in-clude the entire LLM development lifecycle. We need to understand the manufacturing (or embod-ied) carbon of these systems (Gupta et al., 2021) as they grow increasingly more complex, and we hope that our paper can help future work in defin-ing additional factors to consider when measuring the impact of scale on the environment.

此外,复制这种规模的模型存在显着的计算和碳排放成本。虽然估计OPT-175B的碳排放足迹(CO2eq)为75吨,但据估计,GPT-3使用了500吨(Patterson等,2021),而Gopher则需要380吨(Rae等,2021)。这些估计并不普遍报道,而且这些计算的会计方法也没有标准化。此外,模型训练只是AI系统整体碳足迹的一个组成部分;我们还必须考虑实验和最终的下游推断成本,所有这些都会导致创建大规模模型的能源消耗不断增加(Wu等,2022)。通过发布我们的日志,我们希望突显一个理论上的碳成本估计与实际上包括整个LLM开发生命周期的碳成本估计之间的差距。我们需要了解这些系统的制造(或实体化)碳成本(Gupta等,2021),因为它们变得越来越复杂,我们希望我们的论文能够帮助未来的工作确定衡量规模对环境影响时要考虑的额外因素。

Similarly, by producing a set of baselines across a wide range of scales, we hope to enable the broader research community to study the impact and limitations of these models with respect to scale alone. As reported in Hoffmann et al. (2022), many of these LLMs may have been under-trained as a function of the amount of training data used, which implies that incorporating more data and con-tinuing to train these baseline models may continue to improve performance. There is also evidence that step-function changes in capabilities may oc-cur at a scale that is much smaller than 175B (Wei et al., 2021), indicating a need to examine a wider range of scales for different research applications.

同样地,通过在各种规模上生成一组基线模型,我们希望使更广泛的研究社区能够研究这些模型在规模方面的影响和限制。正如Hoffmann等人(2022)报道的那样,由于使用的训练数据量不足,许多这些LLM可能未经充分训练,这意味着加入更多的数据并继续训练这些基线模型可能会进一步提高性能。还有证据表明,能力的阶梯变化可能发生在远小于175B的规模上(Wei等,2021),这表明需要在不同研究应用中考虑更广泛的规模范围。

7 Related Work相关工作

Since the publication of the Transformer architec- ture (Vaswani et al., 2017) and BERT (Devlin et al., 2019), the field of NLP has experienced a massive shift towards the use of LLMs with self-supervised pre-training. Multiple masked langauge models, including T5 (Raffel et al., 2020) and Megatron- LM (Shoeybi et al., 2019), have shown consistent improvements through scale. These scaling gains come not only from growing the total number of parameters in the models, but also the amount and quality of pre-training data (Liu et al., 2019b; Hoff- mann et al., 2022).

Auto-regressive language models (Mikolov et al., 2009) have seen the largest growth in model size, from 117M parameters (Radford et al., 2018) to over 500B parameters (Smith et al., 2022; Chowd- hery et al., 2022). The resulting massive improve- ment in generative fluency and quality was first characterized in GPT-2 (Radford et al., 2019) and further improved with GPT-3 (Brown et al., 2020) and later models. Although a variety of very large (over 100B parameters) generative models have now been trained (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022), they are all closed source and accessible only internally or via paid API ser- vices. There are a few notable efforts towards open sourcing LLMs from non-profit research organiza- tions including EleutherAI (Black et al., 2022) and BigScience.11 These models differ from the OPT models in pre-training data, target languages and model scale, making it possible for the community to compare different pre-training strategies.

自Transformer架构(Vaswani等,2017)和BERT(Devlin等,2019)的发表以来,自监督预训练的LLM在自然语言处理(NLP)领域得到了广泛应用。多个屏蔽语言模型,包括T5(Raffel等,2020)和Megatron-LM(Shoeybi等,2019),已经显示出通过规模的持续改进。这些规模增益不仅来自于模型中参数的增加,还来自于预训练数据的数量和质量(Liu等,2019b;Hoffmann等,2022)。

自回归语言模型(Mikolov等,2009)的模型规模发展最大,从117M参数(Radford等,2018)增加到超过500B参数(Smith等,2022;Chowdhery等,2022)。最初,在GPT-2(Radford等,2019)中首次表征了生成流畅度和质量的巨大提升,并在GPT-3(Brown等,2020)和后续模型中进一步改进。尽管现在已经训练出了各种非常大(超过100B参数)的生成模型(Lieber等,2021;Rae等,2021;Thoppilan等,2022;Smith等,2022;Chowdhery等,2022),它们都是闭源的,只能在内部或通过付费API服务获得。还有一些值得注意的努力来自非营利研究组织,包括EleutherAI(Black等,2022)和BigScience。这些模型在预训练数据、目标语言和模型规模方面与OPT模型不同,使得社区可以比较不同的预训练策略。

Since Brown et al. (2020), the primary evalu- ation criterion for LLMs has been prompt-based (Black et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), as is also performed in this paper. This is largely due to the convenience of evaluat- ing on many tasks without specialized task-specific fine-tuning. Prompting itself has a long history: cloze evaluations go back several decades (Cham- bers and Jurafsky, 2008; Mostafazadeh et al., 2016). More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al., 2019) or perform a variety of NLP tasks (Radford et al., 2019; Brown et al., 2020). There has also been work on eliciting prompting behav- ior in smaller models (Schick and Sch黷ze, 2020;Gao et al., 2021b; Li and Liang, 2021; Lester et al., 2021; Scao and Rush, 2021), improving the flexi-bility of prompting (Shin et al., 2020), and under-standing why and how prompting works (Liu et al., 2021; Min et al., 2022).

自Brown等人(2020)以来,LLM的主要评估标准一直是基于提示的(Black等,2022;Rae等,2021;Chowdhery等,2022),本文也采用了这种评估方法。这主要是因为在没有专门的任务特定微调的情况下,可以方便地在许多任务上进行评估。提示本身有着悠久的历史:填空评估可以追溯到几十年前(Chambers和Jurafsky,2008;Mostafazadeh等,2016)。最近,提示或遮罩填充已被用于探索模型的知识(Petroni等,2019)或执行各种NLP任务(Radford等,2019;Brown等,2020)。还有一些关于在较小模型中引出提示行为的研究(Schick和Sch黷ze,2020;Gao等,2021b;Li和Liang,2021;Lester等,2021;Scao和Rush,2021),改进提示的灵活性(Shin等,2020),以及理解提示为什么和如何工作的工作(Liu等,2021;Min等,2022)。

Recent efforts have shown gains by fine-tuning models to directly respond to instruction-style prompting (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022). However, ef-fective prompt engineering remains an open re-search challenge. Results vary significantly and unpredictably with the selection of the prompt (Lu et al., 2021), and models do not seem to understand the prompts as fully as we expect (Webson and Pavlick, 2021). Furthermore, it is challenging to write prompts without a development set, which leads to questions about the extent to which we are actually achieving zero- or few-shot learning in practice (Perez et al., 2021). We do not attempt to address these concerns of prompting, and instead only aim to provide evaluation of OPT-175B in ex-isting settings. However, we hope the full release of OPT-175B will enable others to better study these challenges in the future.

最近的努力表明,通过微调模型以直接回应指令式提示可以获得提升(Wei等,2021;Min等,2021;Sanh等,2021;Ouyang等,2022)。然而,有效的提示工程仍然是一个开放的研究挑战。此外,对于LLM的后续微调,还有一些研究关注如何更好地调整模型的行为,以避免模型在社会敏感的领域或具有偏见的数据上发挥不当作用(Bommasani等,2021;Holtzman等,2021;Kenton等,2021;Mathew等,2021;Wang等,2021)。这些研究强调了在部署LLM之前进行良好的微调和模型验证的重要性。

8 Conclusion结论

In this technical report, we introduced OPT, a col-lection of auto-regressive language models ranging in size from 125M to 175B parameters. Our goal was to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency. We described training details, evaluated performance in a number of NLP and dialogue set-tings, and characterized behaviors with respect to bias, toxicity and hate speech. We also described many other limitations the models have, and dis-cussed a wide set of considerations for responsibly releasing the models. We believe the entire AI community would benefit from working together to develop guidelines for responsible LLMs, and we hope that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.

在这份技术报告中,我们介绍了OPT,这是一系列参数大小从125M到175B的自回归语言模型。我们的目标是复制GPT-3类模型的性能和规模,并应用最新的数据策划和训练效率最佳实践。我们描述了训练细节,在多个自然语言处理和对话设置中评估了性能,并描述了模型在偏见、有害性和仇恨言论方面的行为特征。我们还描述了这些模型的许多其他限制,并讨论了负责任地发布这些模型的一系列考虑因素。我们相信整个AI社区都将从共同努力制定负责任的大型语言模型指南中受益,并希望对这类模型的广泛访问将增加定义此类技术的伦理考虑的声音多样性。

Acknowledgements致谢

We would like to thank Scott Jeschonek, Giri Anan- tharaman, Diego Sarina, Joaquin Colombo, Chris Bray, Stephen Roylance, Kalyan Saladi, Shubho Sengupta, and Brian O’Horo for helping to remove infrastructure blockers along the way; Percy Liang,

Rishi Bommasani, and Emily Dinan for discus- sions on responsible release practices; Carole-Jean Wu for discussions on sustainability and carbon footprint considerations; Srini Iyer, Ramakanth Pa- sunuru, and Shruti Bhosale for previous contribu- tions to evaluations; Benjamin Lefaudeux, Geeta Chauhan, Natalia Gimelshein, Horace He, and Sam Gross for discussions on performance improvement work; Emily Dinan, Carole-Jean Wu, Daniel McK- innon, and Mark Tygert for feedback on this draft; Antoine Bordes, Joelle Pineau, Mary Williamson, Necip Fazil Ayan, Armand Joulin, Sergey Edunov, Melanie Kambadur, Zornitsa Kozareva, Ves Stoy- anov, Vitaliy Liptchinsky, Rahul Iyer, Jing Xu, Ja- son Weston, and many others for supporting this project internally.

我们要感谢Scott Jeschonek、Giri Anantharaman、Diego Sarina、Joaquin Colombo、Chris Bray、Stephen Roylance、Kalyan Saladi、Shubho Sengupta和Brian O'Horo在过程中帮助解决基础设施问题;感谢Percy Liang、Rishi Bommasani和Emily Dinan就负责任发布实践进行的讨论;感谢Carole-Jean Wu就可持续性和碳足迹问题进行的讨论;感谢Srini Iyer、Ramakanth Pasunuru和Shruti Bhosale对评估工作的先前贡献;感谢Benjamin Lefaudeux、Geeta Chauhan、Natalia Gimelshein、Horace He和Sam Gross就性能改进工作进行的讨论;感谢Emily Dinan、Carole-Jean Wu、Daniel McKinnon和Mark Tygert对本文稿的反馈;感谢Antoine Bordes、Joelle Pineau、Mary Williamson、Necip Fazil Ayan、Armand Joulin、Sergey Edunov、Melanie Kambadur、Zornitsa Kozareva、Ves Stoyanov、Vitaliy Liptchinsky、Rahul Iyer、Jing Xu、Jason Weston以及其他许多人在内部对这个项目的支持。

本文标签: OpenpreLLMsoptlanguage