

LLMs:《OPT: Open Pre-trained Transformer Language Models》翻译与解读

导读:本文主要介绍了开放预训练变换器(Open Pre-trained Transformers,OPT),这是一套仅解码器的预训练变换器,参数范围从125M到175B的自回归语言模型。该论文介绍了OPT(Open Pre-trained Transformer)模型,这是一种开放预训练的Transformer语言模型。其中OPT-175B是一个具有1750亿参数的模型,在公共数据集上训练而成,是第一个可供更广泛的人工智能研究社区使用的模型。之所以共享这个模型,Meta AI 希望更多的社区参与理解关于大模型的基本技术。

>> 模仿GPT-3模型并叙述详细细节:作者的目标是复制GPT-3类模型的性能和规模,并应用最佳实践来提高数据策划和训练效率。他们描述了模型的训练细节,并在多个自然语言处理和对话设置中评估了性能。OPT模型是以完全开放(模型、代码和数据)的方式训练的Transformer语言模型,训练数据包括从互联网上收集的大规模文本数据。研究人员还提供了模型的代码和预训练权重,使其能够被研究人员和开发者广泛使用,为语言模型的进一步研究和应用提供了便利。

>> 能耗仅为GPT-3的1/7(FSDP【Meta】+张量并行【NVIDIA】):Meta AI 在开发 OPT-175B 时考虑到了能源效率,其碳足迹仅为 GPT-3 的 1/7。这是通过在 Megatron-LM 中结合 Meta 的开源全切片数据并行 (FSDP) API 和 NVIDIA 的张量并行抽象来实现的。

>> 提出了一种新的开放预训练方法(目的是为促进学术研究和交流):用于构建可扩展、高效的自然语言处理模型。通过共享预训练权重和数据集,可以加速模型的开发和部署。实验结果表明,这种方法可以提高模型的性能和鲁棒性,适用于多种自然语言处理任务。因为绝大多数大语言模型训练成本高昂,导致大部分研究人员都无法负担大语言模型的训练或使用;同时,各大企业发布的大语言预训练模型由于商业目的也都无法完整访问模型权重,只能通过 API 调用获取结果,阻碍了学术的交流与研究。

>> 探讨模型负责任性+伦理性:作者还讨论了模型的限制以及发布这些模型时需要考虑的负责任因素。他们认为整个AI社区将受益于共同制定负责任的大型语言模型指南,并希望广泛访问这类模型将增加对伦理考虑的多样性。


2 、Method方法

2.1 、Models模型

2.2 Training Setup训练设置

2.3 Pre-training Corpus预训练语料库


The Pile Reddit

2.4、Training Efficiency训练效率

2.5、Training Processes训练过程

硬件故障、损失发散和其他中途更改Hardware Failures 、Loss Divergences 、Other Mid-flight Changes


3.1  Prompting & Few-Shot提示和少样本学习


一次和少次样本学习One-shot and Few-shot


4 Bias & Toxicity Evaluations偏见和有害性评估

4.1 Hate Speech Detection仇恨言论检测

4.2 CrowS-Pairs

4.3 StereoSet

4.4 Real Toxicity Prompts

4.5 Dialogue Safety Evaluations对话安全性评估

5 Limitations限制

6、Considerations for Release发布考虑因素

7 Related Work相关工作

8 Conclusion结论


Meta AI


Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their com-putational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no ac-cess is granted to the full model weights, mak-ing them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is com-parable to GPT-3,1 while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastruc-ture challenges we faced, along with code for experimenting with all of the released models.



Large language models (LLMs) trained on massive text collections have shown surprising emergent capabilities to generate text and perform zero- and few-shot learning (Brown et al., 2020; Lieber et al., 2021; Smith et al., 2022; Rae et al., 2021; Chowd- hery et al., 2022). While in some cases the public can interact with these models through paid APIs, full model access is currently limited to only a few highly resourced labs.2 This restricted access has limited researchers’ ability to study how and why these large language models work, hindering progress on improving known challenges in areas such as robustness, bias, and toxicity.

In this technical report, we present Open Pre- trained Transformers (OPT), a suite of decoder- only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We train the OPT models to roughly match the per- formance and sizes of the GPT-3 class of models, while also applying the latest best practices in data collection and efficient training. Our aim in de- veloping this suite of OPT models is to enable re- producible and responsible research at scale, and to bring more voices to the table in studying the impact of these LLMs. Definitions of risk, harm, bias, and toxicity, etc., should be articulated by the collective research community as a whole, which is only possible when models are available for study.

在大规模文本集上训练的大型语言模型(LLM)展现出令人惊讶的生成文本和零次和少次学习能力(Brown等人,2020年;Lieber等人,2021年;Smith等人,2022年;Rae等人,2021年;Chowd- hery等人,2022年)。虽然在某些情况下公众可以通过付费API与这些模型互动,但完整的模型访问目前仅限于少数资源丰富的实验室。这种受限的访问已经限制了研究人员研究这些大型语言模型的工作原理和原因的能力,阻碍了在健壮性、偏见和有害性等领域改进已知挑战的进展。


We are releasing all of our models between 125M and 66B parameters, and will provide full research access to OPT-175B upon request. Ac- cess will be granted to academic researchers; those affiliated with organizations in government, civil society, and academia; and those in industry re- search laboratories. We are also releasing both the logbook of our model creation as well as our code- base, metaseq,3 which enabled training OPT-175B on 992 80GB A100 GPUs, reaching 147 TFLOP/s utilization per GPU. From this implementation, and from using the latest generation of NVIDIA hard- ware, we are able to develop OPT-175B using only 1/7th the carbon footprint of GPT-3. While this is a significant achievement, the energy cost of creating such a model is still nontrivial, and repeated efforts to replicate a model of this size will only amplify the growing compute footprint of these LLMs.

We believe the entire AI community  — aca- demic researchers, civil society, policymakers, and industry — must work together to develop clear guidelines around responsible AI in general and responsible LLMs in particular, given their cen- trality in many downstream language applications.

我们将释放我们所有的模型,参数范围从125M到66B,并将根据请求向学术研究人员、与政府、民间社会和学术界有关的组织以及工业研究实验室提供OPT-175B的完整研究访问权限。我们还发布了我们模型创建的日志和我们的代码库metaseq,它使得在992个80GB的A100 GPU上训练OPT-175B成为可能,每个GPU的利用率达到147 TFLOP/s。通过这种实现方式,并且使用最新一代的NVIDIA硬件,我们只需1/7的碳足迹即可开发OPT-175B。虽然这是一项重大成就,但创建这样的模型的能源成本仍然是非微不足道的,并且反复努力复制这样大小的模型只会放大这些LLM不断增长的计算足迹。


Table 1: Model architecture details. We report the number of layers (#L), number of attention heads (#H), and the embedding size (dmodel). We also report the peak Learning Rate (LR) and global batch size in num- ber of tokens (Batch).


2 Method方法

2.1 Models模型

We present results on eight Transformer language models ranging from 125 million to 175 billion parameters. Architectural details are displayed in Table 1. In the interest of transparency, and to re-duce risk of training instabilities, our models and hyperparameters largely follow Brown et al. (2020), with variations in batch size mostly to obtain in-creased computational efficiency.


2.2 Training Setup训练设置

For weight initialization, we follow the same set-tings provided in the Megatron-LM codebase,4 us-ing a normal distribution with zero mean and stan-dard deviation of 0.006. Standard deviation for output layers are scaled by a 1.0/√2L term where L is the total number of layers. All bias terms are initialized as 0, and all models are trained with ReLU activation and a sequence length of 2048.

We use an AdamW optimizer (Loshchilov and Hutter, 2017) with (β1, β2) set to (0.9, 0.95), and weight decay of 0.1. We follow a linear learning rate schedule, warming up from 0 to the maximum learning rate over the first 2000 steps in OPT-175B, or over 375M tokens in our smaller baselines, and decaying down to 10% of the maximum LR over 300B tokens. A number of mid-flight changes to LR were also required (see Section 2.5). Our batch sizes range from 0.5M to 4M depending on the model size (see Table 1) and is kept constant throughout the course of training.

We use a dropout of 0.1 throughout, but we do not apply any dropout to embeddings. We clip gradient norms at 1.0, except for some mid-flight changes that reduce this threshold down from 1.0 to 0.3 (see Section 2.5). We also in-clude a gradient predivide factor to reduce the risk of over/underflows when computing the gradient across all ranks (splitting the division by the world size of N into two division operations by √N).




2.3 Pre-training Corpus预训练语料库

The pre-training corpus contains a concatenation of datasets used in RoBERTa (Liu et al., 2019b), the Pile (Gao et al., 2021a), and Red-dit (Baumgartner et al., 2020; Roller et al., 2021). All corpora were previously collected or filtered to contain predominantly English text, but a small amount of non-English data is still present within the corpus via CommonCrawl.

We removed duplicated documents across all datasets by filtering out documents via Min-hashLSH (Rajaraman and Ullman, 2011) with a Jaccard similarity ≥ .95. We found the Pile was particularly full of duplicate documents, and ad-vise future researchers using the Pile to perform additional de-duplication processing.

We tokenize all corpora using the GPT-2 byte level BPE tokenizer (Sennrich et al., 2016; Radford et al., 2019; Brown et al., 2020). Our final corpus contains roughly 180B tokens.

预训练语料库包含RoBERTa(Liu等人,2019b)、Pile(Gao等人,2021a)和 Reddit(Baumgartner等人,2020;Roller等人,2021)中使用的数据集的串联。所有的语料库在之前已经被收集或筛选,以主要包含英文文本,但仍然有少量非英文数据通过CommonCrawl存在于语料库中。




We included the BookCorpus (Zhu et al., 2015) and Stories (Trinh and Le, 2018) sub-sets of the RoBERTa corpus and utilized an up-dated version of CCNews, containing news stories crawled through September 28, 2021. This CC-News v2 corpus was preprocessed the same way as the original RoBERTa CCNews (Liu et al., 2019b).

我们包括RoBERTa语料库的BookCorpus(Zhu等人,2015)和Stories(Trinh和Le,2018)子集,并使用更新的CCNews,其中包含截至2021年9月28日爬取的新闻故事。这个CC-News v2语料库的预处理方式与原始的RoBERTa CCNews(Liu等人,2019b)相同。

The Pile

We included a subset of the Pile (Gao et al., 2021a), including: CommonCrawl,DM Mathematics, Project Gutenberg, Hack-erNews, OpenSubtitles, OpenWebText2, USPTO and Wikipedia. Other subsets of the Pile were elim-inated as we found they increased the risk of insta-bilities, as measured by tendency to cause spikes in gradient norms at the 1.3B scale, or were other-wise deemed unsuitable. All subsets went through additional ad-hoc whitespace normalization.

我们包括了Pile的一个子集(Gao等人,2021a),包括:CommonCrawl、DM Mathematics、Project Gutenberg、HackerNews、OpenSubtitles、OpenWebText2、USPTO和Wikipedia。其他Pile的子集被排除在外,因为我们发现它们增加了在13亿规模下梯度范数突然增加的风险,或者被认为不适合。所有的子集都经过额外的自适应空格标准化处理。 Reddit

We included a subset of the corpus produced by Baumgart-ner et al. (2020) and previously used by Roller et al. (2021). To convert the conversational trees into language-model-accessible documents, we ex-tracted the longest chain of comments in each thread and discarded all other paths in the tree. This reduced the corpus by about 66%.


2.4、Training Efficiency训练效率

We trained OPT-175B on 992 80GB A100 GPUs, by utilizing Fully Sharded Data Parallel (Artetxe et al., 2021) with Megatron-LM Tensor Parallelism (Shoeybi et al., 2019). We achieve utilization of up to 147 TFLOP/s per GPU. We keep Adam state in FP32, since we shard it across all hosts, while the model weights remained in FP16. To avoid under-flows, we used dynamic loss scaling, as described in Micikevicius et al. (2017).

我们使用了992个80GB的A100 GPU,在Megatron-LM Tensor Parallelism (Shoeybi et al., 2019)的基础上,利用Fully Sharded Data Parallel (Artetxe et al., 2021)对OPT-175B进行了训练。我们每个GPU的利用率达到了147 TFLOP/s。我们将Adam状态保持在FP32精度上,因为我们将其在所有主机上进行了分片,而模型权重保持在FP16精度上。为了避免下溢,我们使用了动态损失缩放,如Micikevicius等人(2017)所述。

2.5、Training Processes训练过程

Here we describe significant training process ad-justments that arose during OPT-175B pre-training.


硬件故障、损失发散和其他中途更改Hardware Failures Loss Divergences Other Mid-flight Changes

We faced a significant num-ber of hardware failures in our compute cluster while training OPT-175B. In total, hardware fail-ures contributed to at least 35 manual restarts and the cycling of over 100 hosts over the course of 2 months. During manual restarts, the training run was paused, and a series of diagnostics tests were conducted to detect problematic nodes. Flagged nodes were then cordoned off and training was re-sumed from the last saved checkpoint. Given the difference between the number of hosts cycled out and the number of manual restarts, we estimate 70+ automatic restarts due to hardware failures.


Loss divergences were also an issue in our training run. When the loss diverged, we found that lowering the learning rate and restart-ing from an earlier checkpoint allowed for the job to recover and continue training. We noticed a cor-relation between loss divergence, our dynamic loss scalar crashing to 0, and the l2-norm of the activa-tions of the final layer spiking. These observations led us to pick restart points for which our dynamic loss scalar was still in a “healthy” state (≥ 1.0), and after which our activation norms would trend downward instead of growing unboundedly. Our empirical LR schedule is shown in Figure 1. Early in training, we also noticed that lowering gradient clipping from 1.0 to 0.3 helped with stability; see our released logbook for exact details. Figure 2 shows our validation loss with respect to training iterations.


We conducted a number of other experimental mid-flight changes to handle loss divergences. These included: switch-ing to vanilla SGD (optimization plateaued quickly, and we reverted back to AdamW); resetting the dy-namic loss scalar (this helped recover some but not all divergences); and switching to a newer version of Megatron (this reduced pressure on activation norms and improved throughput).


Figure 2: Validation Perplexity. Our mid-flight LR changes had clear effects on validation perplexity.



3.1  Prompting & Few-Shot提示和少样本学习

We evaluate our model on 16 standard NLP tasks utilized in the literature: HellaSwag (Zellers et al., 2019), StoryCloze (Mostafazadeh et al., 2016), PIQA (Bisk et al., 2020), ARC Easy and Challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), WinoGrad (Levesque et al., 2011), Wino-Grande (Sakaguchi et al., 2020), and SuperGLUE (Wang et al., 2019). We follow GPT-3 (Brown et al., 2020) by using their prompts and overall ex-perimental setup. We compare primarily to GPT-3, having aimed to re-implement their evaluation set-tings, but include reported performance of other LLMs on a per-task basis when available (Lieber et al., 2021; Rae et al., 2021; Hoffmann et al., 2022; Black et al., 2022)

We report performance in accuracy (omitting F1 for MultiRC and ReCoRD for consistency in eval-uation metrics). For the Winograd Schema Chal-lenge (WSC) task in the SuperGLUE benchmark, we follow (Brown et al., 2020) and formulate the task as multiple choice questions, which is known to affect performance (Liu et al., 2020).

我们在文献中使用的16个标准NLP任务上评估我们的模型:HellaSwag (Zellers et al., 2019)、StoryCloze (Mostafazadeh et al., 2016)、PIQA (Bisk et al., 2020)、ARC Easy和Challenge (Clark et al., 2018)、OpenBookQA (Mihaylov et al., 2018)、WinoGrad (Levesque et al., 2011)、Wino-Grande (Sakaguchi et al., 2020)和SuperGLUE (Wang et al., 2019)。我们遵循GPT-3 (Brown et al., 2020)的方法,使用他们的提示和整体实验设置。我们主要与GPT-3进行比较,旨在重新实现他们的评估设置,但也包括其他LLM在每个任务上的表现(Lieber et al., 2021;Rae et al., 2021;Hoffmann et al., 2022;Black et al., 2022)。

我们以准确率来报告性能(忽略MultiRC和ReCoRD的F1以保持评估指标的一致性)。对于SuperGLUE基准测试中的Winograd Schema Challenge (WSC)任务,我们按照Brown等人(2020)的方法,将任务转化为多项选择问题,这已经被证明会影响性能(Liu et al., 2020)。


Overall average zero-shot perfor-mance across all 14 tasks may be seen in Figure 3. Overall, we see our average performance follows the trend of GPT-3. However, performance can vary radically across the tasks: for a full break-down, see Appendix A. Note that we intentionally removed MultiRC and WIC from these averages, as these datasets seem to systematically favor GPT-3 or OPT disproportionately.


Our performance roughly matched GPT-3 for 10 tasks, and underperformed in 3 tasks (ARC Chal-lenge and MultiRC). In 3 tasks (CB, BoolQ, WSC), we find both GPT and OPT models display unpre-dictable behavior with respect to scale, likely due to the small size of the validation set in these 3 tasks (56, 277, and 104 examples, respectively). In WIC, we see that the OPT models always out-perform the GPT-3 models, though the numbers reported by Brown et al. (2020) also seem question-able, given WIC being a binary classification task.5 For MultiRC, we are unable to replicate the GPT-3 results using the Davinci API6 within our evalua-tion setup, suggesting differences in the methods of evaluation on this task. For BoolQ and WSC, we note that both OPT and GPT models seem to hover around majority-class accuracy, suggesting small perturbations in probability masses may be dominating the evaluations.

我们的性能在10个任务中与GPT-3大致相当,而在3个任务中表现不佳(ARC Challenge和MultiRC)。在3个任务(CB、BoolQ、WSC)中,我们发现无论是GPT模型还是OPT模型在规模上都表现出不可预测的行为,这可能是由于这3个任务中验证集的规模较小(分别为56、277和104个示例)所导致的。在WIC任务中,我们发现OPT模型总是优于GPT-3模型,尽管Brown等人(2020)报告的数字也存在问题,因为WIC是一个二分类任务。对于MultiRC任务,我们无法使用Davinci API6在我们的评估设置中复制GPT-3的结果,这表明在这个任务上评估方法存在差异。对于BoolQ和WSC任务,我们注意到OPT和GPT模型都似乎围绕着多数类准确率徘徊,这表明概率质量的微小扰动可能主导了评估结果。

Chinchilla (Hoffmann et al., 2022) and Gopher (Rae et al., 2021) perform roughly consistently with others for their parameter sizes, while PaLM (Chowdhery et al., 2022) generally performs better across all settings, even when controlling for num-ber of parameters. We speculate the high perfor-mance of PaLM comes predominantly from higher quality and diversity of pre-training data.

Chinchilla (Hoffmann et al., 2022)和Gopher (Rae et al., 2021)的性能与其他模型在参数规模上大致一致,而PaLM (Chowdhery et al., 2022)在所有设置下的性能通常更好,即使在控制参数数量时也是如此。我们推测PaLM的高性能主要来自于更高质量和多样性的预训练数据。

一次和少次样本学习One-shot and Few-shot

Average multi-shot in-context performance is shown in Figure 4 (again, omitting MultiRC and WIC), with detailed perfor-mances shown in Appendix A. Across the average of all metrics, we find that OPT models perform similarly to GPT-3 models. However, as with zero-shot, breaking down these results per task shows a different story: in the same set of 10 datasets as zero-shot, we see similar performance across the two models. Some of the remaining datasets show inconsistent performance with respect to model size for both OPT and GPT-3 models (BoolQ, CB, WSC, RTE). In MultiRC, we consistently see un-derperformance of OPT models compared to GPT- 3 models. Similar to our zero-shot evaluation, we hypothesize our one- and few-shot evaluation setup may differ significantly from Brown et al. (2020).



Given that LLMs are known to be an integral com-ponent of modern dialogue models (Adiwardana et al., 2020; Roller et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), we additionally evaluate OPT-175B on several open source dialogue datasets. In particular, we fol-low Roller et al. (2021), and evaluate on ConvAI2 (Dinan et al., 2020b), Wizard of Wikipedia (Di-nan et al., 2019b), Empathetic Dialogues (Rashkin et al., 2019), and Blended Skill Talk (Smith et al., 2020). We additionally evaluate on the more recent Wizard of Internet dataset (Komeili et al., 2021). We focus our comparisons primarily against ex-isting open source dialogue models including the fine-tuned BlenderBot 1 (Roller et al., 2021) and its pre-training counterpart Reddit 2.7B. We also compare against the fine-tuned R2C2 BlenderBot, a 2.7B parameter BlenderBot-like model trained by Shuster et al. (2022).

鉴于LLM已经被确认是现代对话模型的一个重要组成部分(Adiwardana等人,2020;Roller等人,2021;Thoppilan等人,2022;Rae等人,2021;Chowdhery等人,2022),我们额外评估了OPT-175B在几个开源对话数据集上的性能。特别是,我们按照Roller等人(2021)的方法,在ConvAI2(Dinan等人,2020b)、维基百科向导(Dinan等人,2019b)、共情对话(Rashkin等人,2019)和混合技能对话(Smith等人,2020)上进行评估。我们还在较新的互联网向导数据集(Komeili等人,2021)上进行了评估。我们主要与现有的开源对话模型进行比较,包括经过微调的BlenderBot 1(Roller等人,2021)及其预训练对应模型Reddit 2.7B。我们还将其与由Shuster等人(2022)训练的经过微调的R2C2 BlenderBot,即一个包含2.7B参数的BlenderBot类似模型进行比较。

We report Perplexity and Unigram F1 (UF1) overlap, following the metrics of the ConvAI2 com-petition (Dinan et al., 2020b). To control for dif-ferent tokenization in each of the models, we nor-malize all perplexities to be in the space of the GPT-2 tokenizer (Radford et al., 2019). We also note which models are supervised with respect to these dialogue tasks and which are unsupervised. For OPT-175B, all generations are performed using greedy decoding up to a maximum of 32 tokens. We do not attempt to prompt the model at all except for alternating “Person 1:” and “Person 2:” lines of dialogue. The remaining models use the generation parameters found in BlenderBot 1.

我们报告困惑度和Unigram F1(UF1)重叠度,遵循ConvAI2竞赛(Dinan等人,2020b)的度量标准。为了控制不同模型中的标记化差异,我们将所有困惑度标准化为GPT-2的标记化器(Radford等人,2019)空间中的数值。我们还注意到哪些模型在这些对话任务中是有监督的,哪些是无监督的。对于OPT-175B,所有生成都是使用贪婪解码器,最多生成32个标记。除了交替使用“Person 1:”和“Person 2:”对话行之外,我们并不尝试给模型提供任何提示。其他模型使用BlenderBot 1中发现的生成参数。

Results are shown in Table 2. We see that OPT-175B significantly outperforms the also-unsupervised Reddit 2.7B model on all tasks, and performs competitively with the fully supervised BlenderBot 1 model, especially in the ConvAI2 dataset. On the Wizard-of-Internet dataset, which is fully unsupervised for all models, we see that OPT-175B obtains the lowest perplexity but still has lower UF1 than the models with Wizard-of-Wikipedia supervision.

结果如表2所示。我们可以看到,在所有任务中,OPT-175B的性能明显优于同样无监督的Reddit 2.7B模型,并且在ConvAI2数据集中与经过完全监督的BlenderBot 1模型相比表现竞争力。在完全无监督的Wizard-of-Internet数据集中,OPT-175B的困惑度最低,但在UF1上仍然低于具有Wizard-of-Wikipedia监督的模型。

We were somewhat surprised that the evaluations of the unsupervised OPT-175B model were as com-petitive as BlenderBot 1 on the ConvAI2 dataset. This may indicate leakage of the ConvAI2 dataset into the general pre-training corpus or even into the validation data as evaluated in Table 2. To address concerns of leakage, we searched our pre-training corpus for the first conversation in the ConvAI2 dataset, but we did not find any overlap. We addi-tionally evaluated OPT-175B on the ConvAI2 hid-den test set, which has never been publicly released, and achieved 10.7 ppl and .185 UF1, matching the performance of the validation set. Furthermore, we evaluated OPT-175B on a subset of the ConvAI2-like MultiSessionChat (MSC) dataset (Xu et al., 2021b) and obtained a perplexity of 9.7 and UF1 of .177, indicating the model is generalizing well across multiple PersonaChat-like datasets. Since both MSC and WoI datasets were released after the CommonCrawl snapshot used in pre-training cor-pus, there is minimal risk of leakage. We conclude that OPT-175B has a strong ability to maintain a consistent persona across conversations, a behav-ior also highlighted in LaMDA (Thoppilan et al.,2022).

我们对无监督的OPT-175B模型在ConvAI2数据集上的评估结果与BlenderBot 1的竞争性相当令人惊讶。这可能表明ConvAI2数据集泄漏到了一般的预训练语料库中,甚至泄漏到了表2中评估的验证数据中。为了解决泄漏的问题,我们搜索了我们的预训练语料库中与ConvAI2数据集中第一个对话相匹配的部分,但我们没有发现任何重叠之处。我们还在从未公开发布过的ConvAI2隐藏测试集上对OPT-175B进行了评估,结果显示困惑度为10.7,UF1为0.185,与验证集的性能相匹配。此外,我们还对类似于ConvAI2的MultiSessionChat(MSC)数据集(Xu等人,2021b)的子集进行了OPT-175B的评估,得到困惑度为9.7,UF1为0.177,这表明该模型在多个PersonaChat类似的数据集上具有良好的泛化能力。由于MSC和WoI数据集是在用于预训练语料库的CommonCrawl快照之后发布的,泄漏的风险很小。我们得出结论,OPT-175B在对话中具有保持一致的个人特征的强大能力,这也是LaMDA(Thoppilan等人,2022)中强调的行为。

4 Bias & Toxicity Evaluations偏见和有害性评估

To understand the potential harm of OPT-175B, we evaluate a series of benchmarks related to hate speech detection, stereotype awareness, and toxic content generation. While there may be shortcom-ings in these benchmarks (Blodgett et al., 2021; Ja-cobs and Wallach, 2021), these measurements pro-vide a first step towards understanding the limita-tions of OPT-175B. We compare primarily against GPT-3 Davinci, as these benchmarks were not yet available to be included in Brown et al. (2020).

为了了解OPT-175B的潜在危害,我们评估了一系列与仇恨言论检测、刻板意识和有害内容生成相关的基准测试。尽管这些基准测试可能存在缺陷(Blodgett等人,2021;Jacobs和Wallach,2021),但这些测量结果是了解OPT-175B限制的第一步。我们主要与GPT-3 Davinci进行比较,因为这些基准测试在Brown等人(2020)中还不可用。

4.1 Hate Speech Detection仇恨言论检测

Using the ETHOS dataset provided in Mollas et al.(2020) and instrumented by Chiu and Alexander (2021), we measure the ability of OPT-175B to identify whether or not certain English statements are racist or sexist (or neither). In the zero-, one-,and few-shot binary cases, the model is presented with text and asked to consider whether the text is racist or sexist and provide a yes/no response. In the few-shot multiclass setting, the model is asked to provide a yes/no/neither response.

Results are presented in Table 3. With all of our one-shot through few-shot configurations, OPT- 175B performs considerably better than Davinci. We speculate this occurs from two sources: (1) evaluating via the Davinci API may be bringing in safety control mechanisms beyond the original 175B GPT-3 model used in Brown et al. (2020); and (2) the significant presence of unmoderated social media discussions in the pre-training dataset has provided additional inductive bias to aid in such classification tasks.


结果如表3所示。在我们的一次样本到少次样本的所有配置中,OPT-175B的性能远远优于Davinci。我们推测这可能来自两个原因:(1)通过Davinci API进行评估可能带来了超出Brown等人(2020)中使用的原始175B GPT-3模型的安全控制机制;(2)预训练数据集中未经审核的社交媒体讨论的存在对这类分类任务提供了额外的归纳偏差。

4.2 CrowS-Pairs

Developed for masked language models, CrowS-Pairs (Nangia et al., 2020) is a crowdsourced bench-mark aiming to measure intrasentence level biases in 9 categories: gender, religion, race/color, sex-ual orientation, age, nationality, disability, physical appearance, and socioeconomic status. Each exam-ple consists of a pair of sentences representing a stereotype, or anti-stereotype, regarding a certain group, with the goal of measuring model preference towards stereotypical expressions. Higher scores indicate higher bias exhibited by a model.

When compared with Davinci in Table 4, OPT- 175B appears to exhibit more stereotypical biases in almost all categories except for religion. Again, this is likely due to differences in training data; Nangia et al. (2020) showed that Red-dit corpus has a higher incidence rate for stereo-types and discriminatory text than other corpora (e.g. Wikipedia). Given this is a primary data source for OPT-175B, the model may have learned more discriminatory associations, which directly impacts its performance on CrowS-Pairs.


与Davinci在表4中的比较结果显示,OPT-175B在几乎所有类别中都表现出更多的刻板偏见,除了宗教。同样,这很可能是由于训练数据的差异导致的;Nangia等人(2020)表明, Reddit语料库中的刻板和歧视性文本的发生率高于其他语料库(如维基百科)。鉴于这是OPT-175B的主要数据来源,该模型可能已学习到更多的歧视性关联,这直接影响其在CrowS-Pairs上的性能。

Table 4: CrowS-Pairs evaluation. Lower is better for all categories, indicating more fairness. The OPT-175B model performs worse than Davinci in most categories.

表4:CrowS-Pairs评估。所有类别越低越好,表明更公平。OPT-175B模型在大多数类别中表现不如Davinci 。

4.3 StereoSet

Following Lieber et al. (2021) and Artetxe et al.(2021), we use StereoSet (Nadeem et al., 2021) to measure stereotypical bias across 4 categories: profession, gender, religion, and race. In addition to intrasentence measurement (similar to CrowS-Pairs), StereoSet includes measurement at the inter-sentence level to test a model’s ability to incorpo-rate additional context. To account for a potential trade-off between bias detection and language mod-eling capability, StereoSet includes two metrics:Language Modeling Score (LMS) and Stereotype Score (SS), which are then combined to form the Idealized Context Association Test score (ICAT). Unlike Lieber et al. (2021), we normalize scores by token count, rather than character count, which they report improves metrics for several models.


Results are shown in Table 5. We see that Davinci and OPT-175B exhibit similar scores on aggregate (overall ICAT is very close between the two). In particular, Davinci outperforms in the areas of profession and race, while OPT-175B out-performs in the areas of Gender and Religion. OPT- 175B performs better across the board on the SS metric, while Davinci generally outperforms on the LMS metric.


4.4 Real Toxicity Prompts

We evaluate the tendency of OPT-175B to respond with toxic language via the RealToxicityPrompts (Gehman et al., 2020) dataset. Following PaLM (Chowdhery et al., 2022), we sample 25 genera-tions of 20 tokens using nucleus sampling (Holtz-man et al., 2020) (p = 0.9) for each of 10, 000 randomly sampled prompts from RTP, and report mean toxicity probabilities of the continuations, stratified across bucketed toxicities of the original prompts. For comparison, we report bucketed toxi-city rates from Davinci and PaLM.

Results are shown in Figure 5. Overall, we see that OPT-175B has a higher toxicity rate than ei-ther PaLM or Davinci. We also observe that all 3 models have increased likelihood of generating toxic continuations as the toxicity of the prompt increases, which is consistent with the observations of Chowdhery et al. (2022). As with our exper-iments in hate speech detection, we suspect the inclusion of unmoderated social media texts in the pre-training corpus raises model familiarity with, and therefore propensity to generate and detect, toxic text. This strong awareness of toxic language may or may not be desirable depending on the specific requirements of downstream applications. Future applications of OPT-175B should consider this aspect of the model, and take additional miti-gations, or avoid usage entirely as appropriate.

我们通过RealToxicityPrompts(Gehman等,2020)数据集评估OPT-175B生成有害语言的倾向性。沿用PaLM(Chowdhery等,2022)的做法,我们从RTP中随机抽取10,000个提示,并对每个提示使用核心抽样(nucleus sampling)(Holtzman等,2020)(p = 0.9)生成25个长度为20的文本,然后报告连续文本的平均有害概率,并根据原始提示的有害程度分组。为了比较,我们还报告了Davinci和PaLM的有害率分组。


4.5 Dialogue Safety Evaluations对话安全性评估

Finally, we compare OPT-175B on two Dialogue Safety evaluations. The first, SaferDialogues (Ung et al., 2021), measures the ability to recover from explicit safety failures, usually in the form of apol-ogizing or recognizing its mistake. The second, the Safety Bench Unit Tests (Dinan et al., 2021), mea-sures how unsafe a model’s response is, stratified across 4 levels of topic sensitivity: Safe, Realis-tic, Unsafe, and Adversarial. As with the other dialogue evaluations (Section 3.2), we compare to several existing open source dialogue models.

Results for both experiments are shown in Ta-ble 6. We observe that OPT-175B has similar per-formance as the Reddit 2.7B model across both SaferDialogues and the Unit Tests, with OPT-175B performing marginally better in the Safe and Adver-sarial settings. Consistent with Roller et al. (2021) and Xu et al. (2020), we find that the models fine-tuned on curated dialogue datasets (BlenderBot 1, R2C2) have overall lower toxicity. We conclude that future experimentation of OPT-175B for dia-logue should contain explicit fine-tuning on curated datasets in order to improve the safety profile.


两个实验的结果如表6所示。我们观察到,在SaferDialogues和单元测试中,OPT-175B与Reddit 2.7B模型的性能相似,OPT-175B在安全和对抗性设置下略微优于其他模型。与Roller等人(2021)和Xu等人(2020)一致,我们发现在经过精心策划的对话数据集(BlenderBot 1、R2C2)上进行微调的模型整体上具有较低的有害性。我们得出结论,未来在对话中使用OPT-175B时,应在精心策划的数据集上进行明确的微调,以改善安全性能。

5 Limitations限制

In Sections 3.1 and 4, we carried out extensive evaluation of all released models at varying scales. We saw parity in performance for standard evalu-ation datasets used in the GPT-3 models. More-over, we performed safety, bias, and inclusion eval-uations, again seeing largely comparable perfor-mance with some variations in toxicity and hate speech detection. However, such evaluations may not fully characterize the complete limitations of these models. In general, we qualitatively observe that OPT-175B suffers from the same limitations noted in other LLMs (Brown et al., 2020; Lieber et al., 2021; Thoppilan et al., 2022; Rae et al., 2021; Smith et al., 2022; Chowdhery et al., 2022; Bender et al., 2021).


In particular, we found OPT-175B does not work well with declarative instructions or point-blank interrogatives. Prompting with such instructions tends to produce a simulation of a dialogue begin-ning with such an instruction, rather than an execu-tion of the instruction. Future work into instruction learning, in the vein of InstructGPT (Ouyang et al., 2022), may alleviate these limitations.

OPT-175B also tends to be repetitive and can eas-ily get stuck in a loop. While sampling can reduce the incidence rate of repetitive behavior (Holtz-man et al., 2020), we anecdotally found it did not eliminate it entirely when only one generation is sampled. Future work may wish to incorporate more modern strategies for reducing repetition and improving diversity, such as unlikelihood training (Welleck et al., 2020) or best-first decoding (Meis-ter et al., 2020).

Similar to other LLMs, OPT-175B can produce factually incorrect statements (Adiwardana et al., 2020; Brown et al., 2020; Roller et al., 2021; Rae et al., 2021; Chowdhery et al., 2022; Thoppilan et al., 2022). This can be particularly harmful in applications where information accuracy is critical, such as healthcare and scientific discovery (Wei-dinger et al., 2021b). Recently, several efforts have reported that retrieval-augmented models can im-prove factual correctness of LLMs (Lewis et al., 2020; Komeili et al., 2021; Thoppilan et al., 2022; Borgeaud et al., 2021; Shuster et al., 2022; Nakano et al., 2021). We believe OPT-175B will also bene-fit from retrieval-augmentation in future iterations.




As shown in Section 4, we also find OPT-175B has a high propensity to generate toxic language and reinforce harmful stereotypes, even when pro-vided with a relatively innocuous prompt (Gehman et al., 2020), and adversarial prompts are trivial to find (Dinan et al., 2021). There has been a great deal of work on mitigations for toxicity and bi-ases (Dathathri et al., 2019; Dinan et al., 2019a; Sheng et al., 2019; Dinan et al., 2020a; Liu et al., 2019a; Krause et al., 2020; Xu et al., 2020; Liang et al., 2021; Dinan et al., 2021; Xu et al., 2021a; Dhamala et al., 2021; Schick et al., 2021; Ouyang et al., 2022). Depending on downstream applica-tions, future uses of OPT-175B may need to employ these or novel mitigation approaches, especially be-fore any real world deployment. Given our primary goal as a replication of GPT-3, we choose not to apply these mitigations in this first release.


In summary, we still believe this technology is premature for commercial deployment. Despite including data sheets and model cards, we believe more scrutiny should be afforded to the training data with additional data characterization and se-lection criteria in order to use data responsibly. The current practice is to feed the model with as much data as possible and minimal selection within these datasets. Despite having comprehensive evalua-tions, we would ideally have more streamlined and consistent evaluation setups to ensure replicability and reproducibility of evaluation scenarios. Dif-ferences in prompting styles and number of shots for in-context learning could create variations that lead to different results. We hope that the public release of the OPT models will enable many more researchers to work on these important issues.


6Considerations for Release发布考虑因素

Following the recommendations for individual re- searchers generated by the Partnership for AI,7 along with the governance guidance outlined by NIST,8 we are disclosing all of the details in- volved in training OPT-175B through our log- book,9 our code, and providing researchers access to model weights for OPT-175B, along with a suite of smaller baselines mirroring the setup for OPT- 175B. We aim to be fully accountable for the devel- opment lifecycle of OPT-175B, and only through increasing transparency around LLM development can we start understanding the limitations and risks of LLMs before broader deployment occurs.

By sharing a detailed account of our day-to-day training process, we disclose not only how much compute was used to train the current version of OPT-175B, but also the human overhead required when underlying infrastructure or the training pro- cess itself becomes unstable at scale. These details are generally omitted from previous publications, likely due to the inability to fully ablate changes made mid-flight (without drastically increasing the compute budget). We hope that by revealing how certain ad-hoc design decisions were made, we can improve upon these practices in the future, and col- lectively increase the experimental robustness in developing models at this scale.

根据AI伙伴关系组织(Partnership for AI)对个别研究人员的建议以及美国国家标准与技术研究院(NIST)提出的治理指导,我们通过我们的日志、代码以及提供研究人员对OPT-175B模型权重的访问,披露了训练OPT-175B所涉及的所有细节,同时还提供了一套与OPT-175B设置相似的较小基线模型。我们希望在LLM开发过程中能够完全负责任,并且只有通过增加LLM开发的透明度,我们才能在更广泛的部署之前开始了解LLM的限制和风险。


Outside of these notes, the metaseq codebase itself is the final source of truth in many of our implementation details. By releasing our develop- ment codebase, we aim to shed light on any imple- mentation detail that may have been omitted from being explicitly enumerated in this paper, as it is either considered a detail of standard practice in the field, or is simply a detail we failed to account for. This current codebase is also the only known open-source implementation of training a decoder- only transformer that is ≥175B parameters without the use of pipeline paralellism on NVIDIA GPUs.

To enable experimentation at 175B scale, we are providing researchers with direct access to the pa-rameters of OPT-175B. The reasoning here is two-fold: enable Responsible AI research into LLMs while simultaneously reducing the environmental impact of pursuing research at this scale. There is a growing body of work detailing ethical and social risks from deploying language models with emer-gent capabilities at scale (Weidinger et al., 2021a; Bommasani et al., 2021; Dinan et al., 2021; Kenton et al., 2021). By limiting access to OPT-175B to the research community with a non-commercial license, we aim to focus development efforts on quantifying the limitations of the LLMs first, be-fore broader commercial deployment occurs.

除了这些注释,元序列(metaseq)代码库本身是我们许多实现细节的最终真相来源。通过发布我们的开发代码库,我们旨在揭示可能未在本文中明确列举的任何实现细节,因为它们要么被认为是该领域的标准实践细节,要么只是我们未能考虑到的细节。这个当前的代码库也是已知的训练只有NVIDIA GPU上的≥175B参数的解码器的开源实现。


Furthermore, there exists significant compute and carbon cost to reproduce models of this size. While OPT-175B was developed with an estimated carbon emissions footprint (CO2eq) of 75 tons,10 GPT-3 was estimated to use 500 tons (Patterson et al., 2021), while Gopher required 380 tons (Rae et al., 2021). These estimates are not universally re-ported, and the accounting methodologies for these calculations are also not standardized. In addition, model training is only one component of the over-all carbon footprint of AI systems; we must also consider experimentation and eventual downstream inference cost, all of which contribute to the grow-ing energy footprint of creating large-scale models (Wu et al., 2022). By releasing our logbook, we hope to highlight the gap between a theoretical car-bon cost estimate that assumes no hardware failures or training instabilities, versus one that aims to in-clude the entire LLM development lifecycle. We need to understand the manufacturing (or embod-ied) carbon of these systems (Gupta et al., 2021) as they grow increasingly more complex, and we hope that our paper can help future work in defin-ing additional factors to consider when measuring the impact of scale on the environment.


Similarly, by producing a set of baselines across a wide range of scales, we hope to enable the broader research community to study the impact and limitations of these models with respect to scale alone. As reported in Hoffmann et al. (2022), many of these LLMs may have been under-trained as a function of the amount of training data used, which implies that incorporating more data and con-tinuing to train these baseline models may continue to improve performance. There is also evidence that step-function changes in capabilities may oc-cur at a scale that is much smaller than 175B (Wei et al., 2021), indicating a need to examine a wider range of scales for different research applications.


7 Related Work相关工作

Since the publication of the Transformer architec- ture (Vaswani et al., 2017) and BERT (Devlin et al., 2019), the field of NLP has experienced a massive shift towards the use of LLMs with self-supervised pre-training. Multiple masked langauge models, including T5 (Raffel et al., 2020) and Megatron- LM (Shoeybi et al., 2019), have shown consistent improvements through scale. These scaling gains come not only from growing the total number of parameters in the models, but also the amount and quality of pre-training data (Liu et al., 2019b; Hoff- mann et al., 2022).

Auto-regressive language models (Mikolov et al., 2009) have seen the largest growth in model size, from 117M parameters (Radford et al., 2018) to over 500B parameters (Smith et al., 2022; Chowd- hery et al., 2022). The resulting massive improve- ment in generative fluency and quality was first characterized in GPT-2 (Radford et al., 2019) and further improved with GPT-3 (Brown et al., 2020) and later models. Although a variety of very large (over 100B parameters) generative models have now been trained (Lieber et al., 2021; Rae et al., 2021; Thoppilan et al., 2022; Smith et al., 2022; Chowdhery et al., 2022), they are all closed source and accessible only internally or via paid API ser- vices. There are a few notable efforts towards open sourcing LLMs from non-profit research organiza- tions including EleutherAI (Black et al., 2022) and BigScience.11 These models differ from the OPT models in pre-training data, target languages and model scale, making it possible for the community to compare different pre-training strategies.



Since Brown et al. (2020), the primary evalu- ation criterion for LLMs has been prompt-based (Black et al., 2022; Rae et al., 2021; Chowdhery et al., 2022), as is also performed in this paper. This is largely due to the convenience of evaluat- ing on many tasks without specialized task-specific fine-tuning. Prompting itself has a long history: cloze evaluations go back several decades (Cham- bers and Jurafsky, 2008; Mostafazadeh et al., 2016). More recently, prompting or masked infilling has been used to probe models for knowledge (Petroni et al., 2019) or perform a variety of NLP tasks (Radford et al., 2019; Brown et al., 2020). There has also been work on eliciting prompting behav- ior in smaller models (Schick and Sch黷ze, 2020;Gao et al., 2021b; Li and Liang, 2021; Lester et al., 2021; Scao and Rush, 2021), improving the flexi-bility of prompting (Shin et al., 2020), and under-standing why and how prompting works (Liu et al., 2021; Min et al., 2022).


Recent efforts have shown gains by fine-tuning models to directly respond to instruction-style prompting (Wei et al., 2021; Min et al., 2021; Sanh et al., 2021; Ouyang et al., 2022). However, ef-fective prompt engineering remains an open re-search challenge. Results vary significantly and unpredictably with the selection of the prompt (Lu et al., 2021), and models do not seem to understand the prompts as fully as we expect (Webson and Pavlick, 2021). Furthermore, it is challenging to write prompts without a development set, which leads to questions about the extent to which we are actually achieving zero- or few-shot learning in practice (Perez et al., 2021). We do not attempt to address these concerns of prompting, and instead only aim to provide evaluation of OPT-175B in ex-isting settings. However, we hope the full release of OPT-175B will enable others to better study these challenges in the future.


8 Conclusion结论

In this technical report, we introduced OPT, a col-lection of auto-regressive language models ranging in size from 125M to 175B parameters. Our goal was to replicate the performance and sizes of the GPT-3 class of models, while also applying the latest best practices in data curation and training efficiency. We described training details, evaluated performance in a number of NLP and dialogue set-tings, and characterized behaviors with respect to bias, toxicity and hate speech. We also described many other limitations the models have, and dis-cussed a wide set of considerations for responsibly releasing the models. We believe the entire AI community would benefit from working together to develop guidelines for responsible LLMs, and we hope that broad access to these types of models will increase the diversity of voices defining the ethical considerations of such technologies.



We would like to thank Scott Jeschonek, Giri Anan- tharaman, Diego Sarina, Joaquin Colombo, Chris Bray, Stephen Roylance, Kalyan Saladi, Shubho Sengupta, and Brian O’Horo for helping to remove infrastructure blockers along the way; Percy Liang,

Rishi Bommasani, and Emily Dinan for discus- sions on responsible release practices; Carole-Jean Wu for discussions on sustainability and carbon footprint considerations; Srini Iyer, Ramakanth Pa- sunuru, and Shruti Bhosale for previous contribu- tions to evaluations; Benjamin Lefaudeux, Geeta Chauhan, Natalia Gimelshein, Horace He, and Sam Gross for discussions on performance improvement work; Emily Dinan, Carole-Jean Wu, Daniel McK- innon, and Mark Tygert for feedback on this draft; Antoine Bordes, Joelle Pineau, Mary Williamson, Necip Fazil Ayan, Armand Joulin, Sergey Edunov, Melanie Kambadur, Zornitsa Kozareva, Ves Stoy- anov, Vitaliy Liptchinsky, Rahul Iyer, Jing Xu, Ja- son Weston, and many others for supporting this project internally.

我们要感谢Scott Jeschonek、Giri Anantharaman、Diego Sarina、Joaquin Colombo、Chris Bray、Stephen Roylance、Kalyan Saladi、Shubho Sengupta和Brian O'Horo在过程中帮助解决基础设施问题;感谢Percy Liang、Rishi Bommasani和Emily Dinan就负责任发布实践进行的讨论;感谢Carole-Jean Wu就可持续性和碳足迹问题进行的讨论;感谢Srini Iyer、Ramakanth Pasunuru和Shruti Bhosale对评估工作的先前贡献;感谢Benjamin Lefaudeux、Geeta Chauhan、Natalia Gimelshein、Horace He和Sam Gross就性能改进工作进行的讨论;感谢Emily Dinan、Carole-Jean Wu、Daniel McKinnon和Mark Tygert对本文稿的反馈;感谢Antoine Bordes、Joelle Pineau、Mary Williamson、Necip Fazil Ayan、Armand Joulin、Sergey Edunov、Melanie Kambadur、Zornitsa Kozareva、Ves Stoyanov、Vitaliy Liptchinsky、Rahul Iyer、Jing Xu、Jason Weston以及其他许多人在内部对这个项目的支持。

