admin管理员组

文章数量:1531793

LLMs之Gopher/Chinchilla:《Training Compute-Optimal Large Language Models》的翻译与解读

导读:DeepMind 2021 年发表的模型 Chinchilla(龙猫),这个模型目前做各种任务的效果,和 540B 大小的 PaLM 基本相当。Chinchilla 是为了验证作者的计算优化思路而训练的一个规模为70 billion参数的语言模型。

>> 关注权衡—在给定固定的 FLOPs 预算下,如何权衡模型规模大小和训练tokens的数量规模—提出三种方法(固定模型大小来变化训练标记数量/等FLOP曲线/拟合参数化损失函数):>> 在 Chinchilla 之前的一系列大语言模型,在扩展模型参数规模的同时保持训练数据量不变,导致计算资源的浪费和大语言模型的训练不足。
>> 对于计算成本最优的训练,模型规模大小和训练 tokens 的数量应该同等比例地缩放,模型参数规模的加倍时,训练 tokens 的数量也应该加倍。

>> 提出了三种不同的方法来回答研究的问题。固定模型大小变化训练标记数量得出模型大小和训练标记应如何随计算量的增加进行扩展、通过固定不同的训练FLOPs预算预测最佳参数数量、拟合参数化损失函数估计最佳模型大小和训练标记数量的预测。三种方法得出的预测结果表明,随着计算预算的增加,模型大小和训练数据量应以大致相等的比例增加。

>> 基于假设—训练优化的Chinchilla—扩大训练数据量来缩小模型规模(Gopher 模型的1/4):Chinchilla 的思路是给更多的数据,但是把模型规模做小。具体而言,它对标的是 Gopher 模型,Chinchilla 模型大小只有 70B,是 Gopher 的四分之一,但是付出的代价是训练数据总量,是 Gopher 的四倍,所以基本思路是通过放大训练数据量,来缩小模型规模。基于上述假设训练了计算优化模型 Chinchilla,它与 Gopher 使用相同的计算预算,但具有 70B 的参数和 4 倍多的训练数据。同时,Chinchilla 在大量下游评估任务上一致且显著优于 Gopher (280B)、GPT-3 (175B)、Jurassic-1 (178B) 和 Megatron-Turing NLG (530B)。Chinchilla 使用更少的计算来进行微调和推理,极大地促进了下游使用。Chinchilla的规模更小但训练更优化,在大量下游任务上超过Gopher和其他大模型,验证了作者的计算优化思路。但在性别偏见和有毒性分析上,Chinchilla和Gopher表现相似。

>> 基于Gopher模型的架构+4倍数据+微调过的SentencePiece tokenizer+AdamW优化器+参数更新使用bfloat16但存储是float32=70B参数:Chinchilla 基于Gopher模型的架构,但做了以下改进:
采用与Gopher相同的transformer架构,但参数数量为70 billion,比Gopher少4倍;
使用MassiveText这一更大的数据集进行训练(包含1400万tokens,比Gopher更多4倍);
使用微调过的SentencePiece tokenizer,更适用于科学领域;
使用AdamW 优化器替代Gopher的Adam优化器,有助于提高语言建模loss和下游任务表现;
在前向和反向传播时使用bfloat16来节省内存,但参数仍采用float32存储;
在 TPUv3/TPUv4硬件上使用JAX和Haiku进行训练;

>> 提出模型大小和训练数据应该同等扩展以达到计算优化:对400多种模型规模从7000万参数到160亿参数进行训练,发现为了最佳计算效率,模型大小和训练tokens应该同等扩展:每扩大模型大小一倍,训练tokens也应该增加一倍

(1)、根据Chinchilla 的估计,一个 130B(1300亿参数)语言模型的最佳训练标识符训练量应该是4T左右 ​​​​​​。

目录

Gopher算法简介

1、模型结构特点

2、《Scaling Language Models: Methods, Analysis & Insights from Training Gopher》论文摘要与结论

《Training Compute-Optimal Large Language Models》的翻译与解读

Abstract摘要

1、Introduction引言

2、Related Work相关工作

Large language models大型语言模型

Modelling the scaling behavior建模扩展行为

Estimating hyperparameters for large models估计大型模型的超参数

Improved model architectures改进的模型架构

3、Estimating the optimal parameter/training tokens allocation估计最佳参数/训练标记分配

3.1、Approach 1: Fix model sizes and vary number of training tokens固定模型大小,变化训练标记数量

3.2、Approach 2: IsoFLOP profiles等FLOP曲线

3.3、Approach 3: Fitting a parametric loss function拟合参数化损失函数

Model fitting模型拟合

Efficient frontier高效边界

3.4、Optimal model scaling最佳模型扩展

4、Chinchilla

4.1、Model and training details模型和训练细节

4.2、Results结果

4.2.1、Language modelling语言建模

4.2.2、MMLU大规模多任务语言理解基准测试

4.2.3、Reading comprehension阅读理解

4.2.4、BIG-bench

4.2.5、Common sense常识

4.2.6、Closed-book question answering封闭书问题回答

4.2.7、Gender bias and toxicity性别偏见和毒性

Gender bias性别偏见

Sample toxicity样本毒性

5、Discussion & Conclusion讨论与结论

6、Acknowledgements致谢


Gopher算法简介

Gopher 是 DeepMind 发布的大语言模型,拥有过 280B 规模的参数。在语言模型和开发过程中,DeepMind 训练了 6 个不同参数规模的系列模型,参数量包括 44M、117M、417M、1.4B、7.1B、280B(Gopher)。这些模型在 152 项不同的任务上进行了评估,在大多数任务中都实现了最先进的性能。 阅读理解、事实核查和有毒语言识别等领域性能提升最大,但对于逻辑和数学推理等问题的性能提升较小。

地址

《Scaling Language Models: Methods, Analysis & Insights from Training Gopher》

论文:https://arxiv/abs/2112.11446

时间

2021年12月8日

作者

DeepMind

1、模型结构特点

Gopher算法结构特点自回归Transformer架构+RMSNorm+相对位置编码+3000亿/32000/2048+Adam优化器自适应学习率和余弦衰减调整学习率+混合精度训练
>> Gopher采用了自回归Transformer架构,并使用RMSNorm代替LayerNorm进行归一化。
>> 相对位置编码方案用于处理序列位置信息。
>> 训练和评估代码库使用JAX和Haiku构建。
>> 本文介绍了六个Transformer语言模型,参数规模从4400万到2800亿不等,其中最大的模型被称为Gopher,整个模型集合被称为Gopher家族。
>> 使用3000亿个标记进行训练,使用SentencePiece tokenize的词表大小为32000,上下文窗口为2048个标记。
>> 采用Adam优化器,自适应学习率和余弦衰减调整学习率。随着模型规模的增加,减小最大学习率和增加每个批次的标记数。
>> 混合精度训练:为了减少内存占用和提高训练吞吐量,Gopher模型使用了bfloat16数值格式进行参数和激活的计算。模型参数使用半精度bfloat16来节省内存,但更新仍采用float32的参数。
>> Gopher模型的训练数据集为MassiveText,包含来自多个来源的大规模英语文本数据集,经过质量过滤、去重、去除相似文档和测试集重叠文档的处理。训练过程中对MassiveText进行子采样,并根据子集指定的采样比例进行采样,以提高下游性能。

2、《Scaling Language Models: Methods, Analysis & Insights from Training Gopher》论文摘要与结论

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

语言建模通过利用大量的人类书面知识库来更好地预测和理解世界,为智能通信系统迈出了一步。在本文中,我们对基于Transformer的语言模型在各种模型规模上的性能进行了分析,从具有数百万参数的模型到名为Gopher的2800亿参数模型。这些模型在152个不同的任务上进行了评估,并在大多数任务上取得了最先进的性能。模型规模的扩大主要在阅读理解、事实核查和有害语言识别等领域带来最大的收益,但在逻辑和数学推理方面的收益较少。我们对训练数据集和模型行为进行了全面的分析,涵盖了模型规模与偏见和有害性之间的交叉点。最后,我们讨论了语言模型在人工智能安全和减轻下游危害方面的应用。

The landscape of language technologies with general capabilities is progressing rapidly. Language models are a key driver of this progress, and we have shown that an emphasis on data quality and scale still yields interesting performance advances over existing work. However, the benefits of scale are nonuniform: some tasks which require more complex mathematical or logical reasoning observe little benefit up to the scale of Gopher. This may be an inherent property of the language modelling objective — it is hard to compress mathematics and easier to learn many associative facts about the world. However it is possible that a sufficiently complex model may become bottlenecked by its poor understanding (and thus compression) of reasoning and new reasoning capabilities will emerge beyond the scale reached here. Alongside the development of more powerful language models,we advocate broad development of analysis and interpretability tools to better understand model behaviour and fairness, both to guide mitigation of harms and to better inform the use of these models as a tool to scalably align artificial intelligence to societal benefit.

通用语言技术领域正在快速发展。语言模型是这一进展的关键推动因素,我们已经证明,对数据质量和规模的重视仍然能够带来有趣的性能提升,超过现有工作的成果。然而,规模的好处是不均匀的:一些需要更复杂的数学或逻辑推理的任务在Gopher的规模范围内几乎没有收益。这可能是语言建模目标的固有属性——数学很难被压缩,而学习关于世界的许多联想事实则更容易。然而,一个足够复杂的模型可能会因为其对推理的理解(因此压缩)不足而受到限制,而在此规模之外可能会出现新的推理能力。除了开发更强大的语言模型之外,我们主张广泛开发分析和可解释性工具,以更好地理解模型行为和公平性,既为了指导减轻危害,也为了更好地利用这些模型作为一种工具来可扩展地使人工智能与社会利益保持一致。

《Training Compute-Optimal Large Language Models》的翻译与解读

地址

论文:https://arxiv/abs/2203.15556

作者

DeepMind

时间

2022年3月29日

Abstract摘要

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher.

我们研究了在给定计算预算下训练Transformer语言模型的最佳模型大小和标记数量。我们发现当前的大型语言模型存在明显的训练不足,这是近期关注于提高语言模型规模而保持训练数据量不变的结果。通过对从7000万到超过160亿参数、使用50亿到5000亿标记进行训练的400多个语言模型的研究,我们发现对于计算最优训练,模型大小和训练标记数量应该等比例缩放每当模型大小加倍时,训练标记数量也应该加倍。我们通过训练一个预测的计算最优模型Chinchilla来验证这一假设,Chinchilla使用与Gopher相同的计算预算,但具有700亿参数和4倍的数据量。在各种下游评估任务中,Chinchilla在性能上明显优于Gopher(2800亿)、GPT-3(1750亿)、Jurassic-1(1780亿)和Megatron-Turing NLG(5300亿)。这也意味着Chinchilla在微调和推理时使用的计算资源大大减少,极大地促进了后续的应用。值得一提的是,Chinchilla在MMLU基准测试中达到了67.5%的最新准确率,比Gopher提高了超过7%。

1、Introduction引言

Recently a series of Large Language Models (LLMs) have been introduced (Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022), with the largest dense language models now having over 500 billion parameters. These large autoregressive transformers (Vaswani et al., 2017) have demonstrated impressive performance on many tasks using a variety of evaluation protocols such as zero-shot, few-shot, and fine-tuning.

The compute and energy cost for training large language models is substantial (Rae et al., 2021; Thoppilan et al., 2022) and rises with increasing model size. In practice, the allocated training compute budget is often known in advance: how many accelerators are available and for how long we want to use them. Since it is typically only feasible to train these large models once, accurately estimating the best model hyperparameters for a given compute budget is critical (Tay et al., 2021).

近年来,推出了一系列的大型语言模型(LLMs)(Brown等,2020;Lieber等,2021;Rae等,2021;Smith等,2022;Thoppilan等,2022),目前最大的稠密语言模型拥有超过5000亿个参数。这些大型自回归Transformer(Vaswani等,2017)在许多任务上展示了令人印象深刻的性能,使用了各种评估协议,如零样本、少样本和微调。

训练大型语言模型的计算和能量成本是可观的(Rae等,2021;Thoppilan等,2022),而且随着模型大小的增加而增加。实际上,通常事先已知分配的训练计算预算:可用加速器的数量以及希望使用它们的时间长度。由于通常只有一次训练这些大型模型是可行的,准确估计给定计算预算下的最佳模型超参数对于关键(Tay等,2021)。

Kaplan et al. (2020) showed that there is a power law relationship between the number of parameters in an autoregressive language model (LM) and its performance. As a result, the field has been training larger and larger models, expecting performance improvements. One notable conclusion in Kaplan et al. (2020) is that large models should not be trained to their lowest possible loss to be compute optimal. Whilst we reach the same conclusion, we estimate that large models should be trained for many more training tokens than recommended by the authors. Specifically, given a 10× increase computational budget, they suggests that the size of the model should increase 5.5× while the number of training tokens should only increase 1.8×. Instead, we find that model size and the number of training tokens should be scaled in equal proportions.

Following Kaplan et al. (2020) and the training setup of GPT-3 (Brown et al., 2020), many of the recently trained large models have been trained for approximately 300 billion tokens (Table 1), in line with the approach of predominantly increasing model size when increasing compute.

Kaplan等人(2020)表明,自回归语言模型(LM)中参数数量与性能之间存在幂律关系。因此,该领域一直在训练越来越大的模型,以期望获得性能改进。Kaplan等人(2020)中的一个显著结论是,大型模型不应该训练到最低可能的损失以达到计算最优。虽然我们得出了相同的结论,但我们估计大型模型的训练标记数量应该比作者建议的要多得多。具体来说,给定10倍的计算预算,他们建议模型的大小应该增加5.5倍,而训练标记数量只增加1.8倍。相反,我们发现模型大小和训练标记数量应该等比例缩放。

根据Kaplan等人(2020)和GPT-3(Brown等,2020)的训练设置,最近训练的许多大型模型通常使用约3000亿个标记进行训练(表1),这与主要是通过增加计算来增加模型大小的方法一致。

In this work, we revisit the question: Given a fixed FLOPs budget,1 how should one trade-off model size and the number of training tokens? To answer this question, we model the final pre-training loss2

(, ) as a function of the number of model parameters , and the number of training tokens, . Since the computational budget  is a deterministic function FLOPs(, ) of the number of seen training tokens and model parameters, we are interested in minimizing  under the constraint

The functions   (), and   () describe the optimal allocation of a computational budget . We empirically estimate these functions based on the losses of over 400 models, ranging from under 70M to over 16B parameters, and trained on 5B to over 400B tokens – with each model configuration trained for several different training horizons. Our approach leads to considerably different results than that of Kaplan et al. (2020). We highlight our results in Figure 1 and how our approaches differ in Section 2.

在这项工作中,我们重新审视问题:在固定的FLOPs预算下,如何权衡模型大小和训练标记数量?为了回答这个问题,我们将最终的预训练损失模型(,)建模为模型参数数量()和训练标记数量()的函数。由于计算预算()是已观察到的训练标记数量和模型参数的确定性函数FLOPs(,),我们有兴趣在约束下最小化。

函数(),和()描述了计算预算()的最佳分配。我们基于超过400个模型的损失来经验性地估计这些函数,这些模型的参数范围从7000万到超过160亿,训练了50亿到超过4000亿个标记,并且每个模型配置都进行了多个不同的训练时间段的训练。我们的方法得出了与Kaplan等人(2020)完全不同的结果。我们在图1中突出显示了我们的结果,并在第2节中说明了我们的方法的不同之处。

Based on our estimated compute-optimal frontier, we predict that for the compute budget used to train Gopher, an optimal model should be 4 times smaller, while being training on 4 times more tokens. We verify this by training a more compute-optimal 70B model, called Chinchilla, on 1.4 trillion tokens. Not only does Chinchilla outperform its much larger counterpart, Gopher, but its reduced model size reduces inference cost considerably and greatly facilitates downstream uses on smaller hardware. The energy cost of a large language model is amortized through its usage for inference an fine-tuning. The benefits of a more optimally trained smaller model, therefore, extend beyond the immediate benefits of its improved performance.

根据我们估计的计算最优前沿,我们预测对于用于训练Gopher的计算预算,一个最佳模型的大小应该减小4倍,同时训练的标记数量增加4倍。我们通过在1.4万亿个标记上训练更加计算最优的70亿参数模型Chinchilla来验证这一点。Chinchilla不仅表现优于其更大的对应模型Gopher,而且其较小的模型大小大大减少了推理成本,并极大地方便了在较小硬件上的后续使用。大型语言模型的能量成本通过推理和微调过程中的使用而摊销。因此,更优化训练的较小模型的好处不仅限于其改进的性能的直接好处。

2、Related Work相关工作

Large language models大型语言模型

Large language models. A variety of large language models have been introduced in the last few years. These include both dense transformer models (Brown et al., 2020; Lieber et al., 2021; Rae et al., 2021; Smith et al., 2022; Thoppilan et al., 2022) and mixture-of-expert (MoE) models (Du et al., 2021; Fedus et al., 2021; Zoph et al., 2022). The largest dense transformers have passed 500 billion parameters (Smith et al., 2022). The drive to train larger and larger models is clear—so far increasing the size of language models has been responsible for improving the state-of-the-art in many language modelling tasks. Nonetheless, large language models face several challenges, including their overwhelming computational requirements (the cost of training and inference increase with model size) (Rae et al., 2021; Thoppilan et al., 2022) and the need for acquiring more high-quality training data. In fact, in this work we find that larger, high quality datasets will play a key role in any further scaling of language models.

大型语言模型。在过去几年中,引入了各种大型语言模型。这些模型包括密集的Transformer模型(Brown等人,2020; Lieber等人,2021; Rae等人,2021; Smith等人,2022; Thoppilan等人,2022)和专家混合(MoE)模型(Du等人,2021; Fedus等人,2021; Zoph等人,2022)。最大的密集Transformer模型已经达到了5000亿个参数(Smith等人,2022)。训练越来越大的模型的动力是明确的,迄今为止,增加语言模型的大小已经在许多语言建模任务中改进了最先进的技术。然而,大型语言模型面临着几个挑战,包括巨大的计算需求(训练和推理的成本随着模型大小增加)(Rae等人,2021; Thoppilan等人,2022)和获取更多高质量训练数据的需求。实际上,在这项工作中,我们发现更大、高质量的数据集将在进一步扩大语言模型方面起到关键作用。

Modelling the scaling behavior建模扩展行为

Modelling the scaling behavior. Understanding the scaling behaviour of language models and their transfer properties has been important in the development of recent large models (Hernandez et al., 2021; Kaplan et al., 2020). Kaplan et al. (2020) first showed a predictable relationship between model size and loss over many orders of magnitude. The authors investigate the question of choosing the optimal model size to train for a given compute budget. Similar to us, they address this question by training various models. Our work differs from Kaplan et al. (2020) in several important ways. First, the authors use a fixed number of training tokens and learning rate schedule for all models; this prevents them from modelling the impact of these hyperparameters on the loss. In contrast, we find that setting the learning rate schedule to approximately match the number of training tokens results in the best final loss regardless of model size—see Figure A1. For a fixed learning rate cosine schedule to 130B tokens, the intermediate loss estimates (for �' << 130B) are therefore overestimates of the loss of a model trained with a schedule length matching �'. Using these intermediate losses results in underestimating the effectiveness of training models on less data than 130B tokens, and eventually contributes to the conclusion that model size should increase faster than training data size as compute budget increases. In contrast, our analysis predicts that both quantities should scale at roughly the same rate. Secondly, we include models with up to 16B parameters, as we observe that there is slight curvature in the FLOP-loss frontier (see Appendix E)—in fact, the majority of the models used in our analysis have more than 500 million parameters, in contrast the majority of runs in Kaplan et al. (2020) are significantly smaller—many being less than 100M parameters.

建模扩展行为。理解语言模型的扩展行为和其迁移特性对于最近大型模型的发展非常重要(Hernandez等人,2021; Kaplan等人,2020)。Kaplan等人(2020)首次展示了模型大小和损失之间在多个数量级上的可预测关系。作者们通过训练各种模型来研究选择在给定计算预算下训练最佳模型大小的问题,与我们类似。我们的工作与Kaplan等人(2020)在几个重要方面存在不同。首先,作者们对所有模型使用了固定数量的训练标记和学习率计划;这样做阻止了他们对这些超参数对损失的影响进行建模。相反,我们发现将学习率计划设置为大约匹配训练标记数量的长度会导致最佳的最终损失,无论模型大小如何—参见图A1。对于固定的学习率余弦计划到1300B标记,因此中间损失的估计(对于�' << 1300B)是使用与�'匹配的计划长度训练的模型的损失的低估。使用这些中间损失会低估在少于1300B标记的数据上训练模型的效果,最终导致模型大小应该比训练数据大小更快地增加的结论。相反,我们的分析预测这两个数量应以大致相同的速度扩展。其次,我们包括具有多达160B参数的模型,因为我们观察到FLOP-损失前沿存在轻微的曲线(参见附录E)—实际上,我们分析中使用的大多数模型都具有超过5亿个参数,而Kaplan等人(2020)的大多数运行 significantly smaller,许多运行的参数数量都不到1亿。

Recently, Clark et al. (2022) specifically looked in to the scaling properties of Mixture of Expert language models, showing that the scaling with number of experts diminishes as the model size increases—their approach models the loss as a function of two variables: the model size and the number of experts. However, the analysis is done with a fixed number of training tokens, as in Kaplan et al. (2020), potentially underestimating the improvements of branching.

最近,Clark等人(2022)专门研究了专家混合语言模型的扩展特性,表明随着模型大小的增加,随着专家数量的增加,扩展性减弱—他们的方法将损失建模为模型大小和专家数量的函数。然而,与Kaplan等人(2020)一样,该分析是在固定数量的训练标记下进行的,可能低估了分支的改进。

Estimating hyperparameters for large models估计大型模型的超参数

Estimating hyperparameters for large models. The model size and the number of training tokens are not the only two parameters to chose when selecting a language model and a procedure to train it. Other important factors include learning rate, learning rate schedule, batch size, optimiser, and width-to-depth ratio. In this work, we focus on model size and the number of training steps, and we rely on existing work and provided experimental heuristics to determine the other necessary hyperparameters. Yang et al. (2021) investigates how to choose a variety of these parameters for training an autoregressive transformer, including the learning rate and batch size. McCandlish et al. (2018) finds only a weak dependence between optimal batch size and model size. Shallue et al. (2018); Zhang et al. (2019) suggest that using larger batch-sizes than those we use is possible. Levine et al. (2020) investigates the optimal depth-to-width ratio for a variety of standard model sizes. We use slightly less deep models than proposed as this translates to better wall-clock performance on our hardware.

估计大型模型的超参数。模型大小和训练标记数量不是在选择语言模型和训练程序时需要选择的唯一两个参数。其他重要因素包括学习率、学习率计划、批大小、优化器和宽度-深度比。在这项工作中,我们专注于模型大小和训练步骤的数量,并依靠现有的工作和提供的实验启发来确定其他必要的超参数。Yang等人(2021)研究了如何选择各种参数来训练自回归Transformer,包括学习率和批大小。McCandlish等人(2018)发现最佳批大小与模型大小之间的依赖关系很弱。Shallue等人(2018);Zhang等人(2019)建议可以使用比我们使用的更大的批大小。Levine等人(2020)研究了各种标准模型大小的最佳深度-宽度比。我们使用的模型深度稍低于提议的深度,因为这可以在我们的硬件上获得更好的墙钟性能。

Improved model architectures改进的模型架构

Improved model architectures. Recently, various promising alternatives to traditional dense trans- formers have been proposed. For example, through the use of conditional computation large MoE models like the 1.7 trillion parameter Switch transformer (Fedus et al., 2021), the 1.2 Trillion pa- rameter GLaM model (Du et al., 2021), and others (Artetxe et al., 2021; Zoph et al., 2022) are able to provide a large effective model size despite using relatively fewer training and inference FLOPs. However, for very large models the computational benefits of routed models seems to diminish (Clark et al., 2022). An orthogonal approach to improving language models is to augment transformers with explicit retrieval mechanisms, as done by Borgeaud et al. (2021); Guu et al. (2020); Lewis et al. (2020). This approach effectively increases the number of data tokens seen during training (by a factor of ∼ 10 in Borgeaud et al. (2021)). This suggests that the performance of language models may be more dependant on the size of the training data than previously thought.

改进的模型架构。最近,对传统密集Transformer的各种有希望的替代方案被提出。例如,通过使用条件计算,大型MoE模型如17万亿参数的Switch Transformer(Fedus等人,2021)、12万亿参数的GLaM模型(Du等人,2021)等能够在使用相对较少的训练和推理FLOPs的情况下提供较大的有效模型大小。然而,对于非常大的模型,分支模型的计算优势似乎减弱了(Clark等人,2022)。改进语言模型的另一种方法是通过使用显式的检索机制来增强Transformer,这是Borgeaud等人(2021);Guu等人(2020);Lewis等人(2020)所做的。这种方法在训练过程中有效地增加了所见的数据标记数量(在Borgeaud等人(2021)中增加了约10倍)。这表明语言模型的性能可能更多地依赖于训练数据的大小,这与以前的观点不同。

3、Estimating the optimal parameter/training tokens allocation估计最佳参数/训练标记分配

We present three different approaches to answer the question driving our research: Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens? In all three cases we start by training a range of models varying both model size and the number of training tokens and use the resulting training curves to fit an empirical estimator of how they should scale. We assume a power-law relationship between compute and model size as done in Clark et al. (2022); Kaplan et al. (2020), though future work may want to include potential curvature in this relationship for large model sizes. The resulting predictions are similar for all three methods and suggest that parameter count and number of training tokens should be increased equally with more compute3— with proportions reported in Table 2. This is in clear contrast to previous work on this topic and warrants further investigation.

我们提出了三种不同的方法来回答推动我们研究的问题:在固定的浮点运算(FLOPs)预算下,如何权衡模型大小和训练标记的数量?在这三种方法中,我们首先训练一系列不同模型大小和训练标记数量的模型,并利用得到的训练曲线拟合一个经验估计器来描述它们的扩展方式。我们假设计算和模型大小之间存在幂律关系,与Clark等人(2022)和Kaplan等人(2020)的做法相同,尽管未来的研究可能需要考虑大型模型大小之间的潜在曲率。得到的预测对于这三种方法都是相似的,并且表明参数数量和训练标记的数量应当与计算量成比例地增加3——比例关系见表2。这与以往关于这个问题的研究形成了明显的对比,值得进一步调查。

3.1、Approach 1: Fix model sizes and vary number of training tokens固定模型大小,变化训练标记数量

In our first approach we vary the number of training steps for a fixed family of models (ranging from 70M to over 10B parameters), training each model for 4 different number of training sequences. From these runs, we are able to directly extract an estimate of the minimum loss achieved for a given number of training FLOPs. Training details for this approach can be found in Appendix D.

For each parameter count  we train 4 different models, decaying the learning rate by a factor of 10× over a horizon (measured in number of training tokens) that ranges by a factor of 16×. Then, for each run, we smooth and then interpolate the training loss curve. From this, we obtain a continuous mapping from FLOP count to training loss for each run. Then, for each FLOP count, we determine which run achieves the lowest loss. Using these interpolants, we obtain a mapping from any FLOP count , to the most efficient choice of model size  and number of training tokens  such that FLOPs(, ) = .4 At 1500 logarithmically spaced FLOP values, we find which model size achieves the lowest loss of all models along with the required number of training tokens. Finally, we fit power laws to estimate the optimal model size and number of training tokens for any given amount of compute (see the center and right panels of Figure 2), obtaining a relationship   ∝  and   ∝ . We find that  = 0.50 and  = 0.50—as summarized in Table 2. In Section D.4, we show a head-to-head comparison at 1021 FLOPs, using the model size recommended by our analysis and by the analysis of Kaplan et al. (2020)—using the model size we predict has a clear advantage.

在我们的第一种方法中,我们对一组固定模型(参数范围从70M到超过10B)的训练步骤进行了变化,对每个模型进行了4种不同数量的训练序列。通过这些运行,我们能够直接提取出在给定的训练FLOPs数量下实现的最低损失的估计值。有关这种方法的训练细节可以在附录D中找到。

对于每个参数数量,我们训练了4个不同的模型,在一个范围为16倍的训练标记数量下将学习率按10倍衰减。然后,对于每个运行,我们平滑并插值训练损失曲线。通过这样做,我们得到了从FLOP计数到每个运行的训练损失的连续映射。然后,对于每个FLOP计数,我们确定哪个运行实现了最低的损失。利用这些插值,我们得到了一个映射,从任意的FLOP计数到最有效的模型大小和训练标记数量,使得FLOPs(,) = 。在对数间隔的1500个FLOP值上,我们找到了哪个模型大小实现了所有模型中的最低损失,以及所需的训练标记数量。最后,我们拟合幂律来估计在给定计算量下的最佳模型大小和训练标记数量(见图2的中间和右侧面板),得到关系 ∝ 和 ∝ 。我们发现 = 0.50 和 = 0.50,如表2所总结的。在D.4节中,我们展示了在1021个FLOPs下的一对一比较,使用了我们分析推荐的模型大小和Kaplan等人(2020)分析推荐的模型大小——使用我们预测的模型大小有明显优势。

3.2、Approach 2: IsoFLOP profiles等FLOP曲线

In our second approach we vary the model size5 for a fixed set of 9 different training FLOP counts6 (ranging from 6 × 1018 to 3 × 1021 FLOPs), and consider the final training loss for each point7. in contrast with Approach 1 that considered points (, , ) along the entire training runs. This allows us to directly answer the question: For a given FLOP budget, what is the optimal parameter count?

在我们的第二种方法中,我们对一组固定的9个不同训练FLOP计数(范围从6 × 1018到3 × 1021 FLOPs)变化模型大小5,并考虑每个点7的最终训练损失。与方法1不同,方法2只考虑整个训练过程中的点(, , )。这使我们能够直接回答问题:在给定的FLOP预算下,最佳参数数量是多少?

For each FLOP budget, we plot the final loss (after smoothing) against the parameter count in Figure 3 (left). In all cases, we ensure that we have trained a diverse enough set of model sizes to see a clear minimum in the loss. We fit a parabola to each IsoFLOPs curve to directly estimate at what model size the minimum loss is achieved (Figure 3 (left)). As with the previous approach, we then fit a power law between FLOPs and loss-optimal model size and number of training tokens, shown in Figure 3 (center, right). Again, we fit exponents of the form   ∝  and   ∝  and we find that

 = 0.49 and  = 0.51—as summarized in Table 2.

对于每个FLOP预算,我们在图3(左侧)中绘制最终损失(平滑后)与参数数量的关系。在所有情况下,我们确保训练了足够多样化的模型大小集合以看到损失的明显最小值。我们对每个IsoFLOPs曲线拟合一个抛物线,以直接估计损失最小值所对应的模型大小(图3左侧)。与前一方法一样,我们然后拟合一个幂律来描述FLOPs和损失最优模型大小以及训练标记数量之间的关系,如图3(中间、右侧)所示。再次,我们拟合了形式为 ∝ 和 ∝ 的指数,并发现 = 0.49 和 = 0.51,如表2所总结的。

3.3、Approach 3: Fitting a parametric loss function拟合参数化损失函数

Lastly, we model all final losses from experiments in Approach 1 & 2 as a parametric function of model parameter count and the number of seen tokens. Following a classical risk decomposition (see Section D.2), we propose the following functional form

最后,我们将方法1和方法2中的所有最终损失都建模为模型参数数量和观察到的标记数量的参数函数。按照经典的风险分解(详见D.2节),我们提出以下的函数形式:

The first term captures the loss for an ideal generative process on the data distribution, and should correspond to the entropy of natural text. The second term captures the fact that a perfectly trained transformer with  parameters underperforms the ideal generative process. The final term captures the fact that the transformer is not trained to convergence, as we only make a finite number of optimisation steps, on a sample of the dataset distribution.

第一项捕捉到了理想生成过程对数据分布的损失,应对应于自然文本的熵。第二项捕捉到了具有 个参数的完美训练变压器性能低于理想生成过程的事实。最后一项捕捉到了变压器没有训练到收敛的事实,因为我们只进行了有限次数的优化步骤,在数据集分布的一个样本上进行。

Model fitting模型拟合

Model fitting. To estimate ( , , , , ), we minimize the Huber loss (Huber, 1964) between the predicted and observed log loss using the L-BFGS algorithm (Nocedal, 1980):

We account for possible local minima by selecting the best fit from a grid of initialisations. The Huber loss ( = 10−3) is robust to outliers, which we find important for good predictive performance over held-out data points. Section D.2 details the fitting procedure and the loss decomposition.

为了估计(, , , , ),我们使用L-BFGS算法(Nocedal, 1980)最小化预测的对数损失与观察到的对数损失之间的Huber损失(Huber, 1964):

我们通过从一组初始值中选择最佳拟合来考虑可能的局部最小值。Huber损失( = 10−3)对异常值具有鲁棒性,我们发现这对于良好的预测性能对于保留数据点很重要。D.2节详细介绍了拟合过程和损失分解。

Efficient frontier高效边界

Efficient frontier. We can approximate the functions   and   by minimizing the parametric loss ˆ under the constraint FLOPs(, ) ≈ 6  (Kaplan et al., 2020). The resulting   and   balance the two terms in Equation (3) that depend on model size and data. By construction, they have a power-law form:

We show contours of the fitted function ˆ in Figure 4 (left), and the closed-form efficient computational frontier in blue. From this approach, we find that  = 0.46 and  = 0.54—as summarized in Table 2.

我们可以通过在约束FLOPs(, ) ≈ 6(Kaplan等人,2020)下最小化参数化损失 ˆ 来近似函数 和 。所得的 和 平衡了方程(3)中依赖于模型大小和数据的两个项。根据构造方式,它们具有幂律形式:

我们在图4(左侧)中展示了拟合函数 ˆ 的等值线,并且以蓝色展示了闭合形式的高效计算边界。根据这种方法,我们发现 = 0.46 和 = 0.54,如表2所总结的。

3.4、Optimal model scaling最佳模型扩展

We find that the three approaches, despite using different fitting methodologies and different trained models, yield comparable predictions for the optimal scaling in parameters and tokens with FLOPs (shown in Table 2). All three approaches suggest that as compute budget increases, model size and the amount of training data should be increased in approximately equal proportions. The first and second approaches yield very similar predictions for optimal model sizes, as shown in Figure 1 and Figure A3. The third approach predicts even smaller models being optimal at larger compute budgets.

We note that the observed points (, , ) for low training FLOPs ( ζ 121) have larger residuals I2 than points with higher computational budgets. The fitted model places increased weight on the points with more FLOPs—automatically considering the low-computational budget points as outliers due to the Huber loss. As a consequence of the empirically observed negative curvature in the frontier  →   (see Appendix E), this results in predicting a lower   than the two other approaches.

我们发现,尽管三种方法使用不同的拟合方法和训练模型,但它们对于参数和标记的最佳扩展与FLOPs(在表2中显示)给出了可比较的预测结果。所有三种方法都表明,随着计算预算的增加,模型大小和训练数据量应以大致相等的比例增加。第一种和第二种方法对于最佳模型大小给出了非常相似的预测,如图1和图A3所示。第三种方法预测在更大的计算预算下,最优模型甚至更小。

我们注意到,低训练FLOPs( ζ 121)的观察点(, , )的残差I2较大于高计算预算的点。拟合模型在更多FLOPs的点上放置了更大的权重,由于Huber损失,自动将计算预算较低的点视为异常值。由于在前沿 → 中经验观察到的负曲率(见附录E),这导致预测的 比其他两种方法更低。

In Table 3 we show the estimated number of FLOPs and tokens that would ensure that a model of a given size lies on the compute-optimal frontier. Our findings suggests that the current generation of large language models are considerably over-sized, given their respective compute budgets, as shown in Figure 1. For example, we find that a 175 billion parameter model should be trained with a compute budget of 4.41 × 1024 FLOPs and on over 4.2 trillion tokens. A 280 billion Gopher-like model is the optimal model to train given a compute budget of approximately 1025 FLOPs and should be trained on 6.8 trillion tokens. Unless one has a compute budget of 1026 FLOPs (over 250× the compute used to train Gopher), a 1 trillion parameter model is unlikely to be the optimal model to train. Furthermore, the amount of training data that is projected to be needed is far beyond what is currently used to train large models, and underscores the importance of dataset collection in addition to engineering improvements that allow for model scale. While there is significant uncertainty extrapolating out many orders of magnitude, our analysis clearly suggests that given the training compute budget for many current LLMs, smaller models should have been trained on more tokens to achieve the most performant model.

在表3中,我们显示了估计的FLOPs和标记数量,以确保给定大小的模型位于计算最优前沿上。我们的研究结果表明,考虑到各自的计算预算,当前的大型语言模型被过度尺寸化,如图1所示。例如,我们发现一个拥有1750亿个参数的模型应该使用4.41 × 1024个FLOPs的计算预算进行训练,并使用超过42万亿个标记进行训练。在计算预算约为1025个FLOPs的情况下,2800亿个类似Gopher的模型是最佳训练模型,应该使用68万亿个标记进行训练。除非您有1026个FLOPs的计算预算(超过用于训练Gopher的计算的250倍),否则1万亿个参数的模型不太可能是最佳的训练模型。此外,预计所需的训练数据量远远超过目前用于训练大型模型的数据量,并强调了数据集收集的重要性,除了允许模型扩展的工程改进。虽然在许多数量级上进行外推存在显着的不确定性,但我们的分析清楚地表明,考虑到当前许多现有LLM的训练计算预算,应该在更多的标记上训练较小的模型以获得性能最佳的模型。

In Appendix C, we reproduce the IsoFLOP analysis on two additional datasets: C4 (Raffel et al., 2020a) and GitHub code (Rae et al., 2021). In both cases we reach the similar conclusion that model size and number of training tokens should be scaled in equal proportions.

在附录C中,我们对另外两个数据集C4(Raffel等人,2020a)和GitHub代码(Rae等人,2021)进行了IsoFLOP分析的复现。在这两种情况下,我们得出了相似的结论,即模型大小和训练标记的比例应相等地扩展。

4、Chinchilla

Based on our analysis in Section 3, the optimal model size for the Gopher compute budget is somewhere between 40 and 70 billion parameters. We test this hypothesis by training a model on the larger end of this range—70B parameters—for 1.4T tokens, due to both dataset and computational efficiency considerations. In this section we compare this model, which we call Chinchilla, to Gopher and other LLMs. Both Chinchilla and Gopher have been trained for the same number of FLOPs but differ in the size of the model and the number of training tokens.

根据我们在第3节中的分析,Gopher计算预算的最佳模型大小介于400亿和700亿参数之间。为了验证这个假设,我们训练了一个模型,参数数量为700B,在1.4T个标记上进行训练,这考虑到了数据集和计算效率的因素。在本节中,我们将这个模型称为Chinchilla,并将其与Gopher和其他语言模型进行比较。Chinchilla和Gopher在相同的FLOP训练步骤数量上进行了训练,但模型大小和训练标记数量不同。

While pre-training a large language model has a considerable compute cost, downstream fine- tuning and inference also make up substantial compute usage (Rae et al., 2021). Due to being 4× smaller than Gopher, both the memory footprint and inference cost of Chinchilla are also smaller.

虽然预训练大型语言模型的计算成本相当高,但下游的微调和推理也占据了相当大的计算资源(Rae等人,2021)。由于Chinchilla比Gopher小4倍,因此Chinchilla的内存占用和推理成本也更低。

4.1、Model and training details模型和训练细节

The full set of hyperparameters used to train Chinchilla are given in Table 4. Chinchilla uses the same model architecture and training setup as Gopher with the exception of the differences listed below.

>> We train Chinchilla on MassiveText (the same dataset as Gopher) but use a slightly different subset distribution (shown in Table A1) to account for the increased number of training tokens.

>> We use AdamW (Loshchilov and Hutter, 2019) for Chinchilla rather than Adam (Kingma and Ba, 2014) as this improves the language modelling loss and the downstream task performance after finetuning.8

>> We train Chinchilla with a slightly modified SentencePiece (Kudo and Richardson, 2018) tokenizer that does not apply NFKC normalisation. The vocabulary is very similar– 94.15% of tokens are the same as those used for training Gopher. We find that this particularly helps with the representation of mathematics and chemistry, for example.

>> Whilst the forward and backward pass are computed in bfloat16, we store a float32 copy of the weights in the distributed optimiser state (Rajbhandari et al., 2020). See Lessons Learned from Rae et al. (2021) for additional details.

Chinchilla的训练所使用的全部超参数如表4所示。Chinchilla使用与Gopher相同的模型架构和训练设置,但存在以下差异:

>> 我们在MassiveText上训练Chinchilla(与Gopher相同的数据集),但使用稍微不同的子集分布(在表A1中显示),以适应增加的训练标记数量。

>> 我们使用AdamW(Loshchilov和Hutter,2019)作为Chinchilla的优化器,而不是Adam(Kingma和Ba,2014),因为这可以提高语言建模损失和微调后的下游任务性能。

>> 我们使用稍微修改过的SentencePiece(Kudo和Richardson,2018)分词器对Chinchilla进行训练,该分词器不应用NFKC归一化。词汇表非常相似,94.15%的标记与用于训练Gopher的标记相同。我们发现这对数学和化学等领域的表示特别有帮助。

>> 尽管前向和反向传播使用的是bfloat16,但我们在分布式优化器状态中存储了float32类型的权重副本。有关更多细节,请参阅Rae等人(2021)的“从中获得的经验教训”。

In Appendix G we show the impact of the various optimiser related changes between Chinchilla and Gopher. All models in this analysis have been trained on TPUv3/TPUv4 (Jouppi et al., 2017) with JAX (Bradbury et al., 2018) and Haiku (Hennigan et al., 2020). We include a Chinchilla model card (Mitchell et al., 2019) in Table A8.‌

在附录G中,我们展示了Chinchilla和Gopher之间各种优化器相关变化的影响。本分析中的所有模型都是在TPUv3/TPUv4(Jouppi等人,2017)上使用JAX(Bradbury等人,2018)和Haiku(Hennigan等人,2020)进行训练的。在附录A8的表格中包含了Chinchilla的模型卡片(Mitchell等人,2019)。

4.2、Results结果

We perform an extensive evaluation of Chinchilla, comparing against various large language models. We evaluate on a large subset of the tasks presented in Rae et al. (2021), shown in Table 5. As the focus of this work is on optimal model scaling, we included a large representative subset, and introduce a few new evaluations to allow for better comparison to other existing large models. The evaluation details for all tasks are the same as described in Rae et al. (2021).

我们对Chinchilla进行了广泛的评估,与各种大型语言模型进行比较。我们在Rae等人(2021)中列出的一大部分任务上进行评估,具体列在表5中。由于本文的重点是最佳模型扩展,我们包括了一个大型代表性子集,并引入了一些新的评估,以便更好地与其他现有大型模型进行比较。所有任务的评估细节与Rae等人(2021)中描述的相同。

4.2.1、Language modelling语言建模

Chinchilla significantly outperforms Gopher on all evaluation subsets of The Pile (Gao et al., 2020), as shown in Figure 5. Compared to Jurassic-1 (178B) Lieber et al. (2021), Chinchilla is more performant on all but two subsets– dm_mathematics and ubuntu_irc– see Table A5 for a raw bits-per-byte comparison. On Wikitext103 (Merity et al., 2017), Chinchilla achieves a perplexity of 7.16 compared to 7.75 for Gopher. Some caution is needed when comparing Chinchilla with Gopher on these language modelling benchmarks as Chinchilla is trained on 4× more data than Gopher and thus train/test set leakage may artificially enhance the results. We thus place more emphasis on other tasks for which leakage is less of a concern, such as MMLU (Hendrycks et al., 2020) and BIG-bench (BIG-bench collaboration, 2021) along with various closed-book question answering and common sense analyses.

如图5所示,Chinchilla在The Pile(Gao等人,2020)的所有评估子集上均显著优于Gopher。与Jurassic-1(178B)(Lieber等人,2021)相比,Chinchilla在除dm_mathematics和ubuntu_irc之外的所有子集上都表现更好,具体请参见附录A5中的原始比特-每字节对比。在Wikitext103(Merity等人,2017)上,Chinchilla的困惑度为7.16,而Gopher为7.75。在这些语言建模基准测试中,比较Chinchilla和Gopher时需要注意,因为Chinchilla训练的数据量是Gopher的4倍,因此训练集和测试集的泄漏可能会人为地提高结果。因此,我们更加强调其他任务,这些任务的泄漏问题较小,如MMLU(Hendrycks等人,2020)和BIG-bench(BIG-bench合作组,2021),以及各种闭书问答和常识分析。

4.2.2、MMLU大规模多任务语言理解基准测试

The Massive Multitask Language Understanding (MMLU) benchmark (Hendrycks et al., 2020) consists of a range of exam-like questions on academic subjects. In Table 6, we report Chinchilla’s average 5-shot performance on MMLU (the full breakdown of results is shown in Table A6). On this benchmark, Chinchilla significantly outperforms Gopher despite being much smaller, with an average accuracy of 67.6% (improving upon Gopher by 7.6%). Remarkably, Chinchilla even outperforms the expert forecast for June 2023 of 63.4% accuracy (see Table 6) (Steinhardt, 2021). Furthermore, Chinchilla achieves greater than 90% accuracy on 4 different individual tasks– high_school_gov_and_politics, international_law, sociology, and us_foreign_policy. To our knowledge, no other model has achieved greater than 90% accuracy on a subset.

Massive Multitask Language Understanding(MMLU)基准测试(Hendrycks等人,2020)包含了各种学术科目的类似考试的问题。在表6中,我们报告了Chinchilla在MMLU上的平均5-shot性能(完整的结果细分在表A6中显示)。尽管体积较小,但Chinchilla在这个基准测试中显著优于Gopher,平均准确率为67.6%(比Gopher提高了7.6%)。值得注意的是,Chinchilla甚至超过了2023年6月的专家预测准确率63.4%(参见表6)(Steinhardt,2021)。此外,Chinchilla在4个不同的任务上实现了超过90%的准确率,这些任务分别是high_school_gov_and_politics、international_law、sociology和us_foreign_policy。据我们所知,没有其他模型在子集上实现了超过90%的准确率。

In Figure 6, we show a comparison to Gopher broken down by task. Overall, we find that Chin- chilla improves performance on the vast majority of tasks. On four tasks (college_mathematics, econometrics, moral_scenarios, and formal_logic) Chinchilla underperforms Gopher, and there is no change in performance on two tasks.

在图6中,我们展示了与Gopher按任务进行的比较。总体而言,我们发现Chinchilla在绝大多数任务上的性能有所提升。在四个任务(college_mathematics、econometrics、moral_scenarios和formal_logic)中,Chinchilla的表现不如Gopher,而在两个任务中表现不变。

4.2.3、Reading comprehension阅读理解

On the final word prediction dataset LAMBADA (Paperno et al., 2016), Chinchilla achieves 77.4% accuracy, compared to 74.5% accuracy from Gopher and 76.6% from MT-NLG 530B (see Table 7). On RACE-h and RACE-m (Lai et al., 2017), Chinchilla greatly outperforms Gopher, improving accuracy by more than 10% in both cases—see Table 7.

在最后一个词语预测数据集LAMBADA(Paperno等人,2016)上,Chinchilla的准确率达到77.4%,而Gopher和MT-NLG 530B的准确率分别为74.5%和76.6%(请参见表7)。在RACE-h和RACE-m(Lai等人,2017)上,Chinchilla显著优于Gopher,在两种情况下的准确率都提高了10%以上(请参见表7)。

4.2.4、BIG-bench

We analysed Chinchilla on the same set of BIG-bench tasks (BIG-bench collaboration, 2021) reported in Rae et al. (2021). Similar to what we observed in MMLU, Chinchilla outperforms Gopher on the vast majority of tasks (see Figure 7). We find that Chinchilla improves the average performance by 10.7%, reaching an accuracy of 65.1% versus 54.4% for Gopher. Of the 62 tasks we consider, Chinchilla performs worse than Gopher on only four—crash_blossom, dark_humor_detection,

我们对相同的BIG-bench任务集(BIG-bench合作组,2021)对Chinchilla进行了分析,该任务集在Rae等人(2021)中有报道。与我们在MMLU中观察到的情况类似,Chinchilla在绝大多数任务上优于Gopher(请参见图7)。我们发现,Chinchilla的平均性能提高了10.7%,准确率达到了65.1%,而Gopher的准确率为54.4%。在我们考虑的62个任务中,Chinchilla只在四个任务(crash_blossom、dark_humor_detection、...)上表现不如Gopher。

4.2.5、Common sense常识

We evaluate Chinchilla on various common sense benchmarks: PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Winogrande (Sakaguchi et al., 2020), HellaSwag (Zellers et al., 2019), and BoolQ (Clark et al., 2019). We find that Chinchilla outperforms both Gopher and GPT-3 on all tasks and outperforms MT-NLG 530B on all but one task—see Table 8.

On TruthfulQA (Lin et al., 2021), Chinchilla reaches 43.6%, 58.5%, and 66.7% accuracy with 0-shot, 5-shot, and 10-shot respectively. In comparison, Gopher achieved only 29.5% 0-shot and 43.7% 10-shot accuracy. In stark contrast with the findings of Lin et al. (2021), the large improvements (14.1% in 0-shot accuracy) achieved by Chinchilla suggest that better modelling of the pre-training data alone can lead to substantial improvements on this benchmark.

我们对Chinchilla在各种常识基准测试上进行了评估:PIQA(Bisk等人,2020)、SIQA(Sap等人,2019)、Winogrande(Sakaguchi等人,2020)、HellaSwag(Zellers等人,2019)和BoolQ(Clark等人,2019)。我们发现Chinchilla在所有任务上表现优于Gopher和GPT-3,并且在除一个任务外优于MT-NLG 530B(请参见表8)。

在TruthfulQA(Lin等人,2021)上,Chinchilla在0-shot、5-shot和10-shot情况下的准确率分别达到43.6%、58.5%和66.7%。相比之下,Gopher仅在0-shot和10-shot情况下分别达到29.5%和43.7%的准确率。与Lin等人(2021)的研究结果形成鲜明对比的是,Chinchilla在0-shot准确率上取得了大幅改进(14.1%),这表明仅通过更好地对预训练数据进行建模就可以在这个基准测试上实现实质性的改进。

4.2.6、Closed-book question answering封闭书问题回答

Results on closed-book question answering benchmarks are reported in Table 9. On the Natural Questions dataset (Kwiatkowski et al., 2019), Chinchilla achieves new closed-book SOTA accuracies: 31.5% 5-shot and 35.5% 64-shot, compared to 21% and 28% respectively, for Gopher. On TriviaQA (Joshi et al., 2017) we show results for both the filtered (previously used in retrieval and open-book work) and unfiltered set (previously used in large language model evaluations). In both cases, Chinchilla substantially out performs Gopher. On the filtered version, Chinchilla lags behind the open book SOTA (Izacard and Grave, 2020) by only 7.9%. On the unfiltered set, Chinchilla outperforms GPT-3—see Table 9.

在封闭书问题回答的基准测试中,结果如表9所示。在自然问题数据集(Kwiatkowski等人,2019)上,Chinchilla实现了新的封闭书SOTA准确率:5-shot为31.5%,64-shot为35.5%,而Gopher分别为21%和28%。在TriviaQA(Joshi等人,2017)中,我们展示了对筛选集(之前在检索和开放书型工作中使用)和非筛选集(之前在大型语言模型评估中使用)的结果。在两种情况下,Chinchilla的性能远远超过Gopher。在筛选版本中,Chinchilla只比开放书型SOTA(Izacard和Grave,2020)低7.9%。在非筛选集上,Chinchilla胜过GPT-3,详见表9。

4.2.7、Gender bias and toxicity性别偏见和毒性

Large Language Models carry potential risks such as outputting offensive language, propagating social biases, and leaking private information (Bender et al., 2021; Weidinger et al., 2021). We expect Chinchilla to carry risks similar to Gopher because Chinchilla is trained on the same data,albeit with slightly different relative weights, and because it has a similar architecture. Here, we examine gender bias (particularly gender and occupation bias) and generation of toxic language. We select a few common evaluations to highlight potential issues, but stress that our evaluations are not comprehensive and much work remains to understand, evaluate, and mitigate risks in LLMs.

大型语言模型存在潜在的风险,如输出冒犯性语言、传播社会偏见和泄露个人信息(Bender等人,2021;Weidinger等人,2021)。我们预计Chinchilla与Gopher存在类似的风险,因为Chinchilla是在相同数据上训练的,尽管相对权重略有不同,并且具有类似的架构。在这里,我们将重点关注性别偏见(尤其是性别和职业偏见)和生成有害语言。我们选择了几个常见的评估来突出潜在问题,但我们强调我们的评估并不全面,还有许多工作需要理解、评估和减轻LLM中的风险。

Gender bias性别偏见

Gender bias. As discussed in Rae et al. (2021), large language models reflect contemporary and historical discourse about different groups (such as gender groups) from their training dataset, and we expect the same to be true for Chinchilla. Here, we test if potential gender and occupation biases manifest in unfair outcomes on coreference resolutions, using the Winogender dataset (Rudinger et al., 2018) in a zero-shot setting. Winogender tests whether a model can correctly determine if a pronoun refers to different occupation words. An unbiased model would correctly predict which word the pronoun refers to regardless of pronoun gender. We follow the same setup as in Rae et al. (2021) (described further in Section H.3).

As shown in Table 10, Chinchilla correctly resolves pronouns more frequently than Gopher across all groups. Interestingly, the performance increase is considerably smaller for male pronouns (increase of 3.2%) than for female or neutral pronouns (increases of 8.3% and 9.2% respectively). We also consider gotcha examples, in which the correct pronoun resolution contradicts gender stereotypes (determined by labor statistics). Again, we see that Chinchilla resolves pronouns more accurately than Gopher. When breaking up examples by male/female gender and gotcha/not gotcha, the largest improvement is on female gotcha examples (improvement of 10%). Thus, though Chinchilla uniformly overcomes gender stereotypes for more coreference examples than Gopher, the rate of improvement is higher for some pronouns than others, suggesting that the improvements conferred by using a more compute-optimal model can be uneven.

正如Rae等人(2021)中讨论的那样,大型语言模型反映了关于不同群体(如性别群体)的当代和历史话语,我们预计Chinchilla也是如此。在这里,我们使用Winogender数据集(Rudinger等人,2018)在零-shot设置下测试潜在的性别和职业偏见是否会在指代消解中产生不公平的结果。Winogender测试模型是否能正确判断代词是否指代不同的职业词。一个无偏见的模型应该能够正确预测代词指代的词,而不考虑代词的性别。我们按照Rae等人(2021)的设置进行测试(在第H.3节中进一步描述)。

如表10所示,Chinchilla在所有群体中解决代词的准确率都比Gopher高。有趣的是,对于男性代词,性能提升要小得多(增加3.2%),而对于女性代词或中性代词,性能提升分别为8.3%和9.2%。我们还考虑了“gotcha”示例,其中正确的代词解析与性别刻板印象(通过劳动统计数据确定)相矛盾。同样,我们发现Chinchilla比Gopher更准确地解析代词。当按照男性/女性性别和gotcha/not gotcha来分析示例时,最大的改进出现在女性gotcha示例上(提高了10%)。因此,尽管Chinchilla在解决代词指代问题上比Gopher更能克服性别刻板印象,但改进的速度对于某些代词而言更高,这表明使用更高计算效率的模型可能会不均匀地带来改进。

Sample toxicity样本毒性

Sample toxicity. Language models are capable of generating toxic language—including insults, hate speech, profanities and threats (Gehman et al., 2020; Rae et al., 2021). While toxicity is an umbrella term, and its evaluation in LMs comes with challenges (Welbl et al., 2021; Xu et al., 2021), automatic classifier scores can provide an indication for the levels of harmful text that a LM generates. Rae et al. (2021) found that improving language modelling loss by increasing the number of model parameters has only a negligible effect on toxic text generation (unprompted); here we analyze whether the same holds true for a lower LM loss achieved via more compute-optimal training. Similar to the protocol of Rae et al. (2021), we generate 25,000 unprompted samples from Chinchilla, and compare their PerspectiveAPI toxicity score distribution to that of Gopher-generated samples. Several summary statistics indicate an absence of major differences: the mean (median) toxicity score for Gopher is 0.081 (0.064), compared to 0.087 (0.066) for Chinchilla, and the 95th percentile scores are 0.230 for Gopher, compared to 0.238 for Chinchilla. That is, the large majority of generated samples are classified as non-toxic, and the difference between the models is negligible. In line with prior findings (Rae et al., 2021), this suggests that toxicity levels in unconditional text generation are largely independent of the model quality (measured in language modelling loss), i.e. that better models of the training dataset are not necessarily more toxic.

语言模型有能力生成有害语言,包括侮辱、仇恨言论、亵渎和威胁(Gehman等人,2020;Rae等人,2021)。尽管毒性是一个广义术语,在LM中评估毒性面临挑战(Welbl等人,2021;Xu等人,2021),自动分类器分数可以提供一个指示LM生成有害文本水平的线索。Rae等人(2021)发现,通过增加模型参数数量来改进语言建模损失对毒性文本生成(无提示)几乎没有影响;在这里,我们分析了通过更高计算效率的训练实现的较低LM损失是否也适用于相同情况。与Rae等人(2021)的协议类似,我们从Chinchilla生成了25,000个无提示样本,并将其PerspectiveAPI毒性评分分布与Gopher生成样本进行比较。几个摘要统计数据表明几乎没有重大差异:Gopher的平均(中位数)毒性评分为0.081(0.064),而Chinchilla为0.087(0.066),第95个百分位数评分为0.230(对于Gopher),而Chinchilla为0.238。也就是说,绝大多数生成的样本被分类为非有害文本,而模型之间的差异微不足道。与先前的研究结果一致(Rae等人,2021),这表明无条件文本生成中的毒性水平在很大程度上与模型质量(以语言建模损失度量)无关,即训练数据集的更好模型不一定更具毒性。

5、Discussion & Conclusion讨论与结论

The trend so far in large language model training has been to increase the model size, often without increasing the number of training tokens. The largest dense transformer, MT-NLG 530B, is now over 3× larger than GPT-3’s 170 billion parameters from just two years ago. However, this model, as well as the majority of existing large models, have all been trained for a comparable number of tokens—around 300 billion. While the desire to train these mega-models has led to substantial engineering innovation, we hypothesize that the race to train larger and larger models is resulting in models that are substantially underperforming compared to what could be achieved with the same compute budget.

We propose three predictive approaches towards optimally setting model size and training dura- tion, based on the outcome of over 400 training runs. All three approaches predict that Gopher is substantially over-sized and estimate that for the same compute budget a smaller model trained on more data will perform better. We directly test this hypothesis by training Chinchilla, a 70B parameter model, and show that it outperforms Gopher and even larger models on nearly every measured evaluation task.

Whilst our method allows us to make predictions on how to scale large models when given additional compute, there are several limitations. Due to the cost of training large models, we only have two comparable training runs at large scale (Chinchilla and Gopher), and we do not have additional tests at intermediate scales. Furthermore, we assume that the efficient computational frontier can be described by a power-law relationship between the compute budget, model size, and number of training tokens. However, we observe some concavity in log r�� �� ) at high compute budgets (see Appendix E). This suggests that we may still be overestimating the optimal size of large models. Finally, the training runs for our analysis have all been trained on less than an epoch of data; future work may consider the multiple epoch regime. Despite these limitations, the comparison of Chinchilla to Gopher validates our performance predictions, that have thus enabled training a better (and more lightweight) model at the same compute budget.

迄今为止,大型语言模型训练的趋势是增加模型大小,通常不增加训练标记的数量。最大的密集transformer MT-NLG 530B现在比仅仅两年前的GPT-3的1700亿参数大了3倍以上。然而,这个模型以及大多数现有的大型模型都是以大约3000亿个训练标记训练的。虽然渴望训练这些超大模型导致了实质性的工程创新,但我们假设追求训练越来越大的模型导致这些模型的性能明显不如在相同计算预算下可能实现的性能。

我们提出了三种基于400多次训练运行结果的预测方法,用于优化模型大小和训练持续时间。所有三种方法预测Gopher的规模过大,估计在相同计算预算下,一个更小的模型在更多数据上训练将表现更好。我们通过训练Chinchilla,一个70B参数的模型,直接测试了这个假设,并展示它在几乎每个评估任务上都优于Gopher甚至更大的模型。

尽管我们的方法允许我们在给定额外计算资源时对大型模型进行扩展,但存在一些限制。由于训练大型模型的成本,我们只进行了两个大规模的可比较训练运行(Chinchilla和Gopher),并且在中间规模上没有额外的测试。此外,我们假设高效计算边界可以通过计算预算、模型大小和训练标记数量之间的幂律关系来描述。然而,我们观察到在高计算预算下log r增长的凹性(见附录E)。这表明我们可能仍然高估了大型模型的最佳大小。最后,我们分析的训练运行都是在不到一个时期的数据上进行的;未来的研究可能会考虑多个时期的情况。尽管存在这些限制,Chinchilla与Gopher的比较验证了我们的性能预测,从而使我们在相同计算预算下训练了一个更好(和更轻量级)的模型。

Though there has been significant recent work allowing larger and larger models to be trained, our analysis suggests an increased focus on dataset scaling is needed. Speculatively, we expect that scaling to larger and larger datasets is only beneficial when the data is high-quality. This calls for responsibly collecting larger datasets with a high focus on dataset quality. Larger datasets will require extra care to ensure train-test set overlap is properly accounted for, both in the language modelling loss but also with downstream tasks.

Finally, training for trillions of tokens introduces many ethical and privacy concerns. Large datasets scraped from the web will contain toxic language, biases, and private information. With even larger datasets being used, the quantity (if not the frequency) of such information increases, which makes dataset introspection all the more important. Chinchilla does suffer from bias and toxicity but interestingly it seems less affected than Gopher. Better understanding how performance of large language models and toxicity interact is an important future research question.

尽管近年来已经进行了大量的工作,使得越来越大的模型能够进行训练,但我们的分析表明,需要更加重视数据集的扩展。我们推测,只有在数据具有高质量时,将规模扩展到更大的数据集中才会有益。这要求负责任地收集更大的数据集,并高度关注数据集的质量。更大的数据集将需要额外的注意,以确保适当地考虑到训练集和测试集的重叠,无论是在语言模型损失中还是在下游任务中。

最后,训练数万亿个标记会引入许多伦理和隐私问题。从网络上抓取的大型数据集中将包含有害语言、偏见和私人信息。随着使用更大的数据集,此类信息的数量(如果不是频率)会增加,这使得对数据集的自省变得更加重要。Chinchilla虽然存在偏见和有害性问题,但有趣的是,它似乎受到的影响比Gopher更小。更好地理解大型语言模型的性能和有害性如何相互作用是一个重要的未来研究问题。

While we have applied our methodology towards the training of auto-regressive language models, we expect that there is a similar trade-off between model size and the amount of data in other modalities. As training large models is very expensive, choosing the optimal model size and training steps beforehand is essential. The methods we propose are easy to reproduce in new settings.

虽然我们的方法论是应用于自回归语言模型的训练,但我们预计在其他形式的数据模态中,模型大小和数据量之间存在类似的权衡。由于训练大型模型非常昂贵,事先选择最佳的模型大小和训练步骤是至关重要的。我们提出的方法在新环境中易于复现。

6、Acknowledgements致谢

We’d like to thank Jean-baptiste Alayrac, Kareem Ayoub, Chris Dyer, Nando de Freitas, Demis Hassabis, Geoffrey Irving, Koray Kavukcuoglu, Nate Kushman and Angeliki Lazaridou for useful comments on the manuscript. We’d like to thank Andy Brock, Irina Higgins, Michela Paganini, Francis Song, and other colleagues at DeepMind for helpful discussions. We are also very grateful to the JAX and XLA team for their support and assistance.

我们感谢Jean-baptiste Alayrac、Kareem Ayoub、Chris Dyer、Nando de Freitas、Demis Hassabis、Geoffrey Irving、Koray Kavukcuoglu、Nate Kushman和Angeliki Lazaridou对手稿的有用评论。我们还感谢DeepMind的Andy Brock、Irina Higgins、Michela Paganini、Francis Song和其他同事的有益讨论。我们还非常感谢JAX和XLA团队的支持和帮助。

本文标签: ChinchillaTrainingLLMsGopherCompute