admin管理员组

文章数量:1531657

LLMs之GLM-130B/ChatGLM-1:《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

导读:2023年3月10日,千亿对话模型 ChatGLM 开始内测,60亿参数 ChatGLM-6B 模型开源。

>> ChatGLM-1 = 基于GLM架构+参考ChatGPT设计思路+SFT+FB+RLHF+INT4量化技术:ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型,基于GLM架构,具有 62 亿参数。ChatGLM-6B 使用了和 ChatGPT 相似的技术,针对中文问答和对话进行了优化。经过约 1T的token中英双语训练,辅以监督微调(Supervised Fine-Tuning)、反馈自助(Feedback Bootstrap)、人类反馈强化学习(RLHF)等技术的加持,62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答。结合模型量化技术,用户可以在消费级的显卡上进行本地部署(INT4 量化级别下最低只需 6GB 显存),虽然规模不及千亿模型,但大大降低了用户部署的门槛,并且已经能生成相当符合人类偏好的回答。

-----------------------------------------GLM-130B的模型特点-----------------------------------------------

>> 性能更好:性能明显优于GPT-3 175B(davinci)和中文ERNIE 3.0 Titan 260B;
>> 无需后期训练即可达到 INT4量化:无需后期训练即可达到 INT4 量化,且几乎没有性能损失;

>> 平民化GPU运行:GLM-130B能够在 4*RT3090-24G或 8*RTX2080Ti-11G的GPU 上有效推理,是使用 100B 级模型最实惠的 GPU 需求。它旨在支持在一台 A100(40G * 8) 或 V100(32G * 8)服务器上对千亿规模参数的模型进行推理。
-----------------------------------------GLM-130B的模型数字-----------------------------------------------

>> 130B参数+400B语料+基于Transformer(70层)+2048+150K分词器:GLM-130B 在超过 400B个双语token(中文和英文)上进行了预训练。GLM-130B 模型含有 70 层 Transformer,隐层维度 12288,最大序列长度 2048,以及一个基于 icetk 的 150,000 个标识符的双语分词器。
-----------------------------------------GLM-130B的模型结构-----------------------------------------------

>> 双向注意力机制:采用GLM(通用语言模型)作为基础,与GPT风格模型不同,GLM-130B使用双向注意力机制以增强上下文理解。
>> 自回归空白填充—采样两种不同的掩码标识符:训练目标是通过随机采样的顺序进行破坏片段的填充,允许破坏片段之间的互动。结合了短空白([MASK])和长空白([gMASK])的自回归填充目标,以支持更好的理解和生成。GLM-130B 利用自回归空白填充作为其主要的预训练目标,它掩盖了随机的连续文本区间,并对其进行自回归预测。在实际训练中,GLM-130B 使用两种不同的掩码标识符([MASK] 和 [gMASK]),分别用于短文和长文的生成。
>> 层归一化:采用新提出的DeepNorm层规范化初始化方法,使得模型在训练中更稳定。具体公式:LayerNorm(α · x + Network(x)),其中α = (2N)^1/2。
>> PE和FFN改进—RoPE+DeepNorm+GeLU激活函数:PE结构中采用旋转位置编码(RoPE)替代ALiBi以提高训练稳定性。而FFN采样DeepNorm层归一化、GLU与GeLU激活函数。
>> 预训练设置
● 自监督空白填充(SSBIF,95%标记):采用[MASK]和[gMASK],并在不同训练序列中应用不同的掩码策略。
● 多任务指令预训练(MIP,5%标记):在预训练阶段包含语言理解、生成和信息提取的多任务数据集,以提升下游零样本性能。
>> 数据集组成:预训练数据包括1.2T的英语Pile数据、1.0T的中文Wudao-Corpora和250G的中文网络爬取语料,形成了平衡的中英文内容。

目录

GLM模型系列

LLMs之GLM-130B/ChatGLM-1:《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

LLMs之ChatGLM-2:ChatGLM2-6B的简介、安装、使用方法之详细攻略

LLMs之ChatGLM-3:ChatGLM3/ChatGLM3-6B的简介(多阶段增强+多模态理解+AgentTuning技术)、安装、使用方法之详细攻略

LLMs之GLM-4:GLM-4的简介(全覆盖【对话版即ChatGLM4的+工具调用+多模态文生图】能力→Agent)、安装和使用方法、案例应用之详细攻略

MLM之GLM-4:GLM-4-9B的简介、安装和使用方法、案例应用之详细攻略

实战案例

LLMs:从头到尾手把手教大家利用ChatGLM-6B模型实现训练、部署、推理(CLI/Gradio交互界面)、微调(两个提效技巧【混合精度+ZeRO零冗余提效】+三种微调方法【fine-tuning/P-tuning v2改变参数分布/LoRA低秩近似降低要更新参数量】)图文教程之详细攻略

LLMs之ChatGLM:基于Langchain框架利用text2vec-large-chinese+ChatGLM大模型(Docker 部署)接入本地知识库(生成本地知识库/分割/向量化+基于问题【Embdding+向量化+匹配TopK作为上下文】=生成Prompt喂给大模型→LLMs响应)实现问答响应项目(CLI/WebUI/VUE)图文教程之详细攻略

LLMs之ChatGLM:ChatGLM Efficient Tuning(一款高效微调ChatGLM-6B/ChatGLM2-6B的工具【LoRA/P-Tunin】)的简介、安装、使用方法之详细攻略

LLMs之ChatGLM2:ChatGLM2-6B本地部署之单机推理(API/CLI/GUI)、低成本部署(GPU量化部署/CPU及其量化部署/Mac部署/多卡部署)、有限资源下高效微调(全参/P-tuning v2)、模型评估和推理之图文教程之详细攻略

LLMs之ChatGLM2:基于ChatGLM Efficient Tuning(微调工具包)实现对ChatGLM2进行LoRA微调并进行推理测试图文教程之详细攻略

LLMs:LLaMA Efficient Tuning(一款可高效微调【全参数/LoRA/QLoRA】主流大模型【ChatGLM2/LLaMA2/Baichuan等】的高效工具【预训练+指令监督微调+奖励模型训练+PPO 训练+DPO 训练】)的简介、安装、使用方法之详细攻略

ChatGLM-6B的简介

GLM-130B的简介

《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

ABSTRACT摘要

1、INTRODUCTION引言

GPT-3在各种基准测试上表现更好,引领了对1000亿规模LLMs的研究

GLM-130B:1300亿个参数,一个双语(英语和中文)双向密集模型,2022年5月6日至7月3日期间,在96个NVIDIA DGX-A100 (8×40G) GPU节点的集群上预训练了4000亿个令牌

不同于GPT风格架构=基于GLM+利用其双向注意力优势+自回归填充空白的目标

GLM-130B的模型性能:超过GPT-3水平、优于PaLM 540B、与OPT-175B和BLOOM-176B不相上下

GLM-130B的目的:让更多人能够参与研究,因其可在单个A100推断(或在4×RTX 3090-24G或8×RTX 2080 Ti-11G服务器上快速推断)+INT4量化

Table 1: A comparison between GLM-130B and other 100B-scale LLMs and PaLM 540B. (LN: layer norm.; FPF: floating-point format; MIP: multi-task instruction pre-training; CN : Chinese)表1:GLM-130B与其他100B规模的LLM和PaLM 540B的比较。(LN:层归一化;FPF:浮点格式;MIP:多任务指令预训练;CN:中文)

2 THE DESIGN CHOICES OF GLM-130B—GLM-130B的设计选择

2.1 GLM-130B’S ARCHITECTURE—GLM-130B的架构

骨干:基于Transformer的双向GLM,

训练目标(初始):自回归空白填充,带来两大综合性功能

两种空白:GLM的双向注意力(区别于单向注意的GPT风格),混合两个破坏目标=句子中的短空白[MASK]+句子尾的长空白[gMASK]:

两性功能:采用[MASK]时的BERT功能+采用[gMASK]时的PrefixLM功能

层归一化(LN)采用Post-LN、DeepNorm:提高训练稳定性

位置编码(PE)采用RoPE、前馈神经网络(FFN)采用带有GeLU的GLU

Figure 2: GLM-130B and LLMs of similar scale on zero-shot LAMBADA language modeling. Details on GLM’s bidirectional attention are provided in Du et al. (2022).图2:GLM-130B和相似规模的LLM零样本LAMBADA语言建模。Du et al.(2022)提供了GLM双向注意的详细信息。

2.2 GLM-130B’S PRE-TRAINING SETUP—GLM-130B的预训练设置

预训练两大目标:自监督GLM自回归空白填充+小部分标记的多任务学习

自监督空白填充SSBIF(95%的标记):句子中的短空白[MASK—泊松分布采样]占总样本的30%、句子尾的长空白[gMASK—均匀分布采样]占总样本的70%

预训练数据量(占比95%):1.2T的Pile、1T的中文Wudao-Corpora、网络爬取的250G中文语料

多任务指令预训练MIP(5%的标记):多任务学习可能比微调更有帮助

预训练数据量(占比5%—防止破坏LLM的其他一般能力):包括74个提示数据集

2.3、PLATFORM-AWARE PARALLEL STRATEGIES AND MODEL CONFIGURATIONS平台感知的并行策略和模型配置‌

3、THE TRAINING STABILITY OF GLM-130B的训练稳定性

7、CONCLUSION AND LESSONS结论和教训

ACKNOWLEDGEMENT致谢


GLM模型系列

LLMs之GLM-130B/ChatGLM-1:《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

LLMs之GLM-130B/ChatGLM:《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读-CSDN博客

LLMs之ChatGLM-2:ChatGLM2-6B的简介、安装、使用方法之详细攻略

LLMs之ChatGLM2:ChatGLM2-6B的简介、安装、使用方法之详细攻略_一个处女座的程序猿的博客-CSDN博客

LLMs之ChatGLM-3:ChatGLM3/ChatGLM3-6B的简介(多阶段增强+多模态理解+AgentTuning技术)、安装、使用方法之详细攻略

LLMs之ChatGLM3:ChatGLM3/ChatGLM3-6B的简介(多阶段增强+多模态理解+AgentTuning技术)、安装、使用方法之详细攻略-CSDN博客

LLMs之GLM-4:GLM-4的简介(全覆盖【对话版即ChatGLM4的+工具调用+多模态文生图】能力→Agent)、安装和使用方法、案例应用之详细攻略

LLMs之GLM-4:GLM-4的简介(全覆盖【对话版即ChatGLM4的+工具调用+多模态文生图】能力→Agent)、安装和使用方法、案例应用之详细攻略-CSDN博客

MLM之GLM-4:GLM-4-9B的简介、安装和使用方法、案例应用之详细攻略

MLM之GLM-4:GLM-4-9B的简介、安装和使用方法、案例应用之详细攻略-CSDN博客

实战案例

LLMs:从头到尾手把手教大家利用ChatGLM-6B模型实现训练、部署、推理(CLI/Gradio交互界面)、微调(两个提效技巧【混合精度+ZeRO零冗余提效】+三种微调方法【fine-tuning/P-tuning v2改变参数分布/LoRA低秩近似降低要更新参数量】)图文教程之详细攻略

https://yunyaniu.blog.csdn/article/details/120249551

LLMs之ChatGLM:基于Langchain框架利用text2vec-large-chinese+ChatGLM大模型(Docker 部署)接入本地知识库(生成本地知识库/分割/向量化+基于问题【Embdding+向量化+匹配TopK作为上下文】=生成Prompt喂给大模型→LLMs响应)实现问答响应项目(CLI/WebUI/VUE)图文教程之详细攻略

https://yunyaniu.blog.csdn/article/details/130998758

LLMs之ChatGLM:ChatGLM Efficient Tuning(一款高效微调ChatGLM-6B/ChatGLM2-6B的工具【LoRA/P-Tunin】)的简介、安装、使用方法之详细攻略

LLMs之ChatGLM:ChatGLM Efficient Tuning(一款高效微调ChatGLM-6B/ChatGLM2-6B的工具【LoRA/P-Tunin】)的简介、安装、使用方法之详细攻略_一个处女座的程序猿的博客-CSDN博客

LLMs之ChatGLM2:ChatGLM2-6B本地部署之单机推理(API/CLI/GUI)、低成本部署(GPU量化部署/CPU及其量化部署/Mac部署/多卡部署)、有限资源下高效微调(全参/P-tuning v2)、模型评估和推理之图文教程之详细攻略

LLMs之ChatGLM2:ChatGLM2-6B本地部署之单机推理(API/CLI/GUI)、低成本部署(GPU量化部署/CPU及其量化部署/Mac部署/多卡部署)、有限资源下高效微调(全参/P-t_一个处女座的程序猿的博客-CSDN博客

LLMs之ChatGLM2:基于ChatGLM Efficient Tuning(微调工具包)实现对ChatGLM2进行LoRA微调并进行推理测试图文教程之详细攻略

LLMs之ChatGLM2:基于ChatGLM Efficient Tuning(微调工具包)实现对ChatGLM2进行LoRA微调并进行推理测试图文教程之详细攻略_一个处女座的程序猿的博客-CSDN博客

LLMs:LLaMA Efficient Tuning(一款可高效微调【全参数/LoRA/QLoRA】主流大模型【ChatGLM2/LLaMA2/Baichuan等】的高效工具【预训练+指令监督微调+奖励模型训练+PPO 训练+DPO 训练】)的简介、安装、使用方法之详细攻略

LLMs:LLaMA Efficient Tuning(一款可高效微调【全参数/LoRA/QLoRA】主流大模型【ChatGLM2/LLaMA2/Baichuan等】的高效工具【预训练+指令监督微调+_一个处女座的程序猿的博客-CSDN博客

ChatGLM-6B的简介

ChatGLM-6B 是一个开源的、支持中英双语的对话语言模型,基于 General Language Model (GLM) 架构,具有 62 亿参数。结合模型量化技术,用户可以在消费级的显卡上进行本地部署(INT4 量化级别下最低只需 6GB 显存)。 ChatGLM-6B 使用了和 ChatGPT 相似的技术,针对中文问答和对话进行了优化。经过约 1T 标识符的中英双语训练,辅以监督微调、反馈自助、人类反馈强化学习等技术的加持,62 亿参数的 ChatGLM-6B 已经能生成相当符合人类偏好的回答,更多信息请参考我们的博客。欢迎通过 chatglm 体验更大规模的 ChatGLM 模型。

为了方便下游开发者针对自己的应用场景定制模型,我们同时实现了基于 P-Tuning v2 的高效参数微调方法 (使用指南) ,INT4 量化级别下最低只需 7GB 显存即可启动微调。

ChatGLM-6B 权重对学术研究完全开放,在填写问卷进行登记后亦允许免费商业使用

GitHub地址:GitHub - THUDM/ChatGLM-6B: ChatGLM-6B: An Open Bilingual Dialogue Language Model | 开源双语对话语言模型

GLM-130B的简介

《GLM-130B: AN OPEN BILINGUAL PRE-TRAINED MODEL》翻译与解读

地址

官网:ChatGLM

文章:GLM-130B:开源的双语预训练模型 | GLM-130B

GitHub:https://github/THUDM/ChatGLM-6B

论文:https://openreview/pdf?id=-Aw0rrrPUF

时间

GLM-130B:2022年8月4日

千亿对话模型 ChatGLM 开始内测:2023年3月10日

作者

Tsinghua清华大学+智谱AI

总结

该论文介绍了一个开源的双语(英汉)130亿参数语言模型GLM-130B。
背景痛点:
>> 相较于英文,中文NLP研究较缺乏大规模预训练模型资源支持。
>> 许多100亿级规模的语言模型如GPT-3等模型因专有等原因无法公开,妨碍研究进展。
>> 训练这种规模模型面临稳定性、效率等重重技术挑战

具体解决方案:
>> GLM-130B采用GLM(通用语言模型)框架,可以利用双向注意力上下文
>> 预训练任务包含自监督掩蔽完成任务和多任务指令预训练任务。
>> 使用深层标准化层结构,采用深度标准化初始化来提升稳定性。
>> 采用混合精度策略提升训练效率。采用嵌入梯度缩减来有效防止训练失衡。
>> 实现INT4权重量化,支持在更低端GPU上高效预测

核心特点:
>> 首个开源的1000亿参数级量级双语预训练语言模型。
>> 性能测试表明其在人机对话、阅读理解、知识问答等任务上优于同类封闭源模型。
>> 作为中文模型,在NLU和NLG任务上优于目前最大中文预训练模型ERNIE Titan。
>> 实现了低至INT4的权重量化,支持在普通GPU上高效推理。
>> 完整开源模型、代码、日志和体验,提高研究透明度和重复实验能力。

优势网络:
>> 破除1000亿级语言模型黑箱,提高开放性和包容性。
>> 双语学习纳入中文知识,有利减轻模型偏差和毒性。
>> 性能稳定且超越同行,有利推动人机对话和其他任务。
>> 低硬件门槛,受益更广泛研究人群。
>> 开放源代码和实验细节,有利推动相关领域研究进一步。

ABSTRACT摘要

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained. Over the course of this effort, we face numer-ous unexpected technical and engineering challenges, particularly on loss spikes and divergence. In this paper, we introduce the training process of GLM-130B including its design choices, training strategies for both efficiency and stabil-ity, and engineering efforts. The resultant GLM-130B model offers significant outperformance over GPT-3 175B (davinci) on a wide range of popular English benchmarks while the performance advantage is not observed in OPT-175B and BLOOM-176B. It also consistently and significantly outperforms ERNIE TITAN 3.0 260B—the largest Chinese language model—across related benchmarks. Fi-nally, we leverage a unique scaling property of GLM-130B to reach INT4 quanti-zation without post training, with almost no performance loss, making it the first among 100B-scale models and more importantly, allowing its effective inference on 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G) GPUs, the most affordable GPUs required for using 100B-scale models. The GLM-130B model weights are publicly accessible and its code, training logs, related toolkit, and lessons learned are open-sourced at https://github/THUDM/GLM-130B/.

我们介绍了GLM-130B,一个具有1300亿参数的双语(英文和中文)预训练语言模型。它是为了开源一个至少与GPT-3(davinci)一样好的1000亿规模模型,并揭示这样规模的模型如何成功地进行预训练的尝试。在这个过程中,我们遇到了许多意外的技术和工程挑战,特别是在损失峰值和发散方面。在本文中,我们介绍了GLM-130B的训练过程,包括其设计选择、为了提高效率和稳定性的训练策略以及工程努力。在许多流行的英文基准测试中,GLM-130B模型相对于GPT-3 175B(davinci)表现出显著的优势,但在OPT-175B和BLOOM-176B上并没有观察到性能优势。它还在相关基准测试中始终明显优于中国语言模型中最大的ERNIE TITAN 3.0 260B。最后,我们利用GLM-130B的独特扩展属性,实现了在没有后训练的情况下达到INT4量化,并且几乎没有性能损失,使其成为1000亿规模模型中的首例,并且更重要的是,允许在4×RTX 3090(24G)或8×RTX 2080 Ti(11G)GPU上进行有效推理,这是使用1000亿规模模型所需的最经济的GPU。GLM-130B模型的权重可以公开访问,其代码、训练日志、相关工具和经验教训已在https://github/THUDM/GLM-130B/上开源。

1、INTRODUCTION引言

GPT-3在各种基准测试上表现更好引领了对1000亿规模LLMs的研究

Large language models (LLMs), particularly those with over 100 billion (100B) parameters (Brown et al., 2020; Thoppilan et al., 2022; Rae et al., 2021; Chowdhery et al., 2022; Wang et al., 2021), have presented attractive scaling laws (Wei et al., 2022b), where emergent zero-shot and few-shot capabilities suddenly arose. Among them, GPT-3 (Brown et al., 2020) with 175B parameters pi- oneers the study of 100B-scale LLMs by strikingly generating better performance with 32 labeled examples than the fully-supervised BERT-Large model on a variety of benchmarks. However, both GPT-3 (and many other closed-sourced 100B-scale ones)—the model itself—and how it can be trained, have been thus far intransparent to the public. It is of critical value to train a high-quality LLM of such scale with both the model and training process shared with everyone.

We thus aim to pre-train an open and highly-accurate 100B-scale model with ethical concerns in mind. Over the course of our attempt, we have come to realize that pre-training a dense LLM at such a scale raises numerous unexpected technical and engineering challenges compared to training 10B-scale models, in terms of pre-training efficiency, stability, and convergence. Similar difficulties have also been concurrently observed in training OPT-175B (Zhang et al., 2022) and BLOOM176B (Scao et al., 2022), further demonstrating the significance of GPT-3 as a pioneer study.

大型语言模型(LLM),特别是那些参数超过1000亿参数(100B)的模型(Brown et al., 2020;Thoppilan et al., 2022;Rae et al., 2021;Chowdhery et al., 2022;Wang et al., 2021)提出了有吸引力的缩放定律(Wei et al., 2022b),其中出现了新兴的零射击和少射击能力。其中,GPT-3(Brown等人,2020年)拥有1750亿参数,通过在32个标记示例上比全监督的BERT-Large模型在各种基准测试上表现更好引领了对1000亿规模LLMs的研究。然而,迄今为止,GPT-3(以及许多其他闭源的1000亿规模模型)- 模型本身和它的训练方式对公众来说都是不透明的。对于公众来说,使用透明的方式训练一个具有这种规模的高质量LLM是至关重要的。

因此,我们的目标是在考虑伦理问题的前提下,预先训练一个开放且高度准确的1000亿规模模型。在我们的尝试过程中,我们意识到在这种规模上预训练一个密集的LLM与训练10亿规模模型相比,涉及到许多意外的技术和工程挑战,涉及到预训练效率、稳定性和收敛性。类似的困难也在OPT-175B(Zhang等人,2022年)和BLOOM176B(Scao等人,2022年)的训练中同时观察到,进一步证明了GPT-3作为先驱研究的重要性。

GLM-130B:1300亿个参数,一个双语(英语和中文)双向密集模型,2022年5月6日至7月3日期间,在96个NVIDIA DGX-A100 (8×40G) GPU节点的集群上预训练4000亿个令牌

不同于GPT风格架构=基于GLM+利用其双向注意力优势+自回归填充空白的目标

In this work, we introduce the pre-training of a 100B-scale model—GLM-130B, in terms of engi- neering efforts, model design choices, training strategies for efficiency and stability, and quantization for affordable inference. As it has been widely realized that it is computationally unaffordable to empirically enumerate all possible designs for training 100B-scale LLMs, we present not only the successful part for training GLM-130B but also many of the failed options and lessons learned. Particularly, the training stability is the decisive factor in the success of training models of such a scale. Different from practices such as manually adjusting learning rates in OPT-175B and using embedding norm in the sacrifice of performance in BLOOM-176B, we experiment with various op- tions and find the strategy of embedding gradient shrink can significantly stabilize the training of GLM-130B.

Specifically, GLM-130B is a bilingual (English and Chinese) bidirectional dense model with 130 bil- lion parameters, pre-trained over 400 billion tokens on a cluster of 96 NVIDIA DGX-A100 (8×40G) GPU nodes between May 6 and July 3, 2022. Instead of using the GPT-style architecture, we adopt the General Language Model (GLM) algorithm (Du et al., 2022) to leverage its bidirectional at- tention advantage and autoregressive blank infilling objective. Table 1 summarizes the comparison between GLM-130B, GPT-3 and another two open-source efforts—OPT-175B and BLOOM-176B, as well as PaLM 540B (Chowdhery et al., 2022)—a 4× larger model—as a reference.

在这项工作中,我们从工程工作、模型设计选择、效率和稳定性的训练策略以及可负担推理的量化方面介绍了一个1000亿规模模型GLM-130B的预训练。由于人们普遍认识到,由于在训练1000亿规模LLMs时经验性地列举所有可能的设计在计算上是不可承受的,我们不仅呈现了GLM-130B的成功部分,还呈现了许多失败的选项和经验教训

特别是,训练的稳定性是成功训练这种规模模型的决定性因素。与在OPT-175B中手动调整学习率和在BLOOM-176B中以性能为代价使用嵌入规范不同,我们尝试了各种选项,并发现嵌入梯度缩小策略可以显著稳定GLM-130B的训练。

具体来说,GLM-130B是一个双语(英语和中文)双向密集模型,具有1300亿个参数,在2022年5月6日至7月3日期间,在96个NVIDIA DGX-A100 (8×40G) GPU节点的集群上预训练了超过4000亿个令牌。与使用GPT风格架构不同,我们采用了General Language Model(GLM)算法(Du等人,2022年),以利用其双向注意力优势和自回归填充空白的目标

表1总结了GLM-130B、GPT-3和另外两个开源努力—OPT-175B和BLOOM-176B,以及PaLM 540B(Chowdhery等人,2022年)—一个4倍大的模型—作为参考之间的比较。

GLM-130B的模型性能:超过GPT-3水平优于PaLM 540B、与OPT-175B和BLOOM-176B不相上下

Altogether, the conceptual uniqueness and engineering efforts enable GLM-130B to exhibit perfor- mance that surpasses the level of GPT-3 on a wide range of benchmarks (in total 112 tasks) and also outperforms PaLM 540B in many cases, while outperformance over GPT-3 has not been observed in OPT-175B and BLOOM-176B (Cf. Figure 1 left). For zero-shot performance, GLM-130B is better than GPT-3 175B (+5.0%), OPT-175B (+6.5%), and BLOOM-176B (+13.0%) on LAMBADA (Pa-perno et al., 2016), and achieves 3× better performance than GPT-3 on Big-bench-lite (Srivastava et al., 2022). For the 5-shot MMLU (Hendrycks et al., 2021) tasks, it is better than GPT-3 175B(+0.9%) and BLOOM-176B (+12.7%). As a bilingual LLM also in Chinese, it offers significantly better results than ERNIE TITAN 3.0 260B (Wang et al., 2021)—the largest Chinese LLM—on 7 zero-shot CLUE (Xu et al., 2020) datasets (+24.26%) and 5 zero-shot FewCLUE (Xu et al., 2021) ones (+12.75%). Importantly, as summarized in Figure 1 right, GLM-130B as an open model is associated with significantly less bias and generation toxicity than its 100B-scale counterparts.

总之,概念上的独特性和工程上的努力使GLM-130B在广泛的基准测试(总共112项任务)中表现出超过GPT-3水平的性能,并且在许多情况下也优于PaLM 540B,而在OPT-175B和BLOOM-176B中没有观察到优于GPT-3的性能(参见图1左)。

>> 对于零射击性能,GLM-130B在LAMBADA上优于GPT-3 175B(+5.0%)、OPT-175B(+6.5%)和BLOOM-176B (+13.0%) (Pa-perno等,2016),在Big-bench-lite上优于GPT-3 3 (Srivastava等,2022)。对于5发MMLU (Hendrycks等,2021)任务,它优于GPT-3 175B(+0.9%)和BLOOM-176B(+12.7%)。

>> 作为中文的双语LLM,它在7个零射击CLUE (Xu et al., 2020)数据集(+24.26%)和5个零射击FewCLUE (Xu et al., 2021)数据集(+12.75%)上的结果明显优于中文最大的LLMERNIE TITAN 3.0 260B (Wang et al., 2021)。

重要的是,如图1所示,作为开放模型的GLM-130B与100b级模型相比,其偏置和生成毒性明显更小

GLM-130B的目的:让更多人能够参与研究,因其可在单个A100推断(或在4×RTX 3090-24G或8×RTX 2080 Ti-11G服务器上快速推断)+INT4量化

Finally, we design GLM-130B to empower as many people as possible to conduct 100B-scale LLM studies. First, instead of using 175B+ parameters as OPT and BLOOM, the 130B size is decided be- cause such a size supports inference on a single A100 (8×40G) server. Second, to further lower the GPU requirements, we quantize GLM-130B into INT4 precision without post training while OPT and BLOOM can only reach INT8. Due to a unique property of the GLM architecture, GLM-130B’s INT4 quantization introduces negligible performance degradation, e.g., -0.74% on LAMBADA and even +0.05% on MMLU, making it still better than the uncompressed GPT-3. This enables GLM-130B’s fast inference with performance guarantee on a server of 4×RTX 3090 (24G) or 8×RTX 2080 Ti (11G), the most affordable GPU required for using 100B-scale LLMs to date.

We open-source the model checkpoints, code, training logs, related toolkits, and lessons learned.

最后,我们设计GLM-130B的目的是尽可能地赋予更多人进行1000亿规模LLM研究的能力。首先,与OPT和BLOOM使用175B+参数不同,选择130B的大小是因为这样的大小支持在单个A100(8×40G)服务器上进行推断。其次,为了进一步降低GPU要求,我们将GLM-130B量化为INT4精度,而无需后期训练,而OPT和BLOOM只能达到INT8。由于GLM架构的独特属性,GLM-130B的INT4量化引入了可忽略的性能降级,例如在LAMBADA上为-0.74%,甚至在MMLU上为+0.05%,使其仍然优于未压缩的GPT-3。这使得GLM-130B在使用1000亿规模LLMs的最经济的GPU到目前为止,可以在4×RTX 3090(24G)或8×RTX 2080 Ti(11G)服务器上进行快速推断,并且性能有保障。

我们开源了模型检查点、代码、训练日志、相关工具包和经验教训。

Table 1: A comparison between GLM-130B and other 100B-scale LLMs and PaLM 540B. (LN: layer norm.; FPF: floating-point format; MIP: multi-task instruction pre-training; CN : Chinese)表1:GLM-130B与其他100B规模的LLM和PaLM 540B的比较。(LN:层归一化;FPF:浮点格式;MIP:多任务指令预训练;CN:中文)

2 THE DESIGN CHOICES OF GLM-130B—GLM-130B的设计选择

The architecture of a machine learning model defines its inductive bias. However, it has been real-ized that it is computationally unaffordable to explore various architectural designs for LLMs. We introduce and explain the unique design choices of GLM-130B.

机器学习模型的架构定义了其归纳偏差。然而,探索各种架构设计对于LLM来说在计算上是不可承受的。我们介绍并解释GLM-130B的独特设计选择。

2.1 GLM-130B’S ARCHITECTURE—GLM-130B的架构

骨干:基于Transformer的双向GLM

GLM as Backbone. Most recent 100B-scale LLMs, such as GPT-3, PaLM, OPT, and BLOOM, follow the traditional GPT-style (Radford et al., 2019) architecture of decoder-only autoregressive language modeling. In GLM-130B, we instead make an attempt to explore the potential of a bidi-rectional GLM—General Language Model (Du et al., 2022)—as its backbone.

GLM作为主干。大多数最近的100B规模LLM,如GPT-3、PaLM、OPT和BLOOM,遵循传统的GPT风格(Radford等,2019)的仅解码器自回归语言建模架构。在GLM-130B中,我们尝试探索双向GLM(通用语言模型,Du等,2022)的潜力作为其基础。

训练目标(初始)自回归空白填充带来两大综合性功能

GLM is a transformer-based language model that leverages autoregressive blank infilling as its train-ing objective. Briefly, for a text sequence x = [x1, · · · , xn], text spans {s1, · · · , sm} are sampled from it, each of which si denotes a span of consecutive tokens [si,1, · · · , si,li ] and is replaced (i.e., corrupted) with a single mask token to form xcorrupt. The model is asked to recover them autoregres-sively. To allow interactions between corrupted spans, their visibility to each other is decided by a randomly sampled permutation on their order.

GLM是一种基于Transformer的语言模型,其训练目标是自回归空白填充。简而言之,对于一个文本序列x = [x1, ···, xn],从中采样文本跨度片段{s1, ···, sm},其中每个si表示一个连续标记的跨度片段[si,1, ···, si,li],并被一个单一的掩码标记替换(即被破坏)形成xcorrupt。模型被要求自回归地恢复它们。为了允许破坏的跨度片段之间的互动,它们对彼此的可见性由随机采样的顺序决定。

两种空白:GLM的双向注意力(区别于单向注意的GPT风格),混合两个破坏目标=句子中的短空白[MASK]+句子尾的长空白[gMASK]

GLM’s bidirectional attention over unmasked (i.e., uncorrupted) contexts distinguishes GLM-130B from GPT-style LLMs in which the unidirectional attention is used. To support both understanding and generation, it mixes two corruption objectives, each indicated by a special mask token:

• [MASK]: short blanks in sentences whose lengths add up to a certain portion of the input.

• [gMASK]: random-length long blanks at the end of sentences with prefix contexts provided.

GLM在未屏蔽(即未破坏)上下文上的双向注意力,将GLM-130B与使用单向注意的GPT风格LLM区分开来。为了支持理解和生成,它混合了两个破坏目标,每个由一个特殊的掩码标记指示:

>> [MASK]:句子中的短空白,其长度加起来占输入的一定比例

>> [gMASK]:在句子末尾的随机长度长空白,提供前缀上下文。在提供前缀上下文的句子末尾随机长度的长空白。

两性功能:采用[MASK]时的BERT功能+采用[gMASK]时的PrefixLM功能

Conceptually, the blank infilling objective with bidi-rectional attention enables a more effective compre-hension of contexts than GPT-style models: when us-ing [MASK], GLM-130B behaves as BERT (Devlin et al., 2019) and T5 (Raffel et al., 2020); when us-ing [gMASK], GLM-130B behaves similarly to Pre-fixLM (Liu et al., 2018; Dong et al., 2019).

概念上,具有双向注意力的空白填充目标使GLM-130B比GPT风格模型更有效地理解上下文:使用[MASK]时,GLM-130B的行为类似于BERT(Devlin等,2019)和T5(Raffel等,2020);使用[gMASK]时,GLM-130B的行为类似于PrefixLM(Liu等,2018;Dong等,2019)。

Empirically, GLM-130B offers a record-high accuracy of 80.2% on zero-shot LAMBADA by outperforming both GPT-3 and PaLM 540B in Figure 2. By setting the attention mask, GLM-130B’s unidirectional vari-ant is comparable to GPT-3 and OPT-175B. Our ob-servations are in line with existing findings (Liu et al., 2018; Dong et al., 2019).

经验上,GLM-130B在零样本LAMBADA上的准确率达到了创纪录的80.2%,在图2中超过了GPT-3和PaLM 540B。通过设置注意力掩码,GLM-130B的单向变体与GPT-3和OPT-175B相当。们的观察结果与现有研究结果一致(Liu等,2018;Dong等,2019)。

层归一化(LN)采用Post-LN、DeepNorm:提高训练稳定性

Layer Normalization (LN, Ba et al. (2016)). Training instability is one major challenge for training LLMs (Zhang et al., 2022; Scao et al., 2022; Chowdhery et al., 2022) (Cf. Figure 10 in Appendix for collapses in training several 100B-scale models). A proper choice of LNs can help stabilize the training of LLMs. We experiment with existing practices, e.g., Pre-LN (Xiong et al., 2020), Post-LN (Ba et al., 2016), Sandwich-LN (Ding et al., 2021), which are unfortunately incapable of stabilizing our GLM-130B test runs (Cf. Figure 3 (a) and Appendix B.2 for details).

层归一化(LN, Ba等,2016)。训练不稳定性是训练LLM的一个主要挑战(Zhang等,2022;Scao等,2022;Chowdhery等,2022)(参见附录中的图10,用于训练几个100B规模模型时的崩溃)。选择合适的LN可以帮助稳定LLM的训练。我们尝试了现有的实践,例如Pre-LN(Xiong等,2020)、Post-LN(Ba等,2016)、Sandwich-LN(Ding等,2021),但不幸的是,这些都无法稳定我们的GLM-130B测试运行(参见图3(a)和附录B.2中的详细信息)。

Our search is later focused on Post-LN due to its favorable downstream results in preliminary ex-periments though it does not stabilize GLM-130B. Fortunately, one of the attempts on Post-LN initialized with the newly-proposed DeepNorm (Wang et al., 2022b) generates promising training stability. Specifically, given the number of GLM-130B’s layers N, we adopt DeepNorm(x) =LayerNorm(α · x + Network(x)), where α = (2N)^1/2 , and apply the Xavier normal initialization with the scaling factor of (2N)^(−1/2) to ffn, v_proj and out_proj. Additionally, all bias terms are initialized to zero. Figure 3 shows it significantly benefits the training stability of GLM-130B.

我们后来的研究重点是Post-LN,因为它在初步实验中具有良好的下游结果,尽管它不能稳定训练GLM-130B。幸运的是,在使用新提出的DeepNorm(Wang等,2022b)初始化的Post-LN上的一次尝试产生了有希望的训练稳定性。具体来说,鉴于GLM-130B的层数N,我们采用DeepNorm(x) = LayerNorm(α · x + Network(x)),其中α = (2N)^1/2,并将Xavier正态初始化与缩放因子(2N)^(-1/2)应用于ffn、v_proj和out_proj。此外,所有偏置项初始化为零。图3显示了它显著提高了GLM-130B的训练稳定性

位置编码(PE)采用RoPE、前馈神经网络(FFN)采用带有GeLUGLU

Positional Encoding and FFNs. We empirically test different options for positional encoding (PE) and FFN improvements in terms of both training stability and downstream performance (Cf. Ap-pendix B.3 for details). For PEs in GLM-130B, we adopt Rotary Positional Encoding (RoPE, Su et al. (2021)) rather than ALiBi (Press et al., 2021). To improve FFNs in Transformer, we pick GLU with the GeLU (Hendrycks & Gimpel, 2016) activation as the replacement.

位置编码和FFN。我们从训练稳定性和下游性能的角度,实证测试了位置编码(PE)和FFN改进的不同选项(参见附录B.3中的详细信息)。对于GLM-130B中的PE,我们采用了旋转位置编码(RoPE,Su等,2021),而不是ALiBi(Press等,2021)。为了改进Transformer中的FFN,我们选择了GLU与GeLU(Hendrycks & Gimpel,2016)激活作为替代。

Figure 2: GLM-130B and LLMs of similar scale on zero-shot LAMBADA language modeling. Details on GLM’s bidirectional attention are provided in Du et al. (2022).图2:GLM-130B和相似规模的LLM零样本LAMBADA语言建模。Du et al.(2022)提供了GLM双向注意的详细信息。

2.2 GLM-130B’S PRE-TRAINING SETUP—GLM-130B的预训练设置

预训练两大目标:自监督GLM自回归空白填充+小部分标记的多任务学习

Inspired by recent works (Aribandi et al., 2022; Wei et al., 2022a; Sanh et al., 2022), the GLM-130B pre-training objective includes not only the self-supervised GLM autoregressive blank infilling) but also multi-task learning for a small portion of tokens. This is expected to help boost its downstream zero-shot performance.

受近期工作(Aribandi等,2022;Wei等,2022a;Sanh等,2022)的启发,GLM-130B的预训练目标不仅包括自监督GLM自回归空白填充,还包括一小部分标记的多任务学习。这预计将有助于提升其下游零样本性能。

自监督空白填充SSBIF(95%的标记)句子中的短空白[MASK泊松分布采样]占总样本的30%句子尾的长空白[gMASK均匀分布采样]占总样本的70%

Self-Supervised Blank Infilling (95% tokens). Recall that GLM-130B uses both [MASK] and [gMASK] for this task. Each training sequence is applied with one of them independently at a time. Specifically, [MASK] is used to mask consecutive spans in 30% of training sequences for blank infilling. The lengths of spans follow a Poisson distribution (λ = 3) and add up to 15% of the input. For the other 70% sequences, the prefix of each sequence is kept as context and [gMASK] is used to mask the rest of it. The masked length is sampled from the Uniform distribution.

自监督空白填充(95%的标记)。回顾一下,GLM-130B在这一任务中使用了[MASK]和[gMASK]。每个训练序列独立地应用其中一个。具体来说,[MASK]用于在30%的训练序列中掩盖连续的片段以进行空白填充。这些跨度片段的长度遵循泊松分布(λ=3),加起来占输入的15%。对于其他70%的序列,每个序列的前缀保留为上下文,并使用[gMASK]掩盖其余部分。掩盖长度从均匀分布中采样

预训练数据量(占比95%):1.2T的Pile1T的中文Wudao-Corpora网络爬取的250G中文语料

The pre-training data includes 1.2T Pile (train split) (Gao et al., 2020) English, 1.0T Chinese Wudao-Corpora (Yuan et al., 2021), and 250G Chinese corpora (including online forums, encyclopedia, and QA) we crawl from the web, which form a balanced composition of English and Chinese contents.

预训练数据包括1.2T的Pile(训练拆分)(Gao等,2020)英语,1.0T的中文Wudao-Corpora(Yuan等,2021),以及我们从网络上爬取的250G中文语料(包括在线论坛、百科全书和问答),形成了平衡的英文和中文内容。

多任务指令预训练MIP(5%的标记)多任务学习可能比微调更有帮助

Multi-Task Instruction Pre-Training (MIP, 5% tokens). T5 (Raffel et al., 2020) and ExT5 (Aribandi et al., 2022) suggest that multi-task learning in pre-training can be more helpful than fine-tuning, we thus propose to include a variety of instruction prompted datasets including language understanding, generation, and information extraction in GLM-130B’s pre-training.

多任务指令预训练(MIP,5%的标记)。T5(Raffel等,2020)和ExT5(Aribandi等,2022)建议在预训练中进行多任务学习比微调更有帮助,因此我们建议在GLM-130B的预训练中包含各种指令提示的数据集,包括语言理解、生成和信息提取。

预训练数据量(占比5%防止破坏LLM的其他一般能力):包括74个提示数据集

Compared to recent works (Wei et al., 2022a; Sanh et al., 2022) that leverage multi-task prompted fine-tuning to improve zero-shot task transfer, MIP only accounts for 5% tokens and is set in the pre-training stage to prevent spoiling LLMs’ other general ability, e.g., unconditional free generation. Specifically, we include 74 prompted datasets from (Sanh et al., 2022; Wang et al., 2022a), listed in Appendix C and Table 12. GLM-130B users are suggested to avoid evaluating its zero-shot and few-shot capabilities on these datasets according to the criterion illustrated in Section 5.

与最近的研究相比(Wei et al., 2022a;Sanh等人,2022)利用多任务提示微调来改善零样本任务转移,MIP仅占5%的标记,并且设置在预训练阶段,以防止破坏LLM的其他一般能力,例如无条件自由生成。

具体来说,我们包括74个提示数据集,来自(Sanh et al., 2022;Wang et al., 2022a),列于附录C和表12。建议GLM-130B用户避免根据第5节中说明的标准在这些数据集上评估其零样本和少样本能力。

2.3、PLATFORM-AWARE PARALLEL STRATEGIES AND MODEL CONFIGURATIONS平台感知的并行策略和模型配置

GLM-130B is trained on a cluster of 96 DGX-A100 GPU (8×40G) servers with a 60-day access. The goal is to pass through as many tokens as possible, as a recent study (Hoffmann et al., 2022) suggests that most existing LLMs are largely under-trained.

The 3D Parallel Strategy. The data parallelism (Valiant, 1990) and tensor model paral- lelism (Shoeybi et al., 2019) are the de facto practices for training billion-scale models (Wang & Komatsuzaki, 2021; Du et al., 2022). To further handle the huge GPU memory requirement and the decrease in overall GPU utilization resulted from applying tensor parallel between nodes—as 40G rather than 80G A100s are used for training GLM-130B, we combine the pipeline model parallelism with the other two strategies to form a 3D parallel strategy.

GLM-130B在一组96台DGX-A100 GPU(8×40G)服务器上进行训练,访问期限为60天。目标是尽可能通过尽可能多的标记,因为最近的一项研究(Hoffmann等人,2022年)表明,大多数现有的LLM都没有得到充分训练。 三维并行策略。数据并行(Valiant,1990年)和张量模型并行(Shoeybi等人,2019年)是训练十亿规模模型(Wang&Komatsuzaki,2021年;Du等人,2022年)的事实上的做法。为了进一步处理巨大的GPU内存需求和应用张量并行导致的整体GPU利用率降低(由于GLM-130B的训练使用的是40G而不是80G的A100),我们将管道模型并行与其他两种策略相结合,形成了三维并行策略。

The pipeline parallelism divides the model into sequential stages for each parallel group, and to fur- ther minimize bubbles introduced by pipeline, we leverage the PipeDream-Flush (Narayanan et al.,2021) implementation from DeepSpeed (Rasley et al., 2020) to train GLM-130B with a relative big global batch size (4,224) to reduce time and GPU memory wasting. Through both numeri- cal and empirical examinations, we adopt 4-way tensor parallelism and 8-way pipeline parallelism (Cf. Appendix B.4 for details). Following the calculation in (Chowdhery et al., 2022), we report hardware FLOPs utilization (HFU) of 43.3% and model FLOPs utilization (MFU) of 32.5% due to re-materialization.

GLM-130B Configurations. We aim to enable our 100B-scale LLM to run a single DGX-A100 (40G) node in FP16 precision. Based on the hidden state dimension of 12,288 we adopt from GPT-3, the resultant model size has to be no more than 130B parameters, thus GLM-130B. To maximize GPU utilization, we configure the model based on the platform and its corresponding parallel strategy. To avoid insufficient memory utilization in the middle stages due to the additional word embedding at both ends, we balance the pipeline partition by removing one layer from them, making 9×8-2=70 transformer layers in GLM-130B.

管道并行将模型划分为每个并行组的顺序阶段,并且为了最小化管道引入的空隙,我们利用来自DeepSpeed(Rasley等人,2020年)的PipeDream-Flush(Narayanan等人,2021年)实现来训练GLM-130B,使用相对较大的全局批量大小(4,224)来减少时间和GPU内存的浪费。通过数值和实证检验,我们采用4路张量并行和8路管道并行(详见附录B.4)。根据(Chowdhery等人,2022年)的计算,我们报告硬件FLOP(浮点运算)利用率为43.3%,模型FLOP利用率为32.5%,因为重新材料化导致的。 GLM-130B配置。我们的目标是使我们的100B规模LLM能够在单个DGX-A100(40G)节点上以FP16精度运行。根据从GPT-3采用的隐藏状态维度为12,288,得到的模型大小不能超过130B参数,因此命名为GLM-130B。为了最大化GPU利用率,我们根据平台及其相应的并行策略来配置模型。为了避免由于两端的额外词嵌入而导致中间阶段内存利用不足,我们通过从它们中删除一层来平衡管道划分,使GLM-130B中有9×8-2=70个变压器层。

During the 60-day access to the cluster, we manage to train GLM-130B for 400 billion tokens (roughly 200 billion each for Chinese and English) with a fixed sequence length of 2,048 per sample. For the [gMASK] training objective, we use a context window of 2,048 tokens. For the [MASK] and multi-task objectives, we use a context window of 512 and concatenate four samples together to cater the 2,048-sequence-length. We warm-up the batch size from 192 to 4224 over the first 2.5% samples. We use AdamW (Loshchilov & Hutter, 2019) as our optimizer with β1 and β2 set to 0.9 and 0.95, and a weight decay value of 0.1. We warm up the learning rate from 10−7 to 8 × 10−5 over the first 0.5% samples, then decay it by a 10× cosine schedule. We use a dropout rate of 0.1 and clip gradients using a clipping value of 1.0 (Cf. Table 11 for the full configurations).

 在对该集群的60天访问期间,我们设法用固定的样本长度为2,048训练GLM-130B达到4000亿个标记(大约为中文和英文各2000亿个)。对于[gMASK]训练目标,我们使用2,048个标记的上下文窗口。对于[MASK]和多任务目标,我们使用512个标记的上下文窗口,并将四个样本连接在一起以适应2,048个序列长度。我们将批量大小从192逐渐增加到4224,前2.5%的样本用于热身。我们使用AdamW(Loshchilov&Hutter,2019年)作为优化器,将β1和β2设置为0.9和0.95,权重衰减值为0.1。我们将学习率从10的-7次方逐渐增加到8乘以10的-5次方,然后按照10×余弦调度进行衰减。我们使用0.1的dropout率,并使用剪辑值为1.0的梯度裁剪(详见表11的完整配置)。

3、THE TRAINING STABILITY OF GLM-130B的训练稳定性

The training stability is the decisive factor in GLM-130B’s quality, which is also largely impacted by the number of tokens it passes through (Hoffmann et al., 2022). Thus, given the computing usage constraint, there has to be a trade-off between efficiency and stability with regard to floating- point (FP) formats: low-precision FP formats (e.g., 16-bit precision—FP16) improve computing efficiency but are prone to overflow and underflow errors, resulting in training collapses.

Mixed-Precision. We follow the common practice of a mixed- precision (Micikevicius et al., 2018) strategy (Apex O2), i.e., FP16 for forwards and backwards and FP32 for optimizer states and mas- ter weights, to reduce the GPU memory usage and improve train- ing efficiency. Similar to OPT-175B and BLOOM-176B (C.f. Fig- ure 10 in Appendix), the training of GLM-130B faces frequent loss spikes resulted from this choice, which tends to become increas- ingly frequent as the training goes on. The precision related spikes are often without clear reasons: some recover on their own; others come with a portent of suddenly soaring gradient norm and even- tually a spike or even NaN in loss. OPT-175B attempted to fix by manually skipping data and adjusting hyper-parameters; BLOOM- 176B did so via the embedding norm technique (Dettmers et al., 2021). We spent months to empirically investigate the spikes and realize that a few issues emerge when transformers scale up:

训练稳定性是决定GLM-130B质量的关键因素,也受到通过的标记数量的影响(Hoffmann等人,2022年)。因此,在计算使用约束下,必须在浮点(FP)格式方面在效率和稳定性之间进行权衡:低精度FP格式(例如,16位精度-FP16)提高了计算效率,但容易发生溢出和下溢错误,导致训练失败。

混合精度。我们遵循混合精度(Micikevicius等人,2018年)策略(Apex O2)的常见做法,即前向和后向使用FP16,优化器状态和主权重使用FP32,以减少GPU内存使用并提高训练效率。与OPT-175B和BLOOM-176B(附录中的图10)类似,GLM-130B的训练由于这种选择导致频繁的损失峰值,随着训练的进行,这种现象变得越来越频繁。与精度相关的峰值通常没有明显的原因:一些峰值会自行恢复;其他峰值伴随着梯度范数突然飙升,甚至最终导致损失的峰值甚至NaN。OPT-175B尝试通过手动跳过数据和调整超参数来修复此问题;BLOOM-176B通过嵌入规范化技术(Dettmers等人,2021年)来解决。我们花费数月时间对这些峰值进行了经验性的研究,并意识到当Transformer模型扩展时会出现一些问题:

First, the transformer main branch’s value scale can be extremely large in deeper layers if using Pre-LN. This is addressed in GLM- 130B by using DeepNorm based Post-LN (Cf. Section 2.1), which makes the value scale always bounded.

Second, the attention scores grow so large that they exceed FP16's range, as the model scales up. There are a few options to overcome this issue in LLMs. In CogView (Ding et al., 2021), PB-Relax is proposed to remove bias terms and deduct extremum value in attention computation to avoid the problem, which unfortunately does not help avoid dis- convergence in GLM-130B. In BLOOM-176B, the BF16 format is used instead of FP16, due to its wide range of values on NVIDIA Ampere GPUs (i.e., A100). However, BF16 consumes ∼15% more run-time GPU memory than FP16 in our experiments due to its conversion to FP32 in gradi- ent accumulation, and more importantly it is not supported on other GPU platforms (e.g., NVIDIA Tesla V100), limiting the accessibility of produced LLMs. Another option from BLOOM-176B is to apply embedding norm with BF16, but in sacrifice of a significant penalty on model performance, as they notice that embedding norm can harm model’s zero-shot learning (Cf. Section 4.3 in (Scao et al., 2022)).

Embedding Layer Gradient Shrink (EGS). Our empirical search identifies that the gradient norm can serve as an informative indicator of training collapses. Specifically, we find that a training collapse usually lags behind a “spike” in gradient norm by a few training steps. Such spikes are usually caused by the embedding layer’s abnormal gradients, as we observe that its gradient norm is often several magnitude larger that those of other layers in GLM-130B’s early stage training (Cf. Figure 4 (a)). In addition, it tends to fluctuate dramatically in the early training. The problem is handled in vision models (Chen et al., 2021) via freezing the patch projection layer. Unfortunately, we cannot freeze the training of the embedding layer in language models.

首先,如果使用Pre-LN,变压器主分支的值规模可能非常大。GLM-130B通过使用基于DeepNorm的Post-LN(详见第2.1节)来解决这个问题,使值的规模始终受到限制。

其次,随着模型规模的增大,注意力分数变得非常大,超出了FP16的范围。在LLMs中,有几种方法可以解决这个问题。在CogView(Ding等人,2021年)中,提出了PB-Relax方法,在注意力计算中去除偏置项并减去极值,以避免此问题,但不幸的是,这并不能避免GLM-130B的不收敛。在BLOOM-176B中,使用了BF16格式代替FP16,因为它在NVIDIA Ampere GPU(即A100)上具有较宽的值范围。然而,由于BF16在梯度累积中需要转换为FP32,在我们的实验中,它消耗了比FP16多约15%的GPU内存,而且更重要的是,它不支持其他GPU平台(例如NVIDIA Tesla V100),限制了生成的LLMs的可访问性。BLOOM-176B的另一种选择是使用BF16应用嵌入规范化,但这会对模型的性能造成重大损害,因为他们注意到嵌入规范化可能损害模型的零-shot学习(详见(Scao等人,2022年)的第4.3节)。

嵌入层梯度缩放(EGS)。我们的经验调查发现,梯度范数可以作为训练失败的信息指示器。具体而言,我们发现训练失败通常在梯度范数的“峰值”之后几个训练步骤内发生。正如我们观察到的,在GLM-130B的早期训练阶段,嵌入层的梯度范数通常比其他层的梯度范数大几个数量级(详见图4(a))。此外,它在早期训练阶段往往会剧烈波动。在视觉模型中,通过冻结补丁投影层来解决这个问题(Chen等人,2021年)。不幸的是,在语言模型中,我们无法冻结嵌入层的训练。

Finally, we find the gradient shrink on embedding layers could overcome loss spikes and thus sta- bilize GLM-130B’s training. It is first used in the multi-modal transformer CogView (Ding et al., 2021). Let α be the shrinking factor, the strategy can be easily implemented via word_embedding = word_embedding ∗ α + word_embedding.detach() ∗ (1 − α). Figure 4 (b) suggests that empirically, setting α = 0.1 wipes out most spikes we would have met, with negligible latency.

In fact, the final GLM-130B training run only experiences three late-stage loss divergence cases, though it fails numerous times due to hardware failures. For the three unexpected spikes, it turns out further shrinking the embedding gradient can still help stabilize the GLM-130B training. See the training notes and Tensorboard logs in our code repository for details.

最后,我们发现缩小嵌入层的梯度可以克服损失峰值,并稳定GLM-130B的训练。这种策略最初在多模态变换器CogView(Ding等人,2021年)中使用。设α为缩小因子,该策略可以通过以下方式轻松实现:word_embedding = word_embedding * α + word_embedding.detach() * (1 - α)。图4(b)经验证明,将α设置为0.1可以消除大部分损失峰值,并且延迟可以忽略不计。

事实上,最终的GLM-130B训练过程只出现了三次后期损失发散情况,尽管由于硬件故障导致失败了很多次。对于这三个意外的峰值,进一步缩小嵌入层梯度仍然有助于稳定GLM-130B的训练。有关详细信息,请参阅我们代码库中的训练笔记和Tensorboard日志。

7、CONCLUSION AND LESSONS结论和教训

We introduce GLM-130B, a bilingual pre-trained language model that aims to facilitate open and inclusive LLM research. GLM-130B’s technical and engineering undertakings generate insight into LLMs’ architectures, pre-training objectives, training stability and efficiency, and affordable infer- ence. Altogether, it contributes to the high quality of GLM-130B in terms of both language perfor- mance on 112 tasks and ethical results on bias and toxicity benchmarks. Our experiences of both success and failure are condensed into the lessons for training 100B-scale LLMs, attached in the Appendix B.10.

我们介绍了GLM-130B,一个旨在促进开放和包容的LLM研究的双语预训练语言模型。GLM-130B的技术和工程工作为LLM的架构、预训练目标、训练稳定性和效率以及可承受的推理方面提供了见解。总的来说,它在112个任务的语言性能和偏见和毒性基准测试的伦理结果方面为GLM-130B的高质量做出了贡献。我们的成功和失败经验总结在附录B.10中,提供了培训100B规模LLM的教训。

ACKNOWLEDGEMENT致谢

This research was supported by Natural Science Foundation of China (NSFC) 61825602, 62276148 and Zhipu.AI. We thank all our collaborators and partners from the Knowledge Engineering Group (KEG), Parallel Architecture & Compiler technology of Mobile, Accelerated, and Networked sys- tems Group (PACMAN), Natural Language Processing Group (THUNLP) at Tsinghua University, and Zhipu.AI.

本研究得到了中国自然科学基金(NSFC)61825602、62276148和Zhipu.AI的支持。我们感谢清华大学知识工程组(KEG)、移动、加速和网络系统组(PACMAN)的并行架构与编译器技术以及自然语言处理组(THUNLP)和Zhipu.AI的所有合作伙伴。

本文标签: ChatGLMLLMsGLMOpenTRAINED