大语言模型生成式AI学习笔记——1. 2.1LLM预训练和缩放法则—

admin管理员组
文章数量:1584216

Pre-training large language models（预训练大语言模型）

In the previous video, you were introduced to the generative AI project life cycle. As you saw, there are a few steps to take before you can get to the fun part, launching your generative AI app. Once you have scoped out your use case, and determined how you'll need the LLM to work within your application, your next step is to select a model to work with. Your first choice will be to either work with an existing model, or train your own from scratch. There are specific circumstances where training your own model from scratch might be advantageous, and you'll learn about those later in this lesson.

In general, however, you'll begin the process of developing your application using an existing foundation model. Many open-source models are available for members of the AI community like you to use in your application. The developers of some of the major frameworks for building generative AI applications like Hugging Face and PyTorch, have curated hubs where you can browse these models. A really useful feature of these hubs is the inclusion of model cards, that describe important details including the best use cases for each model, how it was trained, and known limitations. You'll find some links to these model hubs in the reading at the end of the week.

The exact model that you'd choose will depend on the details of the task you need to carry out. Variance of the transformer model architecture are suited to different language tasks, largely because of differences in how the models are trained. To help you better understand these differences and to develop intuition about which model to use for a particular task, let's take a closer look at how large language models are trained. With this knowledge in hand, you'll find it easier to navigate the model hubs and find the best model for your use case.

To begin, let's take a high-level look at the initial training process for LLMs. This phase is often referred to as pre-training. As you saw in Lesson 1, LLMs encode a deep statistical representation of language. This understanding is developed during the models pre-training phase when the model learns from vast amounts of unstructured textual data. This can be gigabytes, terabytes, and even petabytes of text. This data is pulled from many sources, including scrapes of the Internet and corpora of texts that have been assembled specifically for training language models. In this self-supervised learning step, the model internalizes the patterns and structures present in the language. These patterns then enable the model to complete its training objective, which depends on the architecture of the model, as you'll see shortly.

During pre-training, the model weights get updated to minimize the loss of the training objective. The encoder generates an embedding or vector representation for each token. Pre-training also requires a large amount of compute and the use of GPUs. Note, when you scrape training data from public sites such as the Internet, you often need to process the data to increase quality, address bias, and remove other harmful content. As a result of this data quality curation, often only 1-3% of tokens are used for pre-training. You should consider this when you estimate how much data you need to collect if you decide to pre-train your own model.

Earlier this week, you saw that there were three variances of the transformer model; encoder-only encoder-decoder models, and decoder-only. Each of these is trained on a different objective, and so learns how to carry out different tasks. Encoder-only models are also known as Autoencoding models, and they are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence. This is also called a denoising objective. Autoencoding models spilled bi-directional representations of the input sequence, meaning that the model has an understanding of the full context of a token and not just of the words that come before. Encoder-only models are ideally suited to task that benefit from this bi-directional contexts. You can use them to carry out sentence classification tasks, for example, sentiment analysis or token-level tasks like named entity recognition or word classification. Some well-known examples of an autoencoder model are BERT and RoBERTa.

Now, let's take a look at decoder-only or autoregressive models, which are pre-trained using causal language modeling. Here, the training objective is to predict the next token based on the previous sequence of tokens. Predicting the next token is sometimes called full language modeling by researchers. Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token. In contrast to the encoder architecture, this means that the context is unidirectional. By learning to predict the next token from a vast number of examples, the model builds up a statistical representation of language. Models of this type make use of the decoder component off the original architecture without the encoder. Decoder-only models are often used for text generation, although larger decoder-only models show strong zero-shot inference abilities, and can often perform a range of tasks well. Well known examples of decoder-based autoregressive models are GPT and BLOOM.

The final variation of the transformer model is the sequence-to-sequence model that uses both the encoder and decoder parts off the original transformer architecture. The exact details of the pre-training objective vary from model to model. A popular sequence-to-sequence model T5, pre-trains the encoder using span corruption, which masks random sequences of input tokens. Those mask sequences are then replaced with a unique Sentinel token, shown here as x. Sentinel tokens are special tokens added to the vocabulary, but do not correspond to any actual word from the input text. The decoder is then tasked with reconstructing the mask token sequences auto-regressively. The output is the Sentinel token followed by the predicted tokens. You can use sequence-to-sequence models for translation, summarization, and question-answering. They are generally useful in cases where you have a body of texts as both input and output. Besides T5, which you'll use in the labs in this course, another well-known encoder-decoder model is BART, not bird.

To summarize, here's a quick comparison of the different model architectures and the targets off the pre-training objectives. Autoencoding models are pre-trained using masked language modeling. They correspond to the encoder part of the original transformer architecture, and are often used with sentence classification or token classification. Autoregressive models are pre-trained using causal language modeling. Models of this type make use of the decoder component of the original transformer architecture, and often used for text generation. Sequence-to-sequence models use both the encoder and decoder part off the original transformer architecture. The exact details of the pre-training objective vary from model to model. The T5 model is pre-trained using span corruption. Sequence-to-sequence models are often used for translation, summarization, and question-answering. Now that you have seen how this different model architectures are trained and the specific tasks they are well-suited to, you can select the type of model that is best suited to your use case.

One additional thing to keep in mind is that larger models of any architecture are typically more capable of carrying out their tasks well. Researchers have found that the larger a model, the more likely it is to work as you needed to without additional in-context learning or further training. This observed trend of increased model capability with size has driven the development of larger and larger models in recent years. This growth has been fueled by inflection points and research, such as the introduction of the highly scalable transformer architecture, access to massive amounts of data for training, and the development of more powerful compute resources. This steady increase in model size actually led some researchers to hypothesize the existence of a new Moore's law for LLMs. Like them, you may be asking, can we just keep adding parameters to increase performance and make models smarter? Where could this model growth lead? While this may sound great, it turns out that training these enormous models is difficult and very expensive, so much so that it may be infeasible to continuously train larger and larger models. Let's take a closer look at some of the challenges associated with training large models in the next video.

在之前的视频中，你了解了生成式AI项目生命周期。正如你所见，要启动你的生成式AI应用程序，你需要进行一些步骤。一旦你确定了你的使用案例，并决定如何在你的应用中使用LLM（大语言模型），你的下一步就是选择一个模型来工作。你的首个选择将是使用现有的模型，或者从头开始训练你自己的模型。在特定情况下，从头开始训练你自己的模型可能是有利的，你将在本课程后面部分学习到这些情况。然而，通常你会从现有的基础模型开始开发你的应用。许多开源模型可供像你这样的AI社区成员在你的应用中使用。一些主要的生成式AI应用框架的开发者，比如Hugging Face和PyTorch，已经在他们的中心库中整理了这些模型供你浏览。这些中心库的一个非常有用的特性是包含了模型卡片，描述了包括每个模型的最佳使用案例、训练方式以及已知的限制等重要细节。在本周末的阅读材料中，你会找到这些模型中心的链接。

你选择的具体模型将取决于你需要完成的任务的细节。由于训练方式的不同，转换器模型架构的不同变体适用于不同的语言任务。为了帮助你更好地理解这些差异，并对特定任务应使用哪种模型有直觉的理解，让我们更深入地看一下大语言模型是如何训练的。掌握了这些知识后，你会发现更容易地浏览模型中心，并为你的应用案例找到最佳的模型。

首先，让我们从一个高层次来看一下LLMs的初始训练过程。这个阶段通常被称为预训练。正如你在第一课中看到的，LLMs对语言进行了深入的统计表示。这种理解是在模型的预训练阶段发展的，当时模型从大量的非结构化文本数据中学习。这些数据可以是吉字节、太字节，甚至是拍字节的文本。这些数据来自许多来源，包括从互联网上抓取的数据和专门为训练语言模型而汇编的文本语料库。在这个自我监督的学习步骤中，模型内部化了语言中的模式和结构。然后，这些模式使模型能够完成其训练目标，这取决于模型的架构，正如你稍后会看到的。

在预训练期间，模型权重得到更新以最小化训练目标的损失。编码器为每个token生成一个嵌入或向量表示。预训练还需要大量的计算和使用GPU。注意，当你从公共网站如互联网上抓取训练数据时，你通常需要处理数据以提高质量，解决偏见，并移除其他有害内容。由于这种数据质量策划的结果，通常只有1-3%的token被用于预训练。如果你决定预训练你自己的模型，你应该在估计你需要收集多少数据时考虑这一点。

本周早些时候，你看到有三种类型的变换器模型：只有编码器的、编码器-解码器模型和只有解码器的。这些模型都根据不同的目标进行训练，因此学会执行不同的任务。只有编码器的模型也被称为自编码模型，它们使用掩码语言建模进行预训练。在这里，输入序列中的token被随机掩盖，训练目标是预测掩盖的token以重构原始句子。这也被称为去噪目标。自编码模型提供了输入序列的双向表示，这意味着模型理解一个token的完整上下文，而不仅仅是之前出现的单词。只有编码器的模型非常适合于从这种双向上下文中受益的任务。你可以使用它们来执行句子分类任务，例如情感分析或词汇级别的任务，如命名实体识别或词分类。一些著名的自编码模型的例子是BERT和RoBERTa。

现在，让我们来看一下只有解码器或自回归模型，它们使用因果语言建模进行预训练。在这里，训练目标是根据之前的token序列预测下一个token。有时，研究人员称预测下一个token为全语言建模。基于解码器的自回归模型掩盖了输入序列，只能看到问题token之前的输入token。该模型没有关于句子结束的知识。然后，该模型一次迭代输入序列以预测下一个token。与编码器架构相比，这意味着上下文是单向的。通过从大量示例中学习预测下一个token，该模型建立了语言的统计表示。这类模型使用了原始架构的解码器组件，而没有编码器。只有解码器的模型通常用于文本生成，尽管较大的解码器模型表现出强大的零样本推理能力，并且通常可以很好地执行一系列任务。基于解码器的自回归模型的一些著名例子是GBT和BLOOM。

转换器模型的最后一个变体是序列到序列模型，它使用了原始转换器架构的编码器和解码器部分。预训练目标的确切细节因模型而异。一个流行的序列到序列模型T5，使用跨度损坏对编码器进行预训练，该过程掩盖了输入tokens的随机序列。然后这些被掩盖的序列被替换为一个独特的哨兵token，这里显示为x。哨兵tokens是添加到词汇表中的特殊tokens，但它们并不对应输入文本中的任何实际单词。然后任务解码器以自回归方式重构被掩盖的token序列。输出是哨兵token后面跟着预测的tokens。你可以使用序列到序列模型进行翻译、摘要和问答。它们通常在你有大量文本作为输入和输出的情况下很有用。除了在本课程实验室中将要使用的T5外，另一个著名的编码器-解码器模型是BART。

总结一下，这是不同模型架构及其预训练目标的快速比较。自编码模型使用掩蔽语言建模进行预训练。它们对应于原始转换器架构的编码器部分，通常用于句子分类或token分类。自回归模型使用因果语言建模进行预训练。这类模型利用原始转换器架构的解码器组件，通常用于文本生成。序列到序列模型使用原始转换器架构的编码器和解码器部分。预训练目标的确切细节因模型而异。T5模型使用跨度损坏进行预训练。序列到序列模型通常用于翻译、摘要和问答。现在你已经看到了这些不同模型架构是如何训练的，以及它们非常适合的具体任务，你可以选择最适合你的使用案例的模型类型。

需要记住的另一件事是，任何架构的较大模型通常更能胜任它们的任务。研究人员发现，模型越大，它就越可能在没有额外的情境内学习或进一步训练的情况下按照你的需要进行工作。这种随着模型大小的增加而提高模型能力的趋势，近年来推动了越来越大的模型的发展。这种增长是由诸如高度可扩展的转换器架构的引入、用于训练的大量数据的获取以及更强大的计算资源的开发等因素推动的。模型大小的稳步增加实际上使一些研究人员假设存在一个新的针对LLMs的摩尔定律。像他们一样，你可能在问，我们能否继续增加参数以提高性能并使模型更智能？这种模型增长会通向何方？虽然这听起来很棒，但事实证明，训练这些巨大的模型是困难且非常昂贵的，以至于持续训练越来越大的模型可能是不可行的。让我们在下个视频中更仔细地看看与训练大型模型相关的一些挑战。

本文标签：模型语言缩放法则学习笔记

版权声明：本文标题：大语言模型生成式AI学习笔记——1. 2.1LLM预训练和缩放法则——预训练大语言模型内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dongtai/1727933828a1138702.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

大语言模型生成式AI学习笔记——1. 2.1LLM预训练和缩放法则——​​​​​​​预训练大语言模型

Pre-training large language models（预训练大语言模型）

更多相关文章

长文慎点，Java学习笔记（四）

谷歌浏览器怎么缩放到130%

价格战、大厂裁员、模型“翻车”……Q2 的AI 圈子可一点都不无聊

英伟达开源3400亿参数GPT-4级大模型；Meta 将把用户数据用于 AI 训练 | AI 头条

为什么说国产大模型的野心，都藏在 MaaS 的生态中

Python语言

C语言基础知识入门（大全详解）

win10禁止dpi缩放在哪_Windows 10，如何禁用应用程序缩放？

如何读取照片的GPS信息？—最好的语言Java实现起来就这么简单【手把手教程+完整代码】

训练深度学习模型时电脑自动重启

【C语言】关机代码

Tensorflow模型保存和恢复 meta,ckpt含义

大学英语精读第三版（第四册）学习笔记（原文及全文翻译）——5B - The Man Who Wrote His Own Obituary（自撰讣告的人）

二十一世纪大学英语读写教程（第三册）学习笔记（原文）——1 - How I Got Smart（我是如何变聪明的）

c51语言自定义头文件,C51语言头文件包括的内容有

Go语言文件下载

全新版大学英语综合教程第二册学习笔记（原文及全文翻译）——1 - Learning, Chinese-Style（中国式的学习风格）

现代大学英语精读第二版（第三册）学习笔记（原文及全文翻译）——5B - The ABCs of Global Warming（全球变暖基本情况）

Java语言艺术与科学 计算机科学导论_Java语言艺术与科学：计算机科学导论（影印版）...

大模型微调报错解决 RunTimeError:CUDA Setup failed despite GPU being available. libcudart.so not found.

发表评论

推荐文章

浏览器内核学习笔记一

js判断浏览器信息

Windows10或Windows11如何实现多用户同时连接远程桌面

Windows设置IGMP版本

Top 28 Material Design HTML5CSS3 Admin Templates To Build Awesome Web Apps 2018

热门文章

三角洲行动卡顿严重？这样快速解决三角洲行动国服卡顿问题

win11玩游戏找不到d3dx9怎么办？多种dll问题解决方法分享

在安装win10专业工作站系统时，出现“Windows安装程序无法将Windows配置为在此计算机的硬件上运行”

CAD看图软件也可以方便地绘图制作

0x00405cad指令引用的“0x00000000”内存。该内存不能为“read”

Python爬虫基础之BeautifulSoup

小企鹅手把手教你安装Microsoft office 365！！！

华为linux输入法,搜狗输入法Linux版斩获华为开发者大赛混合云组银奖

BZOJ 4395: [Usaco2015 dec]Switching on the Lights

2021年激光雷达行业研究报告

最新文章

计算机无法启动 一直在启动修复需要多久,关于电脑总是卡机&amp;蓝屏&amp;启动修复~...

assigning the result of this type assertion to a variable could eliminate the followin assertion解决

Java并发编程——锁粗化（Lock Coarsening）和锁消除（Lock Eliminate）

【考研词汇训练营】Day18 —— amount，max，consider，account，actual，eliminate，letter，significant，embarrass，collapse

Effective-Java读书笔记06 Eliminate obsolete object references 删除无用的对象引用

机器视觉项目中，经常去关闭Win10实时保护，Win10如何关闭和打开实时保护？详细步骤如下（强烈建议第2种方法最快最有效）

Item 7: Eliminate obsolete object references

Arcgis中消除子流域划分时出现的零碎图斑或狭长面（Eliminate）

eliminate什么意思_eliminate是什么意思_eliminate在线翻译_英语_读音_用法_例句_海词词典...

arcgis图斑尖角检查_ArcGIS消除零碎图斑（Eliminate）

以下是adb工具包最新Google官方版下载地址：

第七章第十五题（消除重复）（Eliminate duplication）

dismiss和remove_eliminate, remove, dismiss的区别：新东方考研英语词汇辨析

ArcMap拓扑检查缝隙，使用Eliminate批量处理消除，就近合并属性

电脑键盘打不了字按哪个键恢复？最新分享！

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

大语言模型生成式AI学习笔记——1. 2.1LLM预训练和缩放法则——预训练大语言模型

Java语言艺术与科学计算机科学导论_Java语言艺术与科学：计算机科学导论（影印版）...

计算机无法启动一直在启动修复需要多久,关于电脑总是卡机&蓝屏&启动修复~...

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载