admin管理员组

文章数量:1584216

Pre-training large language models(预训练大语言模型)

In the previous video, you were introduced to the generative AI project life cycle. As you saw, there are a few steps to take before you can get to the fun part, launching your generative AI app. Once you have scoped out your use case, and determined how you'll need the LLM to work within your application, your next step is to select a model to work with. Your first choice will be to either work with an existing model, or train your own from scratch. There are specific circumstances where training your own model from scratch might be advantageous, and you'll learn about those later in this lesson.

In general, however, you'll begin the process of developing your application using an existing foundation model. Many open-source models are available for members of the AI community like you to use in your application. The developers of some of the major frameworks for building generative AI applications like Hugging Face and PyTorch, have curated hubs where you can browse these models. A really useful feature of these hubs is the inclusion of model cards, that describe important details including the best use cases for each model, how it was trained, and known limitations. You'll find some links to these model hubs in the reading at the end of the week.

The exact model that you'd choose will depend on the details of the task you need to carry out. Variance of the transformer model architecture are suited to different language tasks, largely because of differences in how the models are trained. To help you better understand these differences and to develop intuition about which model to use for a particular task, let's take a closer look at how large language models are trained. With this knowledge in hand, you'll find it easier to navigate the model hubs and find the best model for your use case.

To begin, let's take a high-level look at the initial training process for LLMs. This phase is often referred to as pre-training. As you saw in Lesson 1, LLMs encode a deep statistical representation of language. This understanding is developed during the models pre-training phase when the model learns from vast amounts of unstructured textual data. This can be gigabytes, terabytes, and even petabytes of text. This data is pulled from many sources, including scrapes of the Internet and corpora of texts that have been assembled specifically for training language models. In this self-supervised learning step, the model internalizes the patterns and structures present in the language. These patterns then enable the model to complete its training objective, which depends on the architecture of the model, as you'll see shortly.

During pre-training, the model weights get updated to minimize the loss of the training objective. The encoder generates an embedding or vector representation for each token. Pre-training also requires a large amount of compute and the use of GPUs. Note, when you scrape training data from public sites such as the Internet, you often need to process the data to increase quality, address bias, and remove other harmful content. As a result of this data quality curation, often only 1-3% of tokens are used for pre-training. You should consider this when you estimate how much data you need to collect if you decide to pre-train your own model.

Earlier this week, you saw that there were three variances of the transformer model; encoder-only encoder-decoder models, and decoder-only. Each of these is trained on a different objective, and so learns how to carry out different tasks. Encoder-only models are also known as Autoencoding models, and they are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence. This is also called a denoising objective. Autoencoding models spilled bi-directional representations of the input sequence, meaning that the model has an understanding of the full context of a token and not just of the words that come before. Encoder-only models are ideally suited to task that benefit from this bi-directional contexts. You can use them to carry out sentence classification tasks, for example, sentiment analysis or token-level tasks like named entity recognition or word classification. Some well-known examples of an autoencoder model are BERT and RoBERTa.

Now, let's take a look at decoder-only or autoregressive models, which are pre-trained using causal language modeling. Here, the training objective is to predict the next token based on the previous sequence of tokens. Predicting the next token is sometimes called full language modeling by researchers. Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token. In contrast to the encoder architecture, this means that the context is unidirectional. By learning to predict the next token from a vast number of examples, the model builds up a statistical representation of language. Models of this type make use of the decoder component off the original architecture without the encoder. Decoder-only models are often used for text generation, although larger decoder-only models show strong zero-shot inference abilities, and can often perform a range of tasks well. Well known examples of decoder-based autoregressive models are GPT and BLOOM.

The final variation of the transformer model is the sequence-to-sequence model that uses both the encoder and decoder parts off the original transformer architecture. The exact details of the pre-training objective vary from model to model. A popular sequence-to-sequence model T5, pre-trains the encoder using span corruption, which masks random sequences of input tokens. Those mask sequences are then replaced with a unique Sentinel token, shown here as x. Sentinel tokens are special tokens added to the vocabulary, but do not correspond to any actual word from the input text. The decoder is then tasked with reconstructing the mask token sequences auto-regressively. The output is the Sentinel token followed by the predicted tokens. You can use sequence-to-sequence models for translation, summarization, and question-answering. They are generally useful in cases where you have a body of texts as both input and output. Besides T5, which you'll use in the labs in this course, another well-known encoder-decoder model is BART, not bird.

To summarize, here's a quick comparison of the different model architectures and the targets off the pre-training objectives. Autoencoding models are pre-trained using masked language modeling. They correspond to the encoder part of the original transformer architecture, and are often used with sentence classification or token classification. Autoregressive models are pre-trained using causal language modeling. Models of this type make use of the decoder component of the original transformer architecture, and often used for text generation. Sequence-to-sequence models use both the encoder and decoder part off the original transformer architecture. The exact details of the pre-training objective vary from model to model. The T5 model is pre-trained using span corruption. Sequence-to-sequence models are often used for translation, summarization, and question-answering. Now that you have seen how this different model architectures are trained and the specific tasks they are well-suited to, you can select the type of model that is best suited to your use case.

One additional thing to keep in mind is that larger models of any architecture are typically more capable of carrying out their tasks well. Researchers have found that the larger a model, the more likely it is to work as you needed to without additional in-context learning or further training. This observed trend of increased model capability with size has driven the development of larger and larger models in recent years. This growth has been fueled by inflection points and research, such as the introduction of the highly scalable transformer architecture, access to massive amounts of data for training, and the development of more powerful compute resources. This steady increase in model size actually led some researchers to hypothesize the existence of a new Moore's law for LLMs. Like them, you may be asking, can we just keep adding parameters to increase performance and make models smarter? Where could this model growth lead? While this may sound great, it turns out that training these enormous models is difficult and very expensive, so much so that it may be infeasible to continuously train larger and larger models. Let's take a closer look at some of the challenges associated with training large models in the next video.

在之前的视频中,你了解了生成式AI项目生命周期。正如你所见,要启动你的生成式AI应用程序,你需要进行一些步骤。一旦你确定了你的使用案例,并决定如何在你的应用中使用LLM(大语言模型),你的下一步就是选择一个模型来工作。你的首个选择将是使用现有的模型,或者从头开始训练你自己的模型。在特定情况下,从头开始训练你自己的模型可能是有利的,你将在本课程后面部分学习到这些情况。然而,通常你会从现有的基础模型开始开发你的应用。许多开源模型可供像你这样的AI社区成员在你的应用中使用。一些主要的生成式AI应用框架的开发者,比如Hugging Face和PyTorch,已经在他们的中心库中整理了这些模型供你浏览。这些中心库的一个非常有用的特性是包含了模型卡片,描述了包括每个模型的最佳使用案例、训练方式以及已知的限制等重要细节。在本周末的阅读材料中,你会找到这些模型中心的链接。

你选择的具体模型将取决于你需要完成的任务的细节。由于训练方式的不同,转换器模型架构的不同变体适用于不同的语言任务。为了帮助你更好地理解这些差异,并对特定任务应使用哪种模型有直觉的理解,让我们更深入地看一下大语言模型是如何训练的。掌握了这些知识后,你会发现更容易地浏览模型中心,并为你的应用案例找到最佳的模型。

首先,让我们从一个高层次来看一下LLMs的初始训练过程。这个阶段通常被称为预训练。正如你在第一课中看到的,LLMs对语言进行了深入的统计表示。这种理解是在模型的预训练阶段发展的,当时模型从大量的非结构化文本数据中学习。这些数据可以是吉字节、太字节,甚至是拍字节的文本。这些数据来自许多来源,包括从互联网上抓取的数据和专门为训练语言模型而汇编的文本语料库。在这个自我监督的学习步骤中,模型内部化了语言中的模式和结构。然后,这些模式使模型能够完成其训练目标,这取决于模型的架构,正如你稍后会看到的。

在预训练期间,模型权重得到更新以最小化训练目标的损失。编码器为每个token生成一个嵌入或向量表示。预训练还需要大量的计算和使用GPU。注意,当你从公共网站如互联网上抓取训练数据时,你通常需要处理数据以提高质量,解决偏见,并移除其他有害内容。由于这种数据质量策划的结果,通常只有1-3%的token被用于预训练。如果你决定预训练你自己的模型,你应该在估计你需要收集多少数据时考虑这一点。

本周早些时候,你看到有三种类型的变换器模型:只有编码器的、编码器-解码器模型和只有解码器的。这些模型都根据不同的目标进行训练,因此学会执行不同的任务。只有编码器的模型也被称为自编码模型,它们使用掩码语言建模进行预训练。在这里,输入序列中的token被随机掩盖,训练目标是预测掩盖的token以重构原始句子。这也被称为去噪目标。自编码模型提供了输入序列的双向表示,这意味着模型理解一个token的完整上下文,而不仅仅是之前出现的单词。只有编码器的模型非常适合于从这种双向上下文中受益的任务。你可以使用它们来执行句子分类任务,例如情感分析或词汇级别的任务,如命名实体识别或词分类。一些著名的自编码模型的例子是BERT和RoBERTa。

现在,让我们来看一下只有解码器或自回归模型,它们使用因果语言建模进行预训练。在这里,训练目标是根据之前的token序列预测下一个token。有时,研究人员称预测下一个token为全语言建模。基于解码器的自回归模型掩盖了输入序列,只能看到问题token之前的输入token。该模型没有关于句子结束的知识。然后,该模型一次迭代输入序列以预测下一个token。与编码器架构相比,这意味着上下文是单向的。通过从大量示例中学习预测下一个token,该模型建立了语言的统计表示。这类模型使用了原始架构的解码器组件,而没有编码器。只有解码器的模型通常用于文本生成,尽管较大的解码器模型表现出强大的零样本推理能力,并且通常可以很好地执行一系列任务。基于解码器的自回归模型的一些著名例子是GBT和BLOOM。

转换器模型的最后一个变体是序列到序列模型,它使用了原始转换器架构的编码器和解码器部分。预训练目标的确切细节因模型而异。一个流行的序列到序列模型T5,使用跨度损坏对编码器进行预训练,该过程掩盖了输入tokens的随机序列。然后这些被掩盖的序列被替换为一个独特的哨兵token,这里显示为x。哨兵tokens是添加到词汇表中的特殊tokens,但它们并不对应输入文本中的任何实际单词。然后任务解码器以自回归方式重构被掩盖的token序列。输出是哨兵token后面跟着预测的tokens。你可以使用序列到序列模型进行翻译、摘要和问答。它们通常在你有大量文本作为输入和输出的情况下很有用。除了在本课程实验室中将要使用的T5外,另一个著名的编码器-解码器模型是BART。

总结一下,这是不同模型架构及其预训练目标的快速比较。自编码模型使用掩蔽语言建模进行预训练。它们对应于原始转换器架构的编码器部分,通常用于句子分类或token分类。自回归模型使用因果语言建模进行预训练。这类模型利用原始转换器架构的解码器组件,通常用于文本生成。序列到序列模型使用原始转换器架构的编码器和解码器部分。预训练目标的确切细节因模型而异。T5模型使用跨度损坏进行预训练。序列到序列模型通常用于翻译、摘要和问答。现在你已经看到了这些不同模型架构是如何训练的,以及它们非常适合的具体任务,你可以选择最适合你的使用案例的模型类型。

需要记住的另一件事是,任何架构的较大模型通常更能胜任它们的任务。研究人员发现,模型越大,它就越可能在没有额外的情境内学习或进一步训练的情况下按照你的需要进行工作。这种随着模型大小的增加而提高模型能力的趋势,近年来推动了越来越大的模型的发展。这种增长是由诸如高度可扩展的转换器架构的引入、用于训练的大量数据的获取以及更强大的计算资源的开发等因素推动的。模型大小的稳步增加实际上使一些研究人员假设存在一个新的针对LLMs的摩尔定律。像他们一样,你可能在问,我们能否继续增加参数以提高性能并使模型更智能?这种模型增长会通向何方?虽然这听起来很棒,但事实证明,训练这些巨大的模型是困难且非常昂贵的,以至于持续训练越来越大的模型可能是不可行的。让我们在下个视频中更仔细地看看与训练大型模型相关的一些挑战。

本文标签: 模型语言缩放法则学习笔记