Paper：《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读|电子爱好者

admin管理员组
文章数量:1590328

Paper：《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读

Abstract

1、Introduction

2、Background

2.1 Language Representation Learning语言表征学习

Non-contextual Embeddings非上下文嵌入【静态词嵌入】

Contextual Embeddings上下文嵌入【动态词嵌入】

2.2 Neural Contextual Encoders神经网络上下文编码器

2.2.1 Sequence Models序列模型——CNN、RNN

2.2.2 Non-Sequence Models非序列模型——RecursNN、TreeL-STM、GCN、FCSA

2.2.3 Analysis分析

2.3 Why Pre-training?为什么需要预训练——三大优势

2.4 A Brief History of PTMs for NLP—NLP 的 PTM 简史

2.4.1 First-Generation PTMs: Pre-trained Word Embeddings 第一代PTMs：预训练词嵌入

2.4.2 Second-Generation PTMs: Pre-trained Contextual En-coders第二代PTMs：预训练的上下文编码器

3 Overview of PTMs—PTM的概述

3.1 Pre-training Tasks预训练任务

3.1.1 Language Modeling (LM)语言建模

3.1.2 Masked Language Modeling (MLM)掩码语言建模

3.1.3 Permuted Language Modeling (PLM)置换语言建模

3.1.4 Denoising Autoencoder (DAE)降噪自动编码器

3.1.5 Contrastive Learning (CTL)对比学习

3.1.6 Others

3.2 Taxonomy of PTMs

3.3 Model Analysis模型分析

3.3.1 Non-Contextual Embeddings非上下文嵌入

Figure 3: Taxonomy of PTMs with Representative Examples

Table 2: List of Representative PTMs有代表性的 PTMs 及其架构

3.3.2 Contextual Embeddings上下文嵌入

4 Extensions of PTMs—PTM 的扩展

4.1 Knowledge-Enriched PTMs知识丰富的 PTM

4.2 Multilingual and Language-Specific PTMs多语言和特定语言的PTMs

4.2.1 Multilingual PTMs多语言的PTMs

4.2.2 Language-Specific PTMs特定语言的 PTM

4.3 Multi-Modal PTMs多模态PTM

4.3.1 Video-Text PTMs

4.3.2 Image-Text PTMs图像-文本 PTM

4.3.3 Audio-Text PTMs音频-文本PTM

4.4 Domain-Specific and Task-Specific PTMs 特定领域和特定任务的 PTM

4.5 Model Compression模型压缩

4.5.1 Model Pruning模型剪枝——删除不太重要的参数

4.5.2 Quantization量化——用更少的比特来表示参数

4.5.3 Parameter Sharing参数共享——相似单元间共享参数

4.5.4 Knowledge Distillation知识蒸馏/提炼——训练一个更小的学生模型

关键词额外信息补充—Hard-target 和 Soft-target对比

4.5.5 Module Replacing模块替换——用更紧凑的替换

4.5.6 Early Exit早退

5 Adapting PTMs to Downstream Tasks使 PTM 适应下游任务

5.1 Transfer Learning迁移学习

5.2 How to Transfer?如何迁移

5.2.1 Choosing appropriate pre-training task, model architecture and corpus选择合适的预训练任务、模型架构和语料库

5.2.2 Choosing appropriate layers选择合适的图层

5.2.3 To tune or not to tune?是否微调？

5.3 Fine-Tuning Strategies微调策略

5.3.1 Prompt-based Tuning基于提示的微调

6 Resources of PTMs—PTM 的资源

7 Applications应用

7.1 General Evaluation Benchmark通用评价基准

7.2 Question Answering / MRC

7.3 Sentiment Analysis情感分析

额外信息补充：情感分析任务之TBSA对比ABSA

7.4 Named Entity Recognition命名实体识别

7.5 Machine Translation机器翻译

7.6 Summarization摘要总结

7.7 Adversarial Attacks and Defenses对抗性攻击和防御AdvAtt

8 Future Directions未来发展方向

(1)、PTMs当前的无上限——通用性PTMs需要更深、更大、更挑战性→更高成本、需要更复杂和有效训练技术(分布式训练/混合精度/梯度积累积)→更实际方法(基于现有的软硬件设计，如ELECTRA)

(2)、PTM架构——Transformer系列(需高计算复杂度)和非Transformer系列(如NAS)

(3)、面向任务的预训练(特殊场景需特殊架构和任务、提取部分知识)和模型压缩(NLP的PTM才初研究)

(4)、超越微调的知识转移——参数效率低→固定原始参数+自适应模块改进实现共享服务多个下游、挖掘作为外部知识实现特征提取、知识蒸馏、数据增强

(5)、 PTM的可解释性和可靠性——Transformer架构解释较难、易受到对抗性攻击(采用对抗性防御)

9 Conclusion结论

Acknowledgements

References

Paper：《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读

作者	Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, Xuanjing Huang
时间	[提交于2020年3月18日(v1)，最后修订于2021年6月23日(本版本，v4)]
来源	https://arxiv/abs/2003.08271

Abstract

Recently, the emergence of pre-trained models (PTMs)* has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next, we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.

最近，预训练模型(PTMs)*的出现将自然语言处理(NLP)带入了一个新时代。在本次调查中，我们对NLP的PTMs进行了全面回顾。首先简要介绍了①语言表征学习及其研究进展。然后，我们从②四个不同的角度对现有的PTM进行了系统的分类。接下来，我们将③描述如何将PTM的知识应用于下游任务。最后，我们④对PTMs的未来研究方向进行了展望。本调查旨在为理解、使用和开发用于各种NLP任务的PTMs提供实际操作指南。

Deep Learning, Neural Network, Natural Language Processing, Pre-trained Model, Distributed Representation, Word Embedding, Self-Supervised Learning, Language Modelling

深度学习，神经网络，自然语言处理，预训练模型，分布式表示，单词嵌入，自监督学习，语言建模

1、Introduction

With the development of deep learning, various neural net- works have been widely used to solve Natural Language Pro- cessing (NLP) tasks, such as convolutional neural networks (CNNs) [1–3], recurrent neural networks (RNNs) [4, 5], graph- based neural networks (GNNs) [6–8] and attention mechanisms [9, 10]. One of the advantages of these neural models is their ability to alleviate the feature engineering problem.

Non-neural NLP methods usually heavily rely on the discrete handcrafted features, while neural methods usually use low- dimensional and dense vectors (aka. distributed representa- tion) to implicitly represent the syntactic or semantic features of the language. These representations are learned in specific NLP tasks. Therefore, neural methods make it easy for people to develop various NLP systems.

Despite the success of neural models for NLP tasks, the performance improvement may be less significant compared to the Computer Vision (CV) field. The main reason is that current datasets for most supervised NLP tasks are rather small (except machine translation). Deep neural networks usually have a large number of parameters, which make them overfit on these small training data and do not generalize well in practice. Therefore, the early neural models for many NLP tasks were relatively shallow and usually consisted of only 1∼3 neural layers.

随着深度学习的发展，各种神经网络被广泛用于解决NLP任务，如卷积神经网络(CNNs)[1-3]，循环神经网络(RNNs)[4,5]，基于图的神经网络(GNNs)[6-8]和注意力机制(attention)[9,10]。这些神经模型的优点之一是它们能够减轻特征工程问题。

非神经NLP方法通常严重依赖于离散的手工特征，而神经方法通常使用低维度和密集的向量(又叫分布式表示)隐式地表示语言的句法或语义特征。这些表示是在特定的NLP任务中学习的。因此，神经方法使人们可以很容易地开发各种NLP系统。

尽管神经模型在NLP任务中取得了成功，但与计算机视觉(CV)领域相比，其性能改进可能没有那么显著。主要原因是目前大多数受监督NLP任务的数据集都相当小(机器翻译除外)。深度神经网络通常具有大量的参数，这使得它们在这些小的训练数据上易过拟合，在实践中不能很好地泛化。因此，许多 NLP 任务的早期神经模型相对较浅，通常仅由 1∼3 个神经层组成。

Recently, substantial work has shown that pre-trained mod- els (PTMs), on the large corpus can learn universal language representations, which are beneficial for downstream NLP tasks and can avoid training a new model from scratch. With the development of computational power, the emergence of the deep models (i.e., Transformer [10]), and the constant enhancement of training skills, the architecture of PTMs has been advanced from shallow to deep.

The first-generation PTMs aim to learn good word embeddings. Since these mod- els themselves are no longer needed by downstream tasks, they are usually very shallow for computational efficiencies, such as Skip-Gram [11] and GloVe [12]. Although these pre-trained embeddings can capture semantic meanings of words, they are context-free and fail to capture higher-level concepts in con- text, such as polysemous disambiguation, syntactic structures, semantic roles, anaphora.

The second-generation PTMs focus on learning contextual word embeddings, such as CoVe [13], ELMo [14], OpenAI GPT [15] and BERT [16]. These learned encoders are still needed to represent words in context by downstream tasks. Besides, various pre-training tasks are also proposed to learn PTMs for different purposes.

近年来，大量研究表明，在大型语料库上的预训练模型(PTMs)可以学习通用语言表示，这有利于下游的NLP任务，并且可以避免从头开始训练新模型。随着计算能力的发展，深度模型(如Transformer[10])的出现，以及训练技能的不断提高，PTM的架构已经由浅向深推进。

第一代PTM的目标是学习良好的词嵌入。由于下游任务不再需要这些模块本身，因此它们对于计算效率通常非常浅，如Skip-Gram[11]和GloVe[12]。尽管这些预训练的嵌入方法能够捕获词的语义，但它们与上下文无关，无法捕获上下文文本中的高级概念，如多义词消歧、句法结构、语义角色、照应等。

第二代PTM专注于学习上下文的词嵌入，如CoVe [13]， ELMo [14]， OpenAI GPT[15]和BERT[16]。下游任务仍然需要这些学习到的编码器来表示上下文中的单词。此外，还提出了各种预训练任务来学习用于不同目的的 PTM。

The contributions of this survey can be summarized as follows:

(1)、Comprehensive review. We provide a comprehensive review of PTMs for NLP, including background knowl-edge, model architecture, pre-training tasks, various extensions, adaption approaches, and applications.

(2)、New taxonomy. We propose a taxonomy of PTMs for NLP, which categorizes existing PTMs from four dif-ferent perspectives: 1) representation type, 2) model architecture; 3) type of pre-training task; 4) extensions for specific types of scenarios.

(3)、Abundant resources. We collect abundant resources on PTMs, including open-source implementations of PTMs, visualization tools, corpora, and paper lists.

(4)、Future directions. We discuss and analyze the limi-tations of existing PTMs. Also, we suggest possible future research directions.

这项调查的贡献可概括如下:

(1)、综合回顾。我们对NLP的PTMs进行了全面的回顾，包括背景知识、模型架构、预训练任务、各种扩展、适应方法和应用。

(2)、新分类法。我们提出NLP的PTMs分类法，从四个不同的角度对现有的PTMs进行分类:1)表示类型；2)模型架构；3)预训练任务类型；4)针对特定场景类型的扩展。

(3)、提供丰富资源。我们收集了大量关于PTM的资源，包括PTM的开源实现、可视化工具、语料库和论文列表。

(4)、未来发展方向。我们讨论和分析了现有的PTM的局限性。并提出了未来可能的研究方向。

The rest of the survey is organized as follows. Section 2 outlines the background concepts and commonly used nota-tions of PTMs.

Section 3 gives a brief overview of PTMs and clarifies the categorization of PTMs.

Section 4 provides extensions of PTMs.

Section 5 discusses how to transfer the knowledge of PTMs to downstream tasks.

Section 6 gives the related resources on PTMs.

Section 7 presents a collection of applications across various NLP tasks.

Section 8 discusses the current challenges and suggests future directions.

Section 9 summarizes the paper.

调查的其余部分安排如下。

第2节概述了PTM的背景概念和常用符号。

第3节简要概述了PTM，并阐明了PTM的分类。

第4节提供了PTM的扩展。第5节讨论如何将PTM的知识转移到下游任务。

第6节给出了关于PTM的相关资源。

第7节介绍了跨各种NLP任务的应用程序集合。

第8节讨论了当前的挑战并提出了未来的方向。

第9节总结全文。

2、Background

2.1 Language Representation Learning语言表征学习

As suggested by Bengio et al. [17], a good representation should express general-purpose priors that are not task-specific but would be likely to be useful for a learning machine to solve AI-tasks. When it comes to language, a good representation should capture the implicit linguistic rules and common sense knowledge hiding in text data, such as lexical meanings, syn-tactic structures, semantic roles, and even pragmatics.

The core idea of distributed representation is to describe the meaning of a piece of text by low-dimensional real-valued vec-tors. And each dimension of the vector has no corresponding sense, while the whole represents a concrete concept. Figure 1 illustrates the generic neural architecture for NLP. There are two kinds of word embeddings: non-contextual and contex-tual embeddings. The diﬀerence between them is whether the embedding for a word dynamically changes according to the context it appears in.

正如Bengio等人[17]所建议的，一个好的表示应该表达通用的先验，这些先验不是特定任务的，但可能对学习机器解决人工智能任务有用。对于语言而言，一个好的表示应该捕捉隐藏在文本数据中的隐含语言规则和常识知识，如词汇意义、句法结构、语义角色，甚至语用学。

分布式表示的核心思想是用低维实值向量来描述一段文本的意义。向量的每一个维度都没有相应的意义，而整体代表一个具体的概念。图1说明了NLP的通用神经结构。有两种词嵌入：非上下文嵌入和上下文嵌入。它们之间的区别在于一个词的嵌入是否根据它出现的上下文动态变化。

Non-contextual Embeddings非上下文嵌入【静态词嵌入】

Non-contextual Embeddings

The first step of represent-ing language is to map discrete language symbols into a dis-tributed embedding space. Formally, for each word (or sub-word) x in a vocabulary V, we map it to a vector ex ∈ RDe with a lookup table E ∈ RDe×|V|, where De is a hyper-parameter indicating the dimension of token embeddings. These em-beddings are trained on task data along with other model parameters.

There are two main limitations to this kind of embeddings. The first issue is that the embeddings are static. The embed-ding for a word does is always the same regardless of its context. Therefore, these non-contextual embeddings fail to model polysemous words. The second issue is the out-of-vocabulary problem. To tackle this problem, character-level word representations or sub-word representations are widely used in many NLP tasks, such as CharCNN [18], FastText [19] and Byte-Pair Encoding (BPE) [20].

Non-contextual Embeddings非上下文嵌入

表示语言的第一步是将离散的语言符号映射到分布式的嵌入空间中。形式上，对于词汇表V中的每个词(或子词)x，我们用查找表E∈RDe×|V|将其映射到向量ex∈RDe，其中De是表示token的维度嵌入的超参数。这些嵌入，是根据任务数据以及其他模型参数进行训练的。

这种嵌入有两个主要限制。第一个问题是嵌入是静态的。不管上下文如何，单词的嵌入都是一样的。因此，这些非上下文嵌入无法对多义词建模。第二个问题是词汇量不足问题。为了解决这个问题，字符级词表示或子词表示被广泛应用于许多NLP任务中，如CharCNN [18]， FastText[19]和字节对编码(BPE)[20]。

Contextual Embeddings上下文嵌入【动态词嵌入】

Contextual Embeddings

To address the issue of polyse-mous and the context-dependent nature of words, we need distinguish the semantics of words in diﬀerent contexts. Given a text x1, x2, · · · , xT where each token xt ∈ V is a word or sub-word, the contextual representation of xt depends on the whole text.

where fenc() is neural encoder, which is described in Section 2.2, ht is called contextual embedding or dynamical embedding of token xt because of the contextual information included in.

为了解决单词的多义性和上下文依赖性问题，我们需要区分单词在不同语境中的语义。给定一个文本x1, x2，···，xT，其中每个标记xT∈V是一个词或子词，xT的上下文表示取决于整个文本。

其中fenc()是神经编码器，在2.2节中描述，ht被称为令牌xt的上下文嵌入或动态嵌入，因为其中包含上下文信息。

Figure 1: Generic Neural Architecture for NLP

Figure 2: Neural Contextual Encoders

2.2 Neural Contextual Encoders神经网络上下文编码器

Most of the neural contextual encoders can be classified into two categories: sequence models and non-sequence models. Figure 2 illustrates three representative architectures.

大多数神经网络上下文编码器可以分为两类：序列模型和非序列模型。图2说明了三种代表性的架构。

2.2.1 Sequence Models序列模型——CNN、RNN

Sequence models usually capture local context of a word in sequential order.

序列模型通常按顺序捕获单词的局部上下文。

Convolutional Models

Convolutional models take the em-beddings of words in the input sentence and capture the mean-ing of a word by aggregating the local information from its neighbors by convolution operations [2].

Recurrent Models

Recurrent models capture the contextual representations of words with short memory, such as LSTMs [21] and GRUs [22]. In practice, bi-directional LSTMs or GRUs are used to collect information from both sides of a word, but its performance is often aﬀected by the long-term dependency problem.

卷积模型

卷积模型采用输入句子中的词嵌入，并通过卷积运算 [2] 聚合来自相邻词的局部信息来捕获词的含义。

循环模型

循环模型捕获具有短记忆的单词的上下文表示，如LSTMs[21]和GRU[22]。在实际应用中，双向LSTM或GRU用于从单词的两侧收集信息，但其性能往往受到长期依赖问题的影响。

2.2.2 Non-Sequence Models非序列模型——RecursNN、TreeL-STM、GCN、FCSA

Non-sequence models learn the contextual representation with a pre-defined tree or graph structure between words, such as the syntactic structure or semantic relation. Some popu-lar non-sequence models include Recursive NN [6], TreeL-STM [7, 23], and GCN [24].

Although the linguistic-aware graph structure can provide useful inductive bias, how to build a good graph structure is also a challenging problem. Besides, the structure depends heavily on expert knowledge or external NLP tools, such as the dependency parser.

非序列模型，通过预定义的词与词之间的树或图结构学习上下文表示，如句法结构或语义关系。一些流行的非序列模型包括递归神经网络RNN[6]，TreeL-STM[7,23]和GCN[24]。

虽然语言感知图结构可以提供有用的归纳偏差，但是如何构建一个良好的图结构也是一个具有挑战性的问题。此外，该结构在很大程度上依赖于专家知识或外部NLP工具，例如依赖项解析器。

Fully-Connected Self-Attention Model

In practice, a more straightforward way is to use a fully-connected graph to model the relation of every two words and let the model learn the structure by itself. Usually, the connection weights are dynamically computed by the self-attention mechanism, which implicitly indicates the connection between words. A successful instance of fully-connected self-attention model is the Transformer [10, 25], which also needs other supplement modules, such as positional embeddings, layer normalization, residual connections and position-wise feed-forward network (FFN) layers.

全连接自注意力模型

在实践中，更直接的方法是使用全连接图对每两个单词之间的关系进行建模，并让模型自己学习结构。通常，连接权值由自注意力机制动态计算，自注意力机制隐含的表示单词之间的连接。全连接自注意力模型的一个成功实例是Transformer[10,25]，它还需要其他补充模块，如位置嵌入、层归一化、残差连接和位置前馈网络层(position-wise forward network, FFN)。

2.2.3 Analysis分析

Sequence models learn the contextual representation of the word with locality bias and are hard to capture the long-range interactions between words. Nevertheless, sequence models are usually easy to train and get good results for various NLP tasks.

In contrast, as an instantiated fully-connected self-attention model, the Transformer can directly model the dependency between every two words in a sequence, which is more power-ful and suitable to model long range dependency of language. However, due to its heavy structure and less model bias, the Transformer usually requires a large training corpus and is easy to overfit on small or modestly-sized datasets [15, 26].

Currently, the Transformer has become the mainstream architecture of PTMs due to its powerful capacity.

序列模型学习单词的上下文表示带有局部偏差，很难捕捉单词之间的远程交互。然而，序列模型通常易于训练，并在各种NLP任务中都能获得良好的结果。

相比之下，Transformer作为实例化的全连接自注意力模型，可以直接对序列中每两个单词之间的依赖关系进行建模，功能更强大，更适合对语言的长期依赖关系进行建模。然而，由于Transformer的结构较重(庞大的结构)，模型偏差较小，通常需要较大的训练语料库，并且容易在小型或中等规模的数据集上过拟合[15,26]。

目前，Transformer以其强大的能力成为 PTM 的主流架构。

2.3 Why Pre-training?为什么需要预训练——三大优势

With the development of deep learning, the number of model parameters has increased rapidly. The much larger dataset is needed to fully train model parameters and prevent overfit-ting. However, building large-scale labeled datasets is a great challenge for most NLP tasks due to the extremely expen-sive annotation costs, especially for syntax and semantically related tasks.

In contrast, large-scale unlabeled corpora are relatively easy to construct. To leverage the huge unlabeled text data, we can first learn a good representation from them and then use these representations for other tasks. Recent studies have demon-strated significant performance gains on many NLP tasks with the help of the representation extracted from the PTMs on the large unannotated corpora.

随着深度学习的发展，模型参数的数量迅速增加。需要更大的数据集来充分训练模型参数并防止过拟合。然而，构建大规模的标记数据集对于大多数NLP任务来说是一个巨大的挑战，因为注释成本非常昂贵，特别是对于语法和语义相关的任务。

相比之下，大规模的未标记语料库相对容易构建。要利用巨大的未标记文本数据，我们可以首先从它们学习良好的表示，然后将这些表示用于其他任务。最近的研究表明，借助从大型无注释语料库的PTM中提取的表示，在许多NLP任务中获得了显著的性能提升。

The advantages of pre-training can be summarized as fol-lows:

1、Pre-training on the huge text corpus can learn universal language representations and help with the downstream tasks.

2、Pre-training provides a better model initialization, which usually leads to a better generalization perfor-mance and speeds up convergence on the target task.

3、Pre-training can be regarded as a kind of regularization to avoid overfitting on small data [27].

预训练的优势可以总结如下:

1、在庞大的文本语料库上进行预训练，可以学习通用语言表示，帮助完成下游任务。

2、预训练提供了更好的模型初始化，这通常会带来更好的泛化性能，并加快对目标任务的收敛。

3、预训练可以看作是一种正则化，避免在小数据上过拟合[27]。

2.4 A Brief History of PTMs for NLP—NLP 的 PTM 简史

Pre-training has always been an eﬀective strategy to learn the parameters of deep neural networks, which are then fine-tuned on downstream tasks. As early as 2006, the breakthrough of deep learning came with greedy layer-wise unsupervised pre-training followed by supervised fine-tuning [28].

In CV, it has been in practice to pre-train models on the huge ImageNet corpus, and then fine-tune further on smaller data for diﬀerent tasks. This is much better than a random initialization because the model learns general image features, which can then be used in various vision tasks.

In NLP, PTMs on large corpus have also been proved to be beneficial for the downstream NLP tasks, from the shallow word embedding to deep neural models.

预训练一直是学习深度神经网络参数的有效策略，然后对下游任务进行微调。早在2006年，深度学习的突破就来自于贪婪的分层无监督预训练，然后是监督微调[28]。

在CV领域，在庞大的ImageNet语料库上预训练模型，然后在较小的数据上进一步微调以完成不同的任务已经在实践中得到了应用。这比随机初始化好得多，因为模型学习了一般的图像特征，然后可以在各种视觉任务中使用。

在NLP领域中，大型语料库上的PTMs也被证明有利于下游的NLP任务，从浅层词嵌入到深层神经模型。

2.4.1 First-Generation PTMs: Pre-trained Word Embeddings 第一代PTMs：预训练词嵌入

Representing words as dense vectors has a long history [29].

The “modern” word embedding is introduced in pioneer work of neural network language model (NNLM) [30]. ColloBERT et al. [31] showed that the pre-trained word embedding on the unlabelled data could significantly improve many NLP tasks. To address the computational complexity, they learned word embeddings with pairwise ranking task instead of language modeling. Their work is the first attempt to obtain generic word embeddings useful for other tasks from unlabeled data.

Mikolov et al. [11] showed that there is no need for deep neural networks to build good word embeddings. They pro-pose two shallow architectures: Continuous Bag-of-Words (CBOW) and Skip-Gram (SG) models. Despite their sim-plicity, they can still learn high-quality word embeddings to capture the latent syntactic and semantic similarities among words. Word2vec is one of the most popular implementations of these models and makes the pre-trained word embeddings accessible for diﬀerent tasks in NLP. Besides, GloVe [12] is also a widely-used model for obtaining pre-trained word embeddings, which are computed by global word-word co-occurrence statistics from a large corpus.

用密集向量表示单词有着悠久的历史 [29]。

“现代”词嵌入是在神经网络语言模型 (NNLM) [30] 的开创性工作中引入的。ColloBERT et al.[31]表明，在未标记数据上进行预训练的词嵌入可以显著改善许多NLP任务。为了解决计算复杂性，他们通过成对排序任务而不是语言建模来学习词嵌入。他们的工作是第一次尝试从未标记的数据中获得对其他任务有用的通用词嵌入。

Mikolov等人[11]表明，不需要深度神经网络来构建良好的词嵌入。他们提出了两种浅层架构:连续词袋模型(Continuous Bag-of-Words, CBOW)和Skip-Gram模型(SG)。尽管它们很简单，但它们仍然可以学习高质量的词嵌入，以捕捉单词之间潜在的句法和语义相似性。Word2vec是这些模型最流行的实现之一，它使预训练好的词嵌入可用于NLP中的不同任务。此外，GloVe[12]也是一种广泛使用的预训练词嵌入模型，它是通过从大型语料库中全局词-词共现统计来计算的。

Although pre-trained word embeddings have been shown ef-fective in NLP tasks, they are context-independent and mostly trained by shallow models. When used on a downstream task, the rest of the whole model still needs to be learned from scratch.

During the same time period, many researchers also try to learn embeddings of paragraph, sentence or document, such as paragraph vector [32], Skip-thought vectors [33], Con-text2Vec [34]. Diﬀerent from their modern successors, these sentence embedding models try to encode input sentences into a fixed-dimensional vector representation, rather than the contextual representation for each token.

尽管预训练的词嵌入在NLP任务中已被证明是有效的，但它们与上下文无关，并且主要由浅模型训练。当用于下游任务时，整个模型的其余部分仍然需要从头开始学习。

在同一时期，许多研究者也尝试学习段落、句子或文档的嵌入，如段落向量paragraph vector[32]，Skip-thought向量[33]，上下文Con-text2Vec[34]。与它们的现代后继者不同，这些句子嵌入模型试图将输入句子编码为固定维度的向量表示，而不是每个标记的上下文表示。

2.4.2 Second-Generation PTMs: Pre-trained Contextual En-coders第二代PTMs：预训练的上下文编码器

Since most NLP tasks are beyond word-level, it is natural to pre-train the neural encoders on sentence-level or higher. The output vectors of neural encoders are also called contextual word embeddings since they represent the word semantics depending on its context.

由于大多数NLP任务都超出了单词级别，因此在句子级别或更高级别上预训练神经编码器是很自然的。神经编码器的输出向量也被称为上下文词嵌入，因为它们表示依赖于上下文的词语义。

Dai and Le [35] proposed the first successful instance of PTM for NLP. They initialized LSTMs with a language model (LM) or a sequence autoencoder, and found the pre-training can improve the training and generalization of LSTMs in many text classification tasks.

Liu et al. [5] pre-trained a shared LSTM encoder with LM and fine-tuned it under the multi-task learning (MTL) framework. They found the pre-training and fine-tuning can further improve the performance of MTL for several text classification tasks.

Ramachandran et al. [36] found the Seq2Seq models can be significantly improved by unsupervised pre-training. The weights of both encoder and decoder are initialized with pre-trained weights of two lan-guage models and then fine-tuned with labeled data. Besides pre-training the contextual encoder with LM,

McCann et al.[13] pre-trained a deep LSTM encoder from an attentional sequence-to-sequence model with machine translation (MT). The context vectors (CoVe) output by the pre-trained encoder can improve the performance of a wide variety of common NLP tasks.

2015年，Dai和Le[35]提出了 NLP 的第一个成功的 PTM实例。他们使用语言模型(LM-LSTM)或序列自编码器(SA-LSTM)对LSTMs进行初始化，发现预训练可以提高LSTM在许多文本分类任务中的训练和泛化能力。

Liu等人[5]用LM预训练了一个共享LSTM编码器，并在多任务学习(MTL)框架下对其进行了微调。他们发现，预训练和微调可以进一步提高MTL在多个文本分类任务中的性能。

Ramachandran等人[36]发现 Seq2Seq 模型可以通过无监督预训练得到显着改善。编码器和解码器的权重均使用两种语言模型的预训练权重进行初始化，然后使用标记数据进行微调。除了使用LM预训练上下文编码器外，

McCann等人[13]还使用机器翻译(MT)从注意力MLM模型预训练深度LSTM编码器。预训练编码器输出的上下文向量(CoVe)可以提高各种常见NLP任务的性能。

Since these precursor PTMs, the modern PTMs are usually trained with larger scale corpora, more powerful or deeper architectures (e.g., Transformer), and new pre-training tasks.

Peters et al. [14] pre-trained 2-layer LSTM encoder with a bidirectional language model (BiLM), consisting of a for-ward LM and a backward LM. The contextual representations output by the pre-trained BiLM, ELMo (Embeddings from Language Models), are shown to bring large improvements on a broad range of NLP tasks.

Akbik et al. [37] captured word meaning with contextual string embeddings pre-trained with character-level LM. However, these two PTMs are usu-ally used as a feature extractor to produce the contextual word embeddings, which are fed into the main model for downstream tasks. Their parameters are fixed, and the rest parameters of the main model are still trained from scratch. ULMFiT (Universal Language Model Fine-tuning) [38] at-tempted to fine-tune pre-trained LM for text classification (TC) and achieved state-of-the-art results on six widely-used TC datasets. ULMFiT consists of 3 phases:

1) 、pre-training LM on general-domain data;

2)、 fine-tuning LM on target data;

3)、 fine-tuning on the target task.

ULMFiT also investigates some eﬀective fine-tuning strategies, including discrimina-tive fine-tuning, slanted triangular learning rates, and gradual unfreezing.

由于这些前身PTM，现代PTM通常使用更大规模的语料库、更强大/更深入的架构(例如Transformer)和新的预训练任务进行训练。

Peters等[14]预训练的2层LSTM编码器，具有双向语言模型(BiLM)，由正向LM和反向LM组成。通过预训练的BiLM，ELMo(来自语言模型的嵌入)输出的上下文表示显示，在广泛的NLP任务上带来了巨大的改进。

Akbik等人[37]用字符级LM预训练的上下文字符串嵌入捕获词义。但是，这两个PTM通常用作特征提取器来生成上下文词嵌入，这些词嵌入被馈送到用于下游任务的主模型中。它们的参数是固定的，主模型的其余参数仍然从头训练。ULMFiT (通用语言模型微调)[38]尝试对预训练LM进行文本分类(TC)微调，并在六个广泛使用的TC数据集上取得了最先进的结果。ULMFiT由三个阶段组成:

1)、在通用域数据进行预训练LM;

2)、在目标数据进行微调LM;

3)、在目标任务进行微调。

ULMFiT还研究了一些有效的微调策略，包括判别式微调、倾斜三角学习率和逐渐解冻。

More recently, the very deep PTMs have shown their pow-erful ability in learning universal language representations:

e.g., OpenAI GPT (Generative Pre-training) [15] and BERT (Bidirectional Encoder Representation from Transformer) [16].

Besides LM, an increasing number of self-supervised tasks (see Section 3.1) is proposed to make the PTMs capturing more knowledge form large scale text corpora.

Since ULMFiT and BERT, fine-tuning has become the mainstream approach to adapt PTMs for the downstream tasks.

最近，非常深的PTM已经展示了它们在学习通用语言表示方面的强大能力:

例如，OpenAI GPT(生成预训练)[15]和BERT(来自Transformer的双向编码器表示)[16]。

除了LM之外，还提出了越来越多的自监督任务(参见章节3.1)，以使PTM从大规模文本语料库中捕获更多的知识。

自从ULMFiT 和BERT以来，微调已经成为为下游任务调整PTM的主流方法。

3 Overview of PTMs—PTM的概述

The major diﬀerences between PTMs are the usages of con-textual encoders, pre-training tasks, and purposes. We have briefly introduced the architectures of contextual encoders in Section 2.2. In this section, we focus on the description of pre-training tasks and give a taxonomy of PTMs.

PTM之间的主要区别是上下文编码器的使用、预训练任务和目的。我们在2.2节中简要介绍了上下文编码器的架构。在本节中，我们将重点介绍预训练任务的描述，并对预训练任务进行分类。

3.1 Pre-training Tasks预训练任务

The pre-training tasks are crucial for learning the universal representation of language. Usually, these pre-training tasks should be challenging and have substantial training data. In this section, we summarize the pre-training tasks into three categories: supervised learning, unsupervised learning, and self-supervised learning.

1、Supervised learning (SL) is to learn a function that maps an input to an output based on training data consisting of input-output pairs.

2、Unsupervised learning (UL) is to find some intrinsic knowledge from unlabeled data, such as clusters, densi-ties, latent representations.

3、Self-Supervised learning (SSL) is a blend of supervised learning and unsupervised learning1). The learning paradigm of SSL is entirely the same as supervised learning, but the labels of training data are generated automatically. The key idea of SSL is to predict any part of the input from other parts in some form. For example, the masked language model (MLM) is a self-supervised task that attempts to predict the masked words in a sentence given the rest words.

预训练任务对于学习语言的通用表征是至关重要的。通常，这些预训练任务应该具有挑战性并且有大量的训练数据。在本节中，我们将预训练任务总结为三类：监督学习、无监督学习和自监督学习。

1、监督学习(Supervised learning, SL)是根据由输入-输出对组成的训练数据，学习一个将输入映射到输出的函数。

2、无监督学习(Unsupervised learning, UL)是指从未标记数据中发现一些内在知识，如聚类、密度、潜在表征等。

3、自监督学习(SSL)是监督学习和无监督学习的混合体。SSL的学习范式与监督学习完全相同，只是训练数据的标签是自动生成的。SSL的关键思想是以某种形式，从其他部分预测输入的任何部分。例如，掩码/蒙面语言模型 (MLM) 是一项自监督的任务，它试图在给定其余单词的情况下预测句子中的掩码单词。

In CV, many PTMs are trained on large supervised training sets like ImageNet. However, in NLP, the datasets of most supervised tasks are not large enough to train a good PTM. The only exception is machine translation (MT). A large-scale MT dataset, WMT 2017, consists of more than 7 million sen-tence pairs. Besides, MT is one of the most challenging tasks in NLP, and an encoder pre-trained on MT can benefit a va-riety of downstream NLP tasks. As a successful PTM, CoVe [13] is an encoder pre-trained on MT task and improves a wide variety of common NLP tasks: sentiment analysis (SST, IMDb), question classification (TREC), entailment (SNLI), and question answering (SQuAD).

In this section, we introduce some widely-used pre-training tasks in existing PTMs. We can regard these tasks as self-supervised learning. Table 1 also summarizes their loss func-tions.

在CV领域，许多PTM在大型监督训练集(如ImageNet)上进行训练。但是在NLP领域，大多数监督任务的数据集都不够大，无法训练出一个良好的PTM。唯一的例外是机器翻译(MT)。大规模的MT数据集WMT 2017由超过700万句对组成。此外，MT是NLP中最具挑战性的任务之一，预先在MT上训练的编码器可以受益于各种下游的NLP任务。作为一个成功的PTM, CoVe[13]是一个预训练MT任务的编码器，并改进了各种常见的NLP任务：情感分析(SST, IMDb)，问题分类(TREC)，蕴涵(SNLI)和问题回答(SQuAD)。

在本节中，我们将介绍一些现有PTM中广泛使用的预训练任务。我们可以把这些任务看作是自监督学习。表1还总结了它们的损失函数。

3.1.1 Language Modeling (LM)语言建模

The most common unsupervised task in NLP is probabilistic language modeling (LM), which is a classic probabilistic den-sity estimation problem. Although LM is a general concept, in practice, LM often refers in particular to auto-regressive LM or unidirectional LM.	在NLP中最常见的无监督任务是概率语言建模(LM)，这是一个经典的概率密度估计问题。虽然LM是一个一般概念，但在实践中，LM通常特指自回归LM或单向LM。
Given a text sequence x1:T = [x1, x2, · · · , xT ], its joint prob-ability p(x1:T ) can be decomposed as	给定文本序列x1:T = [x1, x2，···，xT]，其联合概率p(x1:T)可分解为
The conditional probability p(xt\|x0:t−1) can be modeled by a probability distribution over the vocabulary given linguistic context x0:t−1. The context x0:t−1 is modeled by neural encoder fenc(·), and the conditional probability is	条件概率p(xt\|x0:t−1)可以用给定语言上下文x0:t−1的词汇表的概率分布来建模。上下文x0:t−1由神经编码器fenc(·)建模，条件概率为
Given a huge corpus, we can train the entire network with maximum likelihood estimation (MLE). A drawback of unidirectional LM is that the representa-tion of each token encodes only the leftward context tokens and itself. However, better contextual representations of text should encode contextual information from both directions. An improved solution is bidirectional LM (BiLM), which con-sists of two unidirectional LMs: a forward left-to-right LM and a backward right-to-left LM. For BiLM, Baevski et al.[39] proposed a two-tower model that the forward tower oper-ates the left-to-right LM and the backward tower operates the right-to-left LM.	给定一个巨大的语料库，我们可以用最大似然估计(MLE)训练整个网络。单向LM的一个缺点是每个标记的表示仅对左向上下文标记及其自身进行编码。然而，更好的文本上下文表示应该从两个方向编码上下文信息。一种改进的解决方案是双向LM (BiLM)，它由两个单向LM组成：一个向前的从左到右LM和一个向后的从右到左LM。对于BiLM, Baevski et al.[39]提出了一个双塔模型，前向塔运行从左到右的LM，后向塔运行从右到左的LM。

Table 1: Loss Functions of Pre-training Tasks

3.1.2 Masked Language Modeling (MLM)掩码语言建模

Masked language modeling (MLM) is first proposed by Tay-lor [40] in the literature, who referred to this as a Cloze task. Devlin et al. [16] adapted this task as a novel pre-training task to overcome the drawback of the standard unidirectional LM. Loosely speaking, MLM first masks out some tokens from the input sentences and then trains the model to predict the masked tokens by the rest of the tokens. However, this pre-training method will create a mismatch between the pre-training phase and the fine-tuning phase because the mask token does not appear during the fine-tuning phase. Empirically, to deal with this issue, Devlin et al. [16] used a special [MASK] token 80%of the time, a random token 10% of the time and the original token 10% of the time to perform masking.

蒙面/掩码语言建模(MLM)是由Tay-lor[40]在文献中首次提出的，他将其称为Cloze完形填空任务。Devlin et al.[16]将该任务作为一种新的预训练任务，以克服标准单向LM的缺点。简单地说，MLM首先从输入句子中屏蔽掉一些tokens/标记，然后训练模型通过剩余tokens来预测被屏蔽的tokens。但是，这种预训练方法将在预训练阶段和微调阶段之间产生不匹配，因为在微调阶段不会出现掩码tokens。根据经验，为了解决这个问题，Devlin等人在80%的时间里使用一个特殊的[MASK]的token，10%的时间里使用一个随机token，10%的时间里使用原始token来执行屏蔽。

Sequence-to-Sequence MLM (Seq2Seq MLM)

MLM is usually solved as classification problem. We feed the masked sequences to a neural encoder whose output vectors are fur-ther fed into a softmax classifier to predict the masked token.

Alternatively, we can use encoder-decoder (aka. sequence-to-sequence) architecture for MLM, in which the encoder is fed a masked sequence, and the decoder sequentially produces the masked tokens in auto-regression fashion. We refer to this kind of MLM as sequence-to-sequence MLM (Seq2Seq MLM), which is used in MASS [41] and T5 [42]. Seq2Seq MLM can benefit the Seq2Seq-style downstream tasks, such as question answering, summarization, and machine transla-tion.

序列对序列的MLM(Seq2Seq MLM)

MLM通常作为分类问题来解决。我们将掩码序列输入一个神经编码器，其输出向量进一步提供给 softmax分类器来预测掩码token。

或者，我们可以使用编码器-解码器（又名序列到序列Seq2Seq）架构，其中编码器被馈送一个掩码序列，解码器以自回归的方式依次产生掩码token。我们将这种MLM称为Seq2Seq的MLM(Seq2Seq MLM)，在MASS[41]和T5[42]中使用。Seq2Seq MLM可以使Seq2Seq风格的下游任务受益，例如问题回答、摘要和机器翻译。

Enhanced Masked Language Modeling (E-MLM)

Con-currently, there are multiple research proposing diﬀerent en-hanced versions of MLM to further improve on BERT. Instead of static masking, RoBERTa [43] improves BERT by dynamic masking.

UniLM [44, 45] extends the task of mask prediction on three types of language modeling tasks: unidirectional, bidi-rectional, and sequence-to-sequence prediction. XLM [46] performs MLM on a concatenation of parallel bilingual sen-tence pairs, called Translation Language Modeling (TLM). SpanBERT [47] replaces MLM with Random Contiguous Words Masking and Span Boundary Objective (SBO) to inte-grate structure information into pre-training, which requires the system to predict masked spans based on span boundaries. Besides, StructBERT [48] introduces the Span Order Recovery task to further incorporate language structures.

Another way to enrich MLM is to incorporate external knowledge (see Section 4.1).

增强型屏蔽语言建模(E-MLM)

目前，有多个研究提出了不同的增强型MLM模型，以进一步改进BERT。RoBERTa[43]通过动态掩蔽来改善BERT，而不是静态掩蔽。

UniLM[44,45]将屏蔽预测任务扩展到三种类型的语言建模任务上：单向、双向和Seq2Seq预测。XLM[46]在并行双语句子对的串联上执行MLM，称为翻译语言建模(TLM)。SpanBERT[47]用随机连续词掩码和跨度边界目标 (SBO) 代替 MLM，将结构信息整合到预训练中，这需要系统基于跨度边界预测掩码跨度。此外，StructBERT[48]还引入了Span Order Recovery跨度顺序恢复任务来进一步整合语言结构。

丰富MLM的另一种方法是整合外部知识(见章节4.1)。

3.1.3 Permuted Language Modeling (PLM)置换语言建模

Despite the wide use of the MLM task in pre-training, Yang et al. [49] claimed that some special tokens used in the pre-training of MLM, like [MASK], are absent when the model is applied on downstream tasks, leading to a gap between pre-training and fine-tuning. To overcome this issue, Permuted Language Modeling (PLM) [49] is a pre-training objective to replace MLM. In short, PLM is a language modeling task on a random permutation of input sequences. A permutation is randomly sampled from all possible permutations. Then some of the tokens in the permuted sequence are chosen as the target, and the model is trained to predict these targets, depending on the rest of the tokens and the natural positions of targets. Note that this permutation does not aﬀect the natural positions of sequences and only defines the order of token pre-dictions. In practice, only the last few tokens in the permuted sequences are predicted, due to the slow convergence. And a special two-stream self-attention is introduced for target-aware representations.

尽管MLM任务在预训练中被广泛使用，但Yang et al.[49]声称，在将模型应用于下游任务时，MLM预训练中使用的一些特殊tokens (如[MASK])不存在，导致预训练与微调之间存在差距。为了克服这一问题，PLM[49]是替代MLM的预训练目标。简而言之，PLM 是一种对输入序列进行随机排列的语言建模任务。一个排列是从所有可能的排列中随机抽取的。然后选择置换序列中的一些tokens作为目标，并训练模型根据剩余tokens和目标的自然位置来预测这些目标。注意，这种排列不影响序列的自然位置，只定义token预测的顺序。在实践中，由于收敛速度较慢，仅预测置换序列中的最后几个token标记。针对目标感知表示引入了一种特殊的双流自注意。

3.1.4 Denoising Autoencoder (DAE)降噪自动编码器

Denoising autoencoder (DAE) takes a partially corrupted input and aims to recover the original undistorted input. Specific to language, a sequence-to-sequence model, such as the standard Transformer, is used to reconstruct the original text. There are several ways to corrupt text [50]:

(1)、Token Masking: Randomly sampling tokens from the input and replacing them with [MASK] elements.

(2)、Token Deletion: Randomly deleting tokens from the in-put. Diﬀerent from token masking, the model needs to decide the positions of missing inputs.

(3)、Text Infilling: Like SpanBERT, a number of text spans are sampled and replaced with a single [MASK] token. Each span length is drawn from a Poisson distribution (λ = 3). The model needs to predict how many tokens are missing from a span.

(4)、Sentence Permutation: Dividing a document into sen-tences based on full stops and shuﬄing these sentences in random order.

(5)、Document Rotation: Selecting a token uniformly at random and rotating the document so that it begins with that token. The model needs to identify the real start position of the document.

降噪自动编码器(DAE)采用部分损坏的输入，目的是恢复原始的未失真的输入。具体到语言，使用Seq2Seq模型(如标准Transformer)来重构原文。有几种方法可以破坏文本[50]:

(1)、Token Masking掩码：从输入中随机采样token，并用[MASK]元素替换它们。

(2)、Token Deletion删除：从输入中随机删除token。与token掩码不同，该模型需要确定缺失输入的位置。

(3)、Text Infilling文本填充：与 SpanBERT 一样，对多个文本跨度进行采样，并用单个 [MASK] token替换。每个跨度长度都取自泊松分布(λ = 3)中绘制的。该模型需要预测跨度中丢失多少tokens。

(4)、Sentence Permutation语句排序：根据句点将文档分成句子，并将这些句子随机排列。

(5)、Document Rotation文档旋转：随机均匀地选择一个token，并旋转文档，使其从该token开始。该模型需要识别文档的真正起始位置。

3.1.5 Contrastive Learning (CTL)对比学习

Contrastive learning [51] assumes some observed pairs of text that are more semantically similar than randomly sampled text. A score function s(x, y) for text pair (x, y) is learned to minimize the objective function:

where (x, y+) are a similar pair and y− is presumably dissimi-lar to x. y+ and y− are typically called positive and negative sample. The score function s(x, y) is often computed by a learnable neural encoder in two ways:

s(x, y) = f Tenc(x) fenc(y) or

s(x, y) = fenc(x ⊕ y).

对比学习[51]假设一些观察到的文本对，在语义上比随机采样的文本更相似。学习一个文本对(x, y)的评分函数s(x, y)来最小化目标函数:

其中(x, y+)是相似的一对，y−可能与x不同，y+和y−通常称为正样本和负样本。评分函数s(x, y)通常由可学习的神经编码器以两种方式计算:

s(x, y) = f Tenc(x) fenc(y)或

s(x, y) = fenc(x⊕y)。

The idea behind CTL is “learning by comparison”. Com-pared to LM, CTL usually has less computational complex-ity and therefore is desirable alternative training criteria for PTMs.

ColloBERT et al. [31] proposed pairwise ranking task to dis-tinguish real and fake phrases. The model needs to predict a higher score for a legal phrase than an incorrect phrase obtained by replacing its central word with a random word.

Mnih and Kavukcuoglu [52] trained word embeddings eﬃ-ciently with Noise-Contrastive Estimation (NCE) [53], which trains a binary classifier to distinguish real and fake samples. The idea of NCE is also used in the well-known word2vec embedding [11].

We briefly describe some recently proposed CTL tasks in the following paragraphs.

CTL背后的理念是“通过比较来学习”。与LM相比，CTL通常具有较小的计算复杂度，因此是PTM的理想替代训练标准。

ColloBERT et al.[31]提出了成对排序任务来区分真假短语。该模型需要预测，一个合法短语，比一个用随机单词替换其中心单词，得到的不正确短语更高的分数。

Mnih和Kavukcuoglu[52]使用噪声对比估计(NCE)[53]有效地训练词嵌入，它训练一个二元分类器来区分真假样本。NCE的思想也被用在众所周知的word2vec嵌入[11]中。

我们将在以下段落中简要介绍一些最近提出的CTL任务。

Deep InfoMax (DIM)

Deep InfoMax (DIM) [54] is origi-nally proposed for images, which improves the quality of the representation by maximizing the mutual information between an image representation and local regions of the image.

Kong et al. [55] applied DIM to language representation learning. The global representation of a sequence x is defined to be the hidden state of the first token (assumed to be a spe-cial start of sentence symbol) output by contextual encoder fenc(x). The objective of DIM is to assign a higher score for fenc(xi: j)T fenc(xˆi: j) than fenc(x˜i: j)T fenc(xˆi: j), where xi: j denotes an n-gram2) span from i to j in x, xˆi: j denotes a sentence masked at position i to j, and x˜i: j denotes a randomly-sampled negative n-gram from corpus.

Deep InfoMax (DIM)

Deep InfoMax (DIM)[54]最初是针对图像提出的，它通过最大化图像表示与图像局部区域之间的互信息来提高图像表示的质量。

Kong et al.[55]将DIM应用于语言表征学习。序列x的全局表示被定义为上下文编码器fenc(x)输出的第一个token(假设是一个特殊的句子符号的开始)的隐藏状态。DIM的目标是为 fenc(xi: j)T fenc(xi: j) 分配比 fenc(x∼i: j)T fenc(xi: j) 更高的分数，其中xi: j表示一个n-gram2)在 x 中从 i 到 j，xi:j 表示在位置 i 到 j 处被屏蔽的句子，x∼i:j 表示从语料库中随机采样的负 n-gram。

Replaced Token Detection (RTD)

Replaced Token Detec-tion (RTD) is the same as NCE but predicts whether a token is replaced given its surrounding context.

CBOW with negative sampling (CBOW-NS) [11] can be viewed as a simple version of RTD, in which the negative samples are randomly sampled from vocabulary with simple proposal distribution.

ELECTRA [56] improves RTD by utilizing a generator to replacing some tokens of a sequence. A generator G and a dis-criminator D are trained following a two-stage procedure:

(1) Train only the generator with MLM task for n1 steps;

(2) Ini-tialize the weights of the discriminator with the weights of the generator. Then train the discriminator with a discriminative task for n2 steps, keeping G frozen. Here the discriminative task indicates justifying whether the input token has been re-placed by G or not. The generator is thrown after pre-training, and only the discriminator will be fine-tuned on downstream tasks.

RTD is also an alternative solution for the mismatch prob-lem. The network sees [MASK] during pre-training but not when being fine-tuned in downstream tasks.

Similarly, WKLM [57] replaces words on the entity-level instead of token-level. Concretely, WKLM replaces entity mentions with names of other entities of the same type and train the models to distinguish whether the entity has been replaced.

替换token检测(RTD)

替换token检测(RTD)与NCE相同，但根据周围的上下文预测token是否被替换。

负抽样CBOW (CBOW-NS)[11]可以看作是RTD的一个简单版本，其中负样本是从具有简单建议分布的词汇表中随机抽取的。

ELECTRA[56]通过使用生成器替换序列的一些token来改进RTD。生成器G和鉴别器D的训练遵循两阶段的过程：

(1)、仅训练带有MLM任务的生成器n1步；

(2)、用生成器的权值初始化鉴别器的权值。然后用判别任务训练鉴别器n2步，保持G冻结。这里的判别任务指示判断输入token是否已被G替换。生成器在预训练后抛出，只有鉴别器将在下游任务上进行微调。

RTD也是失配问题的另一种解决方案。网络在预训练期间看到[MASK]，但在下游任务中进行微调时看不到。

类似地，WKLM[57]在实体级而不是token级替换单词。具体来说，WKLM将实体提及替换为其他同类型实体的名称，并训练模型区分实体是否被替换。

Next Sentence Prediction (NSP)

Punctuations are the nat-ural separators of text data. So, it is reasonable to construct pre-training methods by utilizing them. Next Sentence Predic-tion (NSP) [16] is just a great example of this. As its name suggests, NSP trains the model to distinguish whether two input sentences are continuous segments from the training cor-pus. Specifically, when choosing the sentences pair for each pre-training example, 50% of the time, the second sentence is the actual next sentence of the first one, and 50% of the time, it is a random sentence from the corpus. By doing so, it is capable to teach the model to understand the relationship between two input sentences and thus benefit downstream tasks that are sensitive to this information, such as Question Answering and Natural Language Inference.

However, the necessity of the NSP task has been questioned by subsequent work [47, 49, 43, 63]. Yang et al. [49] found the impact of the NSP task unreliable, while Joshi et al. [47] found that single-sentence training without the NSP loss is superior to sentence-pair training with the NSP loss. More-over, Liu et al. [43] conducted a further analysis for the NSP task, which shows that when training with blocks of text from a single document, removing the NSP loss matches or slightly improves performance on downstream tasks.

NSP下一句预测任务

标点符号是文本数据的天然分隔符。因此，利用这些方法构建预训练方法是合理的。NSP[16]就是一个很好的例子。顾名思义，NSP训练模型以区分两个输入句子是否是来自训练语料库的连续片段。具体来说，在为每个预训练示例选择句子对时，50%的情况：第二句话是第一句的实际下一句，

50%的情况：它是语料库中的随机句子。

通过这样做，它能够教会模型理解两个输入句子之间的关系，从而有利于对这些信息敏感的下游任务，如问答和自然语言推断。

然而，NSP任务的必要性受到了后续工作的质疑[47,49,43,63]。Yang et al.[49]发现NSP任务的影响不可靠，而Joshi et al.[47]发现没有NSP损失的单句训练优于有NSP损失的句子对训练。此外，Liu et al.[43]对NSP任务进行了进一步的分析，结果表明，当使用单个文档中的文本块进行训练时，去除NSP损失匹配或略微提高下游任务的性能。

Sentence Order Prediction (SOP)

To better model inter-sentence coherence, ALBERT [63] replaces the NSP loss with a sentence order prediction (SOP) loss. As conjectured in Lan et al. [63], NSP conflates topic prediction and coherence prediction in a single task. Thus, the model is allowed to make predictions merely rely on the easier task, topic prediction. Diﬀerent from NSP, SOP uses two consecutive segments from the same document as positive examples, and the same two consecutive segments but with their order swapped as negative examples. As a result, ALBERT consistently outperforms BERT on various downstream tasks.

StructBERT [48] and BERTje [88] also take SOP as their self-supervised learning task.

SOP句子顺序预测任务

为了更好地模拟句间连贯性，ALBERT[63]用句子顺序预测(SOP)损失代替了NSP损失。正如Lan等人[63]推测的那样，NSP将主题预测和连贯性预测合并在一个任务中。因此，该模型可以仅依靠更简单的任务，主题预测来进行预测。与NSP不同，SOP使用同一文档中的两个连续片段作为正例，使用相同的两个连续片段但顺序交换作为反例。因此，ALBERT在各种下游任务上始终优于BERT。

StructBERT[48]和BERTje[88]也将SOP作为他们的自监督学习任务。

3.1.6 Others

Apart from the above tasks, there are many other auxiliary pre-training tasks designated to incorporate factual knowledge (see Section 4.1), improve cross-lingual tasks (see Section 4.2), multi-modal applications (see Section 4.3), or other specific tasks (see Section 4.4).

除了上述任务外，还有许多其他辅助的预训练任务，用于整合事实知识(见章节4.1)、提高跨语言任务(见章节4.2)、多模态应用(见章节4.3)或其他特定任务(见章节4.4)。

3.2 Taxonomy of PTMs

To clarify the relations of existing PTMs for NLP, we build the taxonomy of PTMs, which categorizes existing PTMs from four diﬀerent perspectives:

1、Representation Type: According to the representation used for downstream tasks, we can divide PTMs into non-contextual and contextual models.

2、Architectures: The backbone network used by PTMs, including LSTM, Transformer encoder, Transformer decoder, and the full Transformer architecture. “Trans-former” means the standard encoder-decoder architec-ture. “Transformer encoder” and “Transformer decoder” mean the encoder and decoder part of the standard Transformer architecture, respectively. Their diﬀerence is that the decoder part uses masked self-attention with a triangular matrix to prevent tokens from attending their future (right) positions.

3、Pre-Training Task Types: The type of pre-training tasks used by PTMs. We have discussed them in Section 3.1.

4、Extensions: PTMs designed for various scenarios, in-cluding knowledge-enriched PTMs, multilingual or language-specific PTMs, multi-model PTMs, domain-specific PTMs and compressed PTMs. We will particu-larly introduce these extensions in Section 4.

为了理清NLP现有PTM之间的关系，我们构建了PTM的分类法，从四个不同的角度对现有PTM进行分类:

1、表征类型：根据下游任务所使用的表征，我们可以将PTM分为非上下文模型和上下文模型。

2、架构：PTM使用的骨干网络，包括LSTM、Transformer编码器、Transformer解码器和完整的Transformer架构。

“Transformer”是指标准的编码器-解码器结构。“Transformer 编码器”和“Transformer 解码器”分别是指标准 Transformer 架构的编码器和解码器部分。它们的不同之处在于，解码器部分使用带有三角形矩阵的屏蔽自注意力来防止tokens出现在它们未来(右/正确的)的位置。

3、预训练任务类型：PTM使用的预训练任务类型。我们已经在3.1节中讨论过它们。

4、扩展：为各种场景设计的PTM，包括知识丰富的PTM、多语言或特定语言的PTM、多模型的PTM、特定领域的PTM和压缩的PTM。我们将在第4节中详细介绍这些扩展。

Figure 3 shows the taxonomy as well as some correspond-ing representative PTMs. Besides, Table 2 distinguishes some representative PTMs in more detail.

图3显示了分类法以及一些相应的代表性PTM。此外，表2更详细地区分了一些有代表性的PTM。

3.3 Model Analysis模型分析

Due to the great success of PTMs, it is important to understand what kinds of knowledge are captured by them, and how to in-duce knowledge from them. There is a wide range of literature analyzing linguistic knowledge and world knowledge stored in pre-trained non-contextual and contextual embeddings.

由于PTM的巨大成功，了解它们捕获了哪些类型的知识，以及如何从它们中归纳出知识是很重要的。有大量文献分析了存储在预训练的非上下文嵌入和上下文嵌入中的语言知识和世界知识。

PTM捕获的两种类型知识：Linguistic Knowledge语言知识/语言学知识(2种)、World Knowledge世界知识/知识库知识(4种)
https://www.processon/mindmap/5b824880e4b0d4d65be6f667

3.3.1 Non-Contextual Embeddings非上下文嵌入

Static word embeddings are first probed for kinds of knowl-edge. Mikolov et al. [117] found that word representa-tions learned by neural network language models are able to capture linguistic regularities in language, and the rela-tionship between words can be characterized by a relation-specific vector oﬀset. Further analogy experiments [11] demonstrated that word vectors produced by skip-gram model can capture both syntactic and semantic word relationships, such as vec(“China”) − vec(“Beijing”) ≈ vec(“Japan”) − vec(“Tokyo”). Besides, they find compositionality property of word vectors, for example, vec(“Germany”) + vec(“capital”) is close to vec(“Berlin”). Inspired by these work, Rubin-stein et al. [118] found that distributional word representations are good at predicting taxonomic properties (e.g., dog is an animal) but fail to learn attributive properties (e.g., swan is white). Similarly, Gupta et al. [119] showed that word2vec embeddings implicitly encode referential attributes of entities. The distributed word vectors, along with a simple supervised model, can learn to predict numeric and binary attributes of entities with a reasonable degree of accuracy.

本文首先探讨了静态词嵌入技术在各类知识边缘中的应用。Mikolov等[117]发现神经网络语言模型学习的词表示能够捕捉语言中的语言规律，并且词之间的关系可以用关系特定的向量偏移量来表征。进一步的类比实验[11]表明，skip-gram模型产生的词向量可以捕获句法和语义词关系，如 vec(“China”) − vec(“Beijing”) ≈ vec(“Japan”) − vec(“Tokyo”)。此外，他们还发现了词向量的组合性，例如vec(“Germany”) + vec(“capital”) 接近于 vec(“Berlin”)。受到这些工作的启发，Rubin-stein等人[118]发现，分布式词表示擅长预测分类学属性(例如，狗是一种动物)，但无法学习定语属性(例如，天鹅是白色的)。类似地，Gupta等人[119]表明word2vec嵌入隐式地编码实体的引用属性。分布式词向量，以及一个简单的监督模型，可以学习预测实体的数字和二进制属性，具有合理的准确性。

Figure 3: Taxonomy of PTMs with Representative Examples

Table 2: List of Representative PTMs有代表性的 PTMs 及其架构

“Transformer Enc.” and “Transformer Dec.” mean the encoder and decoder part of the standard Transformer architecture respectively. Their dierence is that the
decoder part uses masked self-attention with triangular matrix to prevent tokens from attending their future (right) positions. “Transformer” means the standard encoder-decoder architecture.
the averaged score on 9 tasks of GLUE benchmark (see Section 7.1).
without WNLI task.
indicates ensemble result.
means whether is model usually used in fine-tuning fashion.
The MLM of UniLM is built on three versions of LMs: Unidirectional LM, Bidirectional LM, and Sequence-to-Sequence LM.

“Transformer Enc.”和“Transformer Dec.”分别表示标准Transformer架构的编码器和解码器部分。他们的不同之处在于
解码器部分使用带有三角矩阵的掩码自注意来防止标记出现在它们未来(右)的位置。“Transformer”是指标准的encoder-decoder架构。
GLUE基准测试9项任务的平均分(见7.1节)。
没有WNLI任务。
显示集成结果。
表示模型是否通常用于微调fashion。
UniLM 的 MLM 建立在三个版本的 LM 之上：单向 LM、双向 LM 和序列到序列 LM。

3.3.2 Contextual Embeddings上下文嵌入

A large number of studies have probed and induced diﬀerent types of knowledge in contextual embeddings. In general, there are two types of knowledge: linguistic knowledge and world knowledge.

大量的研究对上下文嵌入中不同类型的知识进行了探究和归纳。一般来说，知识有两种类型：语言知识和世界知识。

Linguistic Knowledge

A wide range of probing tasks are designed to investigate the linguistic knowledge in PTMs. Ten-ney et al. [120], Liu et al. [121] found that BERT performs well on many syntactic tasks such as part-of-speech tagging and constituent labeling. However, BERT is not good enough at semantic and fine-grained syntactic tasks, compared with simple syntactic tasks.

Besides, Tenney et al. [122] analyzed the roles of BERT鈥檚 layers in di铿€erent tasks and found that BERT solves tasks in a similar order to that in NLP pipelines. Furthermore, knowl-edge of subject-verb agreement [123] and semantic roles [124] are also confirmed to exist in BERT. Besides, Hewitt and Man-ning [125], Jawahar et al. [126], Kim et al. [127] proposed several methods to extract dependency trees and constituency trees from BERT, which proved the BERT鈥檚 ability to encode syntax structure. Reif et al. [128] explored the geometry of internal representations in BERT and find some evidence:

1)、linguistic features seem to be represented in separate semantic and syntactic subspaces;

2)、attention matrices contain gram-matical representations;

3)、BERT distinguishes word senses at a very fine level.

语言知识

研究人员设计了大量的探测任务，以考察PTMs中的语言知识。Ten-ney等[120]，Liu等[121]发现BERT在词性标注和成分标注等许多句法任务上表现良好。但是，与简单的语法任务相比，BERT在语义和细粒度语法任务方面还不够好。

此外，Tenney等人[122]分析了 BERT层在不同任务中的作用，发现BERT解决任务的顺序与NLP管道中的类似。此外，BERT中还存在主谓一致知识[123]和语义角色[124]。此外，Hewitt和Man-ning[125]、Jawahar等人[126]、Kim等人[127]提出了几种从BERT中提取依赖树和成分树/选区树的方法，证明了 BERT 的语法结构编码能力。Reif等人[128]探索了BERT中内部表征的几何结构，并发现了一些证据:

1)、语言特征似乎在单独的语义和句法子空间中表示;

2)、注意矩阵包含语法表示;

3)、 BERT可以非常精细地区分词义。

World Knowledge

Besides linguistic knowledge, PTMs may also store world knowledge presented in the training data. A straightforward method of probing world knowledge is to query BERT with “fill-in-the-blank” cloze statements, for example, “Dante was born in [MASK]”. Petroni et al. [129] constructed LAMA (Language Model Analysis) task by manu-ally creating single-token cloze statements (queries) from sev-eral knowledge sources. Their experiments show that BERT contains world knowledge competitive with traditional information extraction methods. Since the simplicity of query generation procedure in LAMA, Jiang et al. [130] argued that LAMA just measures a lower bound for what language models know and propose more advanced methods to generate more eﬃcient queries. Despite the surprising findings of LAMA, it has also been questioned by subsequent work [131, 132]. Sim-ilarly, several studies induce relational knowledge [133] and commonsense knowledge [134] from BERT for downstream tasks.

世界知识

除了语言知识，PTM还可以存储训练数据中呈现的世界知识。探索世界知识的一种直接方法是使用“填空”完形填空语句查询BERT，例如，“但丁出生在[MASK]”。Petroni等人[129]通过从几个知识来源手动创建单token完形语句(查询)构建了LAMA(语言模型分析)任务。他们的实验表明，BERT包含的世界知识可与传统的信息提取方法相媲美。由于LAMA中查询生成过程的简单性，Jiang等人[130]认为LAMA只是度量语言模型所知道的下界，并提出更先进的方法来生成更有效的查询。尽管LAMA的发现令人惊讶，但它也受到了后续工作的质疑[131,132]。类似地，有几项研究从BERT中归纳出用于下游任务的关系知识[133]和常识知识[134]。

4 Extensions of PTMs—PTM 的扩展

4.1 Knowledge-Enriched PTMs知识丰富的 PTM

PTMs usually learn universal language representation from general-purpose large-scale text corpora but lack domain-specific knowledge. Incorporating domain knowledge from external knowledge bases into PTM has been shown to be eﬀective. The external knowledge ranges from linguistic [135, 79, 77, 136], semantic [137], commonsense [138], factual [76–78, 57, 80], to domain-specific knowledge [139, 78].	PTM通常从通用大型文本语料库学习通用语言表示，但缺乏特定领域的知识。将来自外部知识库的领域知识合并到PTM中已被证明是有效的。外部知识的范围从*语言学*[135,79,77,136]、语义[137]、常识[138]、事实[76-78,57,80]到特定领域知识[139,78]。
On the one hand, external knowledge can be injected dur-ing pre-training. Early studies [140–143] focused on learning knowledge graph embeddings and word embedding jointly. Since BERT, some auxiliary pre-training tasks are designed to incorporate external knowledge into deep PTMs. LIB-ERT [135] (linguistically-informed BERT) incorporates lin-guistic knowledge via an additional linguistic constraint task. Ke et al. [79] integrated sentiment polarity of each word to extend the MLM to Label-Aware MLM (LA-MLM). As a re-sult, their proposed model, SentiLR, achieves state-of-the-art performance on several sentence- and aspect-level sentiment classification tasks. Levine et al. [137] proposed SenseBERT, which is pre-trained to predict not only the masked tokens but also their supersenses in WordNet. ERNIE(THU) [76] inte-grates entity embeddings pre-trained on a knowledge graph with corresponding entity mentions in the text to enhance the text representation. Similarly, KnowBERT [77] trains BERT jointly with an entity linking model to incorporate entity repre-sentation in an end-to-end fashion. Wang et al. [80] proposed KEPLER, which jointly optimizes knowledge embedding and language modeling objectives. These work inject structure information of knowledge graph via entity embedding. In con-trast, K-BERT [78] explicitly injects related triples extracted from KG into the sentence to obtain an extended tree-form input for BERT. CoLAKE [81] integrates knowledge context and language context into a unified graph, which is then pre-trained with MLM to obtain contextualized representation for both knowledge and language. Moreover, Xiong et al. [57] adopted entity replacement identification to encourage the model to be more aware of factual knowledge. However, most of these methods update the parameters of PTMs when inject-ing knowledge, which may suﬀer from catastrophic forgetting when injecting multiple kinds of knowledge. To address this, K-Adapter [136] injects multiple kinds of knowledge by train-ing diﬀerent adapters independently for diﬀerent pre-training tasks, which allows continual knowledge infusion.	一方面，可以在预训练期间注入外部知识。早期的研究[140-143]集中在知识图嵌入和词嵌入的联合学习上。自BERT以来，一些辅助预训练任务旨在将外部知识融入深度 PTMs。LIBERT[135](linguistically-informed BERT，语言知情的BERT)通过额外的语言约束任务整合了语言知识。Ke等人[79]整合了每个词的情感极性，将MLM扩展为标签感知MLM(Label-Aware MLM，LA-MLM)。因此，他们提出的模型SentiLR在几个句子级和方面级的情感分类任务上实现了最先进的性能。Levine等人[137]提出了SenseBERT，它经过预训练，不仅可以预测掩码tokens，还可以预测WordNet中的超义。ERNIE(THU)[76]将在知识图上预训练的实体嵌入与文本中提到的相应实体相结合，以增强文本表示。类似地，KnowBERT[77]将BERT与实体链接模型联合训练，以端到端方式整合实体表示。Wang等[80]提出了联合优化知识嵌入和语言建模目标的KEPLER。这些工作通过实体嵌入的方法注入知识图的结构信息。相比之下，K-BERT[78]显式地将从KG(知识图)中提取的相关三元组注入到句子中，以获得BERT的扩展树形输入。CoLAKE[81]将知识上下文和语言上下文整合成一个统一的图中，然后用MLM进行预训练，得到知识和语言的上下化表示。另外，Xiong等WKLM[57]采用实体替换识别，鼓励模型更多地了解到事实性知识。然而，这些方法大多在注入知识时更新PTM的参数，当注入多种知识时可能会出现灾难性遗忘。为了解决这个问题，K-Adapter[136]通过针对不同的预训练任务独立训练不同的适配器来注入多种知识，从而实现持续的知识注入。
On the other hand, one can incorporate external knowledge into pre-trained models without retraining them from scratch. As an example, K-BERT [78] allows injecting factual knowl-edge during fine-tuning on downstream tasks. Guan et al.[138] employed commonsense knowledge bases, ConceptNet and ATOMIC, to enhance GPT-2 for story generation. Yang et al. [144] proposed a knowledge-text fusion model to acquire related linguistic and factual knowledge for machine reading comprehension.	另一方面，人们可以将外部知识整合到预训练的模型中，而无需从头开始重新训练它们。例如，K-BERT[78]允许在下游任务的微调过程中注入事实知识边缘。Guan等人[138]利用*常识知识库ConceptNet和ATOMIC增强了用于故事生成的GPT-2。Yang等人[144]提出了一种知识-文本融合模型KT-NET，以获取机器阅读理解MRC的相关语言和事实知识。备注：KT-NET是由百度开创性地提出了语言表示与知识表示的深度融合模型，希望同时借助语言和知识的力量进一步提升机器阅读理解(Machine Reading Comprehension，MRC*)的效果。
Besides, Logan IV et al. [145] and Hayashi et al. [146] ex-tended language model to knowledge graph language model (KGLM) and latent relation language model (LRLM) respec-tively, both of which allow prediction conditioned on knowl-edge graph. These novel KG-conditioned language models show potential for pre-training.	此外，Logan IV等[145]和Hayashi等[146]分别将语言模型扩展为知识图语言模型(KGLM)和潜在关系语言模型(LRLM)，这两种语言模型都允许以知识图为条件进行预测。这些新的KG条件语言模型显示出预训练的潜力。

4.2 Multilingual and Language-Specific PTMs多语言和特定语言的PTMs

4.2.1 Multilingual PTMs多语言的PTMs

Learning multilingual text representations shared across lan-guages plays an important role in many cross-lingual NLP tasks.

学习跨语言共享的多语言文本表示，在许多跨语言NLP任务中起着重要作用。

Cross-Lingual Language Understanding (XLU)

Most of the early works focus on learning multilingual word em-bedding [147–149], which represents text from multiple lan-guages in a single semantic space. However, these methods usually need (weak) alignment between languages.

跨语言语言理解(XLU)

早期的研究主要集中在学习多语言词嵌入[147-149]，它在一个语义空间中表示来自多种语言的文本。然而，这些方法通常需要在语言之间(弱)对齐。

Multilingual BERT(3) (mBERT) is pre-trained by MLM with the shared vocabulary and weights on Wikipedia text from the top 104 languages. Each training sample is a monolingual doc-ument, and there are no cross-lingual objectives specifically designed nor any cross-lingual data. Even so, mBERT per-forms cross-lingual generalization surprisingly well [150]. K et al. [151] showed that the lexical overlap between languages plays a negligible role in cross-lingual success.

XLM [46] improves mBERT by incorporating a cross-lingual task, translation language modeling (TLM), which performs MLM on a concatenation of parallel bilingual sen-tence pairs. Unicoder [82] further propose three new cross-lingual pre-training tasks, including cross-lingual word recov-ery, cross-lingual paraphrase classification and cross-lingual masked language model (XMLM).

XLM-RoBERTa (XLM-R) [62] is a scaled multilingual encoder pre-trained on a significantly increased amount of training data, 2.5TB clean CommonCrawl data in 100 diﬀer-ent languages. The pre-training task of XLM-RoBERTa is monolingual MLM only. XLM-R achieves state-of-the-arts results on multiple cross-lingual benchmarks, including XNLI, MLQA, and NER.

多语言BERT(3) (mBERT)由MLM预训练，使用来自前104种语言的维基百科文本的共享词汇和权重。每个训练样本都是单语言文档，没有专门设计的跨语言目标，也没有任何跨语言数据。即便如此，mBERT在跨语言泛化方面的表现还是出奇的好[150]。K等人[151]表明，语言之间的词汇重叠在跨语言成功中起着微不足道的作用。

XLM[46]通过结合跨语言任务、翻译语言建模(TLM)来改进mBERT, TLM在并行双语句子对的连接/串联上执行MLM。Unicoder[82]进一步提出了三种新的跨语言预训练任务，包括跨语言单词恢复、跨语言释义分类和跨语言掩码语言模型(XMLM)。

XLM-RoBERTa (XLM-R)[62]是一种可伸缩的多语言编码器，预训练数据量显着增加，包括2.5TB 100种不同语言的干净CommonCrawl数据。XLM-RoBERTa的预训练任务仅为单语MLM。XLM-R在多种跨语言基准测试(包括XNLI、MLQA和NER)上实现了最先进的结果。

Cross-Lingual Language Generation (XLG)

Multilin-gual generation is a kind of tasks to generate text with diﬀerent languages from the input language, such as machine transla-tion and cross-lingual abstractive summarization.

Diﬀerent from the PTMs for multilingual classification, the PTMs for multilingual generation usually needs to pre-train both the encoder and decoder jointly, rather than only focusing on the encoder.

MASS [41] pre-trains a Seq2Seq model with monolingual Seq2Seq MLM on multiple languages and achieves significant improvement for unsupervised NMT. XNLG [60] performs two-stage pre-training for cross-lingual natural language gen-eration. The first stage pre-trains the encoder with monolin-gual MLM and Cross-Lingual MLM (XMLM) tasks. The second stage pre-trains the decoder by using monolingual DAE and Cross-Lingual Auto-Encoding (XAE) tasks while keeping the encoder fixed. Experiments show the benefit of XNLG on cross-lingual question generation and cross-lingual abstractive summarization. mBART [61], a multilingual exten-sion of BART [50], pre-trains the encoder and decoder jointly with Seq2Seq denoising auto-encoder (DAE) task on large-scale monolingual corpora across 25 languages. Experiments demonstrate that mBART produces significant performance gains across a wide variety of machine translation (MT) tasks.

跨语言语言生成(XLG)

多语言生成是一种从输入语言生成不同语言文本的任务，如机器翻译和跨语言抽象摘要。

与用于多语言分类的PTM不同，用于多语言生成的PTM通常需要对编码器和解码器进行联合预训练，而不仅只是对编码器进行预训练。

MASS[41]在多语言上用单语言Seq2Seq MLM预训练Seq2Seq模型，并在无监督NMT方面取得了显著改善。

XNLG[60]为跨语言自然语言生成执行两阶段的预训练。

>>第一阶段用单语MLM和跨语言MLM(XMLM)任务预训练编码器。

>> 第二阶段通过使用单语言DAE和跨语言自动编码(XAE)任务预训练解码器，同时保持编码器固定。实验证明了XNLG在跨语言问题生成和跨语言抽象摘要方面的优势。

mBART[61]是BART[50]的多语言扩展，它与Seq2Seq DAE(去噪自动编码器)任务联合在跨25种语言的大规模单语语料库上预训练编码器和解码器。实验证明mBART在各种机器翻译(MT)任务中产生了显著的性能提升。

4.2.2 Language-Specific PTMs特定语言的 PTM

Although multilingual PTMs perform well on many languages, recent work showed that PTMs trained on a single language significantly outperform the multilingual results [89, 90, 152].

For Chinese, which does not have explicit word bound-aries, modeling larger granularity [85, 87, 86] and multi-granularity [84, 153] word representations have shown great success. Kuratov and Arkhipov [154] used transfer learn-ing techniques to adapt a multilingual PTM to a monolin-gual PTM for Russian language. In addition, some monolin-gual PTMs have been released for diﬀerent languages, such as CamemBERT [89] and FlauBERT [90] for French, Fin-BERT [152] for Finnish, BERTje [88] and RobBERT [91] for Dutch, AraBERT [155] for Arabic language.

尽管多语言PTM在许多语言上表现良好，但最近的研究表明，在单一语言上训练的PTM明显优于多语言的结果[89,90,152]。

对于没有明确词边界的中文，建模更大粒度 [85-BERT-wwm-Chinese、87-ZEN、86-NEZHA] 和多粒度 [84-ERNIE、153] 词表示已显示出巨大的成功。Kuratov和Arkhipov[154]使用迁移学习技术使多语言PTM适应于俄语的单语PTM。此外，还针对不同的语言发布了一些单语PTM，如法语的CamemBERT[89]和FlauBERT[90]，芬兰语的Fin-BERT[152]，荷兰语的BERTje[88]和RobBERT[91]，阿拉伯语的AraBERT[155]。

4.3 Multi-Modal PTMs多模态PTM

Observing the success of PTMs across many NLP tasks, some research has focused on obtaining a cross-modal version of PTMs. A great majority of these models are designed for a general visual and linguistic feature encoding. And these models are pre-trained on some huge corpus of cross-modal data, such as videos with spoken words or images with cap-tions, incorporating extended pre-training tasks to fully utilize the multi-modal feature. Typically, tasks like visual-based MLM, masked visual-feature modeling and visual-linguistic matching are widely used in multi-modal pre-training, such as VideoBERT [97], VisualBERT [94], ViLBERT [92].

观察PTM在许多NLP任务中的成功，一些研究侧重中在获得PTM的跨模态版本上。这些模型大多是为一般的视觉和语言特征编码而设计的。这些模型在一些庞大的跨模态数据语料库上进行预训练，如带语音的视频或带字幕的图像，并加入扩展的预训练任务，以充分利用多模态特性。通常，基于视觉的MLM、掩码视觉特征建模和视觉-语言匹配等任务被广泛应用于多模态预训练中,如VideoBERT[97]、VisualBERT[94]、ViLBERT[92]。

4.3.1 Video-Text PTMs

VideoBERT [97] and CBT [98] are joint video and text mod-els. To obtain sequences of visual and linguistic tokens used for pre-training, the videos are pre-processed by CNN-based encoders and oﬀ-the-shelf speech recognition techniques, re-spectively. And a single Transformer encoder is trained on the processed data to learn the vision-language representations for downstream tasks like video caption. Furthermore, Uni-ViLM [156] proposes to bring in generation tasks to further pre-train the decoder using in downstream tasks.

VideoBERT[97]和CBT[98]是联合视频和文本模型。为了获得用于预训练的视觉和语言tokens序列，视频分别由基于CNN的编码器和现成的语音识别技术进行预处理。在处理过的数据上训练单个Transformer编码器，以学习用于下游任务(如视频字母)的视觉语言表示。此外，Uni-ViLM[156]建议引入生成任务，进一步预训练下游任务中使用的解码器。

4.3.2 Image-Text PTMs图像-文本 PTM

Besides methods for video-language pre-training, several works introduce PTMs on image-text pairs, aiming to fit down-stream tasks like visual question answering(VQA) and vi-sual commonsense reasoning(VCR). Several proposed models adopt two separate encoders for image and text representation independently, such as ViLBERT [92] and LXMERT [93]. While other methods like VisualBERT [94], B2T2 [95], VL-BERT [96], Unicoder-VL [157] and UNITER [158] propose single-stream unified Transformer. Though these model ar-chitectures are diﬀerent, similar pre-training tasks, such as MLM and image-text matching, are introduced in these ap-proaches. And to better exploit visual elements, images are converted into sequences of regions by applying RoI or bound-ing box retrieval techniques before encoded by pre-trained Transformers.

除了视频语言预训练的方法外，还一些工作在图像-文本对上引入了PTMs，旨在适应下游任务，如视觉问题回答(VQA)和视觉常识推理(VCR)。一些提出的模型采用两个单独的编码器分别进行图像和文本表示，如ViLBERT[92]和LXMERT[93]。而VisualBERT[94]、B2T2[95]、VL-BERT[96]、Unicoder-VL[157]、UNITER[158]等方法则提出了单流统一Transformer。虽然这些模型结构不同，但在这些方法中引入了类似的预训练任务，如MLM和图像-文本匹配。为了更好地利用视觉元素，在预训练的Transformer编码之前，应用RoI或边界盒检索技术将图像转换为区域序列。

4.3.3 Audio-Text PTMs音频-文本PTM

Moreover, several methods have explored the chance of PTMs on audio-text pairs, such as SpeechBERT [99]. This work tries to build an end-to-end Speech Question Answering (SQA) model by encoding audio and text with a single Transformer encoder, which is pre-trained with MLM on speech and text corpus and fine-tuned on Question Answering.

此外，还有一些方法探索了在音频-文本对上出现PTM的机会，如SpeechBERT[99]。本文尝试用单个Transformer编码器对音频和文本进行编码，构建一个端到端的语音问答(SQA)模型，该编码器在语音和文本语料库上使用 MLM 进行预训练，并在问答上进行微调。

4.4 Domain-Specific and Task-Specific PTMs 特定领域和特定任务的 PTM

Most publicly available PTMs are trained on general do-main corpora such as Wikipedia, which limits their appli-cations to specific domains or tasks. Recently, some studies have proposed PTMs trained on specialty corpora, such as BioBERT [100] for biomedical text, SciBERT [101] for scien-tific text, ClinicalBERT [159, 160] for clinical text.

In addition to pre-training a domain-specific PTM, some work attempts to adapt available pre-trained models to target applications, such as biomedical entity normalization [161], patent classification [102], progress notes classification and keyword extraction [162].

Some task-oriented pre-training tasks were also proposed, such as sentiment Label-Aware MLM in SentiLR [79] for sen-timent analysis, Gap Sentence Generation (GSG) [163] for text summarization, and Noisy Words Detection for disfluency detection [164].

大多数公开可用的PTM都是在通用领域主语料库(如Wikipedia)上进行训练的，这将它们的应用程序限制在特定的领域或任务上。最近，一些研究提出了在专业语料库上训练的PTMs，如生物医学文本的BioBERT[100]，科学文本的SciBERT[101]，临床文本的ClinicalBERT[159, 160]。

除了对特定领域的PTM进行预训练外，一些工作还试图将可用的预训练模型适应目标应用，例如生物医学实体规范化[161]、专利分类[102]—PatentBERT、进度记录分类和关键字提取[162]。

还提出了一些面向任务的预训练任务，如SentiLR中的情感标签感知MLM[79]用于情感分析，Gap Sentence Generation (GSG)[163]用于文本摘要，以及用于不流畅检测的 Noisy Words Detection[164]—NWD。

4.5 Model Compression模型压缩

Since PTMs usually consist of at least hundreds of millions of parameters, they are diﬃcult to be deployed on the on-line service in real-life applications and on resource-restricted de-vices. Model compression [165] is a potential approach to reduce the model size and increase computation eﬃciency.

There are five ways to compress PTMs [166]:

(1) model pruning, which removes less important parameters,

(2) weight quantization [167], which uses fewer bits to represent the parameters,

(3) parameter sharing across similar model units,

(4) knowledge distillation [168], which trains a smaller student model that learns from intermediate outputs from the original model and

(5) module replacing, which replaces the modules of original PTMs with more compact substitutes.

Table 3 gives a comparison of some representative com-pressed PTMs.

由于PTM通常包含至少数亿个参数，因此很难将它们部署在实际应用程序和资源受限设备的在线服务上。模型压缩[165]是减小模型大小和提高计算效率的一种潜在方法。

有五种方法可以压缩 PTM[166]:

(1)、模型修剪，删除不太重要的参数;

(2)、权重量化[167]，它使用更少的比特来表示参数;

(3)、参数共享，在相似的模型单元之间共享参数;

(4)、知识蒸馏[168]，它训练一个更小的学生模型，从原始模型的中间输出中学习;

(5)、模块替换，它用更紧凑的替代品替换原始PTMs的模块。

表3给出了一些有代表性的压缩PTM的比较。

Table 3: Comparison of Compressed PTMs

4.5.1 Model Pruning模型剪枝——删除不太重要的参数

Model pruning refers to removing part of neural network (e.g., weights, neurons, layers, channels, attention heads), thereby achieving the eﬀects of reducing the model size and speeding up inference time.

Gordon et al. [103] explored the timing of pruning (e.g., pruning during pre-training, after downstream fine-tuning) and the pruning regimes. Michel et al. [174] and Voita et al. [175] tried to prune the entire self-attention heads in the transformer block.

模型剪枝是指去除部分神经网络(如权重、神经元、层、通道、注意头)，从而达到减小模型大小、加快推理时间的效果。

Gordon等人[103]—CompressingBERT 探讨了修剪的时机(例如，在预训练修剪，在下游微调之后修剪)和修剪机制。Michel等人[174]和Voita等人[175]试图修剪Transformer块中的整个self-attention heads 。

[174]量化BERT每个注意力Head的重要性且可修剪掉20~40%的注意力头。在文献《Are Sixteen Heads Really Better than One》中，深入分析了BERT多头机制中每个头到底有多大用，结果发现很多头其实没什么用。作者通过迭代的方法从BERT模型中逐步去除注意力头（attention head），他们使用了一种基于梯度检测的方法(对下游任务进行梯度估计)来估计每个注意力头的重要性，并通过绘制性能--去除的注意力头所占百分比函数来测试模型对注意力头剪枝的鲁棒性。在实践中，作者发现20 - 40%的注意力头可以修剪，且对模型准确性的影响可以忽略不计。

[175]量化Multi-Head Self-Attention中各个注意力Heads重要性并提出可修剪掉冗余Head，在文献《Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned》中，提出了一种量化注意力头重要程度的方法。多个Head的作用有大多数是冗余的，很多可以被砍掉。

4.5.2 Quantization量化——用更少的比特来表示参数

Quantization refers to the compression of higher precision parameters to lower precision. Works from Shen et al. [104] and Zafrir et al. [105] solely focus on this area. Note that quantization often requires compatible hardware.

量化是指将较高精度的参数压缩到较低精度。Shen等人[104]—Q-BERT 和Zafrir等人[105]—Q8BERT 的研究主要集中在这一领域。注意，量化通常需要兼容的硬件。

[104]Q-BERT是一种对BERT使用二阶hessian信息的混合精度方法实现模型压缩的新型系统性方法。它是基于 BERT 的模型执行超低精度量化，旨在最小化性能下降幅度，同时保持硬件效率，能够在CV和NLP领域任务中产生前所未有的小模型。

[105]Q8BERT ，将较高精度的浮点参数(32)压缩为较低精度的浮点参数(8位)，可以达到4x的压缩效果，同时把精度损失降到了最低。核心是通过把所有的FC层和embedding层的权值都量化成了8bit(因为这些权值占据了全部权值的99%)。低位表示是一种与硬件高度相关的技术，所以需要有一个针对8位的通用矩阵乘做了优化的硬件，将量化的模型布置上去后能够加速模型的推理性能，但是该论文只做了量化工作，没有做硬件的设计。

4.5.3 Parameter Sharing参数共享——相似单元间共享参数

Another well-known approach to reduce the number of pa-rameters is parameter sharing, which is widely used in CNNs, RNNs, and Transformer [176]. ALBERT [63] uses cross-layer parameter sharing and factorized embedding parameteriza-tion to reduce the parameters of PTMs. Although the number of parameters is greatly reduced, the training and inference time of ALBERT are even longer than the standard BERT.

Generally, parameter sharing does not improve the compu-tational eﬃciency at inference phase.

另一种著名的减少参数参数数量的方法是参数共享，广泛应用于CNN、RNN和Transformer[176]。ALBERT[63]采用跨层参数共享和分解嵌入参数化来减少 PTM 的参数。虽然参数量大大减少，但ALBERT的训练和推理时间比标准BERT更长。

一般情况下，参数共享不会提高推理阶段的计算效率。

4.5.4 Knowledge Distillation知识蒸馏/提炼——训练一个更小的学生模型

Knowledge distillation (KD) [168] is a compression technique in which a small model called student model is trained to re-produce the behaviors of a large model called teacher model. Here the teacher model can be an ensemble of many models and usually well pre-trained. Diﬀerent to model compres-sion, distillation techniques learn a small student model from a fixed teacher model through some optimization objectives, while compression techniques aiming at searching a sparser architecture.

Generally, distillation mechanisms can be divided into three types:

(1) distillation from soft target probabilities,

(2) dis-tillation from other knowledge, and

(3) distillation to other structures:

知识蒸馏(KD)[168]是一种压缩技术，在这种技术中，一个被称为学生模型的小模型被训练来重现一个被称为教师模型的大模型的行为。在这里，教师模型可以是许多模型的集合，并且通常经过良好的预训练。与模型压缩不同，蒸馏技术通过一些优化目标从固定的教师模型中学习小型学生模型，而压缩技术则旨在搜索更稀疏的架构。

一般来说，蒸馏机制可以分为三种:

(1)、从Soft Target概率中蒸馏，

(2)、从其他知识中蒸馏，

(3)、从其他结构中蒸馏:

(1) Distillation from soft target probabilities. Bucilua et al.[165] showed that making the student approximate the teacher model can transfer knowledge from teacher to student. A com-mon method is approximating the logits of the teacher model. DistilBERT [106] trained the student model with a distillation loss over the soft target probabilities of the teacher as:

where ti and si are the probabilities estimated by the teacher model and the student, respectively.

Distillation from soft target probabilities can also be used in task-specific models, such as information retrieval [177], and sequence labeling [178].

(1)、Soft Target软目标概率蒸馏。Bucilua等[165]表明，使学生近似于教师模型可以将知识从教师转移到学生。一种常用的方法是近似教师模型的对数。DistilBERT[106]对学生模型进行了训练，其对教师软目标概率的蒸馏损失为:

其中ti和si分别是教师模型和学生模型估计的概率。

从软目标概率中蒸馏也可用于特定任务模型，如信息检索[177]和序列标签[178]。

关键词额外信息补充详见—Hard-target 和 Soft-target对比

(2) Distillation from other knowledge. Distillation from soft target probabilities regards the teacher model as a black box and only focus on its outputs. Moreover, decomposing the teacher model and distilling more knowledge can bring improvement to the student model.

TinyBERT [107] performs layer-to-layer distillation with embedding outputs, hidden states, and self-attention distribu-tions. MobileBERT [171] also perform layer-to-layer distil-lation with soft target probabilities, hidden states, and self-attention distributions. MiniLM [108] distill self-attention distributions and self-attention value relation from teacher model.

Besides, other models distill knowledge through many ap-proaches. Sun et al. [169] introduced a “patient” teacher-student mechanism, Liu et al. [179] exploited KD to improve a pre-trained multi-task deep neural network.

(2)、从其他知识中蒸馏。从软目标概率中蒸馏，将教师模型视为一个黑箱，只关注其输出。此外，对教师模型进行分解，蒸馏/提炼出更多的知识，可以对学生模型进行改进。

TinyBERT[107]使用嵌入输出、隐藏状态和自注意分布执行层对层蒸馏。MobileBERT[171]还使用软目标概率、隐藏状态和自注意分布执行层对层蒸馏。MiniLM[108]从教师模型中提取自注意分布和自注意值关系。

此外，其他模型通过许多方法提取知识。Sun等人[169](PKDBert)引入了一种“耐心的”师生机制，Liu等人[179](DK_MT-DNN)利用KD来改进预训练的多任务深度神经网络。

(3) Distillation to other structures. Generally, the structure of the student model is the same as the teacher model, except for a smaller layer size and a smaller hidden size. However, not only decreasing parameters but also simplifying model structures from Transformer to RNN [180] or CNN [181] can reduce the computational complexity.

(3)、蒸馏成其他结构。一般来说，学生模型的结构与教师模型相同，只是层size和隐藏size变得更小。然而，从Transformer到RNN[180]或CNN[181]，除了减少参数外，还可以简化模型结构，从而降低计算复杂度。

[169] PKDBert：针对Bert模型压缩提出了叫做Patient Knowledge Distillation（PKD）的方案，该方案有2种不同的蒸馏策略PKD-Last(学习最后k层)、PKD-Skip(学习中间的每k层信息)。论文介绍了一种bert模型压缩蒸馏的方法，在vanilla 知识蒸馏方法的基础上，直接学习老师模型的中间层信(充分挖掘教师模型的信息)，通过学习teacher网络中间层信息提高student网络表现。
[179] DK_MT-DNN，将知识蒸馏，拓展到多任务学习以训练MT-DNN，从而打造出更稳固且通用的自然语言理解模型。过程如下所示
第一步，为每个task，训练一个由不同MT-DNNs所组成的集成学习模型(teacher模型)，其性能比任何单一模型更优秀；
第二步，通过多任务学习，从多个集成学习模型(ensemble teachers)中，蒸馏训练一个单一的MT-DNN(student模型)。
该文的知识蒸馏过程即对于不同的任务，使用相同的结构在对应的数据集上进行微调，这就可以看作每个任务的Teacher，他们分别擅长解决对应的问题。

关键词额外信息补充—Hard-target 和 Soft-target对比

Hard-target 和 Soft-target传统的神经网络训练方法是，定义一个损失函数，目标是使预测值尽可能接近于真实值（Hard- target），损失函数就是使神经网络的损失值和尽可能小。这种训练过程是对ground truth求极大似然。在知识蒸馏中，是使用大模型的类别概率作为Soft-target的训练过程。
>> Hard-target：原始数据集标注的 one-shot 标签，除了正标签为 1，其他负标签都是 0；
>> Soft-target：Teacher模型softmax层输出的类别概率，每个类别都分配了概率，正标签的概率最高；
知识蒸馏用Teacher模型预测的 Soft-target 来辅助Hard-target训练 Student模型的方式为什么有效呢？softmax层的输出，除了正例之外，负标签也带有Teacher模型归纳推理的大量信息，比如某些负标签对应的概率远远大于其他负标签，则代表 Teacher模型在推理时认为该样本与该负标签有一定的相似性。而在传统的训练过程(Hard-target)中，所有负标签都被统一对待。也就是说，知识蒸馏的训练方式，使得每个样本给Student模型带来的信息量大于传统的训练方式。

源自网络

4.5.5 Module Replacing模块替换——用更紧凑的替换

Module replacing is an interesting and simple way to reduce the model size, which replaces the large modules of original PTMs with more compact substitutes. Xu et al. [109] pro-posed Theseus Compression motivated by a famous thought experiment called “Ship of Theseus”, which progressively substitutes modules from the source model with modules of fewer parameters. Diﬀerent from KD, Theseus Compression only requires one task-specific loss function. The compressed model, BERT-of-Theseus, is 1.94× faster while retaining more than 98% performance of the source model.

模块替换是减小模型尺寸的一种有趣而简单的方法，它用更紧凑的替代品替换原始PTM的大模块。Xu等人[109]提出 Theseus 压缩的动机是一个名为“Theseus 之船”的著名思想实验，该实验逐步用更少参数的模块替代源模型中的模块。与KD不同，Theseus 压缩只需要一个特定任务的损失函数。压缩模型BERT-of-Theseus的速度快1.94倍，同时保持源模型98%以上的性能。

4.5.6 Early Exit早退

Another efficient way to reduce the inference time is early exit, which allows the model to exit early at an o铿€-ramp instead of passing through the entire model. The number of layers to be executed is conditioned on the input.

The idea of early exit is first applied in computer vision, such as BranchyNet [182] and Shallow-Deep Network [183]. With the emergence of deep pre-trained language models, early exit is recently adopted to speedup Transformer-based models. As a prior work, Universal Transformer [176] uses the Adaptive Computation Time (ACT) mechanism [184] to achieve input-adaptive computation. Elbayad et al. [185] pro-posed Depth-adaptive transformer for machine translation, which learns to predict how many decoding layers are re-quired for a particular sequence or token. Instead of learning how much computation is required, Liu et al. [186] proposed two estimation approaches based on Mutual Information (MI) and Reconstruction Loss respectively to directly allocate the appropriate computation to each sample.

另一种减少推断时间的有效方法是提前退出，这允许模型在*处提前退出，而不是通过整个模型。要执行的层数取决于输入。

早期退出的思想最早应用于计算机视觉，如BranchyNet[182]和Shallow-Deep Network[183]。随着深度预训练语言模型的出现，最近采用早期退出来加速基于transformer的模型。作为先前的工作，Universal Transformer[176]使用自适应计算时间(Adaptive Computation Time, ACT)机制[184]来实现输入自适应计算。Elbayad等人[185]提出了用于机器翻译的DAdap Transformers，它可以学习预测特定序列或token需要多少解码层。Liu等[186](FDAdap Transformers)提出了两种分别基于互信息(MI)和重建损失(Reconstruction Loss)的估计方法，直接为每个样本分配适当的计算，而不是学习需要多少计算量。

More recently, DeeBERT [110], RightTool [111], Fast-BERT [112], ELBERT [187], PABEE [113] are proposed to reduce the computation of transformer encoder for natural language understanding tasks. Their methods usually contain two steps: (a) Training the injected oﬀ-ramps (aka internal classifiers), and (b) Designing an exiting strategy to decide whether or not to exit.

减少Transformer Enc的计算量

最近，DeeBERT [110]， RightTool [111]， Fast-BERT [112]， ELBERT [187]， PABEE[113]被提出来减少用于自然语言理解任务的Transformer编码器的计算量。他们的方法通常包含两个步骤:

(a)、训练注入的出入口off-ramps(又名内部分类器)，

(b)、设计一个退出策略来决定是否退出。

Typically, the training objective is a weighted sum of the cross-entropy losses at all oﬀ-ramps, i.e.

where M is the number of oﬀ-ramps. FastBERT [112] adopted the self-distillation loss that trains each oﬀ-ramp with the soft target generated by the final classifier. Liao et al. [114] im-proved the objective by considering both the past and the future information. In particular, the oﬀ-ramps are trained to aggregate the hidden states of the past layers, and also ap-proximate the hidden states of the future layers. Moreover, Sun et al. [115] developed a novel training objective from the perspective of ensemble learning and mutual information, by which the oﬀ-ramps are trained as an ensemble. Their proposed objective not only optimizes the accuracy of each ﬀ-ramp but also the diversity of the oﬀ-ramps.

During inference, an exiting strategy is required to decide whether to exit early or continue to the next layer. Dee-BERT [110], FastBERT [112], Liao et al. [114] adopt the entropy of the prediction distribution as the exiting criterion. Similarly, RightTool [111] use the maximum softmax score to decide whether to exit. PABEE developed a patience-based strategy that allows a sample to exit when the prediction is unchanged for successive layers. Further, Sun et al. [115] adopt a voting-based strategy to let all of the past oﬀ-ramps take a vote to decide whether or not to exit. Besides, Li et al.[116] proposed a window-based uncertainty as the exiting cri-terion to achieve token-level early exit (TokEE) for sequence labeling tasks.

通常，训练目标是所有出入口交叉熵损失的加权和，即。

其中M为出入口数量。FastBERT[112]采用自蒸馏损失，用最终分类器生成的软目标训练每个出入口。Liao等人[114](GPFEE)通过同时考虑过去和未来信息，改进了目标。特别地，off-ramps出入口被训练为聚集过去层的隐藏状态，并且也近似于未来层的隐藏状态。此外，Sun等人[115](EICEE)从集成学习和互信息的角度提出了一种新的训练目标，将出入口作为一个整体进行训练。他们提出的目标不仅优化了每个出入口的准确性，而且还优化了出入口的多样性。

在推理过程中，需要一个退出策略来决定是提早退出还是继续到下一层。DeeBERT [110]， FastBERT [112]， Liao等[114](GPFEE)采用预测分布的熵作为现有的判据。类似地，RightTool[111]使用最大softmax分数来决定是否退出。PABEE开发了一种基于耐心的策略，允许样本在连续层的预测不变时退出。此外，Sun等人[115]采用基于投票的策略，让所有过去的出入口进行投票来决定是否退出。此外，Li等人[116](SentEE/TokEE)提出了基于窗口的不确定性作为退出标准，以实现序列标签任务的token级提前退出(TokEE)。

[114] Global Past-future Early Exit，(GPFEE)则尝试利用 imitation learning，一方面利用所有浅层的样本表示，另外一方面尝试预测出更深层的样本表示来作为辅助信息，进而提升分类的效果。
[115] Early exiting with ensemble internal classifiers，(EICEE)

5 Adapting PTMs to Downstream Tasks使 PTM 适应下游任务

Although PTMs capture the general language knowledge from a large corpus, how eﬀectively adapting their knowledge to the downstream task is still a key problem.

尽管PTM从大型语料库中获取通用语言知识，但如何有效地将其知识适应下游任务仍然是一个关键问题。

5.1 Transfer Learning迁移学习

Transfer learning [188] is to adapt the knowledge from a source task (or domain) to a target task (or domain). Fig-ure 4 gives an illustration of transfer learning.

There are many types of transfer learning in NLP, such as domain adaptation, cross-lingual learning, multi-task learning. Adapting PTMs to downstream tasks is sequential transfer learning task, in which tasks are learned sequentially and the target task has labeled data.

迁移学习[188]是将源任务(或领域)中的知识适应到目标任务(或领域)中。图4给出了迁移学习的例子。

NLP中的迁移学习有多种类型，如领域适应、跨语言学习、多任务学习等。使PTM适应下游任务是顺序迁移学习任务，其中任务是按顺序学习的，并且目标任务有标签的数据。

5.2 How to Transfer?如何迁移

To transfer the knowledge of a PTM to the downstream NLP tasks, we need to consider the following issues:

为了将PTM的知识迁移到下游的NLP任务中，我们需要考虑以下问题:

5.2.1 Choosing appropriate pre-training task, model architecture and corpus选择合适的预训练任务、模型架构和语料库

Diﬀerent PTMs usually have diﬀerent eﬀects on the same downstream task, since these PTMs are trained with various pre-training tasks, model architecture, and corpora.

(1) Currently, the language model is the most popular pre-training task and can more eﬃciently solve a wide range of NLP problems [58]. However, diﬀerent pre-training tasks have their own bias and give diﬀerent eﬀects for diﬀerent tasks. For example, the NSP task [16] makes PTM understand the relationship between two sentences. Thus, the PTM can benefit downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI).

(2) The architecture of PTM is also important for the down-stream task. For example, although BERT helps with most natural language understanding tasks, it is hard to generate language.

(3) The data distribution of the downstream task should be approximate to PTMs. Currently, there are a large number of oﬀ-the-shelf PTMs, which can just as conveniently be used for various domain-specific or language-specific downstream tasks.

Therefore, given a target task, it is always a good solution to choose the PTMs trained with appropriate pre-training task, architecture, and corpus.

不同的PTM通常对相同的下游任务有不同的影响，因为这些PTM是用各种预训练任务、模型架构和语料库训练的。

(1)、语言模型，是目前最流行的预训练任务，可以更有效地解决广泛的NLP问题[58]。但是，不同的预训练任务有其自身的偏向性，对不同的任务有不同的效果。例如，NSP任务[16]让PTM理解两个句子之间的关系。因此，PTM 可以使下游任务受益，如问答(QA)和自然语言推理(NLI)等具体应用。

(2) 、PTM的架构，对下游任务也很重要。例如，尽管BERT有助于大多数自然语言理解任务，但它很难生成语言。

(3)、下游任务的数据分布，应近似于PTM。目前，有大量现成的PTM，它们同样可以方便地用于各种特定领域或特定语言的下游任务。

因此，给定目标任务，选择使用适当的预训练任务、架构和语料库训练的PTM总是一个很好的解决方案。

5.2.2 Choosing appropriate layers选择合适的图层

Given a pre-trained deep model, diﬀerent layers should cap-ture diﬀerent kinds of information, such as POS tagging, pars-ing, long-term dependencies, semantic roles, coreference. For RNN-based models, Belinkov et al. [189] and Melamud et al.[34] showed that representations learned from diﬀerent layers in a multi-layer LSTM encoder benefit diﬀerent tasks (e.g., predicting POS tags and understanding word sense). For transformer-based PTMs, Tenney et al. [122] found BERT represents the steps of the traditional NLP pipeline: basic syntactic information appears earlier in the network, while high-level semantic information appears at higher layers.

给定一个预训练好的深度模型，不同的层应该捕获不同类型的信息，例如POS词性标记、解析、长期依赖、语义角色、共引用。对于基于RNN的模型，Belinkov等人[189]和Melamud等人[34]表明，从多层LSTM编码器中的不同层学习的表示有利于不同的任务(例如，预测POS标签和理解词义)。对于基于Transformer的PTM, Tenney等[122]发现BERT代表了传统NLP管道的步骤：基本语法信息出现在网络的较早位置，而高级语义信息出现在较高层。

Let H(l)(1  l  L) denotes the l-th layer representation of the pre-trained model with L layers, and g(·) denote the task-specific model for the target task.

There are three ways to select the representation:

a、Embedding Only. One approach is to choose only the pre-trained static embeddings, while the rest of the model still needs to be trained from scratch for a new target task.
They fail to capture higher-level information that might be even more useful. Word embeddings are only useful in capturing semantic meanings of words, but we also need to understand higher-level concepts like word sense.
b、Top Layer. The most simple and eﬀective way is to feed the representation at the top layer into the task-specific model g(H(L)).

c、All Layers. A more flexible way is to automatic choose the best layer in a soft version, like ELMo [14]:

设H(l)(1≤ l ≤L)表示l层预训练模型的第l层表示，g(·)表示目标任务的特定任务模型。

选择表示的方式有以下三种：

a，仅嵌入。一种方法是只选择预训练的静态嵌入，而模型的其余部分，仍然需要从头开始训练以完成新的目标任务。它们无法捕捉到可能更有用的更高层次的信息。词嵌入仅在捕捉单词的语义意义时有用，但我们还需要理解更高级的概念，如词义。

b，顶部Layer。最简单有效的方法是将顶层的表示送入特定任务的模型g(H(L))。

c，所有Layers。一个更灵活的方法是在soft 版本中自动选择最好的图层，比如ELMo [14]:

where αl is the softmax-normalized weight for layer l and γ is a scalar to scale the vectors output by pre-trained model. The mixup representation is fed into the task-specific model g(rt).

其中αl为层l的softmax归一化权值，γ 是一个标量，用于缩放预训练模型输出的向量。混合表示被输入到特定任务的模型g(rt)。

5.2.3 To tune or not to tune?是否微调？

Currently, there are two common ways of model transfer: fea-ture extraction (where the pre-trained parameters are frozen), and fine-tuning (where the pre-trained parameters are unfrozen and fine-tuned).

In feature extraction way, the pre-trained models are re-garded as oﬀ-the-shelf feature extractors. Moreover, it is im-portant to expose the internal layers as they typically encode the most transferable representations [190].

Although both these two ways can significantly benefit most of NLP tasks, feature extraction way requires more com-plex task-specific architecture. Therefore, the fine-tuning way is usually more general and convenient for many diﬀerent downstream tasks than feature extraction way.

目前，有两种常见的模型迁移方式:

T1、特征提取(预训练的参数被冻结)和

T2、微调(预训练的参数被解冻和微调)。

在特征提取方面，将预训练好的模型视为现成的特征提取器。此外，暴露内部层很重要，因为它们通常编码最可转移的表示[190]。

虽然这两种方法都可以显著地使大多数NLP任务受益，但特征提取方法需要更复杂的特定任务架构。因此，对于许多不同的下游任务，微调方式通常比特征提取方式更通用、更方便。

Table 4 gives some common combinations of adapting PTMs.

Table 4: Some common combinations of adapting PTMs.

表4给出了自适应PTM的一些常见组合。

表4:适应性PTM的一些常见组合。

5.3 Fine-Tuning Strategies微调策略

With the increase of the depth of PTMs, the representation cap-tured by them makes the downstream task easier. Therefore, the task-specific layer of the whole model is simple. Since ULMFit and BERT, fine-tuning has become the main adaption method of PTMs. However, the process of fine-tuning is often brittle: even with the same hyper-parameter values, distinct random seeds can lead to substantially diﬀerent results [193].

Besides standard fine-tuning, there are also some useful fine-tuning strategies.

随着PTM深度的增加，它们捕获的表示使下游任务更容易。因此，整个模型的特定任务的层很简单。自ULMFit和BERT以来，微调已成为PTM的主要适应方法。然而，微调的过程往往是脆弱的:即使具有相同的超参数值，不同的随机种子也可能导致本质上不同的结果[193]。

除了T1、标准微调，还有一些有用的微调策略。

Two-stage fine-tuning

An alternative solution is two-stage transfer, which introduces an intermediate stage between pre-training and fine-tuning. In the first stage, the PTM is trans-ferred into a model fine-tuned by an intermediate task or cor-pus. In the second stage, the transferred model is fine-tuned to the target task. Sun et al. [64] showed that the “further pre-training” on the related-domain corpus can further improve the ability of BERT and achieved state-of-the-art performance on eight widely-studied text classification datasets. Phang et al. [194] and Garg et al. [195] introduced the intermedi-ate supervised task related to the target task, which brings a large improvement for BERT, GPT, and ELMo. Li et al. [65] also used a two-stage transfer for the story ending prediction. The proposed TransBERT (transferable BERT) can transfer not only general language knowledge from large-scale unla-beled data but also specific kinds of knowledge from various semantically related supervised tasks.

T2、两级微调

另一种解决方案是两阶段转移，它在预训练和微调之间引入了一个中间阶段。
>> 在第一阶段，PTM被转移为一个由中间任务或语料库微调的模型。
>> 在第二阶段，传输的模型被微调到目标任务。
Sun等[64]研究表明，在相关领域语料库上进行“进一步的预训练”可以进一步提高BERT的能力，并在8个被广泛研究的文本分类数据集上取得了最先进的性能。Phang等[194]和Garg等[195]引入了与目标任务相关的中间监督任务，为BERT、GPT和ELMo带来了很大的改进。Li等人[65]也使用了两阶段转移来预测故事结局。TransBERT (transferable BERT)不仅可以从大规模的无标签数据中转移一般语言知识，还可以从各种语义相关的监督任务中转移特定类型的知识。

Multi-task fine-tuning

Liu et al. [67] fine-tuned BERT un-der the multi-task learning framework, which demonstrates that multi-task learning and pre-training are complementary technologies.

T3、多任务微调

Liu等[67]在多任务学习框架下对BERT进行了微调，表明多任务学习和预训练是互补的技术。

Fine-tuning with extra adaptation modules

The main drawback of fine-tuning is its parameter ineﬃciency: every downstream task has its own fine-tuned parameters. There-fore, a better solution is to inject some fine-tunable adaptation modules into PTMs while the original parameters are fixed.

Stickland and Murray [68] equipped a single share BERT model with small additional task-specific adaptation modules, projected attention layers (PALs). The shared BERT with the PALs matches separately fine-tuned models on the GLUE benchmark with roughly 7 times fewer parameters. Similarly, Houlsby et al. [69] modified the architecture of pre-trained BERT by adding adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without re-visiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing.

T4、使用额外的适配模块进行微调

微调的主要缺点是参数效率低：每个下游任务都有自己的微调参数。因此，更好的解决方案是在原有参数不变的情况下，在PTM中注入一些可微调的自适应模块。

Stickland和Murray[68]配备了一个单独的共享BERT模型，该模型带有额外的小型特定任务适应模块，即投影注意层(PALs)。与PALs共享的BERT在GLUE基准测试中分别匹配微调过的模型，参数少了大约7倍。类似地，Houlsby等人[69]通过添加适配器模块修改了预训练BERT的架构。适配器模块产生了一个紧凑且可扩展的模型;每个任务只添加几个可训练的参数，并且可以添加新的任务，而无需重新访问以前的任务。原始网络的参数保持不变，产生了高度的参数共享。

Others Motivated by the success of widely-used ensemble models, Xu et al. [196] improved the fine-tuning of BERT with two eﬀective mechanisms: self-ensemble and self-distillation, which can improve the performance of BERT on downstream tasks without leveraging external resource or significantly de-creasing the training eﬃciency. They integrated ensemble and distillation within a single training process. The teacher model is an ensemble model by parameter-averaging several student models in previous time steps.

Instead of fine-tuning all the layers simultaneously, grad-ual unfreezing [38] is also an eﬀective method that gradu-ally unfreezes layers of PTMs starting from the top layer. Chronopoulou et al. [197] proposed a simpler unfreezing method, sequential unfreezing, which first fine-tunes only the randomly-initialized task-specific layers, and then unfreezes the hidden layers of PTM, and finally unfreezes the embedding layer.

Li and Eisner [198] compressed ELMo embeddings us-ing variational information bottleneck while keeping only the information that helps the target task.

Generally, the above works show that the utility of PTMs can be further stimulated by better fine-tuning strategies.

T5、Others

受广泛使用的集成模型成功的推动，Xu等人[196]通过T5、两种有效的机制改进了BERT的微调：自集成和自蒸馏，可以在不利用外部资源或显著降低训练效率的情况下，提高BERT在下游任务上的性能。他们在单一的训练过程中集成了集成和蒸馏。教师模型是通过在之前的时间步骤中对几个学生模型进行参数平均，而得到的集成模型。

与同时微调所有层不同，T6、逐渐解冻 [38] 也是一种从顶层开始逐渐解冻 PTM 层的有效方法。Chronopoulou等人[197]提出了一种更简单的解冻结方法——顺序解冻结，该方法首先只对随机初始化的特定任务层进行微调，然后解冻结PTM的隐藏层，最后解冻结嵌入层。

Li和Eisner[198]压缩了ELMo嵌入的变分信息瓶颈，同时只保留有助于目标任务的信息。

总的来说，上述工作表明，更好的微调策略可以进一步激发PTM的效用。

5.3.1 Prompt-based Tuning基于提示的微调

Narrowing the gap between pre-training and fine-tuning can further boost the performance of PTMs on downstream tasks. An alternative approach is reformulating the downstream tasks into a MLM task by designing appropriate prompts. Prompt-based methods have shown great power in few-shot setting [199, 200, 70, 72], zero-shot setting [129, 201], and even fully-supervised setting [74, 75]. Current prompt-based methods can be categorized as two branches according to the prompt is whether discrete or continuous.

T7、基于提示的微调

缩小预训练和微调之间的差距可以进一步提高PTM在下游任务中的表现。另一种方法是通过设计适当的提示将下游任务重新组织为MLM任务。基于提示的方法在少样本设置[199,200,70,72]，零样本设置[129,201]，甚至全监督设置[74,75]中都表现出了强大的功能。目前基于提示的方法根据提示是离散的还是连续的可以分为两个分支。

Discrete prompts

Discrete prompt is a sequence of words to be inserted into the input text, which helps the PTM to bet-ter model the downstream task. Sun et al. [202] constructed an auxiliary sentence by transforming aspect-based sentiment analysis (ABSA) task to a sentence pair classification task, but its model parameters still need to be fine-tuned. GPT-3 [59] proposed the in-context learning that concatenates the original input with the task description and a few examples. By this, GPT-3 can achieve competitive performance without tuning the parameters. Besides, Petroni et al. [129] found that with proper manual prompt, BERT can perform well on entity pre-diction task (LAMA) without training. In addition to LAMA, Schick and Sch¨utze [200, 70] proposed PET that designed discrete prompts for various text classification and entailment tasks. However, the manually designed prompts can be sub-optimal, as a result, many methods are developed to automate the generation of prompts. LPAQA [201] uses two methods,i.e., mining-based generation and paraphrasing-based generation, to find the optimal patterns that express particular relations. AutoPrompt [71] finds the optimal prompt with gradient-guided search. LM-BFF [72] employs T5 [42] to automatically generate prompts.

T7.1、离散的提示

离散提示符，是要插入到输入文本中的一系列单词序列，它帮助PTM更好地对下游任务建模。Sun等[202]将基于aspect的情感分析(ABSA)任务转化为句子对分类任务，构建了辅助句，但其模型参数仍需微调。GPT-3[59]提出了上下文学习，将原始输入与任务描述和一些示例连接起来。通过这种方式，GPT-3可以在不调整参数的情况下实现具有竞争力的性能。此外，Petroni等[129]发现在适当的手动提示下，BERT无需训练即可在实体预测任务 (LAMA) 上表现良好。除了LAMA之外，Schick和sch¨utze[200,70]还提出了PET, PET为各种文本分类和隐含任务设计离散提示。然而，手动设计的提示可能不是最优的，因此开发了许多方法来自动生成提示。LPAQA[201]((挖掘自动提示)采用了两种方法:，基于挖掘的生成和基于释义的生成，以找到表达特定关系的最佳模式。AutoPrompt[71](梯度搜索提示)使用梯度引导搜索找到最佳提示。LM-BFF[72]采用T5[42]自动生成提示符。

Continuous prompts

Instead of finding the optimal con-crete prompt, another alternative is to directly optimize the prompt in continuous space, i.e. the prompt vectors are not necessarily word type embeddings of the PTM. The opti-mized continuous prompt is concatenated with word type embeddings, which is then fed into the PTM. Qin and Eisner [203] and Zhong et al. [204] found that the optimized con-tinuous prompt can outperform concrete prompts (including manual [129], mined (LPAQA [201]), and gradient-searched (AutoPrompt [71]) prompts) on relational tasks. WARP [73] inserts trainable continuous prompt tokens before, between, and after the input sequence while keeping the parameters of the PTM fixed, resulting in considerable performance on GLUE benchmark. Prefix-Tuning [74] inserts continuous prompt as prefix of the input of GPT-2 for table-to-text gen-eration and BART for summarization. Prefix-Tuning, as a parameter-eﬃcient tuning technique, achieved comparable per-formance in fully-supervised setting and outperformed model fine-tuning in few-shot setting. Further, P-Tuning [75] showed that, with continuous prompt, GPT can also achieve compa-rable or even better performance to similar-sized BERT on natural language understanding (NLU) tasks. Very recently, Lester et al. [205] showed that prompt tuning becomes more competitive with scale. When the PTM exceeds billions of parameters, the gap between model fine-tuning and prompt tuning can be closed, which makes the prompt-based tuning a very promising method for eﬃcient serving of large-scale PTMs.

T7.2、连续的提示

另一种方法是直接在连续空间中优化提示，而不是寻找最佳的具体提示，即提示向量不一定是 PTM 的词类型嵌入。优化后的连续提示符与词类型嵌入连接，然后将其输入PTM。Qin和Eisner[203]和Zhong等人[204]发现优化后的连续提示在关系任务上优于具体提示(包括手动提示[129]、挖掘提示(LPAQA[201])和梯度搜索提示(AutoPrompt[71])。WARP[73]在保持PTM参数固定的同时，在输入序列之前、之间和之后插入可训练的连续提示token，从而在GLUE基准测试中获得相当好的性能。Prefix-Tuning前缀调优[74]将连续提示符插入GPT-2的输入前缀，用于表到文本的生成，BART用于摘要。Prefix-Tuning作为一种参数高效的调优技术，在全监督环境下具有相当的性能，在少样本环境下优于模型微调。此外，P-Tuning[75]表明，在连续提示下，GPT也可以在自然语言理解(NLU)任务上实现与类似规模的BERT相当甚至更好的性能。最近，Lester等人[205]表明，随着规模的扩大，提示调优变得更具竞争力。当PTM参数超过数十亿个时，模型微调和提示调优之间的差距可以缩小，这使得基于提示的调优成为高效服务大规模PTM的一种很有前途的方法。

6 Resources of PTMs—PTM 的资源

There are many related resources for PTMs available online. Table 5 provides some popular repositories, including third-party implementations, paper lists, visualization tools, and other related resources of PTMs.

Besides, there are some other good survey papers on PTMs for NLP [211, 212, 173].

网上有很多关于PTMs的相关资源。表5提供了一些流行的存储库，包括第三方实现、论文列表、可视化工具和PTM的其他相关资源。

此外，还有一些其他关于 NLP 的 PTM 的优秀调查论文 [211、212、173]。

Table 5: Resources of PTMs

7 Applications应用

In this section, we summarize some applications of PTMs in several classic NLP tasks.

在本节中，我们总结了PTM在几个经典NLP任务中的一些应用。

7.1 General Evaluation Benchmark通用评价基准

There is an essential issue for the NLP community that how can we evaluate PTMs in a comparable metric. Thus, large-scale-benchmark is necessary.

The General Language Understanding Evaluation (GLUE) benchmark [213] is a collection of nine natural language under-standing tasks, including single-sentence classification tasks (CoLA and SST-2), pairwise text classification tasks (MNLI, RTE, WNLI, QQP, and MRPC), text similarity task (STS-B), and relevant ranking task (QNLI). GLUE benchmark is well-designed for evaluating the robustness as well as general-ization of models. GLUE does not provide the labels for the test set but set up an evaluation server.

However, motivated by the fact that the progress in recent years has eroded headroom on the GLUE benchmark dra-matically, a new benchmark called SuperGLUE [214] was presented. Compared to GLUE, SuperGLUE has more challenging tasks and more diverse task formats (e.g., coreference resolution and question answering).

State-of-the-art PTMs are listed in the corresponding leader-board4) 5).

对于NLP社区来说，有一个基本问题是我们如何以可比的指标评估PTM。因此，大规模的基准测试是必要的。

通用语言理解评估GLUE基准[213]是9个自然语言理解任务的集合，包括单句分类任务(CoLA和SST-2)、成对文本分类任务(MNLI、RTE、WNLI、QQP和MRPC)、文本相似任务(STS-B)和相关排名任务(QNLI)。GLUE基准测试是为评估模型的健壮性和通用性而精心设计的。GLUE不为测试集提供标签，而是设置一个评估server。

然而，由于近年来的进步大大削弱了GLUE基准的空间，因此提出了一种名为SuperGLUE[214]的新基准。与GLUE相比，SuperGLUE具有更具挑战性的任务和更多样化的任务格式(例如，协参解析和问题回答)。

最先进的PTMs被列在相应的排行榜上。

GLUE和SuperGLUE的相关文章：
NLP：GLUE和SuperGLUE基准的简介、任务分类、使用方法之详细攻略_一个处女座的程序猿的博客-CSDN博客_superglue

7.2 Question Answering / MRC

Question answering (QA), or a narrower concept machine reading comprehension (MRC), is an important application in the NLP community. From easy to hard, there are three types of QA tasks: single-round extractive QA (SQuAD) [215], multi-round generative QA (CoQA) [216], and multi-hop QA (HotpotQA) [217].

BERT creatively transforms the extractive QA task to the spans prediction task that predicts the starting span as well as the ending span of the answer [16]. After that, PTM as an encoder for predicting spans has become a competitive baseline. For extractive QA, Zhang et al. [218] proposed a retrospective reader architecture and initialize the encoder with PTM (e.g., ALBERT). For multi-round generative QA, Ju et al.[219] proposed a “PTM+Adversarial Training+Rationale Tag-ging+Knowledge Distillation” model. For multi-hop QA, Tu et al. [220] proposed an interpretable “Select, Answer, and Explain” (SAE) system that PTM acts as the encoder in the selection module.

问答(QA)或更狭义的机器阅读理解(MRC)是NLP社区中的一个重要应用。从容易到难，QA任务有三种类型：
>> 单轮提取式QA (SQuAD)[215]、
>> 多轮生成式QA (CoQA)[216]和
>> 多跳式QA (HotpotQA)[217]。

BERT创造性地将提取QA任务转换为跨度预测任务，预测答案[16]的起始跨度和结束跨度。此后，PTM作为一种预测跨度的编码器已成为一种具有竞争力的基线。
>> 对于提取性QA, Zhang等人[218]提出了一种回溯式阅读器架构，并使用PTM(例如ALBERT)初始化编码器。
>> 对于多轮生成式QA, Ju等人[219]提出了“PTM+对抗性训练+基本原理标记+知识蒸馏”模型。

>> 对于多跳QA, Tu等人[220]提出了一种可解释的“选择、回答和解释”(SAE)系统，PTM作为选择模块中的编码器。

Generally, encoder parameters in the proposed QA model are initialized through a PTM, and other parameters are ran-domly initialized. State-of-the-art models are listed in the corresponding leaderboard. 6) 7) 8)

通常，所提出的QA模型中的编码器参数通过PTM初始化，其他参数随机初始化。最先进的模型被列在相应的排行榜上。6) 7) 8)

7.3 Sentiment Analysis情感分析

情感分类任务
BERT微调SST-2实现SOA
BERT+迁移学习技术，在日语 SA 中实现了新的SOA

情感分析任务
直接将BERT应用于ABSA效果不好；
ABSA转化为句对分类；
后训练+BERT；
对抗性训练+BERT；
额外的池化模块+BERT中间层；
基于端到端ABSA的aspect 检测和情感分类；
SentiLR：SentiWordNet+Label-Aware MLM来捕获情感转移关系；
基于BERT的“Mask and Infill”实现分离情感

BERT outperforms previous state-of-the-art models by simply fine-tuning on SST-2, which is a widely used dataset for senti-ment analysis (SA) [16]. Bataa and Wu [221] utilized BERT with transfer learning techniques and achieve new state-of-the-art in Japanese SA.

Despite their success in simple sentiment classification, directly applying BERT to aspect-based sentiment analysis (ABSA), which is a fine-grained SA task, shows less signif-icant improvement [202]. To better leverage the powerful representation of BERT, Sun et al. [202] constructed an auxil-iary sentence by transforming ABSA from a single sentence classification task to a sentence pair classification task. Xu et al. [222] proposed post-training to adapt BERT from its source domain and tasks to the ABSA domain and tasks. Fur-thermore, Rietzler et al. [223] extended the work of [222] by analyzing the behavior of cross-domain post-training with ABSA performance. Karimi et al. [224] showed that the per-formance of post-trained BERT could be further improved via adversarial training. Song et al. [225] added an additional pooling module, which can be implemented as either LSTM or attention mechanism, to leverage BERT intermediate lay-ers for ABSA. In addition, Li et al. [226] jointly learned aspect detection and sentiment classification towards end-to-end ABSA. SentiLR [79] acquires part-of-speech tag and prior sen-timent polarity from SentiWordNet and adopts Label-Aware MLM to utilize the introduced linguistic knowledge to capture the relationship between sentence-level sentiment labels and word-level sentiment shifts. SentiLR achieves state-of-the-art performance on several sentence- and aspect-level sentiment classification tasks.

BERT仅通过对SST-2进行微调，就超越了以前最先进的模型，这是一个广泛用于情感分析(SA)[16]的数据集。Bataa和Wu[221]利用BERT和迁移学习技术，并在日语 SA 中实现了新的SOA水平。

尽管他们在简单的情感分类中取得了成功，但直接将BERT应用于基于aspect的情感分析(ABSA)，这是一项细粒度的SA任务，显示出不太显著的改进[202]。为了更好地利用BERT的强大表示法，Sun等人[202]将ABSA从单句分类任务转化为句对分类任务，构造了一个辅助句。Xu等人[222]提出后训练，使BERT从其源域和任务适应到ABSA域和任务。此外，Rietzler等人[223]通过使用ABSA性能分析跨域后训练的行为，扩展了[222]的工作。Karimi等人[224]研究表明，通过对抗性训练可以进一步提高训练后BERT的表现。Song等人[225]添加了一个额外的池化模块，可以作为 LSTM 或注意力机制来实现，以利用 BERT 中间层进行 ABSA。此外，Li等[226]共同学习了基于端到端ABSA的aspect 检测和情感分类。SentiLR[79]从SentiWordNet中获取词性标签和先验情感极性，并采用Label-Aware MLM利用引入的语言知识来捕获句子级情感标签和单词级情感转移之间的关系。SentiLR在几个句子级和aspect级的情感分类任务上达到了最先进的性能。

For sentiment transfer, Wu et al. [227] proposed “Mask and Infill” based on BERT. In the mask step, the model disen-tangles sentiment from content by masking sentiment tokens. In the infill step, it uses BERT along with a target sentiment embedding to infill the masked positions.

对于情感转移，Wu等[227]提出了基于BERT的“Mask and Infill”。在屏蔽步骤中，模型通过屏蔽情感tokens将情感从内容中分离出来。在填充步骤中，它使用BERT和目标情感嵌入来填充掩码位置。

额外信息补充：情感分析任务之TBSA对比ABSA

Target-Based情感分析任务/TBSA：即target-based，指的是句子中出现的词，换句话说，是句子中直接存在的名词；
Aspect-Based情感分析任务/ABSA：Aspect-Based Sentiment Analysis (ABSA) ，可以是句子中未出现的词。aspect指的是句子中名词或实体的类别，是抽象出来的方面，一般情况下，不是句子中本来存在的名词；这是一项细粒度的SA任务；

7.4 Named Entity Recognition命名实体识别

Named Entity Recognition (NER) in information extraction and plays an important role in many NLP downstream tasks. In deep learning, most of NER methods are in the sequence-labeling framework. The entity information in a sentence will be transformed into the sequence of labels, and one label corresponds to one word. The model is used to predict the label of each word. Since ELMo and BERT have shown their power in NLP, there is much work about pre-trained models for NER.

Akbik et al. [37] used a pre-trained character-level language model to produce word-level embedding for NER. TagLM [228] and ELMo [14] use a pre-trained language model’s last layer output and weighted-sum of each layer output as a part of word embedding. Liu et al. [229] used layer-wise pruning and dense connection to speed up ELMo’s inference on NER. Devlin et al. [16] used the first BPE’s BERT representation to predict each word’s label without CRF. Pires et al. [150] realized zero-shot NER through multilingual BERT. Tsai et al.[178] leveraged knowledge distillation to run a small BERT for NER on a single CPU. Besides, BERT is also used on domain-specific NER, such as biomedicine [230, 100], etc.

命名实体识别(NER)，在信息抽取和许多 NLP 下游任务中起着重要作用。在深度学习中，大多数NER方法都是在序列标签框架中。句子中的实体信息会转化为标签序列，一个标签对应一个单词。该模型用于预测每个单词的标签。由于ELMo和BERT已经在NLP中展示了它们的强大功能，因此有很多关于 NER 预训练模型的工作。

Akbik等人[37]使用预训练的字符级语言模型为NER生成词级嵌入。TagLM[228]和ELMo[14]使用预训练语言模型的最后一层输出和各层输出的加权和作为词嵌入的一部分。Liu等[229]使用分层剪枝和密集连接加速了ELMo对NER的推断。Devlin et al.[16]使用第一个BPE的BERT表示来预测没有CRF的每个单词的标签。Pires等[150]通过多语言BERT实现了零样本NER。Tsai等人[178]利用知识蒸馏在单个CPU上为NER运行一个小型BERT。此外，BERT还用于特定领域NER，如生物医学[230,100]等。

7.5 Machine Translation机器翻译

Machine Translation (MT) is an important task in the NLP community, which has attracted many researchers. Almost all of Neural Machine Translation (NMT) models share the encoder-decoder framework, which first encodes input tokens to hidden representations by the encoder and then decodes output tokens in the target language from the decoder. Ra-machandran et al. [36] found the encoder-decoder models can be significantly improved by initializing both encoder and decoder with pre-trained weights of two language models. Edunov et al. [231] used ELMo to set the word embedding layer in the NMT model. This work shows performance im-provements on English-Turkish and English-German NMT model by using a pre-trained language model for source word embedding initialization.

Given the superb performance of BERT on other NLP tasks, it is natural to investigate how to incorporate BERT into NMT models. Conneau and Lample [46] tried to initialize the entire encoder and decoder by a multilingual pre-trained BERT model and showed a significant improvement could be achieved on unsupervised MT and English-Romanian super-vised MT. Similarly, Clinchant et al. [232] devised a series of diﬀerent experiments for examining the best strategy to utilize BERT on the encoder part of NMT models. They achieved some improvement by using BERT as an initializa-tion of the encoder. Also, they found that these models can get better performance on the out-of-domain dataset. Imamura and Sumita [233] proposed a two stages BERT fine-tuning method for NMT. At the first stage, the encoder is initialized by a pre-trained BERT model, and they only train the decoder on the training set. At the second stage, the whole NMT model is jointly fine-tuned on the training set. By experiment, they show this approach can surpass the one stage fine-tuning method, which directly fine-tunes the whole model. Apart from that, Zhu et al. [192] suggested using pre-trained BERT as an extra memory to facilitate NMT models. Concretely, they first encode the input tokens by a pre-trained BERT and use the output of the last layer as extra memory. Then, the NMT model can access the memory via an extra attention mod-ule in each layer of both encoder and decoder. And they show a noticeable improvement in supervised, semi-supervised, and unsupervised MT.

机器翻译(MT)是NLP领域的一个重要课题，吸引了众多研究者的关注。几乎所有的神经机器翻译(NMT)模型都共享编码器-解码器框架，该框架首先将输入tokens编码为编码器的隐藏表示，然后从解码器解码目标语言的输出tokens。Ra-machandran等人[36]发现，通过对编码器和解码器都初始化预训练好的两种语言模型的权重，编码器-解码器模型可以显著改善。Edunov等[231]在NMT模型中使用ELMo设置词嵌入层。该工作通过使用预训练的语言模型进行源词嵌入初始化，提高了英语-土耳其语和英语-德语NMT模型的性能。

鉴于BERT在其他NLP任务上的出色表现，研究如何将BERT纳入NMT模型是很自然的。Conneau和Lample[46]试图通过多语言预训练BERT模型初始化整个编码器和解码器，并表明在无监督MT和英语-罗马尼亚监督MT上可以实现显著改进。同样，Clinchant等人[232]设计了一系列不同的实验，以检验在NMT模型的编码器部分利用BERT的最佳策略。他们通过使用BERT作为编码器的初始化实现了一些改进。此外，他们发现这些模型可以在域外数据集上获得更好的性能。Imamura和Sumita[233]提出了一种用于NMT的两阶段BERT微调方法。在第一阶段，编码器由预训练的BERT模型初始化，他们只在训练集上训练解码器。在第二阶段，整个NMT模型在训练集上进行联合微调。实验结果表明，该方法优于直接微调整个模型的单阶段微调方法。除此之外，Zhu等人[192]建议使用预训练的BERT作为额外的记忆来促进NMT模型。具体来说，他们首先通过预训练好的BERT对输入tokens进行编码，并使用最后一层的输出作为额外的记忆。然后，NMT模型可以通过编码器和解码器的每一层中额外的注意模块访问内存。他们在有监督、半监督和无监督MT中表现出明显的改善。

Instead of only pre-training the encoder, MASS (Masked Sequence-to-Sequence Pre-Training) [41] utilizes Seq2Seq MLM to pre-train the encoder and decoder jointly. In the experiment, this approach can surpass the BERT-style pre-training proposed by Conneau and Lample [46] both on un-supervised MT and English-Romanian supervised MT. Dif-ferent from MASS, mBART [61], a multilingual extension of BART [50], pre-trains the encoder and decoder jointly with Seq2Seq denoising auto-encoder (DAE) task on large-scale monolingual corpora across 25 languages. Experiments demonstrated that mBART could significantly improve both supervised and unsupervised machine translation at both the sentence level and document level.

MASS（Masked Sequence-to-Sequence Pre-Training）[41]并不是只对编码器进行预训练，而是利用Seq2Seq MLM对编码器和解码器进行联合预训练。在实验中，该方法在无监督MT和英语-罗马尼亚监督MT上都能超过Conneau和Lample[46]提出的BERT式预训练。与 MASS 不同，mBART [61] 是 BART [50] 的多语言扩展，它与 Seq2Seq DAE去噪自动编码器 (DAE) 任务联合对 25 种语言的大规模单语语料库进行预训练。实验表明，mBART在句子水平和文档水平上都能显著提高有监督和无监督机器翻译。

7.6 Summarization摘要总结

Summarization, aiming at producing a shorter text which pre-serves the most meaning of a longer text, has attracted the attention of the NLP community in recent years. The task has been improved significantly since the widespread use of PTM. Zhong et al. [191] introduced transferable knowledge (e.g., BERT) for summarization and surpassed previous mod-els. Zhang et al. [234] tries to pre-trained a document-level model that predicts sentences instead of words, and then apply it on downstream tasks such as summarization. More elabo-rately, Zhang et al. [163] designed a Gap Sentence Generation (GSG) task for pre-training, whose objective involves generat-ing summary-like text from the input. Furthermore, Liu and Lapata [235] proposed BERTSUM. BERTSUM included a novel document-level encoder, and a general framework for both extractive summarization and abstractive summarization.In the encoder frame, BERTSUM extends BERT by inserting multiple [CLS] tokens to learn the sentence representations. For extractive summarization, BERTSUM stacks several inter-sentence Transformer layers. For abstractive summarization, BERTSUM proposes a two-staged fine-tuning approach using a new fine-tuning schedule. Zhong et al. [236] proposed a novel summary-level framework MATCHSUM and conceptu-alized extractive summarization as a semantic text matching problem. They proposed a Siamese-BERT architecture to compute the similarity between the source document and the candidate summary and achieved a state-of-the-art result on CNN/DailyMail (44.41 in ROUGE-1) by only using the base version of BERT.

摘要，旨在生成较短的文本，保留较长文本的大部分含义，近年来引起了NLP界的关注。自PTM广泛应用以来，该任务得到了显著改进。Zhong等人[191]引入了可转移知识(如BERT)进行总结，并超越了以前的模型。Zhang等人[234]尝试预训练一个文档级模型，该模型预测句子而不是单词，然后将其应用于下游任务，如摘要。更详细的是，Zhang等人[163]为预训练设计了一个间隙句生成(GSG)任务，其目标包括从输入中生成类似摘要的文本。此外，Liu和Lapata[235]提出了BERTSUM。BERTSUM包括一个新颖的文档级编码器，以及一个用于提取摘要和抽象摘要的通用框架。在编码器框架中，BERTSUM通过插入多个[CLS]tokens来学习句子表示，从而扩展BERT。

>> 对于提取摘要，BERTSUM堆叠了几个句间Transformer层。

>> 对于抽象总结，BERTSUM提出了一种使用新的微调计划的两阶段微调方法。

Zhong等人[236]提出了一种新的摘要级框架MATCHSUM和概念化的提取摘要作为语义文本匹配问题。他们提出了一个Siamese-BERT架构来计算源文档和候选摘要之间的相似性，并仅使用BERT的基本版本就在CNN/DailyMail上获得了最先进的结果(ROUGE-1中为44.41)。

7.7 Adversarial Attacks and Defenses对抗性攻击和防御AdvAtt

The deep neural models are vulnerable to adversarial examples that can mislead a model to produce a specific wrong predic-tion with imperceptible perturbations from the original input. In CV, adversarial attacks and defenses have been widely stud-ied. However, it is still challenging for text due to the discrete nature of languages. Generating of adversarial samples for text needs to possess such qualities: (1) imperceptible to hu-man judges yet misleading to neural models; (2) fluent in grammar and semantically consistent with original inputs. Jin et al. [237] successfully attacked the fine-tuned BERT on text classification and textual entailment with adversarial exam-ples. Wallace et al. [238] defined universal adversarial triggers that can induce a model to produce a specific-purpose predic-tion when concatenated to any input. Some triggers can even cause the GPT-2 model to generate racist text. Sun et al. [239] showed BERT is not robust on misspellings.

PTMs also have great potential to generate adversarial sam-ples. Li et al. [240] proposed BERT-Attack, a BERT-based high-quality and effective attacker. They turned BERT against another fine-tuned BERT on downstream tasks and success-fully misguided the target model to predict incorrectly, out-performing state-of-the-art attack strategies in both success rate and perturb percentage, while the generated adversarial samples are fluent and semantically preserved.

深度神经模型很容易受到对抗性示例的影响，这些示例子可以误导模型，使模型产生特定的错误预测，并对原始输入产生难以察觉的扰动。在CV领域中，对抗性攻击与防御被广泛研究。然而，由于语言的离散性，文本仍然具有挑战性。生成文本的对抗性样本需要具备以下特性:

(1)、人类判断难以察觉，但对神经模型具有误导性;

(2)、语法流利，语义与原始输入一致。

Jin等人[237]通过对抗性示例成功地攻击了文本分类和文本蕴涵的微调BERT。Wallace等人[238]定义了通用对抗触发器，当连接到任何输入时，可以诱导模型产生特定目的的预测。一些触发器甚至会导致GPT-2模型生成种族主义文本。Sun等人[239]表明BERT在拼写错误上并不稳健。

PTMs在生成对抗样本方面也有很大的潜力。Li等人[240]提出了BERT-Attack，一种基于BERT的高质量和有效的攻击者。他们将 BERT 与另一个在下游任务上经过微调的 BERT 进行对比，并成功地误导目标模型进行错误预测，在成功率和扰动百分比方面均优于最先进的攻击策略，而生成的对抗样本流畅且语义上得到了保留。

Besides, adversarial defenses for PTMs are also promis-ing, which improve the robustness of PTMs and make them immune against adversarial attack.

Adversarial training aims to improve the generalization by minimizes the maximal risk for label-preserving perturba-tions in embedding space. Recent work [241, 242] showed that adversarial pre-training or fine-tuning can improve both generalization and robustness of PTMs for NLP.

此外，针对PTM的对抗性防御也很有前景，它提高了PTM的鲁棒性，使其能够抵御对抗性攻击。

对抗性训练旨在通过最小化嵌入空间中标签保留扰动的最大风险来提高泛化能力。最近的工作[241,242]表明，对抗性预训练或微调可以提高NLP的PTM的泛化和鲁棒性。

8 Future Directions未来发展方向

Though PTMs have proven their power for various NLP tasks, challenges still exist due to the complexity of language. In this section, we suggest five future directions of PTMs.

虽然PTM已经证明了它们在各种NLP任务中的能力，但由于语言的复杂性，挑战仍然存在。在本节中，我们提出了PTM未来的五个方向。

(1)、PTMs当前的无上限——通用性PTMs需要更深、更大、更挑战性→更高成本、需要更复杂和有效训练技术(分布式训练/混合精度/梯度积累积)→更实际方法(基于现有的软硬件设计，如ELECTRA)

(1) Upper Bound of PTMs

Currently, PTMs have not yet reached its upper bound. Most of the current PTMs can be further improved by more training steps and larger corpora.

The state of the art in NLP can be further advanced by increasing the depth of models, such as Megatron-LM [243](8.3 billion parameters, 72 Transformer layers with a hidden size of 3072 and 32 attention heads) and Turing-NLG9) (17 billion parameters, 78 Transformer layers with a hidden size of 4256 and 28 attention heads).

The general-purpose PTMs are always our pursuits for learning the intrinsic universal knowledge of languages (even world knowledge). However, such PTMs usually need deeper architecture, larger corpus, and challenging pre-training tasks, which further result in higher training costs. However, train-ing huge models is also a challenging problem, which needs more sophisticated and eﬃcient training techniques such as distributed training, mixed precision, gradient accumulation, etc. Therefore, a more practical direction is to design more eﬃcient model architecture, self-supervised pre-training tasks, optimizers, and training skills using existing hardware and software. ELECTRA [56] is a good solution towards this direction.

(1)、PTMs当前的无上限

目前，PTMs还没有达到其上限。目前大多数PTMs都可以通过更多的训练步骤和更大的语料库来进一步改进。

NLP 的最新技术可以通过增加模型的深度来进一步推进，例如Megatron-LM [243]（83 亿个参数，72 个Transformer 层，隐藏尺寸为3072 和 32 个注意力头），和Turing-NLG9(170亿参数，78个Transformer层，隐藏尺寸为4256和28个注意头)。

通用性PTMs一直是我们学习语言内在普遍性知识(甚至世界知识)的追求。然而，这种PTMs通常需要更深的架构、更大的语料库和更具挑战性的预训练任务，这进一步导致了更高的训练成本。然而，训练庞大的模型也是一个具有挑战性的问题，这需要更复杂和有效的训练技术，如分布式训练、混合精度、梯度积累积等。因此，一个更实际的方向是利用现有的硬件和软件设计更有效的模型架构、自监督的预训练任务、优化器和训练技能。ELECTRA[56]是朝着这个方向的一个很好的解决方案。

(2)、PTM架构——Transformer系列(需高计算复杂度)和非Transformer系列(如NAS)

(2) Architecture of PTMs

The Transformer has been proved to be an eﬀective architecture for pre-training. How-ever, the main limitation of the Transformer is its computation complexity, which is quadratic to the input length. Limited by the memory of GPUs, most of current PTMs cannot deal with the sequence longer than 512 tokens. Breaking this limit needs to improve the architecture of the Transformer. Al-though many works [25] tried to improve the eﬃciency of Transformer, there remains much room for improvement.

Besides, searching for more eﬃcient alternative non-Transformer architecture for PTMs is important to capture longer-range contextual information. The design of deep architecture is challenging, and we may seek help from some automatic methods, such as neural architecture search (NAS) [245].

(2)、PTM架构

Transformer已被证明是一种有效的预训练架构。然而，Transformer的主要限制是它的计算复杂度，它是输入长度的二次方。受限于GPU的内存，目前大多数PTMs无法处理长度超过512 token的序列。打破这个限制需要改进Transformer的架构。虽然[25]的许多工作都试图提高Transformer的效率，但仍然有很大的改进空间。

此外，为PTM寻找更有效的替代非transformer架构对于捕获更远距离的上下文信息非常重要。深度架构的设计具有挑战性，我们可以从一些自动化方法中寻求帮助，如神经架构搜索(NAS)[245]。

(3)、面向任务的预训练(特殊场景需特殊架构和任务、提取部分知识)和模型压缩(NLP的PTM才初研究)

(3) Task-oriented Pre-training and Model Compression

In practice, diﬀerent downstream tasks require the diﬀerent abilities of PTMs. The discrepancy between PTMs and down-stream tasks usually lies in two aspects: model architecture and data distribution. A larger discrepancy may result in that the benefit of PTMs may be insignificant. For example, text generation usually needs a specific task to pre-train both the encoder and decoder, while text matching needs pre-training tasks designed for sentence pairs.

Besides, although larger PTMs can usually lead to better performance, a practical problem is how to leverage these huge PTMs on special scenarios, such as low-capacity devices and low-latency applications. Therefore, we can carefully de-sign the specific model architecture and pre-training tasks for downstream tasks or extract partial task-specific knowledge from existing PTMs.

Instead of training task-oriented PTMs from scratch, we can teach them with existing general-purpose PTMs by us-ing techniques such as model compression (see Section 4.5). Although model compression is widely studied for CNNs in CV [246], compression for PTMs for NLP is just beginning. The fully-connected structure of the Transformer also makes model compression more challenging.

(3)、面向任务的预训练和模型压缩

在实践中，不同的下游任务对PTM的能力要求不同。PTM与下游任务之间的差异通常体现在两个方面：模型架构和数据分布。较大的差异可能导致PTMs的好处可能是微不足道的。例如，文本生成通常需要一个特定的任务来预训练编码器和解码器，而文本匹配则需要为句子对设计的预训练任务。

此外，尽管较大的PTM通常可以带来更好的性能，但一个实际问题是如何在特殊场景（例如低容量设备和低延迟应用程序）中利用这些巨大的PTM。因此，我们可以仔细地为下游任务设计特定的模型架构和预训练任务，或者从现有的PTM中提取部分特定任务的知识。

我们不需要从头开始训练面向任务的PTM，而是可以使用现有的通用PTM，例如模型压缩(见第4.5节)。虽然在CV[246]中对CNN的模型压缩进行了广泛的研究，但用于NLP的PTMs的模型压缩才刚刚开始。Transformer的全连接结构也使模型压缩更具挑战性。

(4)、超越微调的知识转移——参数效率低→固定原始参数+自适应模块改进实现共享服务多个下游、挖掘作为外部知识实现特征提取、知识蒸馏、数据增强

(4) Knowledge Transfer Beyond Fine-tuning

Currently, fine-tuning is the dominant method to transfer PTMs’ knowl-edge to downstream tasks, but one deficiency is its parameter ineﬃciency: every downstream task has its own fine-tuned parameters. An improved solution is to fix the original pa-rameters of PTMs and by adding small fine-tunable adap-tion modules for specific task [68, 69]. Thus, we can use a shared PTM to serve multiple downstream tasks. Indeed, mining knowledge from PTMs can be more flexible, such as feature extraction, knowledge distillation [210], data augmen-tation [247, 248], using PTMs as external knowledge [129]. More eﬃcient methods are expected.

(4)、超越微调的知识转移

目前，微调是将 PTM 的知识转移到下游任务的主要方法，但其缺点是参数效率低：每个下游任务都有自己的微调参数。一种改进的解决方案是固定PTMs 的原始参数，并为特定任务添加小型可微调自适应模块[68,69]。因此，我们可以使用一个共享的PTM来服务多个下游任务。事实上，从PTMs中挖掘知识可以更加灵活，例如将PTMs作为外部知识[129]，进行特征提取、知识蒸馏[210]、数据增强[247,248]。人们期望有更有效的方法。

(5)、 PTM的可解释性和可靠性——Transformer架构解释较难、易受到对抗性攻击(采用对抗性防御)

(5) Interpretability and Reliability of PTMs

Although PTMs reach impressive performance, their deep non-linear architecture makes the procedure of decision-making highly non-transparent.

Recently, explainable artificial intelligence (XAI) [249] has become a hotspot in the general AI community. Unlike CNNs for images, interpreting PTMs is harder due to the complex-ities of both the Transformer-like architecture and language. Extensive eﬀorts (see Section 3.3) have been made to analyze the linguistic and world knowledge included in PTMs, which help us understand these PMTs with some degree of trans-parency. However, much work on model analysis depends on the attention mechanism, and the eﬀectiveness of attention for interpretability is still controversial [250, 251].

Besides, PTMs are also vulnerable to adversarial attacks (see Section 7.7). The reliability of PTMs is also becoming an issue of great concern with the extensive use of PTMs in production systems. The studies of adversarial attacks against PTMs help us understand their capabilities by fully exposing their vulnerabilities. Adversarial defenses for PTMs are also promising, which improve the robustness of PTMs and make them immune against adversarial attack.

Overall, as key components in many NLP applications, the interpretability and reliability of PTMs remain to be ex-plored further in many respects, which helps us understand how PTMs work and provides a guide for better usage and further improvement.

(5)、 PTM的可解释性和可靠性

虽然PTM的性能令人印象深刻，但其深层非线性结构使得决策过程高度不透明。

近年来，可解释人工智能(XAI)[249]已经成为一般AI社区的一个热点。与用于图像的CNN不同，由于类似Transformer的架构和语言的复杂性，解释PTM更加困难。我们已经做了大量的工作(见3.3节)来分析PTMs中包含的语言和世界知识，这有助于我们在一定程度上理解这些PTMs。然而，许多模型分析工作依赖于注意力attention机制，注意对可解释性的有效性仍有争议[250,251]。

此外，PTM也容易受到对抗性攻击(参见7.7节)。随着PTM在生产系统中的广泛使用，PTM的可靠性也成为一个非常值得关注的问题。针对 PTM 的对抗性攻击的研究通过充分暴露其弱点来帮助我们了解它们的能力。对PTM的对抗性防御也很有前途，它提高了PTM的鲁棒性，并使其免受对抗性攻击。

总之，作为许多NLP应用中的关键组件，PTM的可解释性和可靠性在许多方面仍有待进一步研究，这有助于我们了解PTM的工作原理，并为更好地使用和进一步改进提供指导。

9 Conclusion结论

In this survey, we conduct a comprehensive overview of PTMs for NLP, including background knowledge, model ar-chitecture, pre-training tasks, various extensions, adaption approaches, related resources, and applications. Based on current PTMs, we propose a new taxonomy of PTMs from four diﬀerent perspectives. We also suggest several possible future research directions for PTMs.

在本次调查中，我们对NLP的PTMs进行了全面的概述，包括背景知识、模型架构、预训练任务、各种扩展、适应方法、相关资源和应用。基于现有的PTM，我们从四个不同的角度提出了一种新的PTM分类法。我们还提出了几个未来可能的研究方向。

Acknowledgements

We thank Zhiyuan Liu, Wanxiang Che, Minlie Huang, Dan-qing Wang and Luyao Huang for their valuable feedback on this manuscript. This work was supported by the National Natural Science Foundation of China (No. 61751201 and 61672162), Shanghai Municipal Science and Technology Ma-jor Project (No. 2018SHZDZX01) and ZJLab.

感谢Zhiyuan Liu, Wanxiang Che, Minlie Huang, Dan-qing Wang 和 Luyao Huang对本文的宝贵反馈。国家自然科学基金项目(No. 61751201、61672162)、上海市科技重大专项项目(No. 2018SHZDZX01)和ZJLab的资助。

References

[1] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom.A convolutional neural network for modelling sentences. In ACL, 2014.

[2] Yoon Kim. Convolutional neural networks for sentence classi- fication. In EMNLP, pages 1746–1751, 2014.

[3] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In ICML, pages 1243–1252, 2017.

[4] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to sequence learning with neural networks. In NeurIPS, pages 3104–3112, 2014.

[5] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent neural network for text classification with multi-task learning. In IJCAI, 2016.

[6] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642. ACL, 2013.

[7] Kai Sheng Tai, Richard Socher, and Christopher D. Manning.Improved semantic representations from tree-structured long short-term memory networks. In ACL, pages 1556–1566, 2015.

[8] Diego Marcheggiani, Joost Bastings, and Ivan Titov. Ex- ploiting semantics in neural machine translation with graph convolutional networks. In NAACL-HLT, pages 486–492, 2018.

[9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural machine translation by jointly learning to align and translate. In ICLR, 2014.

[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.

[11] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Cor- rado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013.

[12] Jeffrey Pennington, Richard Socher, and Christopher D. Man- ning. GloVe: Global vectors for word representation. In EMNLP, 2014.

[13] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In NeurIPS, 2017.

[14] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard- ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL-HLT, 2018.

[15] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018. URL https://s3-us-west-2.amazonaws. com/openai-assets/researchcovers/languageunsupervised/ languageunderstandingpaper.pdf.

[16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional trans- formers for language understanding. In NAACL-HLT, 2019.

[17] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep- resentation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35 (8):1798–1828, 2013.

[18] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. In AAAI, 2016.

[19] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching word vectors with subword informa- tion. TACL, 5:135–146, 2017. doi: https://doi/10.1162tacl a 00051.

[20] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In ACL, 2016.

[21] Sepp Hochreiter and Ju¨ rgen Schmidhuber. Long short-term memory. Neural Computation, 1997. doi: https://doi/10. 1162/neco.1997.9.8.1735.

[22] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

[23] Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. Long short-term memory over recursive structures. In International Conference on Machine Learning, pages 1604–1612, 2015.

[24] Thomas N Kipf and Max Welling. Semi-supervised classifica- tion with graph convolutional networks. In ICLR, 2017.

[25] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. arXiv preprint arXiv:2106.04554, 2021.

[26] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xi- angyang Xue, and Zheng Zhang. Star-transformer. In NAACL- HLT, pages 1315–1325, 2019.

[27] Dumitru Erhan, Yoshua Bengio, Aaron C. Courville, Pierre- Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res., 11:625–660, 2010. doi: https://dl.acm/doi/10. 5555/1756006.1756025.

[28] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313 (5786):504–507, 2006. doi: https://doi/10.1126/science.

1127647.

[29] GE Hinton, JL McClelland, and DE Rumelhart. Distributed representations. In Parallel distributed processing: explo- rations in the microstructure of cognition, vol. 1: foundations, pages 77–109. 1986.

[30] Yoshua Bengio, Re´jean Ducharme, Pascal Vincent, and Chris- tian Jauvin. A neural probabilistic language model. Jour- nal of machine learning research, 3:1137–1155, 2003. doi: https://dl.acm/doi/10.5555/944919.944966.

[31] Ronan Collobert, Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 2011. doi: https://dl.acm/doi/10.5555/1953048.2078186.

[32] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In ICML, pages 1188–1196, 2014.

[33] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In NeurIPS, pages 3294–3302, 2015.

[34] Oren Melamud, Jacob Goldberger, and Ido Dagan. Con- text2Vec: Learning generic context embedding with bidirec- tional LSTM. In CoNLL, pages 51–61, 2016.

[35] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In NeurIPS, pages 3079–3087, 2015.

[36] Prajit Ramachandran, Peter J Liu, and Quoc Le. Unsupervised pretraining for sequence to sequence learning. In EMNLP, pages 383–391, 2017.

[37] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual string embeddings for sequence labeling. In COLING, pages 1638–1649, 2018.

[38] Jeremy Howard and Sebastian Ruder. Universal language

model fine-tuning for text classification. In ACL, pages 328– 339, 2018.

[39] Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettle- moyer, and Michael Auli. Cloze-driven pretraining of self- attention networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, EMNLP-IJCNLP, pages 5359– 5368, 2019.

[40] Wilson L. Taylor. “cloze procedure”: A new tool for measur- ing readability. Journalism Quarterly, 30(4):415–433, 1953. doi: https://doi/10.1177/107769905303000401.

[41] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: masked sequence to sequence pre-training for language generation. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 5926–5936, 2019.

[42] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Pe- ter J. Liu. Exploring the limits of transfer learning with a uni- fied text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

[43] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro- bustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

[44] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language un- derstanding and generation. In NeurIPS, pages 13042–13054, 2019.

[45] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming Zhou, et al. UniLMv2: Pseudo-masked language models for unified language model pre-training. arXiv preprint arXiv:2002.12804, 2020.

[46] Alexis Conneau and Guillaume Lample. Cross-lingual lan- guage model pretraining. In NeurIPS, pages 7057–7067, 2019.

[47] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre- training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77, 2019. doi: https://doi/10.1162/tacl a 00300.

[48] Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei Peng, and Luo Si. StructBERT: Incorporating language struc- tures into pre-training for deep language understanding. In ICLR, 2020.

[49] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. XLNet: General- ized autoregressive pretraining for language understanding. In NeurIPS, pages 5754–5764, 2019.

[50] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine- jad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy- anov, and Luke Zettlemoyer. BART: denoising sequence-to-

sequence pre-training for natural language generation, transla- tion, and comprehension. arXiv preprint arXiv:1910.13461, 2019.

[51] Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In ICML, pages 5628–5637, 2019.

[52] Andriy Mnih and Koray Kavukcuoglu. Learning word embed- dings efficiently with noise-contrastive estimation. In NeurIPS, pages 2265–2273, 2013.

[53] Michael Gutmann and Aapo Hyva¨rinen. Noise-contrastive estimation: A new estimation principle for unnormalized sta- tistical models. In AISTATS, pages 297–304, 2010.

[54] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In ICLR, 2019.

[55] Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu, Wang Ling, Zihang Dai, and Dani Yogatama. A mutual information maximization perspective of language representation learning. In ICLR, 2019.

[56] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christo- pher D. Manning. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR, 2020.

[57] Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In ICLR, 2020.

[58] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsuper- vised multitask learners. OpenAI Blog, 2019.

[59] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agar- wal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few- shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Pro- cessing Systems 2020, NeurIPS 2020, December 6-12, 2020,

virtual, 2020.

[60] Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and Heyan Huang. Cross-lingual natural language gen- eration via pre-training. In AAAI, 2019.

[61] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. Multilingual denoising pre-training for neural machine trans- lation. arXiv preprint arXiv:2001.08210, 2020.

[62] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzma´n, Edouard

Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116, 2019.

[63] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. In International Conference on Learning Representa- tions, 2020.

[64] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How to fine-tune BERT for text classification? In China National Conference on Chinese Computational Linguistics, pages 194– 206, 2019.

[65] Zhongyang Li, Xiao Ding, and Ting Liu. Story ending predic- tion by transferable bert. In IJCAI, pages 1800–1806, 2019.

[66] Suchin Gururangan, Ana Marasovic´, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In ACL, pages 8342–8360, 2020.

[67] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding. In ACL, 2019.

[68] Asa Cooper Stickland and Iain Murray. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In ICML, pages 5986–5995, 2019.

[69] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In ICML, pages 2790–2799, 2019.

[70] Timo Schick and Hinrich Schu¨ tze. It’s not just size that mat- ters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2339–2352. Association for Computational Linguistics, 2021.

[71] Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wal- lace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 4222–4235. Association for

Computational Linguistics, 2020.

[72] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- trained language models better few-shot learners. arXiv preprint arXiv:2012.15723, 2020.

[73] Karen Hambardzumyan, Hrant Khachatrian, and Jonathan May. WARP: word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021.

[74] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz- ing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.

[75] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie

Qian, Zhilin Yang, and Jie Tang. GPT understands, too. arXiv preprint arXiv:2103.10385, 2021.

[76] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. ERNIE: enhanced language representation with informative entities. In ACL, 2019.

[77] Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith. Knowledge enhanced contextual word representations. In EMNLP-IJCNLP, 2019.

[78] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju, Haotang Deng, and Ping Wang. K-BERT: Enabling language representation with knowledge graph. In AAAI, 2019.

[79] Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Min- lie Huang. SentiLR: Linguistic knowledge enhanced lan- guage representation for sentiment analysis. arXiv preprint arXiv:1911.02493, 2019.

[80] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang. KEPLER: A unified model for knowledge embedding and pre-trained language representa- tion. arXiv preprint arXiv:1911.06136, 2019.

[81] Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Yaru Hu, Xuanjing Huang, and Zheng Zhang. Colake: Contextual- ized language and knowledge embedding. In Proceedings of the 28th International Conference on Computational Linguis- tics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 3660–3670. International Committee on

Computational Linguistics, 2020.

[82] Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, and Ming Zhou. Unicoder: A universal language encoder by pre-training with multiple cross-lingual tasks. In EMNLP-IJCNLP, pages 2485–2494, 2019.

[83] Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin Kadras, Sylvain Gugger, and Jeremy Howard. MultiFiT: Effi- cient multi-lingual language model fine-tuning. In EMNLP- IJCNLP, pages 5701–5706, 2019.

[84] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. ERNIE: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223, 2019.

[85] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-training with whole word masking for chinese BERT. arXiv preprint arXiv:1906.08101, 2019.

[86] Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, and Qun Liu. NEZHA: Neural contextualized representa- tion for chinese language understanding. arXiv preprint arXiv:1909.00204, 2019.

[87] Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yong- gang Wang. ZEN: pre-training chinese text encoder enhanced by n-gram representations. arXiv preprint arXiv:1911.00720, 2019.

[88] Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, Tommaso Caselli, Gertjan van Noord, and Malvina Nis- sim. BERTje: A Dutch BERT model. arXiv preprint arXiv:1912.09582, 2019.

[89] Louis Martin, Benjamin Mu¨ ller, Pedro Javier Ortiz Sua´rez, Yoann Dupont, Laurent Romary, E´ ric Villemonte de la Clerg- erie, Djame´ Seddah, and Benoˆıt Sagot. CamemBERT: a tasty French language model. arXiv preprint arXiv:1911.03894, 2019.

[90] Hang Le, Lo¨ıc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoˆıt Crabbe´, Laurent Besacier, and Didier Schwab. FlauBERT: Unsupervised language model pre-training for French. arXiv preprint arXiv:1912.05372, 2019.

[91] Pieter Delobelle, Thomas Winters, and Bettina Berendt. Rob- BERT: a Dutch RoBERTa-based language model. arXiv preprint arXiv:2001.06286, 2020.

[92] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL- BERT: Pretraining task-agnostic visiolinguistic representa- tions for vision-and-language tasks. In NeurIPS, pages 13–23, 2019.

[93] Hao Tan and Mohit Bansal. LXMERT: Learning cross- modality encoder representations from transformers. In EMNLP-IJCNLP, pages 5099–5110, 2019.

[94] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. VisualBERT: A simple and performant base- line for vision and language. arXiv preprint arXiv:1908.03557, 2019.

[95] Chris Alberti, Jeffrey Ling, Michael Collins, and David Re- itter. Fusion of detected objects in text for visual question answering. In EMNLP-IJCNLP, pages 2131–2140, 2019.

[96] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. VL-BERT: Pre-training of generic visual- linguistic representations. In ICLR, 2020.

[97] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. VideoBERT: A joint model for video and language representation learning. In ICCV, pages 7463–7472. IEEE, 2019.

[98] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743, 2019.

[99] Yung-Sung Chuang, Chi-Liang Liu, and Hung-yi Lee.

SpeechBERT: Cross-modal pre-trained language model for end-to-end spoken question answering. arXiv preprint arXiv:1910.11559, 2019.

[100] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 2019.

[101] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pre- trained language model for scientific text. In EMNLP-IJCNLP,

pages 3613–3618, 2019.

[102] Jieh-Sheng Lee and Jieh Hsiang. PatentBERT: Patent clas- sification with fine-tuning a pre-trained BERT model. arXiv preprint arXiv:1906.02124, 2019.

[103] Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. Com- pressing BERT: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307, 2020.

[104] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q- BERT: Hessian based ultra low precision quantization of BERT. In AAAI, 2020.

[105] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8BERT: Quantized 8bit BERT. arXiv preprint arXiv:1910.06188, 2019.

[106] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.

[107] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.

[108] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. arXiv preprint arXiv:2002.10957, 2020.

[109] Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and Ming Zhou. BERT-of-Theseus: Compressing BERT by pro- gressive module replacing. arXiv preprint arXiv:2002.02925, 2020.

[110] Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251, Online, July 2020. Association for Computational Linguistics.

[111] Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A. Smith. The right tool for the job: Matching model and instance complexities. In Dan Ju- rafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Associ- ation for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 6640–6651. Association for Computational

Linguistics, 2020.

[112] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. In ACL, 2020.

[113] Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit. arXiv preprint arXiv:2006.04152, 2020.

[114] Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, and Bin He. A global past-future early exit method for accelerat- ing inference of pre-trained language models. In Proceedings

of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2013–2023. Association for Computational Linguistics, 2021.

[115] Tianxiang Sun, Yunhua Zhou, Xiangyang Liu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, and Xipeng Qiu. Early exiting with ensemble internal classifiers. arXiv preprint arXiv: 2105.13792, 2021.

[116] Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, and Xuanjing Huang. Accelerating bert inference for sequence labeling via early-exit. In ACL, 2021.

[117] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguis- tic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013.

[118] Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari Rappoport. How well do distributional models capture different types of semantic knowledge? In ACL, pages 726–730, 2015.

[119] Abhijeet Gupta, Gemma Boleda, Marco Baroni, and Sebastian Pado´ . Distributional vectors encode referential attributes. In EMNLP, pages 12–21, 2015.

[120] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Po- liak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. What do you learn from context? probing for sentence structure in contextualized word representations. In ICLR, 2019.

[121] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transfer- ability of contextual representations. In NAACL-HLT, pages 1073–1094, 2019.

[122] Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscov- ers the classical NLP pipeline. In Anna Korhonen, David R. Traum, and Llu´ıs Ma`rquez, editors, ACL, pages 4593–4601, 2019.

[123] Yoav Goldberg. Assessing BERT’s syntactic abilities. arXiv preprint arXiv:1901.05287, 2019.

[124] Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. TACL, 8:34–48, 2020.

[125] John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In NAACL-HLT, pages 4129–4138, 2019.

[126] Ganesh Jawahar, Benoˆıt Sagot, and Djame´ Seddah. What does BERT learn about the structure of language? In ACL, pages 3651–3657, 2019.

[127] Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang goo Lee.

Are pre-trained language models aware of phrases? simple but strong baselines for grammar induction. In ICLR, 2020.

[128] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Vie- gas, Andy Coenen, Adam Pearce, and Been Kim. Visualizing and measuring the geometry of BERT. In NeurIPS, pages 8592–8600, 2019.

[129] Fabio Petroni, Tim Rockta¨schel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller. Language models as knowledge bases? In EMNLP- IJCNLP, pages 2463–2473, 2019.

[130] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neu- big. How can we know what language models know? arXiv preprint arXiv:1911.12543, 2019.

[131] Nina Po¨rner, Ulli Waltinger, and Hinrich Schu¨tze. BERT is not a knowledge base (yet): Factual knowledge vs. name-based reasoning in unsupervised QA. CoRR, abs/1911.03681, 2019.

[132] Nora Kassner and Hinrich Schu¨ tze. Negated LAMA: birds cannot fly. arXiv preprint arXiv:1911.03343, 2019.

[133] Zied Bouraoui, Jose´ Camacho-Collados, and Steven Schock- aert. Inducing relational knowledge from BERT. In AAAI, 2019.

[134] Joe Davison, Joshua Feldman, and Alexander M. Rush. Com- monsense knowledge mining from pretrained models. In EMNLP-IJCNLP, pages 1173–1178, 2019.

[135] Anne Lauscher, Ivan Vulic, Edoardo Maria Ponti, Anna Ko- rhonen, and Goran Glavas. Informing unsupervised pre- training with external linguistic knowledge. arXiv preprint arXiv:1909.02339, 2019.

[136] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou. K-adapter: Infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808, 2020.

[137] Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir, Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham. SenseBERT: Driving some sense into BERT. arXiv preprint arXiv:1908.05646, 2019.

[138] Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie Huang. A knowledge-enhanced pretraining model for com- monsense story generation. arXiv preprint arXiv:2001.05139, 2020.

[139] Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, Nicholas Jing Yuan, and Tong Xu. Integrating graph contex- tualized knowledge into pre-trained language models. arXiv preprint arXiv:1912.00147, 2019.

[140] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph and text jointly embedding. In EMNLP, pages 1591–1601, 2014.

[141] Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, and Zheng Chen. Aligning knowledge and text embeddings by entity descriptions. In EMNLP, pages 267–272, 2015.

[142] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. Representation learning of knowledge graphs with entity descriptions. In IJCAI, 2016.

[143] Jiacheng Xu, Xipeng Qiu, Kan Chen, and Xuanjing Huang. Knowledge graph representation with jointly structural and textual encoding. In IJCAI, pages 1318–1324, 2017.

[144] An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li. Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In ACL, pages 2346–2357, 2019.

[145] Robert L. Logan IV, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh. Barack’s wife hillary: Using knowledge graphs for fact-aware language modeling. In ACL, 2019.

[146] Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham Neubig. Latent relation language models. In AAAI, 2019.

[147] Manaal Faruqui and Chris Dyer. Improving vector space word representations using multilingual correlation. In EACL, pages 462–471, 2014.

[148] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 151–159, 2015.

[149] Karan Singla, Dog˘an Can, and Shrikanth Narayanan. A multi- task approach to learning multilingual representations. In ACL, pages 214–220, 2018.

[150] Telmo Pires, Eva Schlinger, and Dan Garrette. How multi- lingual is multilingual BERT? In ACL, pages 4996–5001, 2019.

[151] Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth. Cross-lingual ability of multilingual BERT: An empirical study. In ICLR, 2020.

[152] Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. Multilingual is not enough: BERT for Finnish. arXiv preprint arXiv:1912.07076, 2019.

[153] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, and Haifeng Wang. ERNIE 2.0: A continual pre- training framework for language understanding. In AAAI, 2019.

[154] Yuri Kuratov and Mikhail Arkhipov. Adaptation of deep bidi- rectional multilingual transformers for russian language. arXiv preprint arXiv:1905.07213, 2019.

[155] Wissam Antoun, Fady Baly, and Hazem Hajj. AraBERT: Transformer-based model for Arabic language understanding. arXiv preprint arXiv:2003.00104, 2020.

[156] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Tianrui Li, Xilin Chen, and Ming Zhou. UniViLM: A unified video and language pre-training model for multimodal under- standing and generation. arXiv preprint arXiv:2002.06353, 2020.

[157] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In AAAI, 2020.

[158] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740, 2019.

[159] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clin- icalBERT: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.

[160] Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDer- mott. Publicly available clinical BERT embeddings. arXiv preprint arXiv:1904.03323, 2019.

[161] Zongcheng Ji, Qiang Wei, and Hua Xu. BERT-based rank- ing for biomedical entity normalization. arXiv preprint arXiv:1908.03548, 2019.

[162] Matthew Tang, Priyanka Gandhi, Md Ahsanul Kabir, Christo- pher Zou, Jordyn Blakey, and Xiao Luo. Progress notes clas- sification and keyword extraction using attention-based deep learning models with BERT. arXiv preprint arXiv:1910.05786, 2019.

[163] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J Liu. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. arXiv preprint arXiv:1912.08777, 2019.

[164] Shaolei Wang, Wanxiang Che, Qi Liu, Pengda Qin, Ting Liu, and William Yang Wang. Multi-task self-supervised learning for disfluency detection. In AAAI, 2019.

[165] Cristian Bucilua, Rich Caruana, and Alexandru Niculescu- Mizil. Model compression. In KDD, pages 535–541, 2006.

[166] Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad, and Preslav Nakov. Compressing large-scale transformer- based models: A case study on BERT. arXiv preprint arXiv:2002.11985, 2020.

[167] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Ma- honey, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In ICCV, pages 293–302, 2019.

[168] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[169] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowl- edge distillation for BERT model compression. In EMNLP- IJCNLP, pages 4323–4332, 2019.

[170] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: The impact of student initialization on knowledge distillation. arXiv preprint arXiv:1908.08962, 2019.

[171] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yim- ing Yang, and Denny Zhou. MobileBERT: a compact task- agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020.

[172] Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou. Extreme language model compression with optimal subwords and shared projections. arXiv preprint arXiv:1909.11687, 2019.

[173] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in BERTology: What we know about how BERT works. arXiv preprint arXiv:2002.12327, 2020.

[174] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? In NeurIPS, pages 14014–14024, 2019.

[175] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL, pages 5797–5808, 2019.

[176] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. In ICLR, 2019.

[177] Wenhao Lu, Jian Jiao, and Ruofei Zhang. TwinBERT: Distill- ing knowledge to twin-structured BERT models for efficient retrieval. arXiv preprint arXiv:2002.06275, 2020.

[178] Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazha- gan, Xin Li, and Amelia Archer. Small and practical BERT models for sequence labeling. In EMNLP-IJCNLP, pages 3632–3636, 2019.

[179] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Improving multi-task deep neural networks via knowl- edge distillation for natural language understanding. arXiv preprint arXiv:1904.09482, 2019.

[180] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechto- mova, and Jimmy Lin. Distilling task-specific knowledge from BERT into simple neural networks. arXiv preprint arXiv:1903.12136, 2019.

[181] Yew Ken Chia, Sam Witteveen, and Martin Andrews. Trans- former to CNN: Label-scarce distillation for efficient text classification. arXiv preprint arXiv:1909.03508, 2019.

[182] Surat Teerapittayanon, Bradley McDanel, and H. T. Kung. Branchynet: Fast inference via early exiting from deep neural networks. In 23rd International Conference on Pattern Recog- nition, ICPR 2016, Cancu´ n, Mexico, December 4-8, 2016, pages 2464–2469. IEEE, 2016.

[183] Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras.Shallow-deep networks: Understanding and mitigating net- work overthinking. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceed- ings of Machine Learning Research, pages 3301–3310. PMLR, 2019.

[184] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.

[185] Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. In 8th International Confer- ence on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview, 2020.

[186] Yijin Liu, Fandong Meng, Jie Zhou, Yufeng Chen, and Jinan Xu. Faster depth-adaptive transformers. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelli- gence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 13424–13432. AAAI Press, 2021.

[187] Keli Xie, Siyuan Lu, Meiqi Wang, and Zhongfeng Wang. El- bert: Fast albert with confidence-window based early exit. In ICASSP 2021-2021 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 7713– 7717. IEEE, 2021.

[188] Sinno Jialin Pan and Qiang Yang. A survey on transfer learn- ing. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009. doi: https://doi/10.1109/TKDE.

2009.191.

[189] Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. What do neural machine translation models learn about morphology? In ACL, pages 861–872, 2017.

[190] Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Repre- sentation Learning for NLP, RepL4NLP@ACL 2019, Florence, Italy, August 2, 2019, pages 7–14, 2019.

[191] Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, and Xuanjing Huang. Searching for effective neural extractive summarization: What works and what’s next. In ACL, pages 1049–1058, 2019.

[192] Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang Zhou, Houqiang Li, and Tieyan Liu. Incorporating BERT into neural machine translation. In ICLR, 2020.

[193] Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020.

[194] Jason Phang, Thibault Fe´vry, and Samuel R Bowman. Sen- tence encoders on STILTs: Supplementary training on inter- mediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.

[195] Siddhant Garg, Thuy Vu, and Alessandro Moschitti. Tanda: Transfer and adapt pre-trained transformer models for answer sentence selection. In AAAI, 2019.

[196] Yige Xu, Xipeng Qiu, Ligao Zhou, and Xuanjing Huang.

Improving BERT fine-tuning via self-ensemble and self- distillation. arXiv preprint arXiv:2002.10345, 2020.

[197] Alexandra Chronopoulou, Christos Baziotis, and Alexandros Potamianos. An embarrassingly simple approach for transfer learning from pretrained language models. In NAACL-HLT, pages 2089–2095, 2019.

[198] Xiang Lisa Li and Jason Eisner. Specializing word embed- dings (for parsing) by information bottleneck. In EMNLP- IJCNLP, pages 2744–2754, 2019.

[199] Teven Le Scao and Alexander M. Rush. How many data points is a prompt worth? In Proceedings of the 2021 Con- ference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2627–

2636. Association for Computational Linguistics, 2021. [200] Timo Schick and Hinrich Schu¨tze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chap- ter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 255–

269. Association for Computational Linguistics, 2021.

[201] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. How can we know what language models know. Trans. Assoc. Comput. Linguistics, 8:423–438, 2020.

[202] Chi Sun, Luyao Huang, and Xipeng Qiu. Utilizing BERT for aspect-based sentiment analysis via constructing auxiliary sentence. In NAACL-HLT, 2019.

[203] Guanghui Qin and Jason Eisner. Learning how to ask: Query- ing lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 5203–5212. Association for Computational Linguistics, 2021.

[204] Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual prob- ing is [MASK]: learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 5017–5033. Association for Computational Linguistics, 2021.

[205] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.

[206] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. Allennlp: A deep semantic natural language processing platform. 2017.

[207] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caim- ing Xiong, and Richard Socher. CTRL: A conditional trans- former language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.

[208] Jesse Vig. A multiscale visualization of attention in the trans- former model. In ACL, 2019.

[209] Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. exbert: A visual analysis tool to explore learned rep- resentations in transformers models. arXiv preprint arXiv:1910.05276, 2019.

[210] Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting Liu, Shijin Wang, and Guoping Hu. Textbrewer: An open- source knowledge distillation toolkit for natural language pro- cessing. arXiv preprint arXiv:2002.12620, 2020.

[211] Yuxuan Wang, Yutai Hou, Wanxiang Che, and Ting Liu. From static to dynamic word representations: a survey. International Journal of Machine Learning and Cybernetics, pages 1–20, 2020. doi: https://doi/10.1007/s13042-020-01069-8.

[212] Qi Liu, Matt J Kusner, and Phil Blunsom. A survey on con- textual embeddings. arXiv preprint arXiv:2003.07278, 2020.

[213] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language under- standing. In ICLR, 2019.

[214] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general- purpose language understanding systems. In NeurIPS, pages 3261–3275, 2019.

[215] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehen- sion of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors, EMNLP, pages 2383–2392, 2016.

[216] Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A conversational question answering challenge. TACL, 7:249– 266, 2019. doi: https://doi/10.1162/tacl a 00266.

[217] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi- hop question answering. In EMNLP, pages 2369–2380, 2018.

[218] Zhuosheng Zhang, Junjie Yang, and Hai Zhao. Retrospective reader for machine reading comprehension. arXiv preprint arXiv:2001.09694, 2020.

[219] Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng Yang, and Yunfeng Liu. Technical report on conversational question answering. arXiv preprint arXiv:1909.10772, 2019.

[220] Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xi- aodong He, and Bowen Zhou. Select, answer and explain: Interpretable multi-hop reading comprehension over multiple documents. In AAAI, 2020.

[221] Enkhbold Bataa and Joshua Wu. An investigation of transfer learning-based sentiment analysis in japanese. In ACL, 2019.

[222] Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. BERT post- training for review reading comprehension and aspect-based sentiment analysis. In NAACL-HLT, 2019.

[223] Alexander Rietzler, Sebastian Stabinger, Paul Opitz, and Ste- fan Engl. Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect-target sentiment classification. arXiv preprint arXiv:1908.11860, 2019.

[224] Akbar Karimi, Leonardo Rossi, Andrea Prati, and Katharina Full. Adversarial training for aspect-based sentiment analysis with BERT. arXiv preprint arXiv:2001.11316, 2020.

[225] Youwei Song, Jiahai Wang, Zhiwei Liang, Zhiyue Liu, and Tao Jiang. Utilizing BERT intermediate layers for aspect based sentiment analysis and natural language inference. arXiv preprint arXiv:2002.04815, 2020.

[226] Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. Exploit- ing BERT for end-to-end aspect-based sentiment analysis. In W-NUT@EMNLP, 2019.

[227] Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, and Songlin Hu. ”mask and infill” : Applying masked language model to sentiment transfer. In IJCAI, 2019.

[228] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. Semi-supervised sequence tagging with bidirectional language models. In ACL, pages 1756–1765, 2017.

[229] Liyuan Liu, Xiang Ren, Jingbo Shang, Xiaotao Gu, Jian Peng, and Jiawei Han. Efficient contextualized representation: Lan- guage model pruning for sequence labeling. In EMNLP, pages 1215–1225, 2018.

[230] Kai Hakala and Sampo Pyysalo. Biomedical named entity recognition with multilingual BERT. In BioNLP Open Shared Tasks@EMNLP, pages 56–61, 2019.

[231] Sergey Edunov, Alexei Baevski, and Michael Auli. Pre-trained language model representations for language generation. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, NAACL-HLT, pages 4052–4059, 2019.

[232] Stephane Clinchant, Kweon Woo Jung, and Vassilina Nikoulina. On the use of BERT for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, 2019.

[233] Kenji Imamura and Eiichiro Sumita. Recycling a pre-trained BERT encoder for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, November 2019.

[234] Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: Docu- ment level pre-training of hierarchical bidirectional transform- ers for document summarization. In ACL, pages 5059–5069, 2019.

[235] Yang Liu and Mirella Lapata. Text summarization with pre- trained encoders. In EMNLP/IJCNLP, 2019.

[236] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang, Xipeng Qiu, and Xuan-Jing Huang. Extractive summarization as text matching. In ACL, 2020.

[237] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is BERT really robust? natural language attack on text classifi- cation and entailment. In AAAI, 2019.

[238] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP-IJCNLP, pages 2153–2162, 2019.

[239] Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari Asai, Jia Li, Philip Yu, and Caiming Xiong. Adv-BERT: BERT is not robust on misspellings! generating nature adversarial samples on BERT. arXiv preprint arXiv:2003.04985, 2020.

[240] Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. BERT-ATTACK: Adversarial attack against BERT using BERT. arXiv preprint arXiv:2004.09984, 2020.

[241] Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. FreeLB: Enhanced adversarial training for natural language understanding. In ICLR, 2020.

[242] Xiulei Liu, Hao Cheng, Peng cheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial

training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.

[243] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[244] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Atten- tive language models beyond a fixed-length context. In ACL, pages 2978–2988, 2019.

[245] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. In ICLR, 2017.

[246] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.

[247] Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. Conditional BERT contextual augmentation. In International Conference on Computational Science, pages 84–95, 2019.

[248] Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data augmentation using pre-trained transformer models. arXiv preprint arXiv:2003.02245, 2020.

[249] Alejandro Barredo Arrieta, Natalia D´ıaz-Rodr´ıguez, Javier Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, Salvador Garc´ıa, Sergio Gil-Lo´ pez, Daniel Molina, Richard Benjamins, et al. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020.

[250] Sarthak Jain and Byron C Wallace. Attention is not explana- tion. In NAACL-HLT, pages 3543–3556, 2019.

[251] Sofia Serrano and Noah A Smith. Is attention interpretable? In ACL, pages 2931–2951, 2019.

本文标签：自然语言模型 TRAINED Models paper

版权声明：本文标题：Paper：《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/xitong/1728085290a1144894.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

Paper：《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读

Paper：《Pre-trained Models for Natural Language Processing: A Survey自然语言处理的预训练模型综述》翻译与解读

Abstract

1、Introduction

2、Background

2.1 Language Representation Learning语言表征学习

Non-contextual Embeddings非上下文嵌入【静态词嵌入】

Contextual Embeddings上下文嵌入【动态词嵌入】

2.2 Neural Contextual Encoders神经网络上下文编码器

2.2.1 Sequence Models序列模型——CNN、RNN

2.2.2 Non-Sequence Models非序列模型——RecursNN、TreeL-STM、GCN、FCSA

2.2.3 Analysis分析

2.3 Why Pre-training?为什么需要预训练——三大优势

2.4 A Brief History of PTMs for NLP—NLP 的 PTM 简史

2.4.1 First-Generation PTMs: Pre-trained Word Embeddings 第一代PTMs：预训练词嵌入

2.4.2 Second-Generation PTMs: Pre-trained Contextual En-coders第二代PTMs：预训练的上下文编码器

3 Overview of PTMs—PTM的概述

3.1 Pre-training Tasks预训练任务

3.1.1 Language Modeling (LM)语言建模

3.1.2 Masked Language Modeling (MLM)掩码语言建模

3.1.3 Permuted Language Modeling (PLM)置换语言建模

3.1.4 Denoising Autoencoder (DAE)降噪自动编码器

3.1.5 Contrastive Learning (CTL)对比学习

3.1.6 Others

3.2 Taxonomy of PTMs

3.3 Model Analysis模型分析

3.3.1 Non-Contextual Embeddings非上下文嵌入

Figure 3: Taxonomy of PTMs with Representative Examples

Table 2: List of Representative PTMs有代表性的 PTMs 及其架构

3.3.2 Contextual Embeddings上下文嵌入

4 Extensions of PTMs—PTM 的扩展

4.1 Knowledge-Enriched PTMs知识丰富的 PTM

4.2 Multilingual and Language-Specific PTMs多语言和特定语言的PTMs

4.2.1 Multilingual PTMs多语言的PTMs

4.2.2 Language-Specific PTMs特定语言的 PTM

4.3 Multi-Modal PTMs多模态PTM

4.3.1 Video-Text PTMs

4.3.2 Image-Text PTMs图像-文本 PTM

4.3.3 Audio-Text PTMs音频-文本PTM

4.4 Domain-Specific and Task-Specific PTMs 特定领域和特定任务的 PTM

4.5 Model Compression模型压缩

4.5.1 Model Pruning模型剪枝——删除不太重要的参数

4.5.2 Quantization量化——用更少的比特来表示参数

4.5.3 Parameter Sharing参数共享——相似单元间共享参数

4.5.4 Knowledge Distillation知识蒸馏/提炼——训练一个更小的学生模型

关键词额外信息补充—Hard-target 和 Soft-target对比

4.5.5 Module Replacing模块替换——用更紧凑的替换

4.5.6 Early Exit早退

5 Adapting PTMs to Downstream Tasks使 PTM 适应下游任务

5.1 Transfer Learning迁移学习

5.2 How to Transfer?如何迁移

5.2.1 Choosing appropriate pre-training task, model architecture and corpus选择合适的预训练任务、模型架构和语料库

5.2.2 Choosing appropriate layers选择合适的图层

5.2.3 To tune or not to tune?是否微调？

5.3 Fine-Tuning Strategies微调策略

5.3.1 Prompt-based Tuning基于提示的微调

6 Resources of PTMs—PTM 的资源

7 Applications应用

7.1 General Evaluation Benchmark通用评价基准

7.2 Question Answering / MRC

7.3 Sentiment Analysis情感分析

额外信息补充：情感分析任务之TBSA对比ABSA

7.4 Named Entity Recognition命名实体识别

7.5 Machine Translation机器翻译

7.6 Summarization摘要总结

7.7 Adversarial Attacks and Defenses对抗性攻击和防御AdvAtt

8 Future Directions未来发展方向

(1)、PTMs当前的无上限——通用性PTMs需要更深、更大、更挑战性→更高成本、需要更复杂和有效训练技术(分布式训练/混合精度/梯度积累积)→更实际方法(基于现有的软硬件设计，如ELECTRA)

(2)、PTM架构——Transformer系列(需高计算复杂度)和非Transformer系列(如NAS)

(3)、面向任务的预训练(特殊场景需特殊架构和任务、提取部分知识)和模型压缩(NLP的PTM才初研究)

(4)、超越微调的知识转移——参数效率低→固定原始参数+自适应模块改进实现共享服务多个下游、挖掘作为外部知识实现特征提取、知识蒸馏、数据增强

(5)、 PTM的可解释性和可靠性——Transformer架构解释较难、易受到对抗性攻击(采用对抗性防御)

9 Conclusion结论

Acknowledgements

References

更多相关文章

手机查看IGES,STP,STEP格式3D模型免费APP推荐-CAD Assistant

神经网络模型压缩方法

Paper reading (八十)：Persistent microbiome alterations modulate the rate of post-dieting weight regain

大模型日报2024-05-19

大语言模型生成式AI学习笔记——1. 2.1LLM预训练和缩放法则——预训练大语言模型

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载