admin管理员组

文章数量:1623572

 首先放论文原文链接

https://arxiv/pdf/1706.03762.pdfhttps://arxiv/pdf/1706.03762.pdf

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data 主流的序列转录模型大多由基于复杂循环或CNN的编码器和解码器构成。表现最好的模型也通过注意力机制将编码器和解码器结合在一块儿。这里,我们提出一个新的简单的网络架构,Transformer,不需要复杂的循环或卷积,只基于注意力机制。实验证明我们的模型在性能发面很突出,并行度更高且训练时间更少。(下面介绍了一些机器翻译数据集上的运行效果,带有绝对精度和相对精度,以及训练成本和时间)我们发现Transformer可以很好的泛化到别的任务中去,在大数据集和小数据集上都可以有很好的表现。

大致结构是:领域内的主要做法 + 我们的模型概要 + 模型特性 + 实验证明(绝对精度+相对精度+训练成本)+ 模型展望

1 Introduction

Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation [35, 2, 5]. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures [38, 24, 15].

Recurrent models typically factor computation along the symbol positions of the input and output sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Recent work has achieved significant improvements in computational efficiency through factorization tricks [21] and conditional computation [32], while also improving model performance in case of the latter. The fundamental constraint of sequential computation, however, remains.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences [2, 19]. In all but a few cases [27], however, such attention mechanisms are used in conjunction with a recurrent network.

In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality after being trained for as little as twelve hours on eight P100 GPUs.

RNN,特别是LSTM和GRU,已经在时间序列模型领域,如语言模型、机器翻译等任务中,构建起了目前最先进的方法。

RNN序列模型:并行度差,训练速度慢;对于长序列记忆丢失严重。尽管做了弥补工作,但问题依然存在。

注意力机制在序列模型领域有较为重要的作用,但当前大部分实列都基于RNN。

我们的工作只基于Attention,我们的优点......(和摘要部分差不多)

目前的研究现状,他们做出了那些贡献,存在那些缺点,为此我们的模型做到了能够弥补他们的缺点的东西。

2 Background

The goal of reducing sequentia

本文标签: 系列经典论文NLPAttention