admin管理员组

文章数量:1534206

DL之RNN/LSTM/GRU:《Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling门控循环神经网络在序列建模上的实证评估》的翻译与解读

目录

《Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling》的翻译与解读

Abstract

1 Introduction

Figure 1: Illustration of (a) LSTM and (b) gated recurrent units. (a) i, f and o are the input, forget and output gates, respectively. c and c˜ denote the memory cell and the new memory cell content. (b) r and z are the reset and update gates, and h and ˜h are the activation and the candidate activation.

3 Gated Recurrent Neural Networks

3.1 Long Short-Term Memory Unit

3.2 Gated Recurrent Unit

3.3 Discussion

5 Results and Analysis

Figure 2: Learning curves for training and validation sets of different types of units with respect to (top) the number of iterations and (bottom) the wall clock time. y-axis corresponds to the negative-log likelihood of the model shown in log-scale.

6 Conclusion

Figure 3: Learning curves for training and validation sets of different types of units with respect to (top) the number of iterations and (bottom) the wall clock time. x-axis is the number of epochs and y-axis corresponds to the negative-log likelihood of the model shown in log-scale.


《Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling》的翻译与解读

地址

文章地址:https://arxiv/abs/1406.1078

时间

2014年12月11日

作者

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, Yoshua Bengio

Abstract

In this paper we compare different types of recurrent units in recurrent neural net-works (RNNs). Especially, we focus on more sophisticated units that implement a gating mechanism, such as a long short-term memory (LSTM) unit and a re-cently proposed gated recurrent unit (GRU). We evaluate these recurrent units on the tasks of polyphonic music modeling and speech signal modeling. Our exper-iments revealed that these advanced recurrent units are indeed better than more traditional recurrent units such as tanh units. Also, we found GRU to be compa-rable to LSTM.

在本文中,我们比较了循环神经网络(RNNs)中不同类型的循环单元。我们特别关注实现门控机制的更复杂的单元,如长短期记忆(LSTM)单元和最近提出的门控循环单元(GRU)。我们在音乐建模语音信号建模任务上评估了这些循环单元。实验证明这些先进的循环单元相对于传统的tanh单元表现更好,同时发现GRU与LSTM性能相当

1 Introduction

Recurrent neural networks have recently shown promising results in many machine learning tasks, especially when input and/or output are of variable length [see, e.g., Graves, 2012]. More recently, Sutskever et al. [2014] and Bahdanau et al. [2014] reported that recurrent neural networks are able to perform as well as the existing, well-developed systems on a challenging task of machine translation.

One interesting observation, we make from these recent successes is that almost none of these suc- cesses were achieved with a vanilla recurrent neural network. Rather, it was a recurrent neural net- work with sophisticated recurrent hidden units, such as long short-term memory units [Hochreiter and Schmidhuber, 1997], that was used in those successful applications.

近年来,循环神经网络在许多机器学习任务中展现了良好的结果,尤其是在输入和/或输出长度可变的情况下[参见,例如,Graves,2012]。最近,Sutskever等人[2014]和Bahdanau等人[2014]报告称,循环神经网络能够在机器翻译这一具有挑战性的任务上表现得与现有的、成熟的系统一样出色

一个有趣的观察是,从这些最近的成功中,我们发现几乎没有一个是使用基本的循环神经网络实现的。相反,成功的应用中使用了具有复杂循环隐藏单元的循环神经网络,例如长短时记忆单元[LSTM,Hochreiter和Schmidhuber,1997]。

Among those sophisticated recurrent units, in this paper, we are interested in evaluating two closely related variants. One is a long short-term memory (LSTM) unit, and the other is a gated recurrent unit (GRU) proposed more recently by Cho et al. [2014]. It is well established in the field that the LSTM unit works well on sequence-based tasks with long-term dependencies, but the latter has only recently been introduced and used in the context of machine translation.

In this paper, we evaluate these two units and a more traditional tanh unit on the task of sequence modeling. We consider three polyphonic music datasets [see, e.g., Boulanger-Lewandowski et al., 2012] as well as two internal datasets provided by Ubisoft in which each sample is a raw speech representation.

Based on our experiments, we concluded that by using fixed number of parameters for all models on some datasets GRU, can outperform LSTM units both in terms of convergence in CPU time and in terms of parameter updates and generalization.

在这些复杂的循环单元中,本文关注于评估两个密切相关的变体。一个是长短时记忆(LSTM)单元,另一个是由Cho等人[2014]最近提出门控循环单元(GRU)。已经在领域中得到确认,LSTM单元在处理具有长期依赖关系的序列任务时表现良好,,但后者直到最近才被引入并用于机器翻译的上下文中。

在本文中,我们在序列建模任务上评估了这两种单元以及传统的tanh单元。我们考虑了三个复调音乐数据集以及由Ubisoft提供的两个内部数据集,其中每个样本是原始语音表示

根据我们的实验证明,通过在某些数据集上使用相同数量的参数,GRU在CPU时间、参数更新和泛化方面都能胜过LSTM单元

Figure 1: Illustration of (a) LSTM and (b) gated recurrent units. (a) i, f and o are the input, forget and output gates, respectively. c and c˜ denote the memory cell and the new memory cell content. (b) r and z are the reset and update gates, and h and ˜h are the activation and the candidate activation.

图1:(a) LSTM和(b)门控循环单元的示意图。(a) i、f和o分别为输入门、遗忘门和输出门。C和C ~表示存储单元和新的存储单元内容。(b) r和z为重置门和更新门,h和~ h为激活和候选激活。

3 Gated Recurrent Neural Networks

In this paper, we are interested in evaluating the performance of those recently proposed recurrent units (LSTM unit and GRU) on sequence modeling. Before the empirical evaluation, we first de-scribe each of those recurrent units in this section.

在本文中,我们有兴趣评估最近提出的循环单元(LSTM单元和GRU)在序列建模中的性能。在实证评估之前,我们首先在本节中描述了每个循环单元

3.1 Long Short-Term Memory Unit

The Long Short-Term Memory (LSTM) unit was initially proposed by Hochreiter and Schmidhuber [1997]. Since then, a number of minor modifications to the original LSTM unit have been made. We follow the implementation of LSTM as used in Graves [2013].

Unlike to the recurrent unit which simply computes a weighted sum of the input signal and applies a nonlinear function, each j-th LSTM unit maintains a memory cj at time t. The output hj , or the activation, of the LSTM unit is then

长短时记忆(LSTM)单元最初由Hochreiter和Schmidhuber [1997]提出。自那以后,对原始LSTM单元进行了一些微小的修改。我们遵循了Graves [2013]中使用的LSTM的实现。

与简单计算输入信号加权和应用非线性函数的循环单元不同,每个第j个LSTM单元在时间t上维护一个记忆cj。然后,LSTM单元的输出hj,或激活,是:

Unlike to the traditional recurrent unit which overwrites its content at each time-step (see Eq. (2)), an LSTM unit is able to decide whether to keep the existing memory via the introduced gates. Intuitively, if the LSTM unit detects an important feature from an input sequence at early stage, it easily carries this information (the existence of the feature) over a long distance, hence, capturing potential long-distance dependencies.

与传统的循环单元在每个时间步上覆盖其内容的做法不同(见方程(2)),LSTM单元能够通过引入的门决定是否保留现有记忆。直观地说,如果LSTM单元在输入序列的早期阶段检测到一个重要特征,它会轻松地将这个信息(特征的存在)延续很长一段距离,从而捕捉潜在的长距离依赖性

3.2 Gated Recurrent Unit

‌A gated recurrent unit (GRU) was proposed by Cho et al. [2014] to make each recurrent unit to adaptively capture dependencies of different time scales. Similarly to the LSTM unit, the GRU has gating units that modulate the flow of information inside the unit, however, without having a separate memory cells.

门控循环单元(GRU)由Cho等人[2014]提出,以使每个循环单元能够自适应地捕捉不同时间尺度的依赖性。与LSTM单元类似,GRU具有调节单元来调节单元内信息流的门控机制,但没有单独的记忆单元

The activation hjt of the GRU at time t is a linear interpolation between the previous activation hjt-1 and the candidate activation ~hjt :

---         ---

where an update gate zjt decides how much the unit updates its activation, or content. The update gate is computed by

---         ---

This procedure of taking a linear sum between the existing state and the newly computed state is similar to the LSTM unit. The GRU, however, does not have any mechanism to control the degree to which its state is exposed, but exposes the whole state each time.
The candidate activation ~hjt is computed similarly to that of the traditional recurrent unit (see Eq. (2)) and as in [Bahdanau et al., 2014],

---         ---
where rt is a set of reset gates and is an element-wise multiplication. 1 When off (rj
t close to 0), the reset gate effectively makes the unit act as if it is reading the first symbol of an input sequence, allowing it to forget the previously computed state.
The reset gate rjt is computed similarly to the update gate:

---         ---

See Fig. 1 (b) for the graphical illustration of the GRU.

GRU在时间t的激活hjt是先前激活hjt-1和候选激活~hjt之间的线性插值:

---         ---

其中,更新门zjt决定单元更新其激活或内容的程度。更新门的计算如下:

---         ---

这种通过现有状态和新计算状态之间进行线性求和的过程类似于LSTM单元。然而,GRU没有任何机制来控制其状态被暴露的程度,而是每次都完全暴露整个状态

候选激活~hjt的计算方式与传统循环单元相似(参见公式(2)),并且如[Bahdanau等人,2014]中所述,

---         ---

其中rt是一组重置门,表示逐元素相乘。当关闭时(rjt接近0),重置门有效地使单元表现得像是读取输入序列的第一个符号,从而使其忘记先前计算的状态。

重置门rjt的计算方式与更新门类似:

---         ---

有关GRU的图形说明,请参见图1(b)。

3.3 Discussion

It is easy to notice similarities between the LSTM unit and the GRU from Fig. 1.

The most prominent feature shared between these units is the additive component of their update from t to t + 1, which is lacking in the traditional recurrent unit. The traditional recurrent unit always replaces the activation, or the content of a unit with a new value computed from the current input and the previous hidden state. On the other hand, both LSTM unit and GRU keep the existing content and add the new content on top of it (see Eqs. (4) and (5)).

从图1中很容易注意到LSTM单元和GRU之间的相似之处

这两个单元之间共享的最显著特征是它们从t到t + 1的更新的加性成分,而这在传统的循环单元中是缺少的。传统的循环单元始终用当前输入和先前的隐藏状态替换激活或单元的内容。另一方面,LSTM单元和GRU保留现有内容并在其之上添加新内容(见方程(4)和(5))。

This additive nature has two advantages. First, it is easy for each unit to remember the existence of a specific feature in the input stream for a long series of steps. Any important feature, decided by either the forget gate of the LSTM unit or the update gate of the GRU, will not be overwritten but be maintained as it is.

Second, and perhaps more importantly, this addition effectively creates shortcut paths that bypass multiple temporal steps. These shortcuts allow the error to be back-propagated easily without too quickly vanishing (if the gating unit is nearly saturated at 1) as a result of passing through multiple, bounded nonlinearities, thus reducing the difficulty due to vanishing gradients [Hochreiter, 1991, Bengio et al., 1994].

These two units however have a number of differences as well. One feature of the LSTM unit that is missing from the GRU is the controlled exposure of the memory content. In the LSTM unit, the amount of the memory content that is seen, or used by other units in the network is controlled by the output gate. On the other hand the GRU exposes its full content without any control.

这种加法特性有两个优势。首先,对于每个单元来说,很容易记住输入流中某个特定特征的存在,这种特征由LSTM单元的遗忘门或GRU的更新门决定,不会被覆盖而保留下来。

其次,更重要的是,这种加法实际上创建了绕过多个时间步的快捷路径。这些快捷路径允许错误轻松地反向传播,而不会因通过多个有界非线性而迅速消失(如果门控单元接近饱和为1),从而减轻由于梯度消失导致的困难[Hochreiter,1991,Bengio等人,1994]。

然而,这两个单位也有许多不同之处。LSTM单元在GRU中缺少的一个特性是受控的存储器内容暴露。在LSTM单元中,网络中其他单元看到或使用的内存内容的数量由输出门控制。另一方面,GRU在没有任何控制的情况下公开其全部内容。

Another difference is in the location of the input gate, or the corresponding reset gate. The LSTM unit computes the new memory content without any separate control of the amount of information flowing from the previous time step. Rather, the LSTM unit controls the amount of the new memory content being added to the memory cell independently from the forget gate. On the other hand, the GRU controls the information flow from the previous activation when computing the new, candidate activation, but does not independently control the amount of the candidate activation being added (the control is tied via the update gate).

From these similarities and differences alone, it is difficult to conclude which types of gating units would perform better in general. Although Bahdanau et al. [2014] reported that these two units per- formed comparably to each other according to their preliminary experiments on machine translation, it is unclear whether this applies as well to tasks other than machine translation. This motivates us to conduct more thorough empirical comparison between the LSTM unit and the GRU in this paper.

另一个区别在于输入门或相应的复位门的位置。LSTM单元计算新的内存内容,而不需要从上一个时间步流入的信息的任何独立控制。相反,LSTM单元独立于遗忘门控制新的内存内容的量,而GRU在计算新的候选激活时控制来自先前激活的信息流,但不独立地控制添加到候选激活的量(控制通过更新门进行绑定)。

仅从这些相似之处和差异之处来看,很难得出哪种类型的门控单元在一般情况下表现更好。尽管Bahdanau等人[2014]报告说,根据他们在机器翻译的初步实验证明,这两个单元的性能相当,但目前尚不清楚是否这也适用于除机器翻译之外的其他任务。这促使我们在本文中对LSTM单元和GRU进行更深入的经验比较。

5 Results and Analysis

Table 2 lists all the results from our experiments. In the case of the polyphonic music datasets, the GRU-RNN outperformed all the others (LSTM-RNN and tanh-RNN) on all the datasets except for the Nottingham. However, we can see that on these music datasets, all the three models performed closely to each other.

On the other hand, the RNNs with the gating units (GRU-RNN and LSTM-RNN) clearly outper-formed the more traditional tanh-RNN on both of the Ubisoft datasets. The LSTM-RNN was best with the Ubisoft A, and with the Ubisoft B, the GRU-RNN performed best.

表2列出了我们实验的所有结果。在复调音乐数据集的情况下,GRU-RNN在所有数据集上(除Nottingham外)表现优于其他所有模型(LSTM-RNN和tanh-RNN)。然而,在这些音乐数据集上,这三种模型的表现都非常接近

另一方面,具有门控单元(GRU-RNN和LSTM-RNN)的RNN明显优于传统的tanh-RNN,而LSTM-RNN在Ubisoft A上表现最好,在Ubisoft B上,GRU-RNN表现最好。

In Figs. 2–3, we show the learning curves of the best validation runs. In the case of the music datasets (Fig. 2), we see that the GRU-RNN makes faster progress in terms of both the number of updates and actual CPU time. If we consider the Ubisoft datasets (Fig. 3), it is clear that although the computational requirement for each update in the tanh-RNN is much smaller than the other models, it did not make much progress each update and eventually stopped making any progress at much worse level.

These results clearly indicate the advantages of the gating units over the more traditional recurrent units. Convergence is often faster, and the final solutions tend to be better. However, our results are not conclusive in comparing the LSTM and the GRU, which suggests that the choice of the type of gated recurrent unit may depend heavily on the dataset and corresponding task.

在图2-3中,我们展示了最佳验证运行的学习曲线。在音乐数据集的情况下(图2),我们看到GRU-RNN在更新次数和实际CPU时间方面都取得了更快的进展。如果我们考虑Ubisoft数据集(图3),很明显,虽然tanh-RNN每次更新的计算要求要小得多,但它在每次更新时的进展不大,最终在更糟糕的水平上停止了进展。

这些结果清楚地表明了门控单元相对于传统的循环单元的优势。收敛通常更快,最终的解决方案往往更好。然而,我们的结果在比较LSTM和GRU方面并不具有决定性,这表明选择哪种类型的门控循环单元可能严重依赖于数据集和相应的任务

Figure 2: Learning curves for training and validation sets of different types of units with respect to (top) the number of iterations and (bottom) the wall clock time. y-axis corresponds to the negative-log likelihood of the model shown in log-scale.

图2:关于不同类型单元的训练和验证集的学习曲线,分别针对(顶部)迭代次数和(底部)挂钟时间。y轴对应于以对数刻度显示的模型的负对数似然。

6 Conclusion

In this paper we empirically evaluated recurrent neural networks (RNN) with three widely used recurrent units; (1) a traditional tanh unit, (2) a long short-term memory (LSTM) unit and (3) a recently proposed gated recurrent unit (GRU). Our evaluation focused on the task of sequence modeling on a number of datasets including polyphonic music data and raw speech signal data.

The evaluation clearly demonstrated the superiority of the gated units; both the LSTM unit and GRU, over the traditional tanh unit. This was more evident with the more challenging task of raw speech signal modeling. However, we could not make concrete conclusion on which of the two gating units was better.

在本文中,我们对具有三种广泛使用的循环单元的循环神经网络(RNN)进行了经验评估;(1)传统tanh单元,(2)长短时记忆(LSTM)单元和(3)最近提出的门控循环单元(GRU)。我们的评估重点是在包括复调音乐数据和原始语音信号数据在内的多个数据集上进行序列建模任务

评估清楚地证明了门控单元(LSTM单元和GRU)相对于传统tanh单元的优越性。这在更具挑战性的原始语音信号建模任务中更为明显。然而,我们无法就这两种门控单元中哪一种更好做出明确的结论。

We consider the experiments in this paper as preliminary. In order to understand better how a gated unit helps learning and to separate out the contribution of each component, for instance gating units in the LSTM unit or the GRU, of the gating units, more thorough experiments will be required in the future.

我们将本文中的实验视为初步的实验。为了更好地理解门控单元如何帮助学习以及为了分离每个组件的贡献,例如LSTM单元或GRU中的门控单元,将来需要进行更为彻底的实验。

Figure 3: Learning curves for training and validation sets of different types of units with respect to (top) the number of iterations and (bottom) the wall clock time. x-axis is the number of epochs and y-axis corresponds to the negative-log likelihood of the model shown in log-scale.

图3:关于不同类型单元的训练和验证集的学习曲线,分别针对(顶部)迭代次数和(底部)挂钟时间。 x轴是epoch的数量,y轴对应于以对数刻度显示的模型的负对数似然。

本文标签: 门控神经GRUEmpiricalevaluation