


by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

深度Q学习方面的改进:双重DQN,优先体验重播和固定Q目标 (Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

本文是使用Tensorflow?️的深度强化学习课程的一部分。 检查课程表。

In our last article about Deep Q Learning with Tensorflow, we implemented an agent that learns to play a simple version of Doom. In the video version, we trained a DQN agent that plays Space invaders.

在上一篇有关使用Tensorflow进行深度Q学习的文章中 ,我们实现了一个学习播放简单版《毁灭战士》的代理。 在视频版本中, 我们训练了一个DQN代理,该代理可以播放“太空侵略者” 。

However, during the training, we saw that there was a lot of variability.


Deep Q-Learning was introduced in 2014. Since then, a lot of improvements have been made. So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:

深度Q学习在2014年推出。自那时以来,已经进行了很多改进。 因此,今天我们将看到四种可以显着改善DQN代理商的培训和结果的策略:

  • fixed Q-targets

  • double DQNs

  • dueling DQN (aka DDQN)

  • Prioritized Experience Replay (aka PER)


We’ll implement an agent that learns to play Doom Deadly corridor. Our AI must navigate towards the fundamental goal (the vest), and make sure they survive at the same time by killing enemies.

我们将实施一个学习玩《毁灭战士的致命走廊》的特工。 我们的AI必须导航至基本目标(背心),并通过杀死敌人来确保它们同时生存。

固定Q目标 (Fixed Q-targets)

理论 (Theory)

We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q_target) and the current Q value (estimation of Q).


But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.

但是我们对真正的TD目标一无所知。 我们需要估计一下。 使用Bellman方程,我们看到TD目标只是在该状态下采取该操作的奖励,再加上下一个状态的折后最高Q值。

However, the problem is that we using the same parameters (weights) for estimating the target and the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.

但是,问题在于我们使用相同的参数(权重)来估计目标值 Q值。 结果,TD目标与我们正在更改的参数(w)之间存在很大的相关性。

Therefore, it means that at every step of training, our Q values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.

因此,这意味着在训练的每个步骤中, 我们的Q值都会移动,但目标值也会移动。 因此,我们离目标越来越近,但目标也在移动。 就像追逐一个移动的目标! 这导致训练中的大振荡。

It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).


At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).


This leads to a very strange path of chasing (a big oscillation in training).


Instead, we can use the idea of fixed Q-targets introduced by DeepMind:


  • Using a separate network with a fixed parameter (let’s call it w-) for estimating the TD target.

  • At every Tau step, we copy the parameters from our DQN network to update the target network.


Thanks to this procedure, we’ll have more stable learning because the target function stays fixed for a while.


实作 (Implementation)

Implementing fixed q-targets is pretty straightforward:


  • First, we create two networks (DQNetwork, TargetNetwork)

    首先,我们创建两个网络( DQNetworkTargetNetwork )

  • Then, we create a function that will take our DQNetwork parameters and copy them to our TargetNetwork


  • Finally, during the tr

