admin管理员组

文章数量:1531792

深度学习和dqn

by Thomas Simonini

通过托马斯·西蒙尼(Thomas Simonini)

深度Q学习方面的改进:双重DQN,优先体验重播和固定Q目标 (Improvements in Deep Q Learning: Dueling Double DQN, Prioritized Experience Replay, and fixed Q-targets)

This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus here.

本文是使用Tensorflow?️的深度强化学习课程的一部分。 检查课程表。

In our last article about Deep Q Learning with Tensorflow, we implemented an agent that learns to play a simple version of Doom. In the video version, we trained a DQN agent that plays Space invaders.

在上一篇有关使用Tensorflow进行深度Q学习的文章中 ,我们实现了一个学习播放简单版《毁灭战士》的代理。 在视频版本中, 我们训练了一个DQN代理,该代理可以播放“太空侵略者” 。

However, during the training, we saw that there was a lot of variability.

但是,在培训期间,我们看到了很多可变性。

Deep Q-Learning was introduced in 2014. Since then, a lot of improvements have been made. So, today we’ll see four strategies that improve — dramatically — the training and the results of our DQN agents:

深度Q学习在2014年推出。自那时以来,已经进行了很多改进。 因此,今天我们将看到四种可以显着改善DQN代理商的培训和结果的策略:

  • fixed Q-targets

    固定的Q目标
  • double DQNs

    双DQN
  • dueling DQN (aka DDQN)

    决斗DQN(又名DDQN)
  • Prioritized Experience Replay (aka PER)

    优先体验重播(又称PER)

We’ll implement an agent that learns to play Doom Deadly corridor. Our AI must navigate towards the fundamental goal (the vest), and make sure they survive at the same time by killing enemies.

我们将实施一个学习玩《毁灭战士的致命走廊》的特工。 我们的AI必须导航至基本目标(背心),并通过杀死敌人来确保它们同时生存。

固定Q目标 (Fixed Q-targets)

理论 (Theory)

We saw in the Deep Q Learning article that, when we want to calculate the TD error (aka the loss), we calculate the difference between the TD target (Q_target) and the current Q value (estimation of Q).

我们在“深度Q学习”一文中看到,当我们要计算TD误差(又称损失)时,我们计算TD目标(Q_target)和当前Q值(Q的估计)之差。

But we don’t have any idea of the real TD target. We need to estimate it. Using the Bellman equation, we saw that the TD target is just the reward of taking that action at that state plus the discounted highest Q value for the next state.

但是我们对真正的TD目标一无所知。 我们需要估计一下。 使用Bellman方程,我们看到TD目标只是在该状态下采取该操作的奖励,再加上下一个状态的折后最高Q值。

However, the problem is that we using the same parameters (weights) for estimating the target and the Q value. As a consequence, there is a big correlation between the TD target and the parameters (w) we are changing.

但是,问题在于我们使用相同的参数(权重)来估计目标值 Q值。 结果,TD目标与我们正在更改的参数(w)之间存在很大的相关性。

Therefore, it means that at every step of training, our Q values shift but also the target value shifts. So, we’re getting closer to our target but the target is also moving. It’s like chasing a moving target! This lead to a big oscillation in training.

因此,这意味着在训练的每个步骤中, 我们的Q值都会移动,但目标值也会移动。 因此,我们离目标越来越近,但目标也在移动。 就像追逐一个移动的目标! 这导致训练中的大振荡。

It’s like if you were a cowboy (the Q estimation) and you want to catch the cow (the Q-target) you must get closer (reduce the error).

就像您是牛仔(Q估计值)并且想要赶牛(Q目标)一样,您必须靠近(减少误差)。

At each time step, you’re trying to approach the cow, which also moves at each time step (because you use the same parameters).

在每个时间步长处,您都尝试接近牛,它也会在每个时间步长处移动(因为您使用相同的参数)。

This leads to a very strange path of chasing (a big oscillation in training).

这导致了一个非常奇怪的追逐路径(训练中的巨大波动)。

Instead, we can use the idea of fixed Q-targets introduced by DeepMind:

相反,我们可以使用DeepMind引入的固定Q目标的想法:

  • Using a separate network with a fixed parameter (let’s call it w-) for estimating the TD target.

    使用带有固定参数(称为w-)的单独网络来估算TD目标。
  • At every Tau step, we copy the parameters from our DQN network to update the target network.

    在Tau的每个步骤中,我们都从DQN网络中复制参数以更新目标网络。

Thanks to this procedure, we’ll have more stable learning because the target function stays fixed for a while.

由于此过程,我们将获得更稳定的学习,因为目标函数会保持一段时间不变。

实作 (Implementation)

Implementing fixed q-targets is pretty straightforward:

实现固定的q目标非常简单:

  • First, we create two networks (DQNetwork, TargetNetwork)

    首先,我们创建两个网络( DQNetworkTargetNetwork )

  • Then, we create a function that will take our DQNetwork parameters and copy them to our TargetNetwork

    然后,我们创建一个函数,该函数将使用我们的DQNetwork参数并将其复制到我们的TargetNetwork

  • Finally, during the tr

本文标签: 深度决斗DQN