admin管理员组

文章数量:1534195

蒙特卡洛方法习题

深度强化学习说明— 13 (DEEP REINFORCEMENT LEARNING EXPLAINE — 13)

In this new post of the Deep Reinforcement Learning Explained series, we will introduce the Monte Carlo Methods, another of the classical methods of Reinforcement Learning along with Dynamic Programming introduced in the first part of this series and Temporal Difference Learning that we will introduce in the following post. In this post we will also introduce how to estimate the optimal policy and the Exploration-Exploitation Dilemma.

深度强化学习介绍系列的新文章中,我们将介绍蒙特卡洛方法,这是本系列第一部分中介绍的另一种经典强化学习方法以及动态编程,还将介绍时间差异学习在以下帖子中。 在这篇文章中,我们还将介绍如何估算最佳政策和勘探开发困境。

蒙特卡洛与动态规划 (Monte Carlo versus Dynamic Programming)

In Part 1 of this series, we presented a solution to MDP called dynamic programming, pioneered by Richard Bellman. Remember that the Bellman equation allows us to define the value function recursively and can be solved with the Value Iteration algorithm. To summarize, Dynamic Programming provides a foundation for reinforcement learning, but we need to loop through all the states on every iteration (they can grow exponentially in size, and the state space can be very large or infinite). Dynamic programming also requires a model of the Environment, specifically knowing the state-transition probability p(s′,r|s,a).

在本系列的第1部分中,我们介绍了由Richard Bellman率先提出的MDP解决方案,称为动态编程。 请记住,贝尔曼方程式允许我们递归定义值函数,并且可以使用值迭代算法求解。 总而言之,动态编程为强化学习提供了基础,但是我们需要在每次迭代中遍历所有状态(它们的大小可以成倍增长,并且状态空间可以非常大或无限)。 动态编程还需要一个环境模型,特别是要知道状态转换概率p(s',r | s,a)

In contrast, Monte Carlo methods are all about learning from experience. Any expected value can be approximated by sample means — in other words, all we need to do is play a bunch of episodes, gather the returns, and average them. Monte Carlo methods are actually a set of alternatives to the basic algorithm. These are only for episodic tasks, where the interaction stops when the Agent encounters a terminal state. That is, we assume experience is divided into episodes, and that all episodes eventually terminate no matter what actions are selected.

相反,蒙特卡洛方法都是关于从经验中学习的方法。 任何期望值都可以通过样本均值来近似估算-换句话说,我们要做的就是播放一集情节,收集收益并将其平均。 蒙特卡洛方法实际上是基本算法的一组替代方法。 这些仅用于情景任务,在此任务中,当Agent遇到终端状态时,交互将停止。 也就是说,我们假设经验被分为几集,并且无论选择什么动作,所有集最终都会终止。

It’s important to note that Monte Carlo methods only give us a value for states and actions we’ve encountered, and if we never encounter a state its value is unknown.

需要特别注意的是,蒙特卡洛方法仅为我们提供了所遇到的状态和动作的值,如果我们从未遇到过状态,则其值是未知的。

蒙特卡洛方法 (Monte Carlo Methods)

This post will provide a practical approach to Monte Carlo used in Reinforcement Learning. For a more formal explanation of the methods, I invite the reader to read the Chapter 5 of the textbook Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto.

这篇文章将提供一种在强化学习中使用蒙特卡洛的实用方法。 有关方法的更正式说明,我请读者阅读Richard S. Sutton和Andrew G. Barto编写的《 强化学习:入门 》教科书的第5章。

Recall that the optimal policy π∗​ specifies, for each Environment state s, how the Agent should select an action a towards its goal of maximising reward G. We also learned that the Agent could structure its search for an optimal policy by first estimating the optimal action-value function q∗​; then, once q∗​ is known, π∗​ is quickly obtained.

回想一下,对于每个环境状态s最优策略 π*都指定了代理商应如何选择行动a以实现其最大化奖励G的目标。 我们还了解到,Agent可以通过首先估计最佳行动值函数 q ∗来构造其对最佳策略的搜索; 然后,一旦知道q ∗,便很快获得π ∗。

The Agent starts taking a basic policy, like the equiprobable random policy, a stochastic policy where from each state the Agent randomly selects from the set of available actions, and each action is selected with equal probability. The Agent uses it to collect some episodes, and then consolidate the results to arrive at a better policy.

代理开始采用基本策略,例如等概率随机策略,随机策略,其中代理从每种状态中从可用操作集中随机选择,并且以相等的概率选择每个操作。 代理使用它来收集一些情节,然后合并结果以制定更好的策略。

The way to do it is by estimating the action-value function with a table we will call Q-table. This core table in Monte Carlo Methods has a row for each state and a column for each action. The entry corresponding to state s and action a is denoted Q(s,a).

做到这一点的方法是通过用一个表估计动作值函数,我们将其称为Q-table 。 蒙特卡洛方法中的此核心表的每个状态都有一行,每个动作都有一个列。 对应于状态s和动作a的条目表示为Q ( sa )。

We refer to this as the prediction problem: Given a policy, how might the Agent estimate the value function for that policy?. We refer to Monte Carlo (MC) approaches to the prediction problem as MC prediction methods.

我们将其称为预测问题: 给定一个策略,代理如何估算该策略的价值函数? 。 我们将预测问题的蒙特卡罗(MC)方法称为MC预测方法

We will focus our explanation to the action-value function, but “prediction problem” also refers to approaches that can be used to estimate the state-value function.

我们将把解释集中在作用值函数上,但是“预测问题”也指可用于估计状态值函数的方法。

In the algorithm for MC prediction, we begin by collecting many episodes with the policy. Then, we note that each entry in the Q-table corresponds to a particular state and action. To populate an entry of the Q-table, we use the return that followed when the Agent was in that state and chose the action.

在MC预测算法中,我们首先从该策略收集许多事件开始。 然后,我们注意到Q表中的每个条目都对应于特定的状态和动作。 要填充Q表的条目,我们使用Agent处于该状态并选择操作时的返回值。

We define every occurrence of a state in an episode as a visit to that state-action pair. It can happen that a state-action pair is visited more than once in an episode. This leads us to have two versions of MC prediction algorithm:

我们将情节中每次出现的状态定义为对该状态动作对的访问 。 在一个情节中,一次状态-动作对可能会被多次访问。 这使我们拥有两种MC预测算法版本:

  • Every-visit MC Prediction: Average the returns following all visits to each state-action pair, in all episodes.

    每次访问MC预测 :在所有情节中,对每个状态动作对的所有访问之后的平均回报。

  • First-visit MC Prediction: For each episode, we only consider the first visit to the state-action pair.

    首次访问MC预测 :对于每个情节,我们仅考虑首次访问国家行动对。

Both the first visit method and each visit are considered to guarantee convergence to the true action-value function.

首次访问方法和每次访问都被认为可以确保收敛到真实的动作值函数。

In this post, we will implement the first-visit for our working example from OpenAI Gym: BlackJack Environment. But actually what happens in this example first-visit and every-visit MC return the same results. Note that the same state never recurs within one episode, so there is no difference between first-visit and every-visit MC methods. The pseudocode for the First-Visit MC Prediction can be found below:

在本文中,我们将在OpenAI Gym: BlackJack Environment中实现我们的工作示例的首次访问。 但是实际上,在此示例中,首次访问和每次访问MC会返回相同的结果。 请注意,同一状态永远不会在一个情节内重复出现,因此首次访问和每次访问MC方法之间没有区别。 首次访问MC预测的伪代码可以在下面找到:

In this pseudocode, the variable num_episodes indicates the number of episodes the Agent collects and there are three relevant tables:

在此伪代码中,变量num _ episodes指示代理收集的事件数,并且存在三个相关表:

  • Q: A Q-table with a row for each state and a column for each action.

    Q :Q表,每个状态有一行,每个动作有一个列。

  • N : A table that keeps track of the number of first visits we have made to each state-action pair.

    N :一个表,该表跟踪我们对每个状态操作对进行的首次访问的次数。

  • returns_sum: A table that keeps track of the sum of the rewards obtained after first visits to each state-action pair.

    返回 _ sum:一个表, 表跟踪第一次访问每个状态动作对之后获得的奖励总和。

After each episode, N and returns_sum tables are updated to store the information contained in the episode. The final estimate for Q (the Q-table) is done after all of the episodes have been collected.

在每个情节之后,更新N返回 _ 表以存储该情节中包含的信息。 Q的最终估计值( 收集完所有情节后,再进行Q表)。

案例研究:BlackJack (Case study: BlackJack)

To better understand how Monte Carlo works in practice, let’s perform a step-by-step coding with the game of Blackjack using OpenAI’s gym Environment Blackjack-v0.

为了更好地了解Monte Carlo在实践中的工作方式,让我们使用OpenAI的健身房Environment Blackjack-v0在Blackjack游戏中进行逐步编码。

二十一点规则 (BlackJack rules)

To begin with, let’s define the rules and conditions of our game:

首先,让我们定义游戏的规则和条件:

  • The object of the popular casino card game of blackjack is to obtain cards the sum of whose numerical values are as great as possible without exceeding 21. Going over 21 results in a bust, while both parties having 21 results in a draw.

    二十一点这种流行的赌场纸牌游戏的目的是获得数字总和尽可能大且不超过21张的纸牌。超过21张会导致破产,而双方拥有21张会导致平局。
  • We consider a simplified version in which each player competes independently against the dealer. This means that we’ll be playing against the dealer only, no other players will be participating.

    我们考虑一种简化版本,其中每个玩家都可以独立与经销商竞争。 这意味着我们将只与庄家对抗,没有其他玩家参与。
  • The value of numerical cards is at face value. The value of cards J, K, and Q is 10. The value of ace can be 1 or 11 depending on the player’s choice.

    数字卡的价值为面值。 纸牌J,K和Q的值为10。根据玩家的选择,ace的值可以为1或11。
  • Both parties, the player and the dealer, are given two cards. The player’s two cards are face-up, while one of the dealer’s cards is face-up. After the player has seen their cards and the dealer’s first card, the player can choose to hit or stand until he is satisfied with his sum, after which he will stand.

    双方,包括玩家和发牌者,都有两张牌。 玩家的两张牌面朝上,而发牌者的一张牌面朝上。 玩家看到自己的牌和庄家的第一张牌后,玩家可以选择击中或站立,直到对自己的总和满意为止,之后他将站立。
  • The dealer then reveals their second card — if the sum is less than 17, they will continue drawing cards until 17 is reached, after which they will stand.

    然后,发牌人展示他们的第二张牌-如果总和少于17张,他们将继续抽牌,直到达到17张,之后它们将站立。

OpenAI体育馆:BlackJack环境 (OpenAI Gym: BlackJack Environment)

We’ll use OpenAI’s gym Environment Blackjack-v0. Each state is a 3-tuple of:

我们将使用OpenAI的健身房Environment Blackjack-v0 。 每个状态是一个三元组:

  • the player’s current sum ∈{0,1,…,31}

    玩家当前总和∈{0,1,…,31}
  • the dealer’s face-up card ∈{1,…,10}

    经销商的面朝上卡∈{1,…,10}
  • whether or not the player has a usable ace.

    玩家是否具有可用的王牌。

The Agent has two potential actions:

代理有两种可能的行动:

  • Stick (action 0): take no more cards (also know as “stand” or “stay”).

    摇杆 (动作0 ):不再拿卡(也称为“站立”或“停留”)。

  • Hit (action 1): take another card from the dealer.

    击中 (操作1 ):从发牌人那里拿另外一张卡。

Each game of blackjack is an episode. Rewards of +1, −1, and 0 are given for winning, losing, and drawing, respectively. All rewards within a game are 0, and we do not discount (gamma = 1); therefore these terminal rewards are also the returns.

二十一点的每个游戏都是一个情节。 + 1,-1和0的奖励分别用于赢,输和抽奖。 游戏中的所有奖励均为0,我们不折扣(伽马= 1); 因此,这些最终奖励也是回报。

The entire code of this post can be found on GitHub and can be run as a Colab google notebook using this link.

这篇文章的完整代码可以在GitHub上找到,并且可以使用此链接作为Colab Google笔记本运行。

We begin by importing the necessary packages:

我们首先导入必要的软件包:

import sys
import gym
import numpy as np
from collections import defaultdictenv = gym.make('Blackjack-v0')
print(env.observation_space)
print(env.action_space)

We can see that there are 704 different states in the Environment corresponding to 32 times 11 times 2 and two possible actions corresponding to selecting to stick or to hit:

我们可以看到环境中有704个不同的状态,分别对应32次11次2次和两个可能的动作,分别对应于选择粘贴还是击中:

Tuple(Discrete(32), Discrete(11), Discrete(2)) 
Discrete(2)

To see a sample games each time step with each episode the Agent interacts with the Environment we can run the following code (multiple times):

要查看代理与环境互动的每个情节的每个步骤的示例游戏,我们可以运行以下代码(多次):

state = env.reset()
while True:
print(state)
action = env.action_space.sample()
state, reward, done, info = env.step(action)
if done:
if reward > 0:
print('Reward: ', reward, '(Player won)\n')
else:
print('Reward: ', reward, '(Player lost)\n')
break

One example of a sample game is:

示例游戏的一个示例是:

(21, 4, True) 
(13, 4, False)
(14, 4, False)
Reward: -1.0

where the Agent lost the play. I suggest the reader to run this code several times to see different plays.

特工在那场比赛中输了。 我建议读者多次运行此代码以查看不同的玩法。

一个简单的MC预测方法实现 (A simple MC Prediction Method implementation)

Consider the policy where the player decides an action based only on her/his current score. For instance, if the sum of the cards we have is 18 or less we figure that probably it’s okay if we ask for a new car. We do that with a 75% probability. If the sum of the cards is bigger than 18, we consider that it is too dangerous to accept a new card, and we don’t do that with a 75% probability.

考虑玩家仅根据其当前得分决定动作的策略。 例如,如果我们拥有的纸牌总数为18张或更少,我们认为如果我们要新车,那就没关系。 我们以75%的概率做到这一点。 如果卡的总和大于18,我们认为接受新卡太危险了,我们不以75%的概率这样做。

In summary, we select action stick with a 75% probability if the sum is greater than 18; and, if the sum is 18 or below, we select action hit with a 75% probability.

总而言之,如果总和大于18,我们选择动作杆的概率为75%。 并且,如果总和为18或以下,我们选择的动作击中了75%的概率。

Notice that in this first approach of policy, there’s information in the state that we are ignoring, for instance, the dealer’s face-up card or whether we have a usable ace. This is to simplify the example a focus the explanation of the code. The following function generate_episode samples an episode using this policy:

请注意,在第一种策略方法中,有些信息处于我们正在忽略的状态,例如,发牌人的面朝上的卡片或我们是否有可用的ace。 这是为了简化示例重点说明代码。 以下函数generate_episode使用此策略对情节进行采样:

def generate_episode(env):
episode = []
state = env.reset()
while True:
probs = [0.75, 0.25] if state[0] > 18 else [0.25, 0.75]
action = np.random.choice(np.arange(2), p=probs)
next_state, reward, done, info = env.step(action)
episode.append((state, action, reward))
state = next_state
if done:
break
return episode

This code returns an episode (using the policy π decided) as a list of (state, action, reward) tuples of tuples. episode[i][0], episode[i][1], and episode[i][2] correspond to the state at time-step 𝑖, action at time-step i, and reward at time-step 𝑖+1, respectively.

此代码返回一episode (使用决定的策略π )作为元组的(状态,动作,奖励)元组列表。 episode[i][0] ,ep episode[i][1]和ep episode[i][2]对应于时间步𝑖的状态,时间步i的动作以及时间步𝑖+ 1的奖励, 分别。

In order to play (10 games), we can execute the previous function as follows:

为了玩(10个游戏),我们可以执行以下功能:

for i in range(10):
print(generate_episode(env))

The output will be (in my run):

输出将是(在我的运行中):

[((19, 3, True), 1, 0.0), ((19, 3, False), 0, 1.0)] 
[((14, 5, False), 1, -1.0)]
[((15, 10, False), 0, -1.0)]
[((15, 10, False), 1, 0.0), ((17, 10, False), 1, -1.0)]
[((13, 2, False), 1, -1.0)]
[((13, 4, False), 1, 0.0), ((14, 4, False), 0, -1.0)]
[((11, 10, False), 1, 0.0), ((21, 10, False), 0, 0.0)]
[((15, 10, False), 1, -1.0)]
[((9, 9, False), 1, 0.0), ((16, 9, False), 0, 1.0)]
[((13, 7, False), 1, 0.0), ((18, 7, False), 1, 0.0), ((21, 7, False), 1, -1.0)]

The rewards are received only at the end of the game and are 1.0 if we win and -1.0 if we lose. We see that sometimes we win and sometimes we lose.

回报是只在比赛结束接收,并且1.0 ,如果我们赢了-1.0 ,如果我们输了。 我们看到有时我们赢了,有时我们输了。

Let’s start to program a code based on the previous pseudocode. First, we need to initialize the dictionaries N, return_sum, and Q:

让我们开始基于先前的伪代码编写代码。 首先,我们需要初始化字典Nreturn_sumQ

N = defaultdict(lambda: np.zeros(env.action_space.n))
returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
Q = defaultdict(lambda: np.zeros(env.action_space.n))

Next, the algorithm loops over episodes generated using the provided function generate_episodethat uses the policy. Each episode is going to be a list of state, action, and reward tuples. Then we use the zip command to separate the states, actions, and rewards into separate quantities:

接下来,该算法循环遍历使用提供的使用策略的函数generate_episode情节。 每个情节都将是状态,动作和奖励元组的列表。 然后,我们使用zip命令将状态,操作和奖励分成单独的数量:

episode = generate_episode(env)
states, actions, rewards = zip(*episode)

Let’s back to de pseudocode to see that we loop over time steps to look at the corresponding state action pair for each time step. If that’s the first time that we visited that pair we increment the corresponding position of table N by one and add the return at this time step in the corresponding entry of the table return_sum.

让我们回到伪代码,看看我们在时间步长上循环,以查看每个时间步长对应的状态动作对。 如果这是我们第一次访问该对,则将表N的对应位置加1,然后在此步骤将返回值添加到表return_sum的对应条目中。

Remember that in this Blackjack example, first visit and every visit MC prediction are equivalent, so we will do this update for every time step. Then, once we have the corresponding updated values for return_sum and N we can use them to update our Q table estimated. The code for this part will be the following:

请记住,在此二十一点示例中,首次访问和每次访问MC预测是等效的,因此我们将在每个时间步进行此更新。 然后,一旦我们有了return_sumN的相应更新值,就可以使用它们来更新我们估计的Q表。 该部分的代码如下:

discounts = np.array([gamma**i for i in range(len(rewards)+1)])for i, state in enumerate(states):
returns_sum[state][actions[i]] +=
sum(rewards[i:]*discounts[:-(1+i)])
N[state][actions[i]] += 1.0
Q[state][actions[i]] = returns_sum[state][actions[i]]
/ N[state][actions[i]]

The complete code of our MC prediction method is as follows:

我们的MC预测方法的完整代码如下:

def mc_prediction(env, num_episodes, generate_episode, gamma=1.0):

returns_sum = defaultdict(lambda: np.zeros(env.action_space.n))
N = defaultdict(lambda: np.zeros(env.action_space.n))
Q = defaultdict(lambda: np.zeros(env.action_space.n))for episode in range(1, num_episodes+1):
episode = generate_episode(env)
states, actions, rewards = zip(*episode)discounts = np.array([gamma**i for i in
range(len(rewards)+1)])for i, state in enumerate(states):
returns_sum[state][actions[i]] +=
sum(rewards[i:]*discounts[:-(1+i)])
N[state][actions[i]] += 1.0
Q[state][actions[i]] = returns_sum[state][actions[i]]
/ N[state][actions[i]]
return Q

The algorithm has as arguments the instance of an OpenAI Gym Environment, the number of episodes that are generated, and the discount rate (default value 1). The algorithm returns as output the Q-table (estimate of the action-value function), a dictionary (of one-dimensional arrays).

该算法将OpenAI Gym Environment的实例,生成的情节数和折扣率(默认值1 )作为参数。 该算法将Q表(动作值函数的估计值),字典(一维数组)作为输出返回。

Will be interesting to plot the corresponding state value function to see which states have more value. We can do it from the Q-table with this simple code that obtains the value of the state by weighting the value of each action according to how we have defined the problem:

绘制相应的状态值函数以查看哪些状态具有更大的价值将很有趣。 我们可以使用以下简单代码在Q表中完成此操作:根据我们对问题的定义方式,对每个操作的值进行加权,从而获得状态值:

num_episodes=1000000Q = mc_prediction(env, num_episodes, generate_episode)State_Value_table={}
for state, actions in Q.items():
State_Value_table[state]=
(state[0]>18)*(np.dot([0.75, 0.25],actions)) +
(state[0]<=18)*(np.dot([0.75, 0.25],actions))

We could plot this with this code borrowed from Udacity. There are two plots corresponding to whether we have a usable ace or we don’t have a usable ace. But in either case, we see that the highest state values correspond to when the player sum is something like 20 or 21, and it seems obvious because in this case we are most likely to win the game.

我们可以使用从Udacity借来的这段代码来绘制此图。 有两个图对应于我们是否具有可用的ace或没有可用的ace。 但是无论哪种情况,我们都可以看到最高的状态值对应于玩家总和为20或21时的状态,这似乎很明显,因为在这种情况下,我们最有可能赢得比赛。

探索与开发 (Exploration vs. Exploitation)

So far, we have learned how an Agent can take a policy like the equal probable random policy, use that to interact with the Environment, and then use that experience to populate the corresponding Q-table, which becomes an estimate of that policies action-value function. So, now the question is how can we use this in our search for an optimal policy?

到目前为止,我们已经了解了代理如何采取类似均等的随机策略之类的策略,利用该策略与环境进行交互,然后利用该经验来填充相应的Q表,从而成为该策略操作的估计值-值函数。 因此,现在的问题是,我们如何才能在寻找最佳政策时使用它?

贪婪政策 (Greedy Policies)

Well, to get a better policy, that is not necessarily the optimal one, we need only select for each state the action that maximizes the Q-table. Let’s call that new policy π’. When we take a Q-table and use the action that maximizes each row to come up with the policy, we say that we are constructing the policy that’s greedy with respect to the Q-table.

好吧,要获得更好的策略(不一定是最佳策略),我们只需要为每个状态选择最大化Q表的操作即可。 我们将此新策略称为π' 。 当我们使用Q表并使用使每一行最大化的操作来提出策略时 ,我们说我们正在构建对Q表贪心的策略。

It is common to refer to the selected action as the greedy action. In the case of a finite MDP, the action-value function estimate is represented in a Q-table. Then, to get the greedy action, for each row in the table, we need only select the action corresponding to the column that maximize the row.

通常将选定的动作称为贪婪动作 。 在有限MDP的情况下,作用值函数估计值表示在Q表中。 然后,为了获得贪婪的动作,对于表中的每一行,我们只需要选择与最大化行的列对应的动作即可。

Epsilon-Greedy政策 (Epsilon-Greedy policies)

However, instead of always constructing a greedy policy (always select the greedy action)what we will do instead is construct a so-called Epsilon-Greedy policy that’s most likely to pick the greedy action, but with some small but nonzero probability picks one of the other actions instead. In this case, some small positive number epsilon ϵ, which must be between zero and one is used. This is motivated, as we will explain in more detail later, due an Agent must find a way to balance the drive to behave optimally based on their current knowledge and the need to acquire knowledge to attain better future behavior.

但是,不是总是构造一个贪婪策略(总是选择贪婪的动作),我们要做的是构造一个所谓的Epsilon-Greedy策略 ,该策略最有可能选择贪婪的动作,但是概率很小但非零,所以选择了其中之一其他动作代替。 在这种情况下,使用一些小的正数ε,其必须在零和一之间。 这是有动机的,正如我们将在后面更详细地解释的那样,由于代理必须找到一种方法来平衡基于他们当前的知识和获取知识以实现更好的未来行为的需求,以使其表现最佳。

蒙特卡洛控制 (Monte Carlo Control)

We have learned how the Agent can take a policy π, use it to interact with the Environment for many episodes, and then use the results to estimate the action-value function with a Q-table. Once the Q-table closely approximates the action-value function​, the Agent can construct the policy π′ that is ϵ-greedy with respect to the Q-table, which will yield a policy that is better than the original policy π. Then we can improve the policy by changing it to to be ϵ-greedy with respect to the Q-table.

我们已经了解了Agent如何采取策略π ,如何将其与环境交互使用许多情节,然后使用结果通过Q表估计动作值函数。 一旦Q表紧密接近行动值函数,Agent就可以构造相对于Q表为ϵ-贪婪的策略π ',这将产生比原始策略π更好的策略。 然后,我们可以通过改变它是ε-greedy关于Q-表完善政策。

So, we will eventually obtain the optimal policy π∗​ if the Agent alternates between these two steps over and over until we got successively better and better policies in the hope that we converge to the optimal policy:

因此,如果Agent反复在这两个步骤之间交替进行,直到最终得到越来越好的策略以希望收敛到最优策略,我们最终将获得最优策略π ∗*:

  • Step 1: using the policy π to construct the Q-table, and

    步骤1 :使用策略π构造Q表,并且

  • Step 2: improving the policy by changing it to be ϵ-greedy with respect to the Q-table (noted by ϵ-greedy(Q) ).

    步骤2:通过改变它提高政策是ε-greedy相对于所述Q-表(由ε-贪婪(Q)说明)。

This proposed algorithm is so close to giving us the optimal policy, as long as we run it for long enough.

只要我们运行足够长的时间,该算法就很接近为我们提供最佳策略。

We call this algorithm Monte Carlo control method, to estimate the optimal policy. It is common to refer to Step 1 as policy evaluation since it is used to determine the action-value function of the policy. Likewise, since Step 2 is used to improve the policy, we also refer to it as a policy improvement step. Schematically we could represent it as:

我们将此算法称为蒙特卡洛控制方法 ,以估计最佳策略。 通常将步骤1称为策略评估,因为它用于确定策略的操作功能。 同样,由于使用了步骤2改进策略,因此我们也将其称为策略改进步骤。 我们可以将其表示为:

Therefore, using this new terminology, we can summarize that our Monte Carlo control method alternates between policy evaluation and policy improvement steps to find the optimal policy π∗​:

因此,使用这种新术语,我们可以总结出,我们的蒙特卡洛控制方法策略评估策略改进步骤之间交替查找最优策略π*:

where the “E” in the arrow denotes a complete policy evaluation and “I” denotes a complete policy improvement. In the next post we will present an implementation of the Monte Carlo Control Algorithm.

其中箭头“ E”表示完整的策略评估,“ I”表示完整的策略改进。 在下一篇文章中,我们将介绍蒙特卡洛控制算法的实现。

However, some questions arise, how to set the value of ϵ when constructing ϵ-greedy policies? In the next section, we will see how.

然而,一些问题出现,构建ε-greedy政策时,如何设置ε值? 在下一节中,我们将看到如何。

勘探开发困境 (Exploration-Exploitation Dilemma)

Recall that the Environment’s dynamics are initially unknown to our Agent and due the goal is maximizing return, the Agent must learn about the Environment through interaction. Then, at every time step, when the Agent selects an action, it bases its decision on past experience with the Environment. And, the instinct could be to select the action that based on past experience will maximize return. As we discussed in a previous section, this policy that is greedy with respect to the action-value function estimate can easily lead to convergence to a sub-optimal policy.

回想一下,我们的代理最初并不了解环境的动态,并且由于目标是最大化回报,因此代理必须通过交互来了解环境。 然后,在每个时间步,当代理选择一个动作时,它的决定都是基于对环境的过去经验。 并且,本能可能是选择基于 根据过去的经验 将最大化回报。 正如我们在上一节中讨论的那样,这种针对行动价值函数估计的贪婪策略很容易导致收敛为次优策略。

平衡开采和勘探 (Balance exploitation and exploration)

This may happen because, in the early episodes, the Agent’s knowledge is quite limited and dismisses considering non-greedy actions that are better in terms of future rewards than known actions. That means that a successful Agent cannot act greedily at every time step; instead, in order to discover the optimal policy, it has to continue to refine the estimated return for all state-action pairs. But at the same time maintain a certain level of greedy actions to maintain the goal of maximizing return as quickly as possible. This motivated the idea of an ϵ-greedy policy which we have present before.

之所以会发生这种情况,是因为在早期情节中,Agent的知识非常有限,因此不考虑考虑非贪婪的行为,因为这些行为对未来的回报要比已知的行为更好。 这意味着成功的特工不能在每个时间步调都很贪婪。 相反,为了发现最佳策略,它必须继续优化所有状态操作对的估计收益。 但是同时要保持一定程度的贪婪行为,以保持尽快获得最大回报的目标。 这激发了我们以前提出过的ϵ-贪婪政策的想法。

We refer to the need to balance these two competing requirements as the Exploration-Exploitation Dilemma, where an Agent must find a way to balance the drive to behave optimally based on their current knowledge (exploitation) and the need to acquire knowledge to attain better judgment (exploration).

我们将平衡这两个相互竞争的需求的需求称为“ 探索-利用困境” ,在这种困境中 ,座席必须找到一种方法,以平衡基于他们当前的知识( 利用 )和表现出最佳表现的动力,以及获取知识以做出更好判断的需求( 探索 )。

设置Epsilon的值 (Setting the Value of Epsilon)

One potential solution to this dilemma is implemented by gradually modifying the value of ϵ when constructing ϵ-greedy policies. It makes sense for the Agent to begin its interaction with the Environment by opting for exploration rather than exploitation trying various strategies for maximizing return. With this in mind, the best starting policy is the equiprobable random policy, as it is equally likely to explore all possible actions from each state. Setting ϵ=1 yields an ϵ-greedy policy that is equivalent to the equiprobable random policy.

摆脱这种困境的一种可能的解决方案是通过构造ε-greedy政策时逐渐改变ε的值来实现。 对于代理而言,通过选择探索而不是尝试开发来开始与环境的交互是有意义的 各种最大化回报的策略。 考虑到这一点,最好的启动策略是等概率随机策略,因为它同样有可能探索每个州的所有可能动作。 设置ϵ = 1会产生ϵ-贪心策略,该策略等效于等概率随机策略。

At later time steps, it makes sense to foster exploitation over exploration, where the policy gradually becomes more greedy with respect to the action-value function estimate. Setting ϵ=0 yields the greedy policy. It has been shown that initially favoring exploration by exploitation and gradually preferring exploitation to exploring is an optimal strategy.

在以后的步骤中,在勘探中促进开发是有道理的,因为对于操作值函数估计,该策略逐渐变得更加贪婪。 设置ϵ = 0会产生贪婪策略。 已经表明,最初倾向于通过开采进行勘探,然后逐渐倾向于开采而不是勘探是一种最佳策略。

In order to guarantee that MC control converges to the optimal policy π∗​, we need to ensure that two conditions are met: every state-action pair is visited infinitely many times, and the policy converges to a policy that is greedy with respect to the action-value function estimate Q. We refer to these conditions as Greedy in the Limit with Infinite Exploration that ensure the Agent continues to explore for all time steps, and the Agent gradually exploits more and explores less.

为了确保MC控制收敛到最优策略π ∗,我们需要确保满足两个条件:每个状态-动作对被无限次访问,并且该策略收敛到一个相对于贪婪的策略动作值函数估计问:我们把这些条件贪婪与无限探索的限制 ,以确保代理继续探索所有的时间步骤,逐步代理利用越来越少探讨。

One way to satisfy these conditions is to modify the value of ϵ , making it gradually decay, when specifying an ϵ-greedy policy. However, one has to be very careful with setting the decay rate for ϵ. Determining the best decay is not trivial, requiring a bit of alchemy, that is, experience. We will see an example of implementation in the next post.

为满足这些条件的一种方法是修改ε的值,使得它逐渐衰减,指定ε-greedy策略时。 但是,在设置the的衰减率时必须非常小心 确定最佳衰减并不是一件容易的事,这需要一些炼金术,即经验。 在下一篇文章中,我们将看到一个实现示例。

接下来是什么? (What is next?)

So far, in our current algorithm for Monte Carlo control, we collect a large number of episodes to build the Q-table. Then, after the values in the Q-table have converged, we use the table to come up with an improved policy. However, Monte Carlo prediction methods can be implemented incrementally, on an episode-by-episode basis. In the next post, we will present how to build better MC control algorithms following this idea. See you in the next post!

到目前为止,在当前用于蒙特卡洛控制的算法中,我们收集了大量情节以构建Q表。 然后,在Q表中的值收敛之后,我们使用该表提出一种改进的策略。 但是,蒙特卡洛预测方法可以在逐集的基础上逐步实现 。 在下一篇文章中 ,我们将根据该思想介绍如何构建更好的MC控制算法。 下篇再见!

翻译自: https://towardsdatascience/monte-carlo-methods-9b289f030c2e

蒙特卡洛方法习题

本文标签: 蒙特卡洛方法习题