admin管理员组

文章数量:1612391

本文为强化学习笔记,主要参考以下内容:

  • Reinforcement Learning: An Introduction
  • 代码全部来自 GitHub
  • 习题答案参考 Github

目录

  • Gambler’s Problem
  • Exercise
  • Code

Gambler’s Problem

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he has staked on that flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, or loses by running out of money. On each flip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars.

This problem can be formulated as an undiscounted, episodic, finite MDP. The state is the gambler’s capital, s ∈ { 1 , 2 , . . . , 99 } s\in \{1, 2, . . . , 99\} s{1,2,...,99} and the actions are stakes, a ∈ { 0 , 1 , . . . , m i n ( s , 100 − s ) } a\in \{0, 1, . . . , min(s, 100−s)\} a{0,1,...,min(s,100s)}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is + 1 +1 +1. The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p h p_h ph denote the probability of the coin coming up heads. If p h p_h ph is known, then the entire problem is known and it can be solved, for instance, by value iteration.

The policy shown in Figure 4.3 is optimal, but not unique. In fact, there is a whole family of optimal policies, all corresponding to ties for the argmax action selection with respect to the optimal value function. Can you guess what the entire family looks like?

Exercise 4.8
Why does the optimal policy for the gambler’s problem have such a curious form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not. Why is this a good policy?
ANSWER
The gambler’s problem has such curious form of optimal policy because at the number 50, you can suddenly win with probability p h p_h ph. Thus, the best policy will bet all when Capital=50 and the possible dividends of it, like 25.

Thinking capital of 51 as 50 plus 1. Of course we can bet all when we have 51, but the best policy is to see if we can earn much from the extra 1 dollar. If this return g g g is positive, we can say we have extra g g g money and bet it again until 75, where the sudden win chance is coming. On the contrary, if we bet 50 out of 51 first, our chance of win is only p h p_h ph and we lose the chance to reach 75. Instead, we will have to try our best to reach 25 first with 1 dollar if we lose the bet, a much worse condition.

Conclusion: The indicated optimal policy creates more chances to win and guarantees the gambler be better off when he loses.

Exercise

Implement value iteration for the gambler’s problem and solve it for p h = 0.25 p_h = 0.25 ph=0.25 and p h = 0.55 p_h = 0.55 ph=0.55. In programming, you may find it convenient to introduce two dummy states corresponding to termination with capital of 0 and 100, giving them values of 0 and 1 respectively. Show your results graphically, as in Figure 4.3.

Code

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail)             #
# 2016 Kenta Shimada(hyperkentakun@gmail)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

matplotlib.use('Agg')
# goal
GOAL = 100

# all states, including state 0 and state 100
STATES = np.arange(GOAL + 1)

# probability of head
HEAD_PROB = 0.4
def figure_4_3():
    # state value
    state_value = np.zeros(GOAL + 1)
    state_value[GOAL] = 1.0

    sweeps_history = []

    # value iteration
    while True:
        old_state_value = state_value.copy()
        sweeps_history.append(old_state_value)

        for state in STATES[1:GOAL]:
            # get possilbe actions for current state
            actions = np.arange(min(state, GOAL - state) + 1)
            action_returns = []
            for action in actions:
                action_returns.append(
                    HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])
            new_value = np.max(action_returns)
            state_value[state] = new_value
        delta = abs(state_value - old_state_value).max()
        if delta < 1e-9:
            sweeps_history.append(state_value)
            break

    # compute the optimal policy
    policy = np.zeros(GOAL + 1)
    for state in STATES[1:GOAL]:
        actions = np.arange(min(state, GOAL - state) + 1)
        action_returns = []
        for action in actions:
            action_returns.append(
                HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])

        # round to resemble the figure in the book
        # The [1:] avoid choosing the '0' action which doesn't change state nor exptected returns. 
        # Since numpy.argmax chooses the first option in case of ties, rounding the near-ties assures the one associated with the smallest action (or bet) is selected. 
        policy[state] = actions[np.argmax(np.round(action_returns[1:], 5)) + 1]

    plt.figure(figsize=(10, 20))

    plt.subplot(2, 1, 1)
    for sweep, state_value in enumerate(sweeps_history):
        plt.plot(state_value, label='sweep {}'.format(sweep))
    plt.xlabel('Capital')
    plt.ylabel('Value estimates')
    plt.legend(loc='best')

    plt.subplot(2, 1, 2)
    plt.scatter(STATES, policy)
    plt.xlabel('Capital')
    plt.ylabel('Final policy (stake)')

    plt.savefig('../images/figure_4_3.png')
    plt.close()


if __name__ == '__main__':
    figure_4_3()

本文标签: ChapterRLProblemGambler