RL(Chapter 4): Gambler’s Problem|电子爱好者

admin管理员组
文章数量:1612391

本文为强化学习笔记，主要参考以下内容：

Reinforcement Learning: An Introduction
代码全部来自 GitHub
习题答案参考 Github

Gambler’s Problem

A gambler has the opportunity to make bets on the outcomes of a sequence of coin flips. If the coin comes up heads, he wins as many dollars as he has staked on that flip; if it is tails, he loses his stake. The game ends when the gambler wins by reaching his goal of $100, or loses by running out of money. On each flip, the gambler must decide what portion of his capital to stake, in integer numbers of dollars.

This problem can be formulated as an undiscounted, episodic, finite MDP. The state is the gambler’s capital, s ∈ { 1 , 2 , . . . , 99 } s\in \{1, 2, . . . , 99\} s∈{1,2,...,99} and the actions are stakes, a ∈ { 0 , 1 , . . . , m i n ( s , 100 − s ) } a\in \{0, 1, . . . , min(s, 100−s)\} a∈{0,1,...,min(s,100−s)}. The reward is zero on all transitions except those on which the gambler reaches his goal, when it is + 1 +1 +1. The state-value function then gives the probability of winning from each state. A policy is a mapping from levels of capital to stakes. The optimal policy maximizes the probability of reaching the goal. Let p h p_h ph denote the probability of the coin coming up heads. If p h p_h ph is known, then the entire problem is known and it can be solved, for instance, by value iteration.

The policy shown in Figure 4.3 is optimal, but not unique. In fact, there is a whole family of optimal policies, all corresponding to ties for the argmax action selection with respect to the optimal value function. Can you guess what the entire family looks like?

Exercise 4.8
Why does the optimal policy for the gambler’s problem have such a curious form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not. Why is this a good policy?
ANSWER
The gambler’s problem has such curious form of optimal policy because at the number 50, you can suddenly win with probability p h p_h ph. Thus, the best policy will bet all when Capital=50 and the possible dividends of it, like 25.

Thinking capital of 51 as 50 plus 1. Of course we can bet all when we have 51, but the best policy is to see if we can earn much from the extra 1 dollar. If this return g g g is positive, we can say we have extra g g g money and bet it again until 75, where the sudden win chance is coming. On the contrary, if we bet 50 out of 51 first, our chance of win is only p h p_h ph and we lose the chance to reach 75. Instead, we will have to try our best to reach 25 first with 1 dollar if we lose the bet, a much worse condition.

Conclusion: The indicated optimal policy creates more chances to win and guarantees the gambler be better off when he loses.

Exercise

Implement value iteration for the gambler’s problem and solve it for p h = 0.25 p_h = 0.25 ph=0.25 and p h = 0.55 p_h = 0.55 ph=0.55. In programming, you may find it convenient to introduce two dummy states corresponding to termination with capital of 0 and 100, giving them values of 0 and 1 respectively. Show your results graphically, as in Figure 4.3.

Code

#######################################################################
# Copyright (C)                                                       #
# 2016-2018 Shangtong Zhang(zhangshangtong.cpp@gmail)             #
# 2016 Kenta Shimada(hyperkentakun@gmail)                         #
# Permission given to modify the code as long as you keep this        #
# declaration at the top                                              #
#######################################################################

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

matplotlib.use('Agg')

# goal
GOAL = 100

# all states, including state 0 and state 100
STATES = np.arange(GOAL + 1)

# probability of head
HEAD_PROB = 0.4

def figure_4_3():
    # state value
    state_value = np.zeros(GOAL + 1)
    state_value[GOAL] = 1.0

    sweeps_history = []

    # value iteration
    while True:
        old_state_value = state_value.copy()
        sweeps_history.append(old_state_value)

        for state in STATES[1:GOAL]:
            # get possilbe actions for current state
            actions = np.arange(min(state, GOAL - state) + 1)
            action_returns = []
            for action in actions:
                action_returns.append(
                    HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])
            new_value = np.max(action_returns)
            state_value[state] = new_value
        delta = abs(state_value - old_state_value).max()
        if delta < 1e-9:
            sweeps_history.append(state_value)
            break

    # compute the optimal policy
    policy = np.zeros(GOAL + 1)
    for state in STATES[1:GOAL]:
        actions = np.arange(min(state, GOAL - state) + 1)
        action_returns = []
        for action in actions:
            action_returns.append(
                HEAD_PROB * state_value[state + action] + (1 - HEAD_PROB) * state_value[state - action])

        # round to resemble the figure in the book
        # The [1:] avoid choosing the '0' action which doesn't change state nor exptected returns. 
        # Since numpy.argmax chooses the first option in case of ties, rounding the near-ties assures the one associated with the smallest action (or bet) is selected. 
        policy[state] = actions[np.argmax(np.round(action_returns[1:], 5)) + 1]

    plt.figure(figsize=(10, 20))

    plt.subplot(2, 1, 1)
    for sweep, state_value in enumerate(sweeps_history):
        plt.plot(state_value, label='sweep {}'.format(sweep))
    plt.xlabel('Capital')
    plt.ylabel('Value estimates')
    plt.legend(loc='best')

    plt.subplot(2, 1, 2)
    plt.scatter(STATES, policy)
    plt.xlabel('Capital')
    plt.ylabel('Final policy (stake)')

    plt.savefig('../images/figure_4_3.png')
    plt.close()


if __name__ == '__main__':
    figure_4_3()

本文标签： Chapter RL Problem Gambler

版权声明：本文标题：RL(Chapter 4): Gambler’s Problem 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/dianzi/1728631636a1167151.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

更多相关文章

xp系统

电子爱好者 - 最新技术资讯及电子产品介绍！

RL(Chapter 4): Gambler’s Problem

目录

Gambler’s Problem

Exercise

Code

更多相关文章

Chapter 8 Specialized Cloud Mechanisms

100 A Potential Problem

Maven server structure problem

SSM Chapter 14 Git详解

【 STM32CubeMx添加固件支持包：problem during unzip of file】

进程同步02--临界区问题(Critical Section Problem)

A problem has occurred and the system can‘t recover.Please contact a system administrator

Android10.0 启动到桌面显示 There’s an internal problem with your device. Contact your manufacturer

ZOJ Problem Set - 3958―Cooking Competition

ChatGPT技术原理解析：从RL之PPO算法、RLHF到GPT4、instructGPT

problem 1

Risk Management and Financial Institutions Chapter 1——引言

Cognitive Neuroscience (Chapter 1)

UVALive Problem 7456 Least Crucial Node——Regionals 2015 :: Asia - Taipei

【11g】Investigating, Reporting, and Resolving a Problem

斯坦福大学机器学习作业题Problem Set #2 Logistic Regression: Training stability--上篇

A problem occurred configuring project ‘:app‘.＞ java.lang.NullPointerException (no error message)

Problem instances for “A construction-and-repair based method for vehicle scheduling...“

Problem H. Curious （莫比乌斯反演）

Chapter 3. Fundamental Programming Structures in Java

发表评论

推荐文章

idea 2020.3 项目启动报错

【ADB移动端】测试

teamviewer &amp; commercial-use

yarn启动resouceManager报错 Illegal capacity of 0.4 for children of queue root

React fundamental 和 React Router-郭永峰-专题视频课程

热门文章

AndroidOCR文字识别 实时扫描手机号（极速扫描单行文本方案）

windows系统C盘扩容详解

众多Android开源项目推荐

win10升级补丁_win10教育版有什么优缺点

遥感图像处理笔记之【多模态遥感图像综述】

安卓导航车机root方法_不破不立，拥抱安卓的全新奥迪A4L到底有多好用？

解决UnicodeEncodeError: ‘gbk‘ codec can‘t encode character ‘ufeff‘ in position 0？？？

[system]SyntaxError: Unexpected token u in JSON at position 0

python码调试：UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xaf in position 12: illegal...

Loadrunner12.02的安装部署

最新文章

微软给中国学生特权：免费使用正版软件(图)

Java 之父：找Bug最浪费时间，现在不是开源的黄金时代！

从游戏机、计算机、智能手机的过去五十年 看VR和AR的未来五十年

MATLAB 被禁点燃导火索，国产软件路在何方？

青蛙设计首席创意执行官：一切都是为了创新

互联网摸鱼日报（2022-11-12）

【开源软件开发导论作业-1】

信息技术导论 第四章 云计算 笔记

千万别再瞎招人了

[SSD NAND 2.2] 存储历史（从古老的绳子记忆到如今）

到美国去，挣美元!

Xiph.Org基金会 —— 多媒体开源的先锋

基于java的土地档案管理系统设计与实现(项目报告+答辩PPT+源代码+数据库+部署视频)

Linux学习-01-Linux介绍

windows和linux服务器哪个好？有哪些区别？

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

teamviewer & commercial-use

AndroidOCR文字识别实时扫描手机号（极速扫描单行文本方案）

从游戏机、计算机、智能手机的过去五十年看VR和AR的未来五十年

信息技术导论第四章云计算笔记

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载