[Paper Learning] Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning|电子爱好者

admin管理员组
文章数量:1530085

Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning

Authors: Xiao Yu, Maximillian Chen, Zhou Yu

Abstract

Planning for goal-oriented dialogue often requires simulating future dialogue interactions and estimating task progress.

Many approaches thus consider training neural networks to perform look-ahead search algorithms such A* search and Monte Carlo Tree Search (MCTS).

However, this training often requires abundant annotated data, which might be faced with noisy annotations or low-resource settings.

We introduce GDP-ZERO, an approach using Open-Loop MCTS to perform goal-oriented dialogue policy planning without any model training.

GDP-ZERO prompts a LLM to act as a policy prior, value function, user simulator, and system model during the tree search.

We evaluate GDP-ZERO on the goal-oriented task PersuasionForGood, and its responses are preferred over ChatGPT up to 59.32% of the time.

Code available at: here

1. Introduction

In many goal-oriented conversation tasks, interacting parties must retake initiative by executing conversational strategies to lead the conversation a desired outcome (e.g. successful negotiation or emotional support). It is imperative to have high-quality dialogue policy planners that can prescribe an ‘optimal’ strategy at each turn of the dialogue.

Optimal policy planning is a difficult task. Many goal-oriented tasks like persuation task, individual persuaders might adopt different strategies, making it difficult to train or evaluate a policy planner. Moreover, ‘optimality’ in these complex tasks may require expert domain knowledge (e.g., negotiation skills).

In this work, we contribute a novel approach Goal-oriented Dialogue Planning with Zero training (GDP-Zero). GDP-ZERO prompts a LLM to perform planning by simulating future dialogue interactions (seen Figure 1).

Unlike previous approaches, we treat policy planning as a stochastic game, and use prompting for every stage of an open-loop tree search.

We evaluate GDP-ZERO on PersuasionForGood due to its difficult panning task (Wang et al., 2019).

Figure 1: Using GDP-ZERO for persuasion with zero model training.

2. Related Work

Prompting Methods However, prompting has largely focused on dialogue response generation, conversation synthesis and dialogue understanding. To date, prompting has not been used for policy planning.

Dialogue Policy Planning Research on dialogue policy planning can be categorized into neural-focused and algorithmic-focused.

Neural-focused approaches use annotated dialogues to train dedicated classifiers or value functions to predict the next dialogue acts without explicit look-ahead planning. For many goal-oriented dialogues, however, both annotated strategies and dialogue repsonses can be suboptimal/noisy, as different people can respond differently even given the same context.
To reduce the reliance on a labeled dataset, much work has also attempted to combine neural networks with search algorithms, such as A* search and tree search. However, these methods still require model training for dialogue simulation or value function estimation, and are therefore highly rependent on training data quality.

3. Method (important!!!)

In this work, we introduce GDP-ZERO, an algorithm-focused dialogue policy planner for goal-oriented dialogue policy planner for goal-oriented dialogue tasks like persuasion. GDP-ZERO uses zero model training and instead performs Open-Loop MCTS at decision time by prompting an LLM to simulate user and system response, evaluate current task progress, and predict a prior next dialogue act. Our approach has two main differences from existing policy planning work:

we use few-shot prompting to bypass the need for model training on noisy data.
we use Open-Loop MCTS to reduce compounding simulation errors by continuously re-generating system and user repsonses during the tree search.

3.1 Problem Definition

We first formulate planning as a Markov Decision Process (MDP).
A t t t turn dialogue between a user and a system can be defined as:
h = ( a 0 s y s , u 1 s y s , u 1 u s r , . . . , a t − 1 s y s , u t s y s , u t u s r ) h=(a_{0}^{sys}, u_{1}^{sys}, u_{1}^{usr},...,a_{t-1}^{sys}, u_{t}^{sys}, u_{t}^{usr}) h=(a0sys,u1sys,u1usr,...,at−1sys,utsys,utusr)
where a i s y s a_{i}^{sys} aisys is the system’s dialogue act at turn i i i, u i s y s u_{i}^{sys} uisys is the system’s response, and u i u s r u_{i}^{usr} uiusr is the user’s utterance at turn i i i.

We define the task of planning the next a i + 1 s y s a_{i+1}^{sys} ai+1sys as an MDP problem: < S , A , R , P , γ > . <S, A, R, P, \gamma>. <S,A,R,P,γ>.

a i s y s a_{i}^{sys} aisys represents an action a i ∈ A a_i \in A ai∈A of the system at the i i i-th turn;
the corresponding dialogue history h = ( a 0 s y s , u 1 s y s , u 1 u s r , . . . , a t − 1 s y s , u t s y s , u t u s r ) h=(a_{0}^{sys}, u_{1}^{sys}, u_{1}^{usr},...,a_{t-1}^{sys}, u_{t}^{sys}, u_{t}^{usr}) h=(a0sys,u1sys,u1usr,...,at−1sys,utsys,utusr) also represents a state marked as s i ∈ S s_i \in S si∈S.
R R R is a reward function associated with s , a s,a s,a, representing the likelihood of a desired conversational outcome, denoted as R ( s , a ) R(s,a) R(s,a), such as persuading a user to donate to a charity.
P P P is a transition function, representing the probability of transitioning from the state s i s_i si to state s i + 1 s_{i+1} si+1 after executing a i a_i ai at the i i i-th turn, denoted as P : S × A → S P: S \times A \rightarrow S P:S×A→S.
γ ∈ [ 0 , 1 ) \gamma \in [0,1) γ∈[0,1) is the discount factor.

3.2 Dialogue Planning as a Stochastic MDP

…
However, in simulating dialogue interactions during tree search, generating a slightly improbable system or user response for state s s s and storing it in a search tree could lead to a large compounding error for the rest of the subtree. This is because the state space representing all possible repsonses is large, and dialogue response are diverse. We thus treat dialogue policy planning as a stochastic MDP, where the simulated next state s ′ ← P ( s , a ) s' \leftarrow P(s,a) s′←P(s,a) is drawn from a large unknown distribution and might not be representative of the most probable s ′ s' s′. Unlike previous usages of closed-loop MCTS for dialogue which consider a deterministic transition, this formulation requires potentially different s ′ s' s′ to be returned given the same dialogue context s s s and system action a a a.

3.3 GDP-ZERO

Figure 2: GDP-ZERO with ChatGPT backbone. During Selection, simulations are either sampled from cache or newly generated. During Expansion and Evaluation, we prompt ChatGPT for prior policy π \pi π and value estination.

Selection Given a tree state s t r s^{tr} str, the action a ∗ a^* a∗ with the highest Predictor Upper Confidence Tree Bound (PUCT) is selected to traverse the tree:
P U C T ( s t r , a ) = Q ( s t r , a ) + c p Σ a N ( s t r , a ) 1 + N ( s t r , a ) PUCT(s^{tr}, a)=Q(s^{tr},a)+c_p{\frac{\sqrt{\Sigma_{a}N(s^{tr},a)}}{1+N(s^{tr},a)}} PUCT(str,a)=Q(str,a)+cp1+N(str,a)ΣaN(str,a)

where N N N records the number of times a ( s t r , a ) (s^{tr},a) (str,a) pair has been visited, and c p c_p cp is a hyperparameter controlling exploration. (details seen Appendix). We repeat this process until s t r s^{tr} str becomes leaf node.

Expansion Once a leaf node is reached, we treat a LLM M θ M_{\theta} Mθ as a prior policy by prompting it to generate a distribution of next dialogue acts. This is done by sampling M θ M_{\theta} Mθ at temperature τ = 1.0 \tau =1.0 τ=1.0 for m m m times, and converting the sampled DAs into a distribution (seen Appendix A).

Evaluation We model the value of a state v ( s t r ) v(s^{tr}) v(str) by the probability that its dialogue context h t r h^{tr} htr can lead to task success.

Appendix-A: Additional details on GDP-ZERO

GDP-ZERO requires a generative LLM as a backbone model, and takes in a dialogue history h i h_i hi at i i i-th turn as input. For each state, GDP-ZERO keeps a cache of size k k k storing newly generated user and system utterances. We use c p = 1.0 c_p=1.0 cp=1.0 and Q 0 = { 0.0 , 0.25 , 0.5 } Q_0=\{0.0, 0.25, 0.5\} Q0={0.0,0.25,0.5} to promote exploration.

Appendix-B: Prompting Details on P4G

Appendix-C: Ablation Studies

4. Experiments

4.1 Static Evaluation

4.2 Interactive Human Evaluation

Appendix-D: Analysis of GDP-ZERO Dialogues

Appendix-E: GDP-ZERO Setup on P4G

(Original Paper seen here.)

本文标签： Based Monte Carlo paper Learning

版权声明：本文标题：[Paper Learning] Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning 内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://m.elefans.com/xitong/1725588832a1031494.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

[Paper Learning] Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning

Prompt-Based Monte-Carlo Tree Search for Goal-oriented Dialogue Policy Planning

Authors: Xiao Yu, Maximillian Chen, Zhou Yu

Abstract

1. Introduction

2. Related Work

3. Method (important!!!)

3.1 Problem Definition

3.2 Dialogue Planning as a Stochastic MDP

3.3 GDP-ZERO

Appendix-A: Additional details on GDP-ZERO

Appendix-B: Prompting Details on P4G

Appendix-C: Ablation Studies

4. Experiments

4.1 Static Evaluation

4.2 Interactive Human Evaluation

Appendix-D: Analysis of GDP-ZERO Dialogues

Appendix-E: GDP-ZERO Setup on P4G

更多相关文章

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs

signature=a4237be86865284eba30d6e3571f9c0e,Proteomic-Based Biosignatures in Breast Cancer Classifica...

多模态 | Supervised Prototypical Contrastive Learning for Emotion Recognition in Conversation论文详解及实现

SESSION-BASED RECOMMENDATIONS WITH RECURRENT NEURAL NETWORKS

ARM+Linux Learning-Feedback

Jenkins（003）手贱卸载了Role-based Authorization Strategy导致的问题

**加速您的机器学习之旅：无计划学习（Schedule-Free Learning）**

智能突触《Continual Learning Through Synaptic Intelligence》(SI)

learning Efficient Convolutional Networks through Network Slimming

Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hyperspher

Graph Structure Learning（图结构学习综述）

【论文笔记】Learning What Not to Segment: A New Perspective on Few-Shot Segmentation

Differentially Private Federated Learning: A Client Level Perspective

【论文阅读】The Deep Learning Compiler: A Comprehensive Survey

【论文泛读】Easing Embedding Learning by Comprehensive Transcription of Heterogeneous Information Networks

视频超分重建综述 |Video Super Resolution Based on Deep Learning: A Comprehensive Survey | 翻译简记

[caption学习]：综述：A comprehensive survey of deep learning for image caption

多智能体强化学习经典综述A Comprehensive Survey of Multi-Agent Reinforcement Learning翻译

深度学习编译器对比：The Deep Learning Compiler A Comprehensive Survey

跟TED演讲学英文：Bring on the learning revolution! by Sir Ken Robinson

发表评论

推荐文章

联盛德W801系列6-从微信小程序的角度来分析W801的蓝牙通信源码(indicate方式)

[BJDCTF 2020]BJD hamburger competition

基于python，chatgpt，gpt-sovits进行b站直播ai语音自动回复弹幕部署

kali信息收集

兼容性比较好的浏览器推荐（2023手机浏览器排名）

热门文章

Android 10 打开APP时提示An error occurred,please contact administrator.Exiting...

重装系统下载网址

怎样恢复删除的文件，一招解决它！

windows的发展历程（适合小白科普)

Canadian Coding Competition(CCC) 2022 Junior 题解

关于VMware Tool无法共享文件的问题（Win10主机Mac虚拟机）

关于小米路由器的局域网内相互ping设备IP的解决方法

dlink打印服务器重置,DLink_DP310打印服务器用户手册.pdf

【物流及供应链管理】北邮国际学院大三下期末复习

统信UOS专业版安装教程（1070 Intel AMD芯片）

最新文章

spring boot基于Springboot的球鞋调货管理系统设计与实现 毕业设计-附源码160942

【java毕业设计】基于java+SSH+jsp的酒水销售系统设计与实现（毕业论文+程序源码）——酒水销售系统

IntelliJ IDEA下载安装

idea系列之-2019.3版本新特性及安装一

2021.3.1idea(jdk+tomcat+maven)安装与配置

（附源码）Springboot网上购物系统 毕业设计 311236

集成开发工具（IDEA）安装与使用

c#物联网_C# 基础知识系列- 16 开发工具篇

Python环境搭建

PHP环境搭建

IntelliJ IDEA（Ultimate版本）的下载、安装和WordCount的初步使用（本地模式和集群模式）...

idea2021安装教程

MVC 洋酒销售系统的设计与实现java jsp 程序设计 课程设计 毕业设计-附源码02135

IntelliJ IDEA（2018.2.5版本）安装和破解

nextCloud集成至APP端可行性报告

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

加速您的机器学习之旅：无计划学习（Schedule-Free Learning）

spring boot基于Springboot的球鞋调货管理系统设计与实现毕业设计-附源码160942

（附源码）Springboot网上购物系统毕业设计 311236

MVC 洋酒销售系统的设计与实现java jsp 程序设计课程设计毕业设计-附源码02135

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载