TY - JOUR
T1 - Potential-based reward shaping for finite horizon online POMDP planning
AU - Eck, Adam
AU - Soh, Leen Kiat
AU - Devlin, Sam
AU - Kudenko, Daniel
N1 - Funding Information:
This research was partially supported by a National Science Foundation Graduate Research Fellowship (DGE-054850) and a Grant from the National Science Foundation (SES-1132015).
Funding Information:
This research was partially supported by a National Science Foundation Graduate Research Fellowship (DGE-054850) and a Grant from the National Science Foundation (SES-1132015).
Publisher Copyright:
© 2015, The Author(s).
PY - 2016/5/1
Y1 - 2016/5/1
N2 - In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.
AB - In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), our solution is to shape the agent’s reward function in order to lead the agent to large future rewards without having to spend as much time explicitly estimating cumulative future rewards, enabling the agent to save time to improve the breadth planning and build higher quality plans. Specifically, we extend potential-based reward shaping (PBRS) from RL to online POMDP planning. In our extension, information about belief states is added to the function optimized by the agent during planning. This information provides hints of where the agent might find high future rewards beyond its planning horizon, and thus achieve greater cumulative rewards. We develop novel potential functions measuring information useful to agent metareasoning in POMDPs (reflecting on agent knowledge and/or histories of experience with the environment), theoretically prove several important properties and benefits of using PBRS for online POMDP planning, and empirically demonstrate these results in a range of classic benchmark POMDP planning problems.
KW - Online planning
KW - POMDP
KW - Potential-based reward shaping
UR - http://www.scopus.com/inward/record.url?scp=84924192605&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84924192605&partnerID=8YFLogxK
U2 - 10.1007/s10458-015-9292-6
DO - 10.1007/s10458-015-9292-6
M3 - Article
AN - SCOPUS:84924192605
SN - 1387-2532
VL - 30
SP - 403
EP - 445
JO - Autonomous Agents and Multi-Agent Systems
JF - Autonomous Agents and Multi-Agent Systems
IS - 3
ER -