POMDP: Introduction to Partially Observable Markov Decision Processes

Hossein Kamalzadeh, Michael Hahsler

December 8, 2019

Introduction

The R package pomdp provides the infrastructure to define and analyze the solutions of Partially Observable Markov Decision Processes (POMDP) models. The package includes pomdp-solve (Cassandra 2015) to solve POMDPs using a variety of algorithms.

The package provides the following algorithms:

an interface to ‘pomdp-solve’, a solver (written in C) for Partially Observable Markov Decision Processes (POMDP). The package enables the user to simply define all components of a POMDP model and solve the problem using several methods. The package also contains functions to analyze and visualize the POMDP solutions (e.g., the optimal policy).

In this document we will give a very brief introduction to the concept of POMDP, describe the features of the R package, and illustrate the usage with a toy example.

Partially Observable Markov Decision Processes

A partially observable Markov decision process (POMDP) is a combination of an MDP to model system dynamics with a hidden Markov model that connects unobservant system states to observations. The agent can perform actions which affect the system (i.e., may cause the system state to change) with the goal to maximize a reward that depends on the sequence of system state and the agent’s actions. However, the agent cannot directly observe the system state, but at each discrete point in time, the agent makes observations that depend on the state. The agent uses these observations to form a belief of in what state the system currently is. This belief is called a belief state and is expressed as a probability distribution over the states. The solution of the POMDP is a policy prescribing which action is optimal for each belief state.

The POMDP framework is general enough to model a variety of real-world sequential decision-making problems. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The general framework of Markov decision processes with incomplete information was described by Karl Johan Åström (Åström 1965) in the case of a discrete state space, and it was further studied in the operations research community where the acronym POMDP was coined. It was later adapted for problems in artificial intelligence and automated planning by Leslie P. Kaelbling and Michael L. Littman (Kaelbling, Littman, and Cassandra 1998).

A discrete-time POMDP can formally be described as a 7-tuple \[\mathcal{P} = (S, A, T, R, \Omega , O, \gamma),\] where

At each time period, the environment is in some state \(s \in S\). The agent chooses an action \(a \in A\), which causes the environment to transition to state \(s' \in S\) with probability \(T(s' \mid s,a)\). At the same time, the agent receives an observation \(o \in \Omega\) which depends on the new state of the environment with probability \(O(o \mid s',a)\). Finally, the agent receives a reward \(R(s,a)\). Then the process repeats. The goal is for the agent to choose actions at each time step that maximizes its expected future discounted reward, i.e., she chooses the actions at each time \(t\) that \[\max E\left[\sum_{t=0}^{\infty} \gamma^t R(s_t, a_t)\right]\].

For a finite time horizon, only the expectation over the sum up to the time horizon is used.

Package Functionality

Solving a POMDP problem with the pomdp package consists of two steps:

  1. Define a POMDP problem using the function POMDP, and
  2. solve the problem using solve_POMDP.

Defining a POMDP Problem

The POMDP function has the following arguments, each corresponds to one of the elements of a POMDP.

str(args(POMDP))
## function (states, actions, observations, transition_prob, observation_prob, 
##     reward, discount = 0.9, horizon = Inf, terminal_values = 0, start = "uniform", 
##     max = TRUE, name = NA)

where

While specifying the discount rate and the set of states, observations and actions is straight-forward. Some arguments can be specified in different ways. The initial belief state start can be specified as

The transition probabilities, transition_prob, depend on the end.state \(s'\), the start.state \(s\) and the action \(a\). The set of conditional transition probabilities can be specified in several ways:

The observation probabilities, observation_prob, depend on the action, the end.state, and the observation. The set of conditional observation probabilities can be specified in several ways:

The reward function, reward, depends on action, start.state, end.state and the observation. The reward function can be specified several ways:

reward = list(
  "action1" = list(
     "state1" = matrix(c(1, 2, 3, 4, 5, 6) , nrow = 3 , byrow = TRUE), 
     "state2" = matrix(c(3, 4, 5, 2, 3, 7) , nrow = 3 , byrow = TRUE), 
     "state3" = matrix(c(6, 4, 8, 2, 9, 4) , nrow = 3 , byrow = TRUE)), 
  "action2" = list(
     "state1" = matrix(c(3, 2, 4, 7, 4, 8) , nrow = 3 , byrow = TRUE), 
     "state2" = matrix(c(0, 9, 8, 2, 5, 4) , nrow = 3 , byrow = TRUE), 
     "state3" = matrix(c(4, 3, 4, 4, 5, 6) , nrow = 3 , byrow = TRUE)))

Solving a POMDP

POMDP problems are solved with the function solve_POMDP with the following arguments.

str(args(solve_POMDP))
## function (model, horizon = NULL, discount = NULL, terminal_values = NULL, 
##     method = "grid", digits = 7, parameter = NULL, verbose = FALSE)

The model argument is a POMDP problem created using the POMDP function, but it can also be the name of a POMDP file using the format described in the file specification section of pomdp-solve. The horizon argument specifies the finite time horizon (i.e, the number of time steps) considered in solving the problem. If the horizon is unspecified (i.e., NULL), then the algorithm continues running iterations till it converges to the infinite horizon solution. The method argument specifies what algorithm the solver should use. Available methods including "grid", "enum", "twopass", "witness", and "incprune". Further solver parameters can be specified as a list as parameters. The list of available parameters can be obtained using the function solve_POMDP_parameter(). Finally, verbose is a logical that indicates whether the solver output should be shown in the R console or not. The output of this function is an object of class POMDP.

Helper Functions

The package offers several functions to help with managing POMDP problems and solutions.

The functions model, solution, and solver_output extract different elements from a POMDP object returned by solve_POMDP().

The package provides a plot function to visualize the solution’s policy graph using the package igraph. The graph itself can be extracted from the solution using the function policy_graph().

The Tiger Problem Example

We will demonstrate how to use the package with the Tiger Problem (Cassandra, Kaelbling, and Littman 1994).

A tiger is put with equal probability behind one of two doors, while treasure is put behind the other one. You are standing in front of the two closed doors and need to decide which one to open. If you open the door with the tiger, you will get hurt by the tiger (negative reward), but if you open the door with the treasure, you receive a positive reward. Instead of opening a door right away, you also have the option to wait and listen for tiger noises. But listening is neither free nor entirely accurate. You might hear the tiger behind the left door while it is actually behind the right door and vice versa.

The states of the system are the tiger behind the left door (tiger-left) and the tiger behind the right door (tiger-right).

Available actions are: open the left door (open-left), open the right door (open-right) or to listen (listen).

Rewards associated with these actions depend on the resulting state: +10 for opening the correct door (the door with treasure), -100 for opening the door with the tiger. A reward of -1 is the cost of listening.

As a result of listening, there are two observations: either you hear the tiger on the right (tiger-right), or you hear it on the left (tiger-left).

The transition probability matrix for the action listening is identity, i.e., the position of the tiger does not change. Opening either door means that the game restarts by placing the tiger uniformly behind one of the doors.

Specifying the Tiger Problem

The problem can be specified using function POMDP() as follows.

library("pomdp")

Tiger <- POMDP(
  name = "Tiger Problem",
  
  discount = 0.75,
  
  states = c("tiger-left" , "tiger-right"),
  actions = c("listen", "open-left", "open-right"),
  observations = c("tiger-left", "tiger-right"),
  
  start = "uniform",
  
  transition_prob = list(
    "listen" = "identity", 
    "open-left" = "uniform", 
    "open-right" = "uniform"),

  observation_prob = list(
    "listen" = matrix(c(0.85, 0.15, 0.15, 0.85), nrow = 2, byrow = TRUE), 
    "open-left" = "uniform",
    "open-right" = "uniform"),
    
  reward = rbind(
    R_("listen",     "*",           "*", "*", -1  ),
    R_("open-left",  "tiger-left",  "*", "*", -100),
    R_("open-left",  "tiger-right", "*", "*", 10  ),
    R_("open-right", "tiger-left",  "*", "*", 10  ),
    R_("open-right", "tiger-right", "*", "*", -100)
  )
)

Tiger
## Unsolved POMDP model: Tiger Problem 
##      horizon: Inf

Note that we use for each component the way that lets us specify them in the easiest way (i.e., for observations and transitions a list and for rewards a data frame created with the R_ function).

Solving the Tiger Problem

Now, we can solve the problem using the default algorithm. We use the finite grid method which implements a form of point-based value iteration that can find approximate solutions also for difficult problems.

sol <- solve_POMDP(Tiger)
sol
## Solved POMDP model: Tiger Problem 
##      solution method: grid 
##      horizon: Inf 
##      converged: TRUE 
##      total expected reward (for start probabilities): 1.933439

The output is an object of class POMDP which contains the solution.

sol$solution
## $method
## [1] "grid"
## 
## $parameter
## NULL
## 
## $horizon
## [1] Inf
## 
## $discount
## [1] 0.75
## 
## $converged
## [1] TRUE
## 
## $total_expected_reward
## [1] 1.933439
## 
## $initial_belief
##  tiger-left tiger-right 
##         0.5         0.5 
## 
## $initial_pg_node
## [1] 3
## 
## $terminal_values
## [1] 0
## 
## $belief_states
##         tiger-left  tiger-right
##  [1,] 5.000000e-01 5.000000e-01
##  [2,] 8.500000e-01 1.500000e-01
##  [3,] 1.500000e-01 8.500000e-01
##  [4,] 9.697987e-01 3.020134e-02
##  [5,] 3.020134e-02 9.697987e-01
##  [6,] 9.945344e-01 5.465587e-03
##  [7,] 5.465587e-03 9.945344e-01
##  [8,] 9.990311e-01 9.688763e-04
##  [9,] 9.688763e-04 9.990311e-01
## [10,] 9.998289e-01 1.711147e-04
## [11,] 1.711147e-04 9.998289e-01
## [12,] 9.999698e-01 3.020097e-05
## [13,] 3.020097e-05 9.999698e-01
## [14,] 9.999947e-01 5.329715e-06
## [15,] 5.329715e-06 9.999947e-01
## [16,] 9.999991e-01 9.405421e-07
## [17,] 9.405421e-07 9.999991e-01
## [18,] 9.999998e-01 1.659782e-07
## [19,] 1.659782e-07 9.999998e-01
## [20,] 1.000000e+00 2.929027e-08
## [21,] 2.929027e-08 1.000000e+00
## [22,] 1.000000e+00 5.168871e-09
## [23,] 5.168871e-09 1.000000e+00
## [24,] 1.000000e+00 9.121536e-10
## [25,] 9.121536e-10 1.000000e+00
## 
## $pg
## $pg[[1]]
##   node     action tiger-left tiger-right
## 1    1  open-left          3           3
## 2    2     listen          3           1
## 3    3     listen          4           2
## 4    4     listen          5           3
## 5    5 open-right          3           3
## 
## 
## $alpha
## $alpha[[1]]
##      tiger-left tiger-right
## [1,] -98.549921   11.450079
## [2,] -10.854299    6.516937
## [3,]   1.933439    1.933439
## [4,]   6.516937  -10.854299
## [5,]  11.450079  -98.549921
## 
## 
## attr(,"class")
## [1] "POMDP_solution"

The solution contains the following elements:

Visualization

In this section, we will visualize the policy graph provided in the solution by the solve_POMDP function.

plot_policy_graph(sol)

The policy graph can be easily interpreted. Without prior information, the agent starts at the node marked with “initial.” In this case the agent beliefs that there is a 50-50 chance that the tiger is behind ether door. The optimal action is displayed inside the state and in this case is to listen. The observations are labels on the arcs. Let us assume that the observation is “tiger-left”, then the agent follows the appropriate arc and ends in a node representing a belief (one ore more belief states) that has a very high probability of the tiger being left. However, the optimal action is still to listen. If the agent again hears the tiger on the left then it ends up in a note that has a close to 100% belief that the tiger is to the left and open-right is the optimal action. The are arcs back from the nodes with the open actions to the initial state reset the problem.

Since we only have two states, we can visualize the piecewise linear convex value function as a simple plot.

alpha <- sol$solution$alpha
alpha
## [[1]]
##      tiger-left tiger-right
## [1,] -98.549921   11.450079
## [2,] -10.854299    6.516937
## [3,]   1.933439    1.933439
## [4,]   6.516937  -10.854299
## [5,]  11.450079  -98.549921
plot_value_function(sol, ylim = c(0,20))

The lines represent the nodes in the policy graph and the optimal actions are shown in the legend.

References

Åström, K. J. 1965. “Optimal Control of Markov Processes with Incomplete State Information.” Journal of Mathematical Analysis and Applications 10 (1): 174–205. https://doi.org/https://doi.org/10.1016/0022-247X(65)90154-X.

Cassandra, Anthony R. 2015. “The POMDP Page.” https://www.pomdp.org.

Cassandra, Anthony R., Leslie Pack Kaelbling, and Michael L. Littman. 1994. “Acting Optimally in Partially Observable Stochastic Domains.” In Proceedings of the Twelfth National Conference on Artificial Intelligence. Seattle, WA.

Cassandra, Anthony R., Michael L. Littman, and Nevin Lianwen Zhang. 1997. “Incremental Pruning: A Simple, Fast, Exact Method for Partially Observable Markov Decision Processes.” In UAI’97: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence, 54–61.

Kaelbling, Leslie Pack, Michael L. Littman, and Anthony R. Cassandra. 1998. “Planning and Acting in Partially Observable Stochastic Domains.” Artificial Intelligence 101 (1): 99–134. https://doi.org/10.1016/S0004-3702(98)00023-X.

Littman, Michael L., Anthony R. Cassandra, and Leslie Pack Kaelbling. 1995. “Learning Policies for Partially Observable Environments: Scaling up.” In Proceedings of the Twelfth International Conference on International Conference on Machine Learning, 362–70. ICML’95. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Pineau, Joelle, Geoff Gordon, and Sebastian Thrun. 2003. “Point-Based Value Iteration: An Anytime Algorithm for Pomdps.” In Proceedings of the 18th International Joint Conference on Artificial Intelligence, 1025–30. IJCAI’03. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Sondik, E. J. 1971. “The Optimal Control of Partially Observable Markov Decision Processes.” PhD thesis, Stanford, California.

Zhang, Nevin L., and Wenju Liu. 1996. “Planning in Stochastic Domains: Problem Characteristics and Approximation.” HKUST-CS96-31. Hong Kong University.