INPUT transition-probability matrices and reward matrices.
FOR i=1:T
FOR all states s∊Hi
Initialize Vi(s)=0 (or select an arbitrary number), for all states s∊H.
WHILE is not changed.
LOOP
LOOP
LOOP
Figure 3.. Value iteration algorithm[200]
Policy iteration starts with an arbitrary policy and repeats replacing the current policy by a better one until an optimal policy is achieved. Since there are a finite number of states and actions, the algorithm will eventually terminate with a policy that cannot be further improved (see Figure 3.).
INPUT transition-probability matrices and reward matrices.
FOR i=1:T
FOR all states s∊Hi.
WHILE is not changed.
Select a random policy a∊Ai(s).
Compute the value of the current policy for the given state.
IF i=1, ,
ELSE,
END
Update the optimal policy for the given state.
LOOP
LOOP
LOOP
|
Figure 3.. Policy iteration algorithm[200]
The backward approach involved in DP reduces the computational burden by chopping down the branches, which are unlikely to have a better future value of the objective function. The idea of separating past and future costs enables DP to provide the global optimal solution by balancing the immediate reward and the future reward. The proof of convergence to an optimal policy in DP can be found in most optimisation text books[56, 57, 219].
In contrast, DP is known to be limited for many large and complex real problems, because of the amount of computational requirements (e.g., computation of the matrices of transition probability and the rewards for the transition times for every decision). Particularly, where the sub-problems are dependent upon each other (i.e., if past states or decisions affect the current decision), obtaining or storing those data can be very challenging and computationally expensive: this motivated the development of various ADP techniques.
Dostları ilə paylaş: |