3.5.2Dynamic programming
DP divides the overall problem into several sub-problems by consecutive time cycles. Each sub-problem has a number of possible states st+1 associated with the beginning of that state s, possible actions a and the immediate (i.e., one-step) reward r from using a for s at each time period t. Where i=T-t, and T and t refer to total and current time periods, respectively and γ represents the discount rate, the Bellman’s value function of policy for i steps to go is as following[54]:
, i=1
i>1
Equation 3.4.
The optimal policy π*: S→X is:
Equation 3.5.
The solution procedure starts with a small portion of the original problem and gradually enlarges to the overall problem. The selected actions transform the current state to a next state according to a probability distribution . The expected reward assigned to the transition from the current state to the next state usually can be interpreted as the immediate from making the decision for . The best policy for the remaining states is the action, which maximises the value function as Equation 3.5. The difference between the reward function and the value function is that the value function is defined over states that gives an estimate of the total (possibly discounted) reward expected in the long run, following a particular policy for each state, whereas reward is an immediate, possibly stochastic, payoff that results from performing an action in a state.
The basis of DP is the Bellman’s Optimality Principle, which states that “An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision”[54]. This means that the optimal decision depends on only the current state and not on how to get there. The decision maker is usually assumed to have a linear, additive and risk neutral utility function over time. Any decision problem, which does not satisfy those assumptions, may be restricted to formulate and solve the problem using classic DP.
Classic algorithms to solve DP are value iteration or policy iteration[54]. Value iteration starts with some arbitrary values for the value function vector and keeps updating its elements. The idea behind this is that value iteration guarantees convergence to the optimal solution by updating the value function, where the true value of the state s∊Hi is not known initially (see Figure 3.).
Dostları ilə paylaş: |