DP used the Bellman’s value function in Equation 3.4-3.5, where the search space was constructed with a set of individual drugs SS(see Figure 3.). To consider the different transition probabilities depending on the disease history, the health state transition space HS was constructed with the possible health state combinations until the time period considered: thus the number of possible health states increased from 1 at t1 to 27 at t4. All matrices of potential transitions mTransition and one-step rewards mReward were calculated before the optimisation procedure started. The problem solving procedure was started from the last decision period i. Policy iteration compared the estimates of the value function V{i}(s,a)where a∈SS was used for s∈HSiand identified the best solution for each health state s in each period i. Once the optimal solution in the last period was identified, the optimal solution was selected under the policy of using in i=1. This process was followed until the optimal drug sequence was identified for the first time period. DP had a limitation to restrict the infeasible solutions under the decision-rule assumed in the hypothetical SDDP (i.e., non-repetition of drugs after treatment failure and the continuous use of the current drug after treatment success). As its optimal solution was calculated backward in time, medical and drug use history was not known when the decision was made so health states or drug uses in previous time periods cannot be considered when working out the optimal drug at current time.
T = 3; % The number of time periods.
SS = 3; % The number of possible treatment options. They are same in SS1, SS2, SS3 and SS4.
HS = [3^0,3^1,3^2,3^3] % The number of possible health states (with disease history) at t, where HS1=3^0, HS2=3^1, HS3=3^2 and HS4=3^3.
% Calculate all possible transition probabilities from st to st+1by drug a∊SS using ‘function_Transition’.
[mTransition{t}(s,s’,a)] = function_Transition;
% Calculate all immediate rewards using ‘function_Reward’.