Sequential drug decision problems in long-term medical conditions: a case Study of Primary Hypertension Eunju Kim ba, ma, msc

Reinforcement learning: Q-learning

Yüklə 10,52 Mb.

səhifə	90/116
tarix	04.01.2022
ölçüsü	10,52 Mb.
	#58520

1 ... 86 87 88 89 90 91 92 93 ... 116

6.6Reinforcement learning: Q-learning

The hypertension SDDP model applied the Q-learning method, which is a simple incremental learning method that repeatedly updates Q-values based on the new cases until it converges upon the optimal solution. The decision space was decomposed by consecutive time cycles as Figure ‎6.. The health space HS had five sub-health state spaces HS1-HS5, which had 2^(t-1) health sates. Death was not included in the decomposed HS because it is the absorbing state and no decision or further update is made. As the transition probabilities and the drug decision rule depended on the disease history, all possible health states in each sub health state space included the information about the previous health states. The search space SS was also decomposed by consecutive time cycles SS1-SS5. Each search space included the treatment options as defined in Chapter 4.

Five Q-tables were defined by the number of possible health states and the number of possible drugs in each time period (i.e., 1x4 for Q1, 2x10 for Q2, 4x14 for Q3, 8x14 for Q4 and 16x14 for Q5). The Q-values were set to 0 initially and gradually updated as the simulator went through the corresponding state and action at t. The outer loop repeated 1,000 times in minimum and 1,000,000 times in maximum so as to observe the performance and convergence depending on the number of cases. The inner loop repeated for the number of drug switching period. Starting from the initial health state (i.e., cState), the algorithm generated the subsequent trajectories of drugs (i.e., cDrug) and health states (i.e., nState) until a terminal state at t4. As the subsequent drugs and health states were generated, the treatment history (i.e., tHist), and the disease history (i.e., dHist), were updated. The probability of the initial health state was set to 1 because the hypertension model assumed that all patients started from the same initial health state. The initial SBP was 173.5 and the SD was 21.1.

A drug was selected using a ε-greedy method, which is often the method of first choice because of the practical effectiveness^[200]. With the ε-greedy method, the simulator selects a drug greedily at each time step, with an increasing probability determined by ε=1-(1/log(n+2)). At the beginning of learning, a drug is likely to be selected randomly. As more learning is achieved, the action, which maximises the Q-value for the current health state, has a higher probability to be selected.

For the patients alive, EvModelDC calculates the one-step reward for the transition from cState to nState where cDrug is used at t. The underlying evaluation model ‘EvModel’ used for enumeration, SA and GA was slightly modified so that ‘EvModelDC’ provided the one-step net benefit observed in a decomposed period rather than the whole follow-up period. IR is a single value, which represents the one-step reward for the transition from cState to nState where cDrug is used at t. fProb is a 1x3 matrix including the transition probabilities from cState to the next possible health states. fSBP and fSBPSD are 1x2 matrices including the mean SBPs and the SD of the patients whose treatment is successful and unsuccessful. They were used to update the baseline information (i.e., cProb, cSBP and cSBPSD) in the next period.

The target to maximise is the Q-value for the current health state, which is the sum of the one-step reward IR from the current state and the maximum Q-value from the next health state (i.e., max(Q{t+1}(nStateIdx,:))). In sensitivity analyses, Q-learning considering a two or three-step future reward was also implemented to compare the performance; in this case, the feedback is the sum of the immediate rewards and the mean total net benefit from two or three transitions in the future. Where the time to the end of the drug switching period is shorter than the time to be considered in Q-learning (e.g., decision-making in the third period where a two-step future reward was considered), the feedback was made based on the immediate reward and any future reward that would occur until the end of the drug switching period.

As mentioned earlier, all the values in the Q-tables are 0 until the simulator passes the corresponding health state and drug. If there is a Q-value, which has been calculated previously, the old Q-value is updated with the difference between the old Q-value and the new Q-value (i.e., delta). All the differences between the old Q-values and the new Q-values were saved in Discrepancy with the learning rate ɑ=1/sqrt(n+2), which gradually decreased over iterations. Their mean discrepancy every 100 cases were saved in mDiscrepancy and used to check the convergence of the Q-values. The discount rate of 0.8 determined the importance of the future rewards in decision-making.

All the updated Q-values were stored in a look-up table, where rows represent the possible health states and columns represent the possible drugs. At the end of the learning, the optimal drug, which provides the highest Q-value at t, was selected for each state h at t. A feasibility test was included to check whether the selected action was feasible under the given drug switching rules (i.e., a drug cannot be used if it has been previously used for treatment; initial treatment should be one of single drugs; and three-drug combinations are only accepted after the use of any two-drug combinations).

1) HS_trepresents the health state space at t; SS_trepresents the search space at t, 1 represents Failure and 2 represents Success.

Figure ‎6.. Possible health states and treatment options, where the decision space is decomposed

% Select the key parameters for reinforcement learning.

T = 4; % The number of drug switching period.

DR = 0.8; % Discount rate.

N = 1000000; % The number of cases (1000, 10000 and 100000 were also tested).

FR = 1; % Future reward considered (from one to three steps).
% Define the health state space decomposed by the time period. The possible health states include the previous disease history.

HS1 = 1;

HS2 = [1,1;1,2];
HS3 = [1,1,1;1,1,2;1,2,1;1,2,2];

HS4 = [1,1,1,1;1,1,1,2;1,1,2,1;1,1,2,2;...

1,2,1,1;1,2,1,2;1,2,2,1;1,2,2,2];

HS5 = [1,1,1,1,1;1,1,1,1,2;1,1,1,2,1;1,1,1,2,2;...

1,1,2,1,1;1,1,2,1,2;1,1,2,2,1;1,1,2,2,2;...

1,2,1,1,1;1,2,1,1,2;1,2,1,2,1;1,2,1,2,2;...

1,2,2,1,1;1,2,2,1,2;1,2,2,2,1;1,2,2,2,2];

HS = {HS1,HS2,HS3,HS4,HS5};
% Define the search space decomposed by the time period.

SS1 = (1:1:4); % Possible treatment options 1-4 at t1.

SS2 = (1:1:10); % Possible treatment options 1-10 at t2.

SS3 = (1:1:14); % Possible treatment options 1-14 at t3.

SS4 = (1:1:14); % Possible treatment options 1-14 at t4.

SS5 = (1:1:14); % Possible treatment options 1-14 at t5.

SS = {SS1,SS2,SS3,SS4};
% Initialise the Q-tables for each time period to 0.

Q1 = zeros(size(HS1,1),size(SS1,2)); % 1x4 matrix at t1.

Q2 = zeros(size(HS2,1),size(SS2,2)); % 2x10 matrix at t2.

Q3 = zeros(size(HS3,1),size(SS3,2)); % 4x14 matrix at t3.

Q4 = zeros(size(HS4,1),size(SS4,2)); % 8x14 matrix at t4.

Q5 = zeros(size(HS5,1),size(SS5,2)); % 16x14 matrix at t5.

Q = {Q1,Q2,Q3,Q4,Q5};
% Initialise the parameters to check convergence.

Discrepancy = []; mDiscrepancy = [];

% Initialise the solution tables storing the optimal solutions and the maximum reward where the optimal solutions were used.
OptSeq = zeros(2^(T-1),T); MaxV = zeros(2^(T-1),T);

% Repeat calculating Q-values for N times.

FOR n = 1:N
cState = 1; % Initial state.

cStateIdx = 1; % Location of the current state in the Q-table.

dHist = 1; % Memory variable to save the disease history.

tHist = []; % Memory variable to save the treatment history.

cMT = 0; % Maintenance therapy.

cProb = 1; % The probability of the initial state.

cSBP = 173.5; cSBPSD = 21.1; % Initial SBP and SD.

% For each time period, generate a subsequent states from cState to a terminal state.

Yüklə 10,52 Mb.

Dostları ilə paylaş:

1 ... 86 87 88 89 90 91 92 93 ... 116