Sequential drug decision problems in long-term medical conditions: a case Study of Primary Hypertension Eunju Kim ba, ma, msc


Reinforcement learning: Q-learning



Yüklə 10,52 Mb.
səhifə42/116
tarix04.01.2022
ölçüsü10,52 Mb.
#58520
1   ...   38   39   40   41   42   43   44   45   ...   116

3.6.5Reinforcement learning: Q-learning


Q-learning was used to solve the simple SDDP, where the health state space and the search space were constructed as with DP (see Figure ‎3.). Four Q-tables Q1-Q4 were defined by the number of possible health states and the number of possible drugs (i.e., 1x3 for Q1, 3x3 for Q2, 9x3 for Q3 and 27x3 for Q4). The values in the Q-values were set to 0 initially and gradually updated as the simulator went through the corresponding state and action at t. 10,000 cases were randomly simulated, which were large enough to observe the Q-values from all possible cases and to achieve the convergence. The target to maximise is the sum of the one-step reward IR from the transition between the current state cState and the next state nState and the Q-value, which maximised the one-step future reward from nState (i.e., max(Q{t+1}(nStateIdx,:)), where the discount rate DR was set to 0.8. The future reward Q{t+1}(nStateIdx,:) was 0 until the corresponding health state was observed during the simulation procedure.

The problem solving procedure worked forward from t1 to t4. Starting from the initial health state cState, subsequent health states nState and drugs drug were also simulated until a terminal state at t4. For each state, a drug was chosen by -greedy action selection method, which assumed an increasing probability by ∊=1-(1/log(n+2)).

The Q-values were updated by the difference between the old Q-value and the newly observed Q-value dQ. As the learning rate 1/sqrt(n+2) was gradually decreased, the Q-values were also gradually converged to the certain numbers. An additional memory variable discrepancy was included to track the variations in the Q-values. mdiscrepancy saved the mean of the discrepancies every 100 cases. The optimal solution for each state OptSol(:,:) was selected based on the values in the Q-table Q{t} at the end of the learning procedure. A feasibility test was included to prevent the repetition of the same drug in the case of Hu and He and to continue using the same drug in the case of Hn.


T = 3; % The number of time periods.

SS = 3; % The number of possible treatment options.

HS = 3; % The number of possible health states.

N = 10000; % The number of cases.

DR = 0.8; % Discount rate.

mdiscrepancy = []; % Q-variations.


% Define the possible health states (including the disease history) in each time period.

SqDiz1 = 1;

SqDiz2 = [1,1;1,2;1,3];

SqDiz3 = [1,1,1;1,1,2;1,1,3;1,2,1;1,2,2;1,2,3;1,3,1;1,3,2;1,3,3];

SqDiz4 = [1,1,1,1;1,1,1,2;1,1,1,3;1,1,2,1;1,1,2,2;1,1,2,3;...

1,1,3,1;1,1,3,2;1,1,3,3;1,2,1,1;1,2,1,2;1,2,1,3;...

1,2,2,1;1,2,2,2;1,2,2,3;1,2,3,1;1,2,3,2;1,2,3,3;...

1,3,1,1;1,3,1,2;1,3,1,3;1,3,2,1;1,3,2,2;1,3,2,3;...

1,3,3,1;1,3,3,2;1,3,3,3];

SqDiz = {SqDiz1,SqDiz2,SqDiz3,SqDiz4};

% Initialize the Q-tables for each time period to 0.

Q1 = zeros(HS^0,A); % Q-table at t1.

Q2 = zeros(HS^1,A); % Q-table at t2.

Q3 = zeros(HS^2,A); % Q-table at t3.

Q4 = zeros(HS^3,A); % Q-table at t4.

Q = {Q1,Q2,Q3,Q4};



Yüklə 10,52 Mb.

Dostları ilə paylaş:
1   ...   38   39   40   41   42   43   44   45   ...   116




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin