Sequential drug decision problems in long-term medical conditions: a case Study of Primary Hypertension Eunju Kim ba, ma, msc


Reinforcement learning: Q-learning



Yüklə 10,52 Mb.
səhifə101/116
tarix04.01.2022
ölçüsü10,52 Mb.
#58520
1   ...   97   98   99   100   101   102   103   104   ...   116

7.6Reinforcement learning: Q-learning


RL, which is based on Q-learning, was implemented for the hypertension SDDP model. In the base-case, Q-values were gradually updated using both the immediate reward and the one-step future reward. Q-learning using a two or three-step future reward was also implemented to compare the performance; in this case, the feedback is the immediate rewards and the mean total net benefit from two or three transitions in the future. The total number of cases was also increased from 1,000 to 1,000,000 times to test the impact on the quality of the solution and the computational time. A penalty function forced the Q-value to 0 if the selected drug to be evaluated was against the feasibility restrictions made in the hypertension SDDP model.

Table ‎7. summarise the results from Q-learning in the base-case. Compared with the optimal solution (and statistically equivalent solutions) identified by enumeration, the solutions identified by RL were more heavily influenced by the three month SBP lowering effect of the drugs; this was more evident where the feedback was from the one-step future reward (see Table ‎7.) than where the feedback was from the two-step or three-step future reward (see Table ‎7. and Table ‎7.). Whereas the optimal initial solution found by enumeration was Ds or ACEIs/ARBs, the optimal solutions identified by Q-learning either started with BBs (1000, 100,000, and 1,000,000 cases) or CCBs (10,000 cases). The optimal subsequent treatments after initial drug were also sensitive to the number of cases. In the base-case, the optimal second-line treatment was switching to BBs+ACEIs/ARBs (1,000, and 10,000 cases) or Ds (100,000, and 1,000,000 cases). For the patients who again failed to achieve the treatment goal, the optimal third-line treatment was Ds+CCBs for all scenarios. In the fourth period, the optimal solution for the patients who had never achieved the treatment goal with the previous treatments was Ds (1,000, and 10,000 cases), Ds+BBs (100,000 cases) or Ds+BBs+CCBs (1,000,000 cases).

A better solution was found when more cases were observed. In the base-case, the solution identified from 1,000,000 cases produced a total net benefit of £312,822, whereas from 1,000 cases it was £294,335. Where two-step future expected rewards were used, the total net benefit of the solution identified from 1,000,000 cases was again higher than the total net benefit of the solution identified from 1,000 cases (£313,497 versus £302,699). The same pattern was also observed where three-step future expected rewards were used, although the difference was smaller than other scenarios (£302,363 versus 302,327). This agreed with a general rule that RL converges to the optimal solution where more cases are continuously observed.

The best solution was identified when the future feedback was from two transitions in the future (see Table ‎7.). Where 100,000 cases were observed, the optimal treatment sequence was using Ds initially and then switching to Ds+ACEIs/ARBs, Ds+BBs+ACEIs/ARBs and Ds+BBs+CCBs in order for the patients who failed to achieve the treatment goal consecutively: the expected total net benefit was £314,297 from this solution. More stable convergence in the Q-value was achieved when the observed cases were more than 100,000; whereas the discrepancy in the Q-value did not fully converge where the observed cases were less than 10,000 (see Figure ‎7.). The computational time for 1,000,000 cases was much longer than SA and GA, and even longer than enumeration, and the quality of solutions was less than those methods; as such, Q-leaning was not deemed computationally efficient when 1,000,000 cases were generated.



Table ‎7.. Results from RL depending on the number of iterations where the feedback is from one-step future reward



















The number of cases




1000

10000

100000

1000000

Optimal solution

t=1

Failure

 

 

 

CCBs

BBs

CCBs

CCBs

t=2

Failure

Failure

 

 

BBs+ACEIs/ARBs

BBs+ACEIs/ARBs

Ds

Ds

Failure

Success

 

 

CCBs

BBs

CCBs

CCBs

t=3

Failure

Failure

Failure

 

Ds+CCBs

Ds+CCBs

Ds+CCBs

Ds+CCBs

Failure

Failure

Success

 

BBs+ACEIs/ARBs

BBs+ACEIs/ARBs

Ds

Ds

Failure

Success

Failure

 

ACEIs/ARBs

Ds+CCBs

Ds+CCBs

Ds+CCBs

Failure

Success

Success

 

CCBs

BBs

CCBs

CCBs

t=4

Failure

Failure

Failure

Failure

Ds

Ds

Ds+BBs

Ds+BBs+CCBs

Failure

Failure

Failure

Success

Ds+CCBs

Ds+CCBs

Ds+CCBs

Ds+CCBs

Failure

Failure

Success

Failure

Ds

Ds

ACEIs/ARBs

ACEIs/ARBs

Failure

Failure

Success

Success

BBs+ACEIs/ARBs

BBs+ACEIs/ARBs

Ds

Ds

Failure

Success

Failure

Failure

Ds

Ds

Ds

Ds

Failure

Success

Failure

Success

ACEIs/ARBs

Ds+CCBs

Ds+CCBs

Ds+CCBs

Failure

Success

Success

Failure

BBs

Ds

Ds+CCBs

Ds+CCBs

Failure

Success

Success

Success

CCBs

BBs

CCBs

CCBs

Optimal value (£)

294,334.5

294,552.7

300,576.6

312,822.2

Computational time1)

1m

11m

1.44h

19.45h

1) s represents seconds, m represents minutes and h represents hours.

Table ‎7.. Results from RL depending on the number of iterations where the feedback is from two-step future reward





















The number of cases




1000

10000

100000

1000000

Optimal solution

t=1

Failure

 

 

 

CCBs

BBs

Ds

BBs

t=2

Failure

Failure

 

 

Ds+ACEIs/ARBs

Ds

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Failure

Success

 

 

CCBs

BBs

Ds

BBs

t=3

Failure

Failure

Failure

 

Ds

CCBs+ACEIs/ARBs

Ds+BBs+ACEIs/ARBs

Ds

Failure

Failure

Success

 

Ds+ACEIs/ARBs

Ds

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Failure

Success

Failure

 

Ds

Ds

Ds+CCBs

Ds

Failure

Success

Success

 

CCBs

BBs

Ds

BBs

t=4

Failure

Failure

Failure

Failure

BBs+CCBs

Ds+ACEIs/ARBs

Ds+BBs+CCBs

Ds+BBs+CCBs

Failure

Failure

Failure

Success

Ds

CCBs+ACEIs/ARBs

Ds+BBs+ACEIs/ARBs

Ds

Failure

Failure

Success

Failure

CCBs+ACEIs/ARBs

Ds+ACEIs/ARBs

CCBs+ACEIs/ARBs

CCBs+ACEIs/ARBs

Failure

Failure

Success

Success

Ds+ACEIs/ARBs

Ds

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Failure

Success

Failure

Failure

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Failure

Success

Failure

Success

Ds

Ds

Ds+CCBs

Ds

Failure

Success

Success

Failure

Ds+BBs+ACEIs/ARBs

Ds+ACEIs/ARBs

BBs+CCBs

Ds+BBs+CCBs

Failure

Success

Success

Success

CCBs

BBs

Ds

BBs

Optimal value (£)

302,699.0

306,131.5

314,296.6

313,496.8

Computational time1)

1.06m

10.59m

1.73h

17.10h

1) s represents seconds, m represents minutes and h represents hours.

Table ‎7.. Results from RL depending on the number of iterations where the feedback is from three-step future reward





















The number of cases

TD3

1000

10000

100000

1000000

Optimal solution

t=1

Failure

 

 

 

ACEIs/ARBs

Ds

BBs

BBs

t=2

Failure

Failure

 

 

CCBs

CCBs

BBs+ACEIs/ARBs

CCBs

Failure

Success

 

 

ACEIs/ARBs

Ds

BBs

BBs

t=3

Failure

Failure

Failure

 

Ds+BBs+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+BBs+ACEIs/ARBs

Failure

Failure

Success

 

CCBs

CCBs

BBs+ACEIs/ARBs

CCBs

Failure

Success

Failure

 

Ds+BBs+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+BBs+ACEIs/ARBs

Failure

Success

Success

 

ACEIs/ARBs

Ds

BBs

BBs

t=4

Failure

Failure

Failure

Failure

Ds+BBs

Ds+BBs

Ds+BBs

Ds+BBs

Failure

Failure

Failure

Success

Ds+BBs+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+BBs+ACEIs/ARBs

Failure

Failure

Success

Failure

Ds+BBs

Ds+BBs

Ds+BBs

Ds+BBs

Failure

Failure

Success

Success

CCBs

CCBs

BBs+ACEIs/ARBs

CCBs

Failure

Success

Failure

Failure

Ds+BBs

Ds+BBs

Ds+BBs

Ds+BBs

Failure

Success

Failure

Success

Ds+BBs+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+ACEIs/ARBs

Ds+BBs+ACEIs/ARBs

Failure

Success

Success

Failure

Ds+BBs

Ds+BBs

Ds+BBs

Ds+CCBs+ACEIs/ARBs

Failure

Success

Success

Success

ACEIs/ARBs

Ds

BBs

BBs

Optimal value (£)

302,326.6

300,539.0

300,625.8

302,362.5

Computational time1)

59.13s

9.67m

1.61h

15.73h

1) s represents seconds, m represents minutes and h represents hours.













Figure ‎7.. Convergence of the discrepancy in the Q-values from RL

Yüklə 10,52 Mb.

Dostları ilə paylaş:
1   ...   97   98   99   100   101   102   103   104   ...   116




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin