Sequential drug decision problems in long-term medical conditions: a case Study of Primary Hypertension Eunju Kim ba, ma, msc

Reinforcement learning: Q-learning

Yüklə 10,52 Mb.

səhifə	101/116
tarix	04.01.2022
ölçüsü	10,52 Mb.
	#58520

1 ... 97 98 99 100 101 102 103 104 ... 116

7.6Reinforcement learning: Q-learning

RL, which is based on Q-learning, was implemented for the hypertension SDDP model. In the base-case, Q-values were gradually updated using both the immediate reward and the one-step future reward. Q-learning using a two or three-step future reward was also implemented to compare the performance; in this case, the feedback is the immediate rewards and the mean total net benefit from two or three transitions in the future. The total number of cases was also increased from 1,000 to 1,000,000 times to test the impact on the quality of the solution and the computational time. A penalty function forced the Q-value to 0 if the selected drug to be evaluated was against the feasibility restrictions made in the hypertension SDDP model.

Table ‎7. summarise the results from Q-learning in the base-case. Compared with the optimal solution (and statistically equivalent solutions) identified by enumeration, the solutions identified by RL were more heavily influenced by the three month SBP lowering effect of the drugs; this was more evident where the feedback was from the one-step future reward (see Table ‎7.) than where the feedback was from the two-step or three-step future reward (see Table ‎7. and Table ‎7.). Whereas the optimal initial solution found by enumeration was Ds or ACEIs/ARBs, the optimal solutions identified by Q-learning either started with BBs (1000, 100,000, and 1,000,000 cases) or CCBs (10,000 cases). The optimal subsequent treatments after initial drug were also sensitive to the number of cases. In the base-case, the optimal second-line treatment was switching to BBs+ACEIs/ARBs (1,000, and 10,000 cases) or Ds (100,000, and 1,000,000 cases). For the patients who again failed to achieve the treatment goal, the optimal third-line treatment was Ds+CCBs for all scenarios. In the fourth period, the optimal solution for the patients who had never achieved the treatment goal with the previous treatments was Ds (1,000, and 10,000 cases), Ds+BBs (100,000 cases) or Ds+BBs+CCBs (1,000,000 cases).

A better solution was found when more cases were observed. In the base-case, the solution identified from 1,000,000 cases produced a total net benefit of £312,822, whereas from 1,000 cases it was £294,335. Where two-step future expected rewards were used, the total net benefit of the solution identified from 1,000,000 cases was again higher than the total net benefit of the solution identified from 1,000 cases (£313,497 versus £302,699). The same pattern was also observed where three-step future expected rewards were used, although the difference was smaller than other scenarios (£302,363 versus 302,327). This agreed with a general rule that RL converges to the optimal solution where more cases are continuously observed.

The best solution was identified when the future feedback was from two transitions in the future (see Table ‎7.). Where 100,000 cases were observed, the optimal treatment sequence was using Ds initially and then switching to Ds+ACEIs/ARBs, Ds+BBs+ACEIs/ARBs and Ds+BBs+CCBs in order for the patients who failed to achieve the treatment goal consecutively: the expected total net benefit was £314,297 from this solution. More stable convergence in the Q-value was achieved when the observed cases were more than 100,000; whereas the discrepancy in the Q-value did not fully converge where the observed cases were less than 10,000 (see Figure ‎7.). The computational time for 1,000,000 cases was much longer than SA and GA, and even longer than enumeration, and the quality of solutions was less than those methods; as such, Q-leaning was not deemed computationally efficient when 1,000,000 cases were generated.

Table ‎7.. Results from RL depending on the number of iterations where the feedback is from one-step future reward

						The number of cases
						1000	10000	100000	1000000
Optimal solution	t=1	Failure				CCBs	BBs	CCBs	CCBs
	t=2	Failure	Failure			BBs+ACEIs/ARBs	BBs+ACEIs/ARBs	Ds	Ds
	t=2	Failure	Success			CCBs	BBs	CCBs	CCBs
	t=3	Failure	Failure	Failure		Ds+CCBs	Ds+CCBs	Ds+CCBs	Ds+CCBs
		Failure	Failure	Success		BBs+ACEIs/ARBs	BBs+ACEIs/ARBs	Ds	Ds
		Failure	Success	Failure		ACEIs/ARBs	Ds+CCBs	Ds+CCBs	Ds+CCBs
		Failure	Success	Success		CCBs	BBs	CCBs	CCBs
	t=4	Failure	Failure	Failure	Failure	Ds	Ds	Ds+BBs	Ds+BBs+CCBs
		Failure	Failure	Failure	Success	Ds+CCBs	Ds+CCBs	Ds+CCBs	Ds+CCBs
		Failure	Failure	Success	Failure	Ds	Ds	ACEIs/ARBs	ACEIs/ARBs
		Failure	Failure	Success	Success	BBs+ACEIs/ARBs	BBs+ACEIs/ARBs	Ds	Ds
		Failure	Success	Failure	Failure	Ds	Ds	Ds	Ds
		Failure	Success	Failure	Success	ACEIs/ARBs	Ds+CCBs	Ds+CCBs	Ds+CCBs
		Failure	Success	Success	Failure	BBs	Ds	Ds+CCBs	Ds+CCBs
		Failure	Success	Success	Success	CCBs	BBs	CCBs	CCBs
Optimal value (£)						294,334.5	294,552.7	300,576.6	312,822.2
Computational time¹⁾						1m	11m	1.44h	19.45h

1) s represents seconds, m represents minutes and h represents hours.

Table ‎7.. Results from RL depending on the number of iterations where the feedback is from two-step future reward

						The number of cases
						1000	10000	100000	1000000
Optimal solution	t=1	Failure				CCBs	BBs	Ds	BBs
	t=2	Failure	Failure			Ds+ACEIs/ARBs	Ds	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs
	t=2	Failure	Success			CCBs	BBs	Ds	BBs
	t=3	Failure	Failure	Failure		Ds	CCBs+ACEIs/ARBs	Ds+BBs+ACEIs/ARBs	Ds
		Failure	Failure	Success		Ds+ACEIs/ARBs	Ds	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs
		Failure	Success	Failure		Ds	Ds	Ds+CCBs	Ds
		Failure	Success	Success		CCBs	BBs	Ds	BBs
	t=4	Failure	Failure	Failure	Failure	BBs+CCBs	Ds+ACEIs/ARBs	Ds+BBs+CCBs	Ds+BBs+CCBs
		Failure	Failure	Failure	Success	Ds	CCBs+ACEIs/ARBs	Ds+BBs+ACEIs/ARBs	Ds
		Failure	Failure	Success	Failure	CCBs+ACEIs/ARBs	Ds+ACEIs/ARBs	CCBs+ACEIs/ARBs	CCBs+ACEIs/ARBs
		Failure	Failure	Success	Success	Ds+ACEIs/ARBs	Ds	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs
		Failure	Success	Failure	Failure	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs
		Failure	Success	Failure	Success	Ds	Ds	Ds+CCBs	Ds
		Failure	Success	Success	Failure	Ds+BBs+ACEIs/ARBs	Ds+ACEIs/ARBs	BBs+CCBs	Ds+BBs+CCBs
		Failure	Success	Success	Success	CCBs	BBs	Ds	BBs
Optimal value (£)						302,699.0	306,131.5	314,296.6	313,496.8
Computational time¹⁾						1.06m	10.59m	1.73h	17.10h

1) s represents seconds, m represents minutes and h represents hours.

Table ‎7.. Results from RL depending on the number of iterations where the feedback is from three-step future reward

						The number of cases
TD3						1000	10000	100000	1000000
Optimal solution	t=1	Failure				ACEIs/ARBs	Ds	BBs	BBs
	t=2	Failure	Failure			CCBs	CCBs	BBs+ACEIs/ARBs	CCBs
	t=2	Failure	Success			ACEIs/ARBs	Ds	BBs	BBs
	t=3	Failure	Failure	Failure		Ds+BBs+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+BBs+ACEIs/ARBs
		Failure	Failure	Success		CCBs	CCBs	BBs+ACEIs/ARBs	CCBs
		Failure	Success	Failure		Ds+BBs+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+BBs+ACEIs/ARBs
		Failure	Success	Success		ACEIs/ARBs	Ds	BBs	BBs
	t=4	Failure	Failure	Failure	Failure	Ds+BBs	Ds+BBs	Ds+BBs	Ds+BBs
		Failure	Failure	Failure	Success	Ds+BBs+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+BBs+ACEIs/ARBs
		Failure	Failure	Success	Failure	Ds+BBs	Ds+BBs	Ds+BBs	Ds+BBs
		Failure	Failure	Success	Success	CCBs	CCBs	BBs+ACEIs/ARBs	CCBs
		Failure	Success	Failure	Failure	Ds+BBs	Ds+BBs	Ds+BBs	Ds+BBs
		Failure	Success	Failure	Success	Ds+BBs+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+ACEIs/ARBs	Ds+BBs+ACEIs/ARBs
		Failure	Success	Success	Failure	Ds+BBs	Ds+BBs	Ds+BBs	Ds+CCBs+ACEIs/ARBs
		Failure	Success	Success	Success	ACEIs/ARBs	Ds	BBs	BBs
Optimal value (£)						302,326.6	300,539.0	300,625.8	302,362.5
Computational time¹⁾						59.13s	9.67m	1.61h	15.73h

1) s represents seconds, m represents minutes and h represents hours.

Figure ‎7.. Convergence of the discrepancy in the Q-values from RL

Yüklə 10,52 Mb.

Dostları ilə paylaş:

1 ... 97 98 99 100 101 102 103 104 ... 116