7.6Reinforcement learning: Q-learning
RL, which is based on Q-learning, was implemented for the hypertension SDDP model. In the base-case, Q-values were gradually updated using both the immediate reward and the one-step future reward. Q-learning using a two or three-step future reward was also implemented to compare the performance; in this case, the feedback is the immediate rewards and the mean total net benefit from two or three transitions in the future. The total number of cases was also increased from 1,000 to 1,000,000 times to test the impact on the quality of the solution and the computational time. A penalty function forced the Q-value to 0 if the selected drug to be evaluated was against the feasibility restrictions made in the hypertension SDDP model.
Table 7. summarise the results from Q-learning in the base-case. Compared with the optimal solution (and statistically equivalent solutions) identified by enumeration, the solutions identified by RL were more heavily influenced by the three month SBP lowering effect of the drugs; this was more evident where the feedback was from the one-step future reward (see Table 7.) than where the feedback was from the two-step or three-step future reward (see Table 7. and Table 7.). Whereas the optimal initial solution found by enumeration was Ds or ACEIs/ARBs, the optimal solutions identified by Q-learning either started with BBs (1000, 100,000, and 1,000,000 cases) or CCBs (10,000 cases). The optimal subsequent treatments after initial drug were also sensitive to the number of cases. In the base-case, the optimal second-line treatment was switching to BBs+ACEIs/ARBs (1,000, and 10,000 cases) or Ds (100,000, and 1,000,000 cases). For the patients who again failed to achieve the treatment goal, the optimal third-line treatment was Ds+CCBs for all scenarios. In the fourth period, the optimal solution for the patients who had never achieved the treatment goal with the previous treatments was Ds (1,000, and 10,000 cases), Ds+BBs (100,000 cases) or Ds+BBs+CCBs (1,000,000 cases).
A better solution was found when more cases were observed. In the base-case, the solution identified from 1,000,000 cases produced a total net benefit of £312,822, whereas from 1,000 cases it was £294,335. Where two-step future expected rewards were used, the total net benefit of the solution identified from 1,000,000 cases was again higher than the total net benefit of the solution identified from 1,000 cases (£313,497 versus £302,699). The same pattern was also observed where three-step future expected rewards were used, although the difference was smaller than other scenarios (£302,363 versus 302,327). This agreed with a general rule that RL converges to the optimal solution where more cases are continuously observed.
The best solution was identified when the future feedback was from two transitions in the future (see Table 7.). Where 100,000 cases were observed, the optimal treatment sequence was using Ds initially and then switching to Ds+ACEIs/ARBs, Ds+BBs+ACEIs/ARBs and Ds+BBs+CCBs in order for the patients who failed to achieve the treatment goal consecutively: the expected total net benefit was £314,297 from this solution. More stable convergence in the Q-value was achieved when the observed cases were more than 100,000; whereas the discrepancy in the Q-value did not fully converge where the observed cases were less than 10,000 (see Figure 7.). The computational time for 1,000,000 cases was much longer than SA and GA, and even longer than enumeration, and the quality of solutions was less than those methods; as such, Q-leaning was not deemed computationally efficient when 1,000,000 cases were generated.
Table 7.. Results from RL depending on the number of iterations where the feedback is from one-step future reward
|
|
|
|
|
|
The number of cases
|
|
1000
|
10000
|
100000
|
1000000
|
Optimal solution
|
t=1
|
Failure
|
|
|
|
CCBs
|
BBs
|
CCBs
|
CCBs
|
t=2
|
Failure
|
Failure
|
|
|
BBs+ACEIs/ARBs
|
BBs+ACEIs/ARBs
|
Ds
|
Ds
|
Failure
|
Success
|
|
|
CCBs
|
BBs
|
CCBs
|
CCBs
|
t=3
|
Failure
|
Failure
|
Failure
|
|
Ds+CCBs
|
Ds+CCBs
|
Ds+CCBs
|
Ds+CCBs
|
Failure
|
Failure
|
Success
|
|
BBs+ACEIs/ARBs
|
BBs+ACEIs/ARBs
|
Ds
|
Ds
|
Failure
|
Success
|
Failure
|
|
ACEIs/ARBs
|
Ds+CCBs
|
Ds+CCBs
|
Ds+CCBs
|
Failure
|
Success
|
Success
|
|
CCBs
|
BBs
|
CCBs
|
CCBs
|
t=4
|
Failure
|
Failure
|
Failure
|
Failure
|
Ds
|
Ds
|
Ds+BBs
|
Ds+BBs+CCBs
|
Failure
|
Failure
|
Failure
|
Success
|
Ds+CCBs
|
Ds+CCBs
|
Ds+CCBs
|
Ds+CCBs
|
Failure
|
Failure
|
Success
|
Failure
|
Ds
|
Ds
|
ACEIs/ARBs
|
ACEIs/ARBs
|
Failure
|
Failure
|
Success
|
Success
|
BBs+ACEIs/ARBs
|
BBs+ACEIs/ARBs
|
Ds
|
Ds
|
Failure
|
Success
|
Failure
|
Failure
|
Ds
|
Ds
|
Ds
|
Ds
|
Failure
|
Success
|
Failure
|
Success
|
ACEIs/ARBs
|
Ds+CCBs
|
Ds+CCBs
|
Ds+CCBs
|
Failure
|
Success
|
Success
|
Failure
|
BBs
|
Ds
|
Ds+CCBs
|
Ds+CCBs
|
Failure
|
Success
|
Success
|
Success
|
CCBs
|
BBs
|
CCBs
|
CCBs
|
Optimal value (£)
|
294,334.5
|
294,552.7
|
300,576.6
|
312,822.2
|
Computational time1)
|
1m
|
11m
|
1.44h
|
19.45h
|
1) s represents seconds, m represents minutes and h represents hours.
Table 7.. Results from RL depending on the number of iterations where the feedback is from two-step future reward
|
|
|
|
|
|
The number of cases
|
|
1000
|
10000
|
100000
|
1000000
|
Optimal solution
|
t=1
|
Failure
|
|
|
|
CCBs
|
BBs
|
Ds
|
BBs
|
t=2
|
Failure
|
Failure
|
|
|
Ds+ACEIs/ARBs
|
Ds
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Failure
|
Success
|
|
|
CCBs
|
BBs
|
Ds
|
BBs
|
t=3
|
Failure
|
Failure
|
Failure
|
|
Ds
|
CCBs+ACEIs/ARBs
|
Ds+BBs+ACEIs/ARBs
|
Ds
|
Failure
|
Failure
|
Success
|
|
Ds+ACEIs/ARBs
|
Ds
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Failure
|
Success
|
Failure
|
|
Ds
|
Ds
|
Ds+CCBs
|
Ds
|
Failure
|
Success
|
Success
|
|
CCBs
|
BBs
|
Ds
|
BBs
|
t=4
|
Failure
|
Failure
|
Failure
|
Failure
|
BBs+CCBs
|
Ds+ACEIs/ARBs
|
Ds+BBs+CCBs
|
Ds+BBs+CCBs
|
Failure
|
Failure
|
Failure
|
Success
|
Ds
|
CCBs+ACEIs/ARBs
|
Ds+BBs+ACEIs/ARBs
|
Ds
|
Failure
|
Failure
|
Success
|
Failure
|
CCBs+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
CCBs+ACEIs/ARBs
|
CCBs+ACEIs/ARBs
|
Failure
|
Failure
|
Success
|
Success
|
Ds+ACEIs/ARBs
|
Ds
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Failure
|
Success
|
Failure
|
Failure
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Failure
|
Success
|
Failure
|
Success
|
Ds
|
Ds
|
Ds+CCBs
|
Ds
|
Failure
|
Success
|
Success
|
Failure
|
Ds+BBs+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
BBs+CCBs
|
Ds+BBs+CCBs
|
Failure
|
Success
|
Success
|
Success
|
CCBs
|
BBs
|
Ds
|
BBs
|
Optimal value (£)
|
302,699.0
|
306,131.5
|
314,296.6
|
313,496.8
|
Computational time1)
|
1.06m
|
10.59m
|
1.73h
|
17.10h
|
1) s represents seconds, m represents minutes and h represents hours.
Table 7.. Results from RL depending on the number of iterations where the feedback is from three-step future reward
|
|
|
|
|
|
The number of cases
|
TD3
|
1000
|
10000
|
100000
|
1000000
|
Optimal solution
|
t=1
|
Failure
|
|
|
|
ACEIs/ARBs
|
Ds
|
BBs
|
BBs
|
t=2
|
Failure
|
Failure
|
|
|
CCBs
|
CCBs
|
BBs+ACEIs/ARBs
|
CCBs
|
Failure
|
Success
|
|
|
ACEIs/ARBs
|
Ds
|
BBs
|
BBs
|
t=3
|
Failure
|
Failure
|
Failure
|
|
Ds+BBs+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+BBs+ACEIs/ARBs
|
Failure
|
Failure
|
Success
|
|
CCBs
|
CCBs
|
BBs+ACEIs/ARBs
|
CCBs
|
Failure
|
Success
|
Failure
|
|
Ds+BBs+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+BBs+ACEIs/ARBs
|
Failure
|
Success
|
Success
|
|
ACEIs/ARBs
|
Ds
|
BBs
|
BBs
|
t=4
|
Failure
|
Failure
|
Failure
|
Failure
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Failure
|
Failure
|
Failure
|
Success
|
Ds+BBs+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+BBs+ACEIs/ARBs
|
Failure
|
Failure
|
Success
|
Failure
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Failure
|
Failure
|
Success
|
Success
|
CCBs
|
CCBs
|
BBs+ACEIs/ARBs
|
CCBs
|
Failure
|
Success
|
Failure
|
Failure
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Failure
|
Success
|
Failure
|
Success
|
Ds+BBs+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+ACEIs/ARBs
|
Ds+BBs+ACEIs/ARBs
|
Failure
|
Success
|
Success
|
Failure
|
Ds+BBs
|
Ds+BBs
|
Ds+BBs
|
Ds+CCBs+ACEIs/ARBs
|
Failure
|
Success
|
Success
|
Success
|
ACEIs/ARBs
|
Ds
|
BBs
|
BBs
|
Optimal value (£)
|
302,326.6
|
300,539.0
|
300,625.8
|
302,362.5
|
Computational time1)
|
59.13s
|
9.67m
|
1.61h
|
15.73h
|
1) s represents seconds, m represents minutes and h represents hours.
Figure 7.. Convergence of the discrepancy in the Q-values from RL
Dostları ilə paylaş: |