Sequential drug decision problems in long-term medical conditions: a case Study of Primary Hypertension Eunju Kim ba, ma, msc



Yüklə 10,52 Mb.
səhifə35/116
tarix04.01.2022
ölçüsü10,52 Mb.
#58520
1   ...   31   32   33   34   35   36   37   38   ...   116

3.5.3Reinforcement learning


One of the most widely used ADP techniques is RL, which uses simulation to overcome the curse of modelling and dimensionality of classic DP: therefore this is often called simulation-based DP[200]. RL needs the elements of the value function of DP but not the transition matrices. Rather RL calculates the transition probabilities and rewards within a simulator, which governs the environment including the random variables. The fundamental idea is that a goal-directed agent does not initially know what effect its actions will produce and keeps interacting within an uncertain environment to gradually learn what the best action to do is at each state.

The conventional approach to learn the value function is temporal differences (TD), which looks ahead for a sampled successor state like the Monte Carlo method, and then updates the value for the current state using an estimate of the successor state like DP (which is called bootstrapping). It does not need to wait until the actual return is known at the terminal state like the Monte Carlo method nor does it need to know the actual value function of the optimal solutions following st like DP. Rather TD samples n transitions from the current state and uses the sum or a weighted average of n-step returns to come as the expected return from the current state. The general form of TD learning is:


,

where ɑ is the learning rate and γ is the discount rate. Equation 3.6.


is the actual one-step return after time t. The target to maximise for the TD update is , which is gradually updated with the difference between the old estimate and the new estimate . The discount rate γ=[0,1] determines the importance of considering the future rewards in decision-making. As γ tends to 1, higher weight is placed on the future rewards. ɑ is the learning rate or step-size parameter ɑ (e.g., ɑ=1/t) controls to what extent the new information (i.e., the difference between the old information and the new information) will be considered. The learning rate starts at ɑ and gradually decays to 0 so that a learning algorithm converges to a single answer independent of the learning rate ɑ. If ɑ is small, the agent tends to rely on the old information, whereas the agent tends to replace the old information with the newly obtained information if ɑ is large. TD learning can be easily extended to the control problem, which is learning the optimal policy . One widely used learning algorithm is Q-learning, which a specific kind of RL that assigns values to action-state pairs.

Yüklə 10,52 Mb.

Dostları ilə paylaş:
1   ...   31   32   33   34   35   36   37   38   ...   116




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin