Sequential drug decision problems in long-term medical conditions: a case Study of Primary Hypertension Eunju Kim ba, ma, msc



Yüklə 10,52 Mb.
səhifə36/116
tarix04.01.2022
ölçüsü10,52 Mb.
#58520
1   ...   32   33   34   35   36   37   38   39   ...   116

3.5.4Q-learning


Q-learning is a simple incremental learning method that repeatedly updates Q-values based on the new cases until it converges upon the optimal solution[200]. While a simulator explores the subsequent states and actions following the current state s, the rewards are stored in the form of the so-called Q-factors, which is a state-action pair at t. The simplest Q-learning uses the observations of what happened in two cycles - the immediate reward, and the Q-values of other states in the future (see Equation 3.7).

where ɑ is the learning rate and γ is the discount rate. Equation 3.7.


The value of the state one step later represents the remaining rewards in the future. They are initially set to 0 and gradually updated as the simulator goes through the corresponding health state and drug. Where the feedback is from multiple transitions, the feedback is either the sum (or weighted sum) of n-step rewards (see Equation 3.8) or the average of n-step rewards (See Equation 3.9). Positive feedback strengthens the action tested by increasing, while negative feedback weakens the action tested. All that is required for correct convergence is that all pairs continue to be updated.
Equation 3.8.
Equation 3.9.
The commonly used action selection rule is -greedy (or near-greedy), which randomly selects an action at each time step with a fixed probability (e.g., ∊=1-(1/log(n+2)). Where n is the number of experiences and 0≤ρ≤1 is a uniform random number drawn at each time step, the agent selects action x randomly, if ρ>∊. Otherwise, action x is one of the learned optimal policy with the highest estimated Q-value (see Equation 3.10). increases as the number of cases n increases, therefore the algorithm is more likely to explore the search space in the early stage of learning and to exploit the previous experience to maximise the Q-value at the end of the learning. This implies that the selection of action converges to the optimal solution where a sufficiently large number of trials are obtained.
Equation 3.10.
The Q-learning algorithm starts with initialising a Q-table, which has a two dimensions containing the pairs of all possible states and all possible actions at t (see Figure ‎3.). Starting from an initial state at t1, a simulator spontaneously generates the current state and the subsequent states with the action selected by the ε-greedy method until a terminal state is reached at the end of the follow-up period. All the immediate rewards received from the transition between st and st+1 are saved in the Q-table at t and used to update the Q-value for the current state st. All that is required for correct convergence is that all pairs are continuously updated in an infinite number of times so that they are independent of the action being followed. At the end of the learning procedure, the action a, which maximises the Q-value for st, is selected as the optimal solution for st.


Initialise all Q-values arbitrarily.


Yüklə 10,52 Mb.

Dostları ilə paylaş:
1   ...   32   33   34   35   36   37   38   39   ...   116




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin