3.5Reinforcement learning 3.5.1Bridge between DP, ADP, RL and Q-learning
DP is a well-known and efficient mathematical programming method to solve sequential or multi-stage decision problems using the Bellman’s value function. However, it has been recognized that classic DP is limited in its suitability to directly solve large and complex sequential decision problems encountered in the real-world because of the curse of dimensionality and computational requirements[51]. This causes an increasing interest in ADP, which uses simulation and/or a function approximation for solving large and complex sequential decision problems under the theoretical foundations in the traditional DP.
The most widely used ADP is RL, which is a computational model using animals’ complex behaviours to learn by experience[200]. In the conventional framework of RL, the agent does not initially know what immediate rewards or penalties its actions will produce. Instead it gradually learns which action is best at each state to maximize its long-term reward by trying various actions at various states.
Q-learning is a specific type of RL, which stores the value function in a state-action pair (called Q-values) during simulation. The learned action-value function is directly used to approximate the optimal action-value function, independent of the action being followed. This simply repeated procedure showed early convergence proofs where all pairs of state-action were continuously updated. The theoretical background and methodological application of DP, RL and Q-learning are detailed in the following sections.
Dostları ilə paylaş: |