Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Q-learning a. How long a sequence of training examples is needed to guarantee th

ID: 3717671 • Letter: Q

Question

Q-learning a. How long a sequence of training examples is needed to guarantee that Q-learning will learn the optimal policy? b. One effective TD learning approach is to use a very optimistic (high) estimate for the initial utilities of actions. Why does this help in TD learning (what problem does it help avoid)? c. Another approach is for a Q-learning agent to act randomly on some fraction of actions, while avoid)? slowly decreasing this fraction. Why does this help in Q-learning (what problem does it help

Explanation / Answer

Solution a:- Q learning always tries to go for policies that seem immediately better, although at the end that doesn't necessarily guarantee the overall optimal policy (this is the exploration vs exploitation problem). It involves the sequence of training examples but still doesn't assure to revert with an optimal solution or within any specific time.

Solution b: More often than not the activity with the most noteworthy evaluated remunerate is picked, called the high optimistic or greediest activity. From time to time, say with a little likelihood, an activity is chosen indiscriminately. The activity is chosen consistently, independent of the activity esteem gauges. This technique guarantees that if enough trials are done, each activity will be attempted an unbounded number of times, consequently guaranteeing ideal activities are found.