Page 7 - jamshid
P. 7
human, or even surpassing these. H. R. Maei. Gradient temporal-difference learning algorithms.
Double DQN appears more robust to this more challeng- PhD thesis, University of Alberta, 2011.
ing evaluation, suggesting that appropriate generalizations V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
occur and that the found solutions do not exploit the deter- Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostro-
minism of the environments. This is appealing, as it indi- vski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King,
cates progress towards finding general solutions rather than D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-
a deterministic sequence of steps that would be less robust. level control through deep reinforcement learning. Nature, 518
(7540):529–533, 2015.
Discussion A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. D.
This paper has five contributions. First, we have shown why Maria, V. Panneershelvam, M. Suleyman, C. Beattie, S. Pe-
Q-learning can be overoptimistic in large-scale problems, tersen, S. Legg, V. Mnih, K. Kavukcuoglu, and D. Silver. Mas-
even if these are deterministic, due to the inherent estima- sively parallel methods for deep reinforcement learning. In Deep
Learning Workshop, ICML, 2015.
tion errors of learning. Second, by analyzing the value es-
timates on Atari games we have shown that these overesti- M. Riedmiller. Neural fitted Q iteration - first experiences with a
mations are more common and severe in practice than pre- data efficient neural reinforcement learning method. In J. Gama,
R. Camacho, P. Brazdil, A. Jorge, and L. Torgo, editors, Pro-
viously acknowledged. Third, we have shown that Double ceedings of the 16th European Conference on Machine Learning
Q-learning can be used at scale to successfully reduce this (ECML’05), pages 317–328. Springer, 2005.
overoptimism, resulting in more stable and reliable learning.
Fourth, we have proposed a specific implementation called B. Sallans and G. E. Hinton. Reinforcement learning with factored
states and actions. The Journal of Machine Learning Research,
Double DQN, that uses the existing architecture and deep 5:1063–1088, 2004.
neural network of the DQN algorithm without requiring ad-
ditional networks or parameters. Finally, we have shown that A. L. Strehl, L. Li, and M. L. Littman. Reinforcement learning in
Double DQN finds better policies, obtaining new state-of- finite MDPs: PAC analysis. The Journal of Machine Learning
Research, 10:2413–2444, 2009.
the-art results on the Atari 2600 domain.
R. S. Sutton. Learning to predict by the methods of temporal dif-
Acknowledgments ferences. Machine learning, 3(1):9–44, 1988.
We would like to thank Tom Schaul, Volodymyr Mnih, Marc R. S. Sutton. Integrated architectures for learning, planning, and
Bellemare, Thomas Degris, Georg Ostrovski, and Richard reacting based on approximating dynamic programming. In
Sutton for helpful comments, and everyone at Google Deep- Proceedings of the seventh international conference on machine
Mind for a constructive research environment. learning, pages 216–224, 1990.
R. S. Sutton and A. G. Barto. Introduction to reinforcement learn-
References ing. MIT Press, 1998.
R. Agrawal. Sample mean based index policies with O(log n) re- R. S. Sutton, C. Szepesv´ ari, and H. R. Maei. A convergent O(n)
gret for the multi-armed bandit problem. Advances in Applied algorithm for off-policy temporal-difference learning with lin-
Probability, pages 1054–1078, 1995. ear function approximation. Advances in Neural Information
P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the Processing Systems 21 (NIPS-08), 21:1609–1616, 2008.
multiarmed bandit problem. Machine learning, 47(2-3):235– R. S. Sutton, A. R. Mahmood, and M. White. An emphatic ap-
256, 2002. proach to the problem of off-policy temporal-difference learn-
L. Baird. Residual algorithms: Reinforcement learning with func- ing. arXiv preprint arXiv:1503.04269, 2015.
tion approximation. In Machine Learning: Proceedings of the I. Szita and A. L˝ orincz. The many faces of optimism: a unifying
Twelfth International Conference, pages 30–37, 1995. approach. In Proceedings of the 25th international conference
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The ar- on Machine learning, pages 1048–1055. ACM, 2008.
cade learning environment: An evaluation platform for general G. Tesauro. Temporal difference learning and td-gammon. Com-
agents. J. Artif. Intell. Res. (JAIR), 47:253–279, 2013.
munications of the ACM, 38(3):58–68, 1995.
R. I. Brafman and M. Tennenholtz. R-max-a general polynomial S. Thrun and A. Schwartz. Issues in using function approxima-
time algorithm for near-optimal reinforcement learning. The tion for reinforcement learning. In M. Mozer, P. Smolensky,
Journal of Machine Learning Research, 3:213–231, 2003.
D. Touretzky, J. Elman, and A. Weigend, editors, Proceedings
K. Fukushima. Neocognitron: A hierarchical neural network ca- of the 1993 Connectionist Models Summer School, Hillsdale, NJ,
pable of visual pattern recognition. Neural networks, 1(2):119– 1993. Lawrence Erlbaum.
130, 1988.
J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference
L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning with function approximation. IEEE Transactions on
learning: A survey. Journal of Artificial Intelligence Research, Automatic Control, 42(5):674–690, 1997.
4:237–285, 1996.
H. van Hasselt. Double Q-learning. Advances in Neural Informa-
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based tion Processing Systems, 23:2613–2621, 2010.
learning applied to document recognition. Proceedings of the
IEEE, 86(11):2278–2324, 1998. H. van Hasselt. Insights in Reinforcement Learning. PhD thesis,
Utrecht University, 2011.
L. Lin. Self-improving reactive agents based on reinforcement
learning, planning and teaching. Machine learning, 8(3):293– C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis,
321, 1992. University of Cambridge England, 1989.