Page 8 - jamshid

P. 8

Appendix Theorem 2. Consider a state s in which all the true optimal action
Theorem 1. Consider a state s in which all the true optimal ac- values are equal at Q ∗(s, a) = V ∗(s). Suppose that the estimation
tion values are equal at Q ∗(s, a) = V ∗(s) for some V ∗(s). Let errors Q t(s, a)−Q ∗(s, a) are independently distributed uniformly
Q t be arbitrary value estimates that are on the whole unbiased in randomly in [−1, 1]. Then,
P
the sense that (Q t(s, a) − V ∗(s)) = 0, but that are not all
h i m − 1
2
a P
zero, such that 1 (Q t(s, a) − V ∗(s)) = C for some C > 0, E max Q t(s, a) − V ∗(s) =
m a a m + 1
where m ≥ 2 is the number of actions in s. Under these conditions,
q
max a Q t(s, a) ≥ V ∗(s) + C . This lower bound is tight. Un- Proof. Deﬁne a = Q t(s, a) − Q ∗(s, a); this is a uniform ran-
m−1
der the same conditions, the lower bound on the absolute error of dom variable in [−1, 1]. The probability that max a Q t(s, a) ≤ x
the Double Q-learning estimate is zero. for some x is equal to the probability that a ≤ x for all a simul-
taneously. Because the estimation errors are independent, we can
Proof of Theorem 1. Deﬁne the errors for each action a as a = derive
Q t(s, a) − V ∗(s). Suppose that there exists a setting of { a} such
q
+
that max a a < C . Let { } be the set of positive of size P(max a ≤ x) = P(X 1 ≤ x ∧ X 2 ≤ x ∧ . . . ∧ X m ≤ x)
m−1 i a
−
m
n, and { } the set of strictly negative of size m − n (such Y
j
+
−
that { } = { } ∪ { }). If n = m, then P a = 0 =⇒ = P( a ≤ x) .
i j a
P 2 a=1
a = 0 ∀a, which contradicts a a = mC. Hence, it must be

q
P n + + C
that n ≤ m − 1. Then, ≤ n max i < n , The function P( a ≤ x) is the cumulative distribution function
i=1 i i m−1
P (CDF) of a, which here is simply deﬁned as
and therefore (using the constraint a = 0) we also have that
a
q q
P m−n − C − C 
j
j=1 | | < n m−1 . This implies max j | | < n m−1 . By  0 if x ≤ −1
j
H¨ older’s inequality, then P( a ≤ x) = 1+x if x ∈ (−1, 1)
2
1 if x ≥ 1

m−n m−n
X X
− 2 − −
( j ) ≤ | j | · max | j | This implies that
j
j=1 j=1
r r m
C C Y
< n n . P(max a ≤ x) = P( a ≤ x)
m − 1 m − 1 a
a=1

We can now combine these relations to compute an upper-bound  0 if x ≤ −1

on the sum of squares for all a: = 1+x m if x ∈ (−1, 1)
 1 2 if x ≥ 1
m n m−n
X 2 X + 2 X − 2
( a) = ( i ) + ( j )
This gives us the CDF of the random variable max a a. Its expec-
a=1 i=1 j=1
tation can be written as an integral
r r
C C C
< n + n n Z 1
m − 1 m − 1 m − 1 h i
E max a = xf max(x) dx ,
n(n + 1) a
= C −1
m − 1
where f max is the probability density function of this variable, de-
≤ mC.
ﬁned as the derivative of the CDF: f max(x) = d P(max a a ≤
dx
P m 2
This contradicts the assumption that a=1 a < mC, and there- x), so that for x ∈ [−1, 1] we have f max(x) = m 1+x m−1 .

q 2 2
C Evaluating the integral yields
fore max a a ≥ for all settings of that satisfy the con-
m−1
straints. We can check that the lower-bound is tight by setting h i Z 1
q p E max a = xf max(x) dx
a = C for a = 1, . . . , m − 1 and m = − (m − 1)C.
m−1 a −1
P 2 P

This veriﬁes a a = mC and a a = 0. x + 1 m mx − 1 1
The only tight lower bound on the absolute error for Double Q- =
0
learning |Q t (s, argmax Q t(s, a)) − V ∗(s)| is zero. This can be 2 m + 1 −1
a
seen by because we can have m − 1
= .
m + 1
r
m − 1
Q t(s, a 1) = V ∗(s) + C ,
m Experimental Details for the Atari 2600
and Domain
s
1 We selected the 49 games to match the list used by Mnih et al.
Q t(s, a i) = V ∗(s) − C , for i > 1.
m(m − 1) (2015), see Tables below for the full list. Each agent step is com-
posed of four frames (the last selected action is repeated during
Then the conditions of the theorem hold. If then, furthermore, we these frames) and reward values (obtained from the Arcade Learn-
0
have Q t (s, a 1) = V ∗(s) then the error is zero. The remaining ac- ing Environment (Bellemare et al., 2013)) are clipped between -1
0
tion values Q t (s, a i), for i > 1, are arbitrary. and 1.

3 4 5 6 7 8 9 10 11 12 13