Never Give Up notes
Jan 2, 2021 13:51 · 231 words · 2 minute read
[Paper] Never Give Up Learning directed exploration strategies
Introduction
Exploration: $\epsilon$-greedy is very inefficient and the steps they require grow exponentially with the size of the state space
Method
$r_t=r^e_t+βr^i_t$
exploration bonus = intrinsic reward
$r_{t}^{i}=r_{t}^{\text {episodic }} \cdot \min \left\lbrace\max \left\lbrace\alpha_{t}, 1\right\rbrace, L\right\rbrace$
$r_{t}^{\text {episodic }}$ episodic intrinsic reward
$\alpha_{t}$ life-long curiosity factor
Episodic curiosity
Embedding network
Using an embedding network $f: \mathcal{O} \rightarrow \mathbb{R}^{p}$
Trained using a Siamese network to predict action $a_t$ given the observations $x_t$ and $x_{t+1}$
$p\left(a \mid x_{t}, x_{t+1}\right)=h\left(f\left(x_{t}\right), f\left(x_{t+1}\right)\right)$ where $h$ is a hidden layer MLP followed by a softmax
Learner uses last 5 frames of each sequence to train the network
learned representation == controllable state
Episodic memory
$r_{t}^{\text {episodic }}=\frac{1}{\sqrt{n\left(f\left(x_{t}\right)\right)}} \approx \frac{1}{\sqrt{\sum_{f_{i} \in N_{k}} K\left(f\left(x_{t}\right), f_{i}\right)}+c}$
$n(f(x_t))$ counts for the visits to the abstract state $f(x_t)$
$c=0.001$
$K$ is a Dirac delta function
$K(x, y)=\frac{\epsilon}{\frac{d^{2}(x, y)}{d_{m}^{2}}+\epsilon}$
$\epsilon=10^{−3}$
$d$ Euclidan distance
$d^2_m$ squared Euclidean distance of the $k$-th nearest neighbors
Life-long curiosity
Random Network Distillation: train a network to predict the outputs of a random initialized network
$g: \mathcal{O} \rightarrow \mathbb{R}^{k}$ random untrained network
$\hat g: \mathcal{O} \rightarrow \mathbb{R}^{k}$ predictor network
Minimize $ \operatorname{err}\left(x_{t}\right)=\left|\hat{g}\left(x_{t} ; \theta\right)-g\left(x_{t}\right)\right|^{2}$
$\alpha_{t}=1+\frac{\operatorname{err}\left(x_{t}\right)-\mu_{e}}{\sigma_{c}}$ where $\sigma_{c}$ std and $\mu_{e}$ mean
NGU agent
- R2D2 base [Paper]
- Universal value function approximator (UVFA) $ Q(x,a,\beta_i)$
I made a partial implementation of the never give up agent. You can find it on my Github.