Never Give Up notes

Jan 2, 2021 13:51 · 231 words · 2 minute read paper ai rl

[Paper] Never Give Up Learning directed exploration strategies

Introduction

Exploration: $\epsilon$-greedy is very inefficient and the steps they require grow exponentially with the size of the state space

Method

$r_t=r^e_t+βr^i_t$

exploration bonus = intrinsic reward

$r_{t}^{i}=r_{t}^{\text {episodic }} \cdot \min \left\lbrace\max \left\lbrace\alpha_{t}, 1\right\rbrace, L\right\rbrace$

$r_{t}^{\text {episodic }}$ episodic intrinsic reward

$\alpha_{t}$ life-long curiosity factor

Episodic curiosity

Embedding network

Using an embedding network $f: \mathcal{O} \rightarrow \mathbb{R}^{p}$

Trained using a Siamese network to predict action $a_t$ given the observations $x_t$ and $x_{t+1}$

$p\left(a \mid x_{t}, x_{t+1}\right)=h\left(f\left(x_{t}\right), f\left(x_{t+1}\right)\right)$ where $h$ is a hidden layer MLP followed by a softmax

Learner uses last 5 frames of each sequence to train the network

learned representation == controllable state

Episodic memory

$r_{t}^{\text {episodic }}=\frac{1}{\sqrt{n\left(f\left(x_{t}\right)\right)}} \approx \frac{1}{\sqrt{\sum_{f_{i} \in N_{k}} K\left(f\left(x_{t}\right), f_{i}\right)}+c}$

$n(f(x_t))$ counts for the visits to the abstract state $f(x_t)$

$c=0.001$

$K$ is a Dirac delta function

$K(x, y)=\frac{\epsilon}{\frac{d^{2}(x, y)}{d_{m}^{2}}+\epsilon}$

$\epsilon=10^{−3}$

$d$ Euclidan distance

$d^2_m$ squared Euclidean distance of the $k$-th nearest neighbors

Life-long curiosity

Random Network Distillation: train a network to predict the outputs of a random initialized network

$g: \mathcal{O} \rightarrow \mathbb{R}^{k}$ random untrained network

$\hat g: \mathcal{O} \rightarrow \mathbb{R}^{k}$ predictor network

Minimize $ \operatorname{err}\left(x_{t}\right)=\left|\hat{g}\left(x_{t} ; \theta\right)-g\left(x_{t}\right)\right|^{2}$

$\alpha_{t}=1+\frac{\operatorname{err}\left(x_{t}\right)-\mu_{e}}{\sigma_{c}}$ where $\sigma_{c}$ std and $\mu_{e}$ mean

NGU agent

  • R2D2 base [Paper]
  • Universal value function approximator (UVFA) $ Q(x,a,\beta_i)$

I made a partial implementation of the never give up agent. You can find it on my Github.

tweet Share