๊ธฐ์ดˆ/์ธ๊ณต์ง€๋Šฅ

์˜ตํ‹ฐ๋งˆ์ด์ €(Optimizer) (2/2)

James Hwang๐Ÿ˜Ž 2021. 9. 2. 23:38
๐Ÿ’ก 'Deep Learning from Scratch'์™€ 'CS231N'์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ

(๊ฐ ์ ˆ์˜ ๋„˜๋ฒ„๋ง์€ ์ง€๋‚œ ๊ฒŒ์‹œ๋ฌผ์—์„œ ์ด์–ด์ง‘๋‹ˆ๋‹ค)

2. ์˜ตํ‹ฐ๋งˆ์ด์ €

  ์ง€๋‚œ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” SGD์˜ ๋ฌธ์ œ์ ์œผ๋กœ ์ง€์ ๋˜์—ˆ๋˜ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ฐ€์šด๋ฐ ์Šคํ… ๋ฐฉํ–ฅ์„ ๊ฐœ์„ ํ•œ ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ดค์Šต๋‹ˆ๋‹ค. ์˜ค๋ฒ„์ŠˆํŒ…(overshooting)์œผ๋กœ ์•ˆ์žฅ์ (saddle point)๊ณผ ์ง€์—ญ ์ตœ์†Ÿ๊ฐ’(local minima)์„ ํ†ต๊ณผํ•˜๋ฉฐ ์ „์—ญ ์ตœ์†Ÿ๊ฐ’(global minimum)์„ ์ฐพ๋˜ SGD+Momentum, NAG๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ฐฑ์‹ ๋œ ์ •๋„์— ๋”ฐ๋ผ ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •ํ•˜๋ฉฐ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š” AdaGrad๋ฅ˜์˜ ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 1. ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ๋ฐœ๋‹ฌ ๊ณ„๋ณด [ํ•˜์šฉํ˜ธ, 2017]

2.5 Adaptive Gradient (AdaGrad)

  AdaGrad๋Š” ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ์—์„œ ํ•™์Šต ๋ณดํญ์ด ๋™์ผํ•˜๋‹ค๋Š” SGD์˜ ๋ฌธ์ œ์ ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ์‹ ๊ฒฝ๋ง ํ•™์Šต์—์„œ ํ•™์Šต๋ฅ (learning rate)๋Š” ๋งค์šฐ ์ค‘์š”ํ•œ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ํ•™์Šต๋ฅ ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ •ํ•˜๊ธฐ ์œ„ํ•ด 'ํ•™์Šต๋ฅ  ๊ฐ์†Œ(learning rate decay)'๊ฐ€ ์‚ฌ์šฉ๋˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต๋ฅ  ๊ฐ์†Œ
  ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ํ•™์Šต๋ฅ ์„ ์ ์ฐจ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•. ์ฒ˜์Œ์—๋Š” ํฐ ๋ณดํญ์œผ๋กœ ํ•™์Šตํ•˜์˜€๋‹ค๊ฐ€ ์กฐ๊ธˆ์”ฉ ์ž‘๊ฒŒ ํ•™์Šตํ•จ์œผ๋กœ์จ ํšจ๊ณผ์ ์ธ ํ•™์Šต์„ ์œ ๋„ํ•จ.

  ํ•™์Šต๋ฅ ์„ ์กฐ์ ˆํ•˜๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์€ ๋งค๊ฐœ๋ณ€์ˆ˜ '์ „์ฒด'์˜ ํ•™์Šต๋ฅ ์„ ์ผ๊ด„์ ์œผ๋กœ ๋‚ฎ์ถ”๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๊ฐœ๋…์„ ๋ฐœ์ „์‹œํ‚จ ๊ฒƒ์ด ๋ฐ”๋กœ AdaGrad์ž…๋‹ˆ๋‹ค. AdaGrad๋Š” '๊ฐ๊ฐ์˜' ๋งค๊ฐœ๋ณ€์ˆ˜์— ์ ํ•ฉํ•œ ํ•™์Šต๋ฅ ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, AdaGrad๋Š” ๊ฐœ๋ณ„ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ์ ์‘์ (adaptive)์œผ๋กœ ํ•™์Šต๋ฅ ์„ ์กฐ์ •ํ•˜๋ฉฐ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. AdaGrad์˜ ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$$ \theta_{t+1}=\theta_{t} - \frac{\eta}{\sqrt{G_{t}+\epsilon}} \odot \triangledown J(\theta_{t}) $$

$$ G_{t}=G_{t-1}+(\triangledown J(\theta_t))^2 $$

  • $ G_t $ : ์†์‹ค ํ•จ์ˆ˜ ๊ธฐ์šธ๊ธฐ์˜ ์ œ๊ณฑํ•ฉ
  • $ \odot $ : ํ–‰๋ ฌ์˜ ์›์†Œ๋ณ„ ๊ณฑ์…ˆ

  ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

class AdaGrad:
    def __init__(self, lr: float=0.01):
        self.lr = lr
        self.h = None
    
    def update(self, params: Dict[str, float],
               grads: Dict[str, float]) -> None:
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
        
        for key in params.keys():
            self.h[key] += grads[key] * grads[key]
            params[key] -= \
                self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐฑ์‹ ํ•˜๋Š” ๋ถ€๋ถ„์—์„œ ๋งค์šฐ ์ž‘์€ ๊ฐ’์ธ 1e-7์„ ๋”ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” h[key] ์•ˆ์— 0์ด ๋‹ด๊ฒจ ์žˆ๋”๋ผ๋„ 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ง‰์•„์ค๋‹ˆ๋‹ค. ์ด๋ฅผ ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ 2์ž…๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 2. AdaGrad์— ์˜ํ•œ ์ตœ์ ํ™” ๊ฐฑ์‹  ๊ฒฝ๋กœ

  ์ „์—ญ ์ตœ์†Ÿ๊ฐ’์„ ํ–ฅํ•ด ํšจ์œจ์ ์œผ๋กœ ์›€์ง์ด๋Š” ๊ฒƒ์„ ๊ทธ๋ฆผ์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ธฐ์šธ๊ธฐ์— ๋น„๋ก€ํ•˜์—ฌ ์›€์ง์ž„๋„ ์ ์  ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” AdaGrad์˜ ์žฅ์ ์ด์ž ๋‹จ์ ์ž…๋‹ˆ๋‹ค. AdaGrad๋Š” ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ๊ณ„์† ๋”ํ•ด๊ฐ€๋Š” ์„ฑ์งˆ ๋•Œ๋ฌธ์— ํ•™์Šต์„ ์ง„ํ–‰ํ• ์ˆ˜๋ก ๊ทธ ์›€์ง์ž„์ด ์•ฝํ•ด์ง‘๋‹ˆ๋‹ค. AdaGrad๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฌดํ•œํžˆ ํ•™์Šตํ•˜๋ฉด, ์–ด๋Š ์ˆœ๊ฐ„์—๋Š” ๊ฐฑ์‹ ๋Ÿ‰์ด 0์ด ๋˜์–ด ์ „ํ˜€ ๊ฐฑ์‹ ํ•˜์ง€ ์•Š๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

2.6 RMSProp

  RMSProp์€ AdaGrad๋ฅผ ๊ฐœ์„ ํ•œ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. RMSProp์€ ๊ณผ๊ฑฐ์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ท ์ผํ•˜๊ฒŒ ๋”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋จผ ๊ณผ๊ฑฐ์˜ ๊ธฐ์šธ๊ธฐ๋Š” ์„œ์„œํžˆ ์žŠ๊ณ  ์ƒˆ๋กœ์šด ๊ธฐ์šธ๊ธฐ ์ •๋ณด๋ฅผ ํฌ๊ฒŒ ๋ฐ˜์˜ํ•˜๋Š” ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์ง€์ˆ˜์ด๋™ํ‰๊ท (Exponential Moving Average)๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ง€์ˆ˜์ด๋™ํ‰๊ท ์„ ํ†ตํ•ด ์†์‹ค ํ•จ์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ ์ œ๊ณฑํ•ฉ์ด ๋‹จ์ˆœ ๋ˆ„์ ๋˜์–ด ๋ฌดํ•œ๋Œ€๋กœ ๋ฐœ์‚ฐํ•˜๋Š” ๊ฒƒ์„ ๋ง‰์•„์ค๋‹ˆ๋‹ค. RMSProp์˜ ๊ณต์‹์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$$ \theta_{t+1}=\theta_{t} - \frac{\eta}{\sqrt{G_{t}+\epsilon}} \odot \triangledown J(\theta_{t}) $$

$$ G_{t}=\gamma G_{t-1}+(1-\gamma)(\triangledown J(\theta_t))^2 $$

  • $ \gamma $ : ๊ฐ์†Œ์œจ(decaying factor)

  ์—ฌ๊ธฐ์„œ ๊ฐ์†Œ์œจ์€ ์ผ๋ฐ˜์ ์œผ๋กœ 0.9 ํ˜น์€ 0.99์˜ ๋†’์€ ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด์™ธ์˜ ๊ฐœ๋…๋“ค์€ AdaGrad์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ตฌํ˜„ํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

class RMSProp:
    def __init__(self, lr: float=0.01, decay_rate: float=0.99):
        self.lr = lr
        self.dr = decay_rate
        self.h = None
    
    def update(self, params: Dict[str, float],
               grads: Dict[str, float]) -> None:
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)
        
        for key in params.keys():
            self.h[key] *= self.dr
            self.h[key] += (1- self.dr) * grads[key] * grads[key]
            params[key] -= \
                self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

2.7 Adaptive moment estimation (Adam)

  ๋ชจ๋ฉ˜ํ…€๊ณผ RMSProp์˜ ๋ฐฉ์‹์„ ํ•ฉ์นœ ๊ธฐ๋ฒ•์ด ๋ฐ”๋กœ Adam์ž…๋‹ˆ๋‹ค. Adam์€ ๋ถˆํŽธ ์ถ”์ •์น˜(unbiased estimate)๋ฅผ ํ†ตํ•ด ์ดˆ๊ธฐ์— $ m_t $์—์„œ ์ด๋™ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ๊ณผ $ v_t $์—์„œ ๋„ˆ๋ฌด ํฐ ํ•™์Šต ๋ณดํญ์„ ๋ฐŸ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ ์šฉํ•œ Adam์˜ ์ˆ˜์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$$ m_t = \beta_1 m_{t-1}+(1-\beta_1)\triangledown J(\theta_{t}) $$

$$ v_t = \beta_2 v_{t-1}+(1-\beta_2)(\triangledown J(\theta_{t}))^2 $$

$$ \theta_{t+1}=\theta_t - \frac{\eta}{\sqrt{\hat{v}_t+\epsilon}}\hat{m}_t $$

$$ (\hat{m}_t=\frac{m_t}{1-\beta^t_1},\ \ \hat{v}_t=\frac{m_t}{1-\beta^t_2}) $$

  • $ m_t $ : ๋ชจ๋ฉ˜ํ…€ ๊ธฐ๋ฒ•
  • $ v_t $ : RMSProp ๊ธฐ๋ฒ•
  • $ \beta $ : ๊ฐ์‡  ์ƒ์ˆ˜(decay constant)
  • $ \hat{m_t}, \hat{v_t} $ : ๋ถˆํŽธ ์ถ”์ •์น˜ ์ ์šฉ

  ์ผ๋ฐ˜์ ์œผ๋กœ $ \beta_1 $์€ 0.9๋ฅผ, $ \beta_2 $๋Š” 0.999๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ตฌํ˜„ํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

class Adam:
    def __init__(self, lr: float=0.001, beta1: float=0.9,
                 beta2: float=0.999):
        self.lr = lr
        self.b1 = beta1
        self.b2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
    
    def update(self, params: Dict[str, float],
               grads: Dict[str, float]) -> None:
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)

        self.iter += 1

        for key in params.keys():
            # Momentum
            self.m[key] = \
                self.b1 * self.m[key] + (1 - self.b1) * grads[key]
            m_hat = \
                self.m[key] / (1 - self.b1 ** self.iter)  # bias correcton
            # RMSProp
            self.v[key] = \
                self.b2 * self.v[key] + (1 - self.b2) * (grads[key] ** 2)
            v_hat = \
                self.v[key] / (1 - self.b2 ** self.iter)  # bias correcton
            # Update
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + 1e-7)

  Adam์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์˜ ๊ทธ๋ฆผ 3๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 3. Adam์— ์˜ํ•œ ์ตœ์ ํ™” ๊ฐฑ์‹  ๊ฒฝ๋กœ

  Adam์˜ ์ตœ์ ํ™” ๊ฐฑ์‹  ๊ฒฝ๋กœ๋ฅผ ์‚ดํŽด ๋ณด๋ฉด, ๊ธฐ์กด์˜ ๋ชจ๋ฉ˜ํ…€๊ณผ ๊ฐ™์ด ์˜ค๋ฒ„์ŠˆํŒ…์„ ํ•˜๋ฉด์„œ๋„ ์ „์—ญ ์ตœ์†Ÿ๊ฐ’์„ ํ–ฅํ•ด ํšจ์œจ์ ์œผ๋กœ ์›€์ง์ด๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์ด๋Š” ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋ชจ๋ฉ˜ํ…€์˜ ํŠน์ง•๊ณผ RMSProp์˜ ํŠน์ง• ๋ชจ๋‘๋ฅผ ์ ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค.

2.8 Nesterov-accelerated Adaptive Moment Estimation (Nadam)

  Nadam์€ Adam์—์„œ ์ ์šฉํ•œ ๋ชจ๋ฉ˜ํ…€ ๊ธฐ๋ฒ•์„ NAG๋กœ ๋ณ€๊ฒฝํ•˜์˜€์Šต๋‹ˆ๋‹ค. Nadam์€ Adam๊ณผ NAG์˜ ์žฅ์ ์„ ํ•ฉ์ณค๊ธฐ ๋•Œ๋ฌธ์—, Adam๋ณด๋‹ค ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ์ „์—ญ ์ตœ์†Ÿ๊ฐ’์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. Nadam์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ธฐ์กด์˜ NAG์˜ ๊ณต์‹์„ ์กฐ๊ธˆ ์ˆ˜์ •ํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. NAG์—์„œ ๋ชจ๋ฉ˜ํ…€์„ ์กฐ์ •ํ•˜๋Š” ์ˆ˜์‹์€ ์•„๋ž˜์™€ ๊ฐ™์•˜์Šต๋‹ˆ๋‹ค.

$$ g_t =\triangledown J(\theta_t - \gamma m_{t-1}) $$

$$ m_t = \gamma m_{t-1} + \eta g_t $$

$$ \theta_{t+1} = \theta - m_t $$

  NAG๋Š” ํ˜„์žฌ์˜ ์œ„์น˜($ \theta_t $)์—์„œ ํ˜„์žฌ์˜ ๋ชจ๋ฉ˜ํ…€($ \ m_t $)๋งŒํผ ์ด๋™ํ•œ ์ž๋ฆฌ์—์„œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์ด์ „ ๋‹จ๊ณ„์˜ ๋ชจ๋ฉ˜ํ…€์— ๋”ํ•ด์คŒ์œผ๋กœ์จ ํ˜„์žฌ์˜ ๋ชจ๋ฉ˜ํ…€($ \ m_t $)๋ฅผ ๊ฐฑ์‹ ํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

  ์œ„์˜ NAG ๊ณต์‹์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐฑ์‹ ์„ ์œ„ํ•˜์—ฌ ์ด์ „ ๋‹จ๊ณ„์˜ ๋ชจ๋ฉ˜ํ…€($ m_{t-1} $)์„ 2๋ฒˆ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Nadam์€ ์ด๋ฅผ ์กฐ๊ธˆ ๋ณ€ํ˜•ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ๋‹จ๊ณ„์˜ ๋ชจ๋ฉ˜ํ…€($ m_{t-1} $)์„ ๋Œ€์‹ ํ•˜์—ฌ ํ˜„์žฌ์˜ ๋ชจ๋ฉ˜ํ…€($ m_t $)์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋ฏธ๋ž˜์˜ ๋ชจ๋ฉ˜ํ…€์„ ์‚ฌ์šฉํ•˜๋Š” ํšจ๊ณผ๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ NAG์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •์— ๋ฐ˜์˜ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$$ g_t=\triangledown J(\theta) $$

$$ \theta_{t+1} = \theta - (\gamma m_t + \eta g_t) $$

  ์œ„์˜ ํšจ๊ณผ๋ฅผ Adam์— ์ ์šฉํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ๊ธฐ์กด์˜ Adam์ด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๋ถ€๋ถ„์„ ์กฐ๊ธˆ ๋” ํ’€์–ด์„œ ์ž‘์„ฑํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

$$ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}+\epsilon}}\hat{m_t} $$

$$ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}+\epsilon}}\left(\frac{\beta_1m_{t-1}}{1-\beta_{1}^{t}}+\frac{(1-\beta_1)g_t}{1-\beta_{1}^{t}}\right) $$

$$ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}+\epsilon}}\left(\beta_1 \hat{m_{t-1}}+\frac{(1-\beta_1)g_t}{1-\beta_{1}^{t}}\right) $$

  ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋ฏธ๋ž˜์˜ ๋ชจ๋ฉ˜ํ…€์„ ์‚ฌ์šฉํ•˜๋Š” ํšจ๊ณผ๋ฅผ Adam์— ์ ์šฉํ•˜๋ฉด, ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณต์‹์„ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

$$ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{\hat{v_t}+\epsilon}}\left(\beta_1 \hat{m_{t}}+\frac{(1-\beta_1)g_t}{1-\beta_{1}^{t}}\right) $$

  ์œ„์˜ ๊ณต์‹์„ ์ ์šฉํ•˜์—ฌ Nadam์„ ๊ตฌํ˜„ํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

class Nadam:
    def __init__(self, lr: float=0.001, beta1: float=0.9,
                 beta2: float=0.999):
        self.lr = lr
        self.b1 = beta1
        self.b2 = beta2
        self.iter = 0
        self.m = None
        self.v = None
    
    def update(self, params: Dict[str, float],
               grads: Dict[str, float]) -> None:
        if self.m is None:
            self.m, self.v = {}, {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)

        self.iter += 1

        for key in params.keys():
            # Momentum
            self.m[key] = \
                self.b1 * self.m[key] + (1 - self.b1) * grads[key]
            m_hat = \
                self.m[key] / (1 - self.b1 ** self.iter)  # bias correcton
            
            # RMSProp
            self.v[key] = \
                self.b2 * self.v[key] + (1 - self.b2) * (grads[key] ** 2)
            v_hat = \
                self.v[key] / (1 - self.b2 ** self.iter)  # bias correcton
            
            # Update
            params[key] -= \
                self.lr / (np.sqrt(v_hat) + 1e-7) * \
                    (self.b1 * m_hat + (1 - self.b1) * \
                        grads[key] / (1 - (self.b1 ** self.iter)))

  Nadam์˜ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 4. Nadam์— ์˜ํ•œ ์ตœ์ ํ™” ๊ฐฑ์‹  ๊ฒฝ๋กœ

  Adam๊ณผ Nadam์˜ ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•ด๋ณด๋ฉด, Nadam์ด ๋” ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ์ „์—ญ ์ตœ์†Ÿ๊ฐ’์„ ์ฐพ์•„๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


3. ์˜ตํ‹ฐ๋งˆ์ด์ € ๋น„๊ต

  SGD๋ฅผ ํฌํ•จํ•˜์—ฌ ์ด 7๊ฐœ์˜ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์•Œ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค. SGD๋ฅผ ์ œ์™ธํ•œ 6๊ฐœ์˜ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 6๊ฐœ์˜ ์˜ตํ‹ฐ๋งˆ์ด์ € ๋ชจ๋‘ ๋™์ผํ•œ ์—ํญ์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 5. ์ตœ์ ํ™” ๊ธฐ๋ฒ• ๋น„๊ต

  ๊ทธ๋ฆผ 5์—์„œ ์‚ฌ์šฉํ•œ ๊ธฐ๋ฒ•์— ๋”ฐ๋ผ ๊ฐฑ์‹  ๊ฒฝ๋กœ๊ฐ€ ๋‹ฌ๋ผ์ง์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 5๋งŒ ๋ณด๋ฉด, AdaGrad์™€ RMSPorp์ด ๊ฐ€์žฅ ๋‚˜์•„๋ณด์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ํ’€์–ด์•ผ ํ•  ๋ฌธ์ œ๊ฐ€ ๋ฌด์—‡์ธ์ง€์— ๋”ฐ๋ผ ์–ด๋–ค ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ•ด์•ผํ• ์ง€ ๊ฒฐ์ •ํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ์„œ๋„ ๊ทธ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

๊ทธ๋ฆผ 6. ์ „์—ญ ์ตœ์†Œ๊ฐ’์„ ์ฐพ๋Š” ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ๋ชจ์Šต [S. Ruder, 2016]

  ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋Œ€ํ•œ ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋Š” SGD๊ฐ€ Adam์— ๋น„ํ•˜์—ฌ ์ผ๋ฐ˜ํ™”(generalization)๋ฅผ ์ž˜ํ•˜์ง€๋งŒ, Adam์˜ ์†๋„๊ฐ€ SGD์— ๋น„ํ•ด ํ›จ์”ฌ ๋น ๋ฅด๋‹ค๋Š” ๊ฒฐ๋ก ์„ ๋‚ด๋ ธ์Šต๋‹ˆ๋‹ค. ์ด ์—ฐ๊ตฌ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ๊ทผ์—๋Š” Adam๊ณผ SGD์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ์ค‘ ๋Œ€ํ‘œ์ ์ธ ๊ฒƒ์ด ๋ฐ”๋กœ Adam์„ SGD๋กœ ์ „ํ™˜ํ•œ SWATS์ž…๋‹ˆ๋‹ค. ๋˜ ์ด์™ธ์—๋„ AMSBound์™€ AdaBound ๋“ฑ์ด ๋“ฑ์žฅํ•˜๋ฉฐ, Adam์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ์‹œ๋„๋Š” ์ด์–ด์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์ตœ์‹  ์˜ตํ‹ฐ๋งˆ์ด์ €๋Š” ์ถ”ํ›„์˜ ๊ฒŒ์‹œ๋ฌผ์—์„œ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.


์ฐธ๊ณ ์ž๋ฃŒ ์ถœ์ฒ˜

- ํ•˜์šฉํ˜ธ, "์ž์Šตํ•ด๋„ ๋ชจ๋ฅด๊ฒ ๋˜ ๋”ฅ๋Ÿฌ๋‹, ๋จธ๋ฆฟ์†์— ์ธ์Šคํ†จ์‹œ์ผœ๋“œ๋ฆฝ๋‹ˆ๋‹ค", https://www.slideshare.net/yongho/ss-79607172, 2017

- S. Ruder, "An overview of gradient descent optimization algorithms", https://ruder.io/optimizing-gradient-descent, 2016