์ธ๊ณต์ง€๋Šฅ ๋…ผ๋ฌธ ์š”์•ฝ/Text-to-Speech

WaveNet: A Generative Model for Raw Audio ์ •๋ฆฌ

James Hwang๐Ÿ˜Ž 2023. 6. 18. 16:40
๐Ÿ“œ A. Oord et al., "WaveNet: A Generative Model for Raw Audio," in arXiv, 2016.

๋…ผ๋ฌธ 1์ค„ ์š”์•ฝ

  1. WaveNet์€ dilated causal convolution์„ ๊ธฐ๋ฐ˜์œผ๋กœ audio waveform์„ ์ƒ์„ฑํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

Abstract

  ๋ณธ ๋…ผ๋ฌธ์€ ์˜ค๋””์˜ค ํŒŒํ˜•(audio waveform)์„ ์ƒ์„ฑํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ธ "WaveNet"์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. WaveNet์€ ๋ชจ๋“  ์ด์ „์˜ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ ์กฐ์ ˆ๋œ ๊ฐ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ™•๋ฅ ์ ์ด๋ฉฐ auto-regressiveํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. WaveNet์€ ๊ฐ๊ฐ์˜ ๋ฐœํ™”์ž(speaker)์— ์œ ์‚ฌํ•˜๊ฒŒ ํŠน์ง•์„ ํฌ์ฐฉํ•˜๊ณ  ์ด๋ฅผ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ๋‹ค๋ฅธ ๋ฐœํ™”์ž์˜ ๋ชฉ์†Œ๋ฆฌ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


1. Introduction

  ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” auto-regressiveํ•œ ์ƒ์„ฑ ๋ชจ๋ธ๋“ค๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ๋ฐ›์•„, ์Œ์„ฑ ์ƒ์„ฑ ๊ธฐ๋ฒ•์— ๋Œ€ํ•˜์—ฌ ์—ฐ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์•ž์„œ ์—ฐ๊ตฌ๋œ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์ตœ์†Œ 16,000Hz์˜ ๊ด‘๋Œ€์—ญ ์Œ์„ฑ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ์ค‘์ ์ ์œผ๋กœ ๋‹ค๋ฃน๋‹ˆ๋‹ค. WaveNet์€ PixelCNN [V. Oord et al., 2016] ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์Œ์„ฑ ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ฃผ๋œ ๊ธฐ์—ฌ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

[๊ทธ๋ฆผ 1] ์ƒ์„ฑ๋œ waveform์˜ ์‹ ํ˜ธ

  • WaveNet์€ text-to-speech(TTS) ๋ถ„์•ผ์—์„œ ์ด์ „์—๋Š” ๋ณด๊ณ ๋œ์  ์—†๋˜ ์ƒ๋Œ€์ ์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด ์Œ์„ฑ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • ๋งค์šฐ ๋„“์€ ์ˆ˜์šฉ ์˜์—ญ(receptive field)์„ ๋ณด์—ฌ์ฃผ๋Š” dilated causal convolution์„ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค.
  • ๋ฐœํ™”์ž(speaker)์˜ ํŠน์„ฑ์„ ์กฐ์ ˆํ•˜์˜€์„ ๋•Œ, ๋‹จ์ผ ๋ชจ๋ธ๋กœ๋„ ๋‹ค๋ฅธ ๋ฐœํ™”์ž์˜ ์Œ์„ฑ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ž‘์€ ์Œ์„ฑ ์ธ์‹ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ๋„ ๊ฐ•๋ ฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ ์Œ์•…๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ ๋ถ„์•ผ์—์„œ๋„ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•  ๊ฒƒ์œผ๋กœ ์ „๋ง๋ฉ๋‹ˆ๋‹ค.

  ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” WaveNet์˜ ์ผ๋ฐ˜์ ์ด๊ณ  ์œ ์—ฐํ•œ ๊ตฌ์กฐ(framework)๊ฐ€ TTS๋‚˜ ์Œ์„ฑ ๋ณ€์กฐ(voice conversion) ๋“ฑ์˜ ์˜ค๋””์˜ค ์ƒ์„ฑ๊ณผ ๊ด€๋ จ๋œ ๋งŽ์€ ๋ถ„์•ผ์— ์ ์šฉ๋  ๊ฒƒ์ด๋ผ๊ณ  ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค.


2. WaveNet

  WaveNet์€ ์Œ์„ฑ ํŒŒํ˜•์— ๋Œ€ํ•ด ์ง์ ‘ ์ž‘๋™ํ•˜๋Š” ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํŒŒํ˜• $ x = \left\{x_1, \ldots, x_T \right\} $์˜ joint probability์€ conditional probability์˜ ๊ณฑ(product)์œผ๋กœ ๋ถ„ํ•ดํ•ด์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

$$ \begin{equation} p(\mathbf{x})=\prod_{t=1}^{T}{p(x_t|x_1,\ldots,x_{t-1})} \end{equation} $$

  ๋”ฐ๋ผ์„œ ๊ฐ๊ฐ์˜ ์Œ์„ฑ ์ƒ˜ํ”Œ $ x_t$๋Š” ๋ชจ๋“  ์ด์ „ ์‹œ๊ฐ„ ๋‹จ๊ณ„์˜ ์ƒ˜ํ”Œ๋“ค์„ condition์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. PixelCNN [V. Oord et al., 2016]์—์„œ๋„ ๋น„์Šทํ•˜๊ฒŒ ์ปจ๋ณผ๋ฃจ์…˜(convolutional) ๊ณ„์ธต๋“ค์„ ์Œ“์Œ์œผ๋กœ์จ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋ชจ๋ธ๋งํ–ˆ์Šต๋‹ˆ๋‹ค.

Autoregressive model (AR model)
  ์ถœ๋ ฅ ๋ณ€์ˆ˜(variable)๊ฐ€ ์ด์ „์˜ ๊ฐ’๊ณผ ํ™•๋ฅ ์ ์ธ ํ•ญ(stochastic term)์— ๋Œ€ํ•˜์—ฌ ์„ ํ˜•์ ์œผ๋กœ ์˜์กดํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. WaveNet ๋„คํŠธ์›Œํฌ๋Š” autoregressive ๋ชจ๋ธ ๊ฐ€์šด๋ฐ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

2.1 Dilated Causal Convolutions

[๊ทธ๋ฆผ 3] Dilated causal convolution ์‹œ๊ฐํ™”

  ์˜ค๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋Š” WaveNet์˜ ํ•ต์‹ฌ์€ causal convolution์ž…๋‹ˆ๋‹ค. Causal convolution์€ ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ์ˆœ์„œ๋ฅผ ์œ„๋ฐ˜ํ•˜์ง€ ์•Š๋„๋ก ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์‹œ๊ฐ„ ๋‹จ๊ณ„ $ t $์˜ ์˜ˆ์ธก๊ฐ’ $ p(x_{t+1}|x_1, \ldots, x_t) $์€ ์–ด๋– ํ•œ ๋ฏธ๋ž˜ ์‹œ๊ฐ„๋Œ€์˜ ์ •๋ณด $ x_{t+1}, \ldots, x_T $๋ฅผ ์ด์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์‹œ๊ฐ„ ๋‹จ๊ณ„ $ t $์—์„œ ์ƒ์„ฑํ•˜๋Š” ๊ฒฐ๊ณผ๋Š” ๊ณผ๊ฑฐ์˜ ๊ฒฐ๊ณผ๋งŒ์„ ์ฐธ๊ณ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๊ทธ๋ฆผ 3์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  ํ•™์Šต ๋‹จ๊ณ„(training)์—์„œ๋Š” ์šฐ๋ฆฌ๋Š” ๋ชจ๋“  ์‹œ๊ฐ„ ๋‹จ๊ณ„์˜ ์ฐธ๊ฐ’(ground truth) $ \mathbf{x} $๋ฅผ ์•Œ๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณ‘๋ ฌ์ (parallel)์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ถ”๋ก  ๋‹จ๊ณ„(inference)์—์„œ๋Š” ์˜ˆ์ธก์ด autoregressiveํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ฆ‰, ์ด์ „ ๋‹จ๊ณ„์—์„œ ์˜ˆ์ธก๋œ ์ƒ˜ํ”Œ์„ ํ˜„์žฌ์˜ ์ƒ˜ํ”Œ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋„คํŠธ์›Œํฌ์— ์ง‘์–ด๋„ฃ์Šต๋‹ˆ๋‹ค.

  Causal convolution์ด ์ ์šฉ๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ˜๋ณต์ ์ธ ์—ฐ๊ฒฐ(recurrent connection)์„ ๊ฐ€์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ˆ˜์šฉ ์˜์—ญ์„ ํ‚ค์šฐ๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋” ๋งŽ์€ ๊ณ„์ธต์ด๋‚˜ ๋” ํฐ ํ•„ํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์ด causal connection์ด ๊ฐ€์ง„ ๋ฌธ์ œ์ ์ž…๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ค‘์š”๋„์— ๋”ฐ๋ผ ์ˆ˜์šฉ ์˜์—ญ์„ ๋„“ํžˆ๊ธฐ ์œ„ํ•˜์—ฌ dilated convolution์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

[๊ทธ๋ฆผ 4] Dilated convolution ๋„คํŠธ์›Œํฌ ์ž‘๋™ ์˜ˆ์‹œ

  Dilated convolution์€ ๊ทธ๋ฆผ 4์™€ ๊ฐ™์ด ์ž…๋ ฅ๊ฐ’์ด ํŠน์ • ๋‹จ๊ณ„๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋„๋ก ์„ค๊ณ„ํ•จ์œผ๋กœ์จ, ํ•„ํ„ฐ๊ฐ€ ๊ธธ์ด๋ณด๋‹ค ํฐ ์˜์—ญ์— ์ ์šฉ๋˜๋Š” convolution์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. Dilated convolution์€ ์ผ๋ฐ˜์ ์ธ convolution๋ณด๋‹ค ๋” coarseํ•œ ๊ทœ๋ชจ์—์„œ ๋„คํŠธ์›Œํฌ๊ฐ€ ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ํ’€๋ง(pooling)์ด๋‚˜ ์ŠคํŠธ๋ผ์ด๋“œ(stride)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” convolution๊ณผ ์œ ์‚ฌํ•˜์ง€๋งŒ, ์ถœ๋ ฅ๊ฐ’์ด ์ž…๋ ฅ๊ฐ’๊ณผ ๊ฐ™์€ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ์ ์—์„œ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

ํ’€๋ง (pooling)
  Convolution ์—ฐ์‚ฐ ์ดํ›„, ์–ด๋–ค ๊ฐ’์„ ์ทจํ• ์ง€๋ฅผ ๋œปํ•ฉ๋‹ˆ๋‹ค. Pooling์˜ ๋ฐฉ๋ฒ•์—๋Š” ์ตœ๋Œ“๊ฐ’์„ ์ทจํ•˜๋Š” max pooling๊ณผ convolution ์—ฐ์‚ฐ์˜ ๊ฒฐ๊ณผ๋ฌผ์— ๋Œ€ํ•ด ํ‰๊ท ๊ฐ’์„ ์ทจํ•˜๋Š” average pooling์ด ์žˆ์Šต๋‹ˆ๋‹ค.
์ŠคํŠธ๋ผ์ด๋“œ (stride)
  Convolution ์—ฐ์‚ฐ์— ์ด์šฉํ•˜๋Š” ์ปค๋„(kernel)์„ ์–ผ๋งˆ๋‚˜ ์ด๋™์‹œํ‚ฌ์ง€๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

  Dilated convolution์€ ์ ์€ ๊ณ„์ธต๋“ค๋กœ๋„ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋” ํฐ ์ˆ˜์šฉ ์˜์—ญ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋˜, ๋„คํŠธ์›Œํฌ ์ „์ฒด์˜ ์ž…๋ ฅ ํ•ด์ƒ๋„์™€ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์€ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์™€ ๊ฐ™์ด dilation์€ ๋ชจ๋“  ๊ณ„์ธต์—์„œ ๋๊นŒ์ง€ 2๋ฐฐ๋กœ ์ฆ๊ฐ€๋˜๊ณ , ์ด๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

$$ 1, 2, 4, \ldots, 512, 1, 2, 4, \ldots, 512, 1, 2, 4, \ldots, 512. $$

  ์ด๋Ÿฌํ•œ ๊ตฌ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

  1. Dilation ๊ณ„์ˆ˜๋ฅผ ์ง€์ˆ˜์ ์œผ๋กœ(exponentially) ์ฆ๊ฐ€์‹œํ‚ค๋ฉด ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด์™€ ์ˆ˜์šฉ ์˜์—ญ ์—ญ์‹œ ์ง€์ˆ˜์ ์œผ๋กœ ์ปค์ง‘๋‹ˆ๋‹ค [Yu & Koltun, 2016].
  2. Dilation ๋ธ”๋ก์„ ์Œ“๋Š” ๊ฒƒ์€ ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ๊ณผ ์ˆ˜์šฉ ์˜์—ญ์˜ ํฌ๊ธฐ๋ฅผ ๋” ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2.2 Softmax Distributions

  ๊ฐœ๋ณ„ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ $ p(x_t|x_1, \ldots, x_{t-1}) $๋ฅผ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ mixture density ๋„คํŠธ์›Œํฌ์™€ ๊ฐ™์€ ํ˜ผํ•ฉ ๋ชจ๋ธ(mixture model)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ softmax distribution์ด ์ž ์žฌ์ ์œผ๋กœ ์—ฐ์†์ ์ธ ๋ฐ์ดํ„ฐ(continuous data)์—์„œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” softmax distribution์ด ๋” ์œ ์—ฐํ•˜๊ณ  ํ˜•ํƒœ์— ๋Œ€ํ•œ ๊ฐ€์ •์„ ํ•˜์ง€ ์•Š์•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ž„์˜์˜ ๋ถ„ํฌ๋กœ ๋” ์‰ฝ๊ฒŒ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

  ์ผ๋ฐ˜์ ์œผ๋กœ ์˜ค๋””์˜ค๋Š” 16๋น„ํŠธ ์ •์ˆ˜๊ฐ’(์‹œ๊ฐ„ ๋‹จ๊ณ„๋‹น 1๊ฐœ)์˜ ์‹œํ€€์Šค(sequence)๋กœ ์ €์žฅ๋˜๊ธฐ ๋•Œ๋ฌธ์—, softmax ๊ณ„์ธต์€ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ํ™•๋ฅ ๊ฐ’์„ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด ์‹œ๊ฐ„ ๋‹จ๊ณ„๋‹น 56,636๊ฐœ์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์— $ \mu $-law companding transformation์„ ์ ์šฉํ•˜์—ฌ 25๊ฐœ์˜ ๊ฐ€๋Šฅํ•œ ๊ฐ’์œผ๋กœ qunatizeํ•ฉ๋‹ˆ๋‹ค.

Quantization
  ๋ฌดํ•œ๋Œ€์˜ ๊ฐ’์„ ์œ ํ•œํ•œ ๋ช‡ ๊ฐ€์ง€์˜ ๋Œ€ํ‘œ๊ฐ’์œผ๋กœ ๋ฐ”๊พธ์–ด ์ฃผ๋Š” ๊ฒƒ (e.g. 0.5 → 1)

  ์ด๋ฅผ ์ˆ˜์‹ํ™”ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

$$ f(x_t) = \text{sign}(x_t)\frac{\ln{1+\mu|x_t|}}{\ln(1+\mu)} $$

  ์—ฌ๊ธฐ์„œ $ -1 < x_t < 1 $์ด๋ฉฐ $ \mu = 225 $์ž…๋‹ˆ๋‹ค. non-linear quantization์€ ๋‹จ์ˆœํ•œ linear quantization๋ณด๋‹ค ๋” ์ž˜ ์žฌ๊ตฌ์„ฑ(reconstruction)ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์Œ์„ฑ ๋ถ„์•ผ์—์„œ quantization ์ดํ›„ ์žฌ๊ตฌ์„ฑ๋œ ์‹ ํ˜ธ๊ฐ€ ๊ธฐ์กด์˜ ์˜ค๋””์˜ค์™€ ๊ต‰์žฅํžˆ ์œ ์‚ฌํ•˜๊ฒŒ ๋“ค๋ฆฐ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

2.3 Gated Activation Units

  Gated PixelCNN [V. Oord et al., 2016]์—์„œ ์‚ฌ์šฉํ•œ gated activation unit์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

$$ \begin{equation} \mathbf{z}=\tanh{(W_{f, k}*\mathbf{x})}\odot\sigma{(W_{g, k}*\mathbf{x})} \end{equation} $$

  ์—ฌ๊ธฐ์„œ *๋Š” ํ•ฉ์„ฑ๊ณฑ(convolution) ์—ฐ์‚ฐ, $ \odot $์€ element-wise multiplication ์—ฐ์‚ฐ, $ \sigma{(\cdot)} $์€ sigmoid ํ•จ์ˆ˜, $ k $๋Š” ๊ณ„์ธต์˜ ์ธ๋ฑ์Šค, $ f $๋Š” ํ•„ํ„ฐ, $ g $๋Š” ๊ฒŒ์ดํŠธ, $ W $๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ปจ๋ณผ๋ฃจ์…˜ ํ•„ํ„ฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ์‹คํ—˜์—์„œ ์ด ๋น„์„ ํ˜• ํ•จ์ˆ˜๊ฐ€ ReLU ํ•จ์ˆ˜๋ณด๋‹ค ๋” ์ž˜ ์ž‘๋„ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

2.4 Residual and Skip Connections

๊ทธ๋ฆผ 3. residual block์„ ํฌํ•จํ•œ ์ „์ฒด ์•„ํ‚คํ…์ณ

  ์ˆ˜๋ ด ์†๋„๋ฅผ ๋†’์ด๊ณ  ๋ชจ๋ธ์„ ๊นŠ๊ฒŒ ์Œ“์•„์„œ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด residual connection๊ณผ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™”๋œ skip connection์„ ๋„คํŠธ์›Œํฌ ์ „์ฒด์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์•„๋ž˜์˜ ๊ทธ๋ฆผ์— residual block์œผ๋กœ ํ‘œํ˜„ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

2.5 Conditional WaveNets

  WaveNet์€ ์ถ”๊ฐ€์ ์ธ ์ž…๋ ฅ๊ฐ’ $ h $์ด ์ฃผ์–ด์ง€๋ฉด, ์˜ค๋””์˜ค์˜ ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ $ p(\mathbf{x}|\mathbf{h}) $๋ฅผ ๋ชจ๋ธ๋งํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. $ h $๋ฅผ ์ด์šฉํ•˜์—ฌ (1)์˜ ์‹์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

$$ p(\mathbf{x}|\mathbf{h})=\prod_{t=1}^{T}{p(x_t|x_1,\ldots,x_{t-1}, \mathbf{h})} $$

  ์ž…๋ ฅ ๋ณ€์ˆ˜๋“ค์„ ํ†ตํ•ด ํ•„์š”ํ•œ ํŠน์„ฑ์„ ๊ฐ€์ง„ ์˜ค๋””์˜ค๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก WaveNet์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„์Šทํ•˜๊ฒŒ TTS์—์„œ๋Š” ์ถ”๊ฐ€์ ์ธ ์ž…๋ ฅ๊ฐ’์œผ๋กœ์„œ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ž…๋ ฅํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” WaveNet์— ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ž…๋ ฅ๊ฐ’์„ ์ง‘์–ด๋„ฃ์—ˆ์Šต๋‹ˆ๋‹ค.

2.5.1 Global conditioning

  Global conditioning์„ ์œ„ํ•ด ๋ชจ๋“  ์‹œ๊ฐ„ ๋‹จ๊ณ„์— ๊ฑธ์ณ ์ถœ๋ ฅ๊ฐ’์— ์˜ํ–ฅ์„ ์ฃผ๋Š” speaker embedding๊ณผ ๊ฐ™์€ ๋‹จ์ผ latent representation์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜์‹ (2)์— ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜์‹์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

$$ \mathbf{z}=\tanh{(W_{f, k}*\mathbf{x}+V_{f, k}^{T}\mathbf{h})}\odot\sigma{(W_{g, k}*\mathbf{x}+V_{f, k}^{T}\mathbf{h})} $$

  ์—ฌ๊ธฐ์„œ $ V_{*, k} $๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ linear projection์„ ์˜๋ฏธํ•˜๋ฉฐ, $ V_{f, k}^{T}\mathbf{h} $๋Š” ์ „์ฒด ์‹œ๊ฐ„ ์ฐจ์›์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

2.5.2 Local conditioning

  local conditioning์„ ์œ„ํ•ด ์–ธ์–ด์  ์ •๋ณด(linguistic feature)์™€ ๊ฐ™์ด ์˜ค๋””์˜ค ์‹ ํ˜ธ๋ณด๋‹ค ๋‚ฎ์€ sampling frequency์˜ ์‹œ๊ณ„์—ด ์‹œํ€€์Šค(timeseries sequence) $ h_t $๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–ธ์–ด์  ์ •๋ณด์˜ ์‹œํ€€์Šค๋ฅผ ์˜ค๋””์˜ค ์‹ ํ˜ธ์™€ ๋™์ผํ•œ ํ•ด์ƒ๋„๋กœ upsampling ํ•˜๊ธฐ ์œ„ํ•ด transposed convolution ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ upsampling๋œ ์ƒˆ๋กœ์šด ์‹œ๊ณ„์—ด ์‹œํ€€์Šค $ y=f(\mathbf{h}) $๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด activation unit์ด ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

$$ \mathbf{z} = \tanh{(W_{f, k} * \mathbf{x} + V_{f, k} * \mathbf{y})} \odot \sigma{(W_{g, k} * \mathbf{x} + V_{g, k} * \mathbf{y})} $$

  ์—ฌ๊ธฐ์„œ $ V_{g, k} * \mathbf{y} $๋Š” $ 1 \times 1 $ ์ปจ๋ณผ๋ฃจ์…˜์ด ๋ฉ๋‹ˆ๋‹ค. Transposed convolution ๋„คํŠธ์›Œํฌ์˜ ๋Œ€์•ˆ์œผ๋กœ $ V_{f, k} * \mathbf{h} $๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์ด ๊ฐ’์„ ์‹œ๊ฐ„ ๋‹จ๊ณ„์— ๋”ฐ๋ผ ๋ฐ˜๋ณตํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

2.6 Context Stacks

  WaveNet์˜ ์ˆ˜์šฉ ์˜์—ญ ํฌ๊ธฐ๋ฅผ ํ‚ค์šฐ๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์„ ์•ž์„œ ์–ธ๊ธ‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ณด์™„์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ์˜ค๋””์˜ค ์‹ ํ˜ธ์˜ ๊ธด ๋ถ€๋ถ„์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์ž‘์€ context stack์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ณ , ์˜ค๋””์˜ค ์‹ ํ˜ธ์˜ ์งง์€ ๋ถ€๋ถ„๋งŒ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋” ํฐ WaveNet์„ ๋ถ€๋ถ„์ ์œผ๋กœ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, hiddnen unit์˜ ์ˆ˜์™€ ๊ธธ์ด๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ํ•˜์—ฌ ๋‹ค์–‘ํ•œ context stack์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


3. Experiments

  WaveNet์˜ ์˜ค๋””์˜ค ๋ชจ๋ธ๋ง ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ 3๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ฌธ์ œ๋ฅผ ์ด์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค. Multi-speaker ์Œ์„ฑ ์ƒ์„ฑ๊ณผ TTS, ์Œ์•… ์˜ค๋””์˜ค ๋ชจ๋ธ๋ง์ž…๋‹ˆ๋‹ค. ๋ณธ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ์•ž์„  2๊ฐ€์ง€ ๋ฌธ์ œ๋งŒ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

3.1 Multi-Speaker Speech Generation

  ๋ฐœํ™”์ž์˜ ID๋ฅผ ๋ชจ๋ธ์— one-hot ๋ฒกํ„ฐ์˜ ํ˜•ํƒœ๋กœ ์ง‘์–ด๋„ฃ์–ด ์ƒ์„ฑ๋œ ์Œ์„ฑ์˜ ๋ชฉ์†Œ๋ฆฌ๋ฅผ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. ๋ฐœํ™”์ž ์ •๋ณด์— ๋Œ€ํ•œ one-hot encoding์„ ์กฐ์ ˆํ•จ์œผ๋กœ์จ, ๋‹จ์ผ WaveNet์œผ๋กœ ์–ด๋– ํ•œ ๋ฐœํ™”์ž์˜ ์Œ์„ฑ์ด๋“  ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ผ ๋ฐœํ™”์ž๋กœ๋งŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์—ฌ๋Ÿฌ ๋ฐœํ™”์ž๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋Š” WaveNet์˜ ๋‚ด๋ถ€ representation์ด ์—ฌ๋Ÿฌ ๋ฐœํ™”์ž๋“ค ์‚ฌ์ด์— ๊ณต์œ ๋˜๊ณ  ์žˆ์Œ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ WaveNet์ด ๋ชฉ์†Œ๋ฆฌ ๊ทธ ์ž์ฒด ์ด์™ธ์—๋„ ์˜ค๋””์˜ค๋กœ๋ถ€ํ„ฐ ๋‹ค๋ฅธ ํŠน์ง•๋“ค์„ ํฌ์ฐฉํ–ˆ์Œ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

3.2 Text-to-Speech

  TTS ๋ฌธ์ œ๋ฅผ ์œ„ํ•ด WaveNet์€ ์ž…๋ ฅ ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ์–ธ์–ด์  ํŠน์ง•(linguistic feature)์„ ๋ถ€๋ถ„์ ์œผ๋กœ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์–ธ์–ด์  ํŠน์ง•์— logarithmic fundamental frequency $ \log F_{0}$์„ ๋”ํ•˜์—ฌ ํ”ผ์น˜ ์ •๋ณด๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์–ธ์–ด์  ํŠน์ง•์œผ๋กœ๋ถ€ํ„ฐ $ \log F_{0} $ ๊ฐ’๊ณผ ์Œ์†Œ duration์„ ์˜ˆ์ธกํ•˜๋Š” ์™ธ๋ถ€์˜ ๋ชจ๋ธ์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. WaveNet์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด ์ฃผ๊ด€์ ์ธ ํ‰๊ฐ€์ธ mean opinion score (MOS) ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. MOS ํ…Œ์ŠคํŠธ์˜ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์˜ ํ‘œ์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

ํ‘œ 1. MOS ํ‰๊ฐ€ ๋น„๊ต


4. Conclusion

  ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” WaveNet์€ autoregressiveํ•˜๊ฒŒ ์Œ์„ฑ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. WaveNet์€ causal ํ•„ํ„ฐ๋“ค์„ ํ†ตํ•ฉํ•˜์—ฌ ์ˆ˜์šฉ ์˜์—ญ์ด ์ง€์ˆ˜์ ์œผ๋กœ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฒƒ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, WaveNet์ด ์–ด๋–ป๊ฒŒ ์ž…๋ ฅ๊ฐ’์— ๋Œ€ํ•˜์—ฌ global ํ˜น์€ localํ•œ ๋ฐฉ์‹์œผ๋กœ ์กฐ์ ˆ๋˜๋Š”์ง€๋ฅผ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค. TTS ๋ฌธ์ œ์— WaveNet์„ ์ ์šฉํ•  ๊ฒฝ์šฐ, WaveNet์œผ๋กœ ์ƒ์„ฑํ•œ ์ƒ˜ํ”Œ๋“ค์ด ํ˜„์กดํ•˜๋Š” TTS ์‹œ์Šคํ…œ๋“ค๋ณด๋‹ค ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ WaveNet์„ ์Œ์•… ๋ชจ๋ธ๋ง๊ณผ ์Œ์„ฑ ์ธ์‹์— ์ ์šฉํ•˜์˜€์„ ๋•Œ, ์œ ๋งํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.


Reference

  1. V. Oord et al., "Pixel Recurrent Neural Networks," in ICML, 2016.
  2. Yu and Koltun, "Multi-scale Context Aggregation by Dilated Convolutions," in ICLR, 2016

* ์ธ์šฉ๋œ ๋…ผ๋ฌธ์€ ๋” ์žˆ์ง€๋งŒ, ๋ณธ ์š”์•ฝ์—์„œ ์–ธ๊ธ‰ํ•œ ๋…ผ๋ฌธ๋งŒ์„ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.


์ฐธ๊ณ ์ž๋ฃŒ

  1. "WaveNet: A Generative Model for Raw Audio," https://www.deepmind.com/blog/wavenet-a-generative-model-for-raw-audio, 2016.
  2. "Types of Convolution Kernels: Simplified," https://towardsdatascience.com/types-of-convolution-kernels-simplified-f040cb307c37, 2019.