์ธ๊ณต์ง€๋Šฅ ๋…ผ๋ฌธ ์š”์•ฝ/Text-to-Speech

Tacotron: Towards End-to-End Speech Synthesis ์š”์•ฝ

James Hwang๐Ÿ˜Ž 2023. 7. 3. 00:27
๐Ÿ“œ Y. Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," in Interspeech, 2017

๋…ผ๋ฌธ 3์ค„ ์š”์•ฝ

  1. ๋ณต์žกํ•œ ๊ตฌ์กฐ์˜ ํ˜„๋Œ€ TTS ๋ชจ๋ธ์„ end-to-end ๊ตฌ์กฐ๋กœ ๋ณ€ํ™”ํ•˜์˜€๋‹ค.
  2. <ํ…์ŠคํŠธ, ์Œ์„ฑ> ์Œ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ, ๋” ๋‹ค์–‘ํ•œ ํŠน์ง•์˜ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค.
  3. ์˜ค๋””์˜ค ์ƒ˜ํ”Œ ๋‹จ์œ„์˜ ์ƒ์„ฑ์ด ์•„๋‹Œ, Mel-spectrogram ํ”„๋ ˆ์ž„ ๋‹จ์œ„๋กœ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜์—ฌ ๋” ๋น ๋ฅธ ํ•™์Šต๊ณผ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Abstract

  Text-to-Speech (TTS, ๋ฌธ์ž ์Œ์„ฑ ๋ณ€ํ™˜) ์‹œ์Šคํ…œ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„์„์„ ์œ„ํ•œ frontend์™€ ์Œํ–ฅ ๋ชจ๋ธ(acoustic model), ์˜ค๋””์˜ค ํ•ฉ์„ฑ ๋ชจ๋“ˆ(audio synthesis module)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๊ตฌ์ถ•์—๋Š” ๊ด‘๋ฒ”์œ„ํ•œ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ์ „๋ฌธ ์ง€์‹์ด ํ•„์š”ํ•˜๋ฉฐ, ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ์—๋Š” ๋ถˆ์•ˆ์ •ํ•œ ์„ค๊ณ„๊ฐ€ ํฌํ•จ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ฌธ์ž๋กœ๋ถ€ํ„ฐ ๋ฐ”๋กœ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” end-to-end TTS ๋ชจ๋ธ์ธ 'Tacotron'์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Tacotron์€ ์ฃผ์–ด์ง„ <ํ…์ŠคํŠธ(text), ์˜ค๋””์˜ค(audio)> ์Œ์„ ์ด์šฉํ•˜์—ฌ ๋žœ๋คํ•˜๊ฒŒ ์ดˆ๊ธฐํ™”(random initialization)๋œ ๋„คํŠธ์›Œํฌ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์™„๋ฒฝํ•˜๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ด๋Ÿฌํ•œ ๋„์ „์ ์ธ ๋ฌธ์ œ๋ฅผ ์ž˜ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๊ธฐ๋ฒ•๋“ค์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. Tacotron์€ US ์˜์–ด์— ๋Œ€ํ•œ mean opinion score (MOS)์—์„œ 5์  ๋งŒ์ ์— 3.82์ ์˜ ์„ฑ์ ์„ ๊ฑฐ๋‘์—ˆ์Šต๋‹ˆ๋‹ค. Parametric ๊ธฐ๋ฐ˜์˜ ์‹œ์Šคํ…œ๋ณด๋‹ค ์ž์—ฐ์Šค๋Ÿฌ์›€(terms of naturalness)์— ๋Œ€ํ•œ ํ‰๊ฐ€์—์„œ ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ์ถ”๊ฐ€๋กœ, Tacotron์€ ํ”„๋ ˆ์ž„ ๋‹จ์œ„(frame-level)๋กœ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ƒ˜ํ”Œ ๋‹จ์œ„(sample-level)๋กœ ์ƒ์„ฑํ•˜๋Š” autoregressive ๋ฐฉ๋ฒ•๋ณด๋‹ค ๋Œ€์ฒด๋กœ ๋น ๋ฆ…๋‹ˆ๋‹ค.


1. Introduction

  ํ˜„๋Œ€์˜ TTS ํŒŒ์ดํ”„๋ผ์ธ์€ ๊ต‰์žฅํžˆ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํ†ต๊ณ„์  parametric ๊ธฐ๋ฐ˜์˜ TTS๋Š” ์–ธ์–ด์  ์ •๋ณด(linguistic feature)๋ฅผ ์ถ”์ถœํ•˜๋Š” frontend ๋ชจ๋ธ๊ณผ duration์„ ๋ถ„์„ํ•˜๋Š” ๋ชจ๋ธ, ์Œํ–ฅ์  ์ •๋ณด(acoustic feature)๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ, ๋ณต์žกํ•œ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ๊ธฐ๋ฐ˜์˜ ๋ณด์ฝ”๋”(vocoder)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๋“ค์„ ์œ„ํ•ด ๊ด‘๋ฒ”์œ„ํ•œ ๋„๋ฉ”์ธ์˜ ์ „๋ฌธ ์ง€์‹์„ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ์ด๋ฅผ ์„ค๊ณ„ํ•˜๋Š” ์ผ์€ ๋งค์šฐ ์–ด๋ ต์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๊ตฌ์„ฑ ์š”์†Œ๋ณ„๋กœ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ตฌ์„ฑ ์š”์†Œ๋“ค์˜ ์˜ค์ฐจ๋Š” ๋ˆ„์ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ <ํ…์ŠคํŠธ, ์Œ์„ฑ> ์Œ์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ end-to-end TTS ์‹œ์Šคํ…œ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์žฅ์ ์„ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ํœด๋ฆฌ์Šคํ‹ฑํ•˜๊ณ  ๋ถˆ์•ˆ์ •ํ•œ ์„ค๊ณ„๊ฐ€ ํฌํ•จ๋  ์ˆ˜ ์žˆ๋Š” feature engineering์˜ ํ•„์š”๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.
  2. ๋ฐœํ™”์ž(speaker)๋‚˜ ์–ธ์–ด(language), ๊ฐ์„ฑ(sentiment)๊ณผ ๊ฐ™์€ high-level์˜ ํŠน์ง•์„ ์‰ฝ๊ฒŒ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  3. ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ adaptation์ด ๋” ์‰ฌ์›Œ์ง‘๋‹ˆ๋‹ค.
  4. ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๊ฒฌ๊ณ (robust)ํ•ฉ๋‹ˆ๋‹ค.

  TTS๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์—ญ๋ณ€ํ™˜ ๋ฌธ์ œ(inverse problem)์ž…๋‹ˆ๋‹ค. TTS๋Š” ์ •๋ณด๊ฐ€ ๋งค์šฐ ์••์ถ•๋œ ํ…์ŠคํŠธ๋ฅผ "decompress"ํ•จ์œผ๋กœ์จ ์˜ค๋””์˜ค๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ™์€ ํ…์ŠคํŠธ๋ผ๋„ ์‚ฌ๋žŒ๋งˆ๋‹ค ๋ฐœ์Œ๊ณผ ๋งํ•˜๋Š” ๋ฐฉ์‹์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, end-to-end ๋ชจ๋ธ์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ํŠนํžˆ ์–ด๋ ค์› ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ธฐ์กด์—๋Š” ์‹ ํ˜ธ ๋‹จ์œ„์—์„œ ์ฃผ์–ด์ง„ ์ž…๋ ฅ์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ๋ณ€ํ™”๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”์šฑ์ด end-to-end ์Œ์„ฑ ์ธ์‹์ด๋‚˜ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, TTS์˜ ์ถœ๋ ฅ๊ฐ’์€ ์—ฐ์†์ ์ด๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ ์ž…๋ ฅ๋œ ๊ฐ’๋ณด๋‹ค ๊ธธ์—ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” attention์„ ํฌํ•จํ•œ sequence-to-sequence (seq2seq) ๊ธฐ๋ฐ˜์˜ end-to-end TTS ์ƒ์„ฑ ๋ชจ๋ธ์ธ 'Tacotron'์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Tacotron์€ ๋ฌธ์ž(character)๋ฅผ ์ž…๋ ฅ ๋ฐ›๊ณ , linear-spectrogram์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. 

๋ฌธ์ž(character)
  ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์•ŒํŒŒ๋ฒณ, ํ•œ๊ธ€๊ณผ ๊ฐ™์€ ๊ธ€์ž๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
์Œ์†Œ(phoneme)
  ์†Œ๋ฆฌ๋ฅผ ๋‚ด๋Š” ์–ธ์–ด์˜ ๋‚ฑ๋ง์„ ๊ตฌ๋ถ„ํ•˜๋Š” ์ด๋ก ์ ์ธ ๋‚ฑ๋‚ฑ์˜ ์†Œ๋ฆฌ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์Œ์†Œ๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์ตœ๊ทผ ๋Œ€๋ถ€๋ถ„์˜ TTS ์—ฐ๊ตฌ์—์„œ๋Š” ์Œ์†Œ๋ฅผ ์ž…๋ ฅ๊ฐ’์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

2. Tacotron

[๊ทธ๋ฆผ 1] Tacotron ์ „์ฒด ์•„ํ‚คํ…์ณ

  Tacotron์˜ ๋ฐฑ๋ณธ ๋ชจ๋ธ์€ attention์„ ํฌํ•จํ•œ seq2seq ๋ชจ๋ธ(์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ)์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 1์—์„œ ๋ณด์ด๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, Tacotron์€ ์ธ์ฝ”๋”์™€ attention ๊ธฐ๋ฐ˜์˜ ๋””์ฝ”๋”, post processing ๋„คํŠธ์›Œํฌ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐœ๋žต์ ์œผ๋กœ ๋ณด๋ฉด, Tacotron์€ ๋ฌธ์ž๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ linear-spectrogram์„ ์ƒ์„ฑํ•˜๊ณ , Grifin-Lim์„ ํ†ตํ•ด waveform์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

2.1 CBHG Module

[๊ทธ๋ฆผ 2] CBHG ๋ชจ๋“ˆ

  Tacotron์˜ contribution ๊ฐ€์šด๋ฐ ํ•˜๋‚˜๋Š” [J. Lee et al., 2017]์—์„œ ์ œ์•ˆํ•œ ์ธ์ฝ”๋” ์•„ํ‚คํ…์ณ๋ฅผ ์ˆ˜์ •ํ•œ CBHG ๋ชจ๋“ˆ์ž…๋‹ˆ๋‹ค. Tacotron์—์„œ ์ˆ˜์ •ํ•œ ๋ถ€๋ถ„์€ ํ•ด๋‹น ๋ชจ๋“ˆ์ด ์ผ๋ฐ˜ํ™”(generalization) ์—ญ๋Ÿ‰์„ ํ–ฅ์ƒํ•˜๋Š”๋ฐ ๋„์›€์„ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 2์ฒ˜๋Ÿผ CBHG ๋ชจ๋“ˆ์€ 1D convolutional bank์™€ highway ๋„คํŠธ์›Œํฌ, Bidirectional GRU๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค.

1) Conv1D bank: ์ž…๋ ฅ ์‹œํ€€์Šค๋Š” 1D convolutional filter์˜ $ K $๊ฐœ์˜ ์ง‘ํ•ฉ์œผ๋กœ ๋ฌถ์ด๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ $ k $๋ฒˆ์งธ ์ง‘ํ•ฉ์€ ๋„ˆ๋น„๊ฐ€ $ k $์ธ $ C_k $์˜ ํ•„ํ„ฐ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•œ ๋„ˆ๋น„์˜ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๊ตญ์†Œ์ (local)์ด๊ณ  ๋งฅ๋ฝ์ (contextual)์ธ ์ •๋ณด๋ฅผ explicitํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜๋Š”๋ฐ ๋งค์šฐ ํšจ๊ณผ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. Conv1d bank์˜ convolutoin์˜ ์ถœ๋ ฅ๊ฐ’์€ ๋ˆ„์ (stack)๋ฉ๋‹ˆ๋‹ค.

2) Max-pooling: Local invariance๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ์‹œ๊ฐ„์ถ•์„ ๋”ฐ๋ผ max-pooling์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ๊ฐ’์˜ ์‹œ๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ŠคํŠธ๋ผ์ด๋“œ์˜ ํฌ๊ธฐ๋กœ 1์„ ์ ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

3) Residual connection: ์ด๋ ‡๊ฒŒ ์ฒ˜๋ฆฌ๋œ ์‹œํ€€์Šค์— ๋Œ€ํ•ด์„œ 1D convolution ์—ฐ์‚ฐ์„ ํ•˜๊ณ  residual connection์„ ํ†ตํ•ด ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๋”ํ•ด์ค๋‹ˆ๋‹ค.

4) Highway network: high-level ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด 4๊ณ„์ธต์˜ FC layer์— ์ž…๋ ฅ๋ฉ๋‹ˆ๋‹ค.

5) Bidirectional RNN: ์–‘๋ฐฉํ–ฅ์œผ๋กœ ๋งฅ๋ฝ(context)์œผ๋กœ๋ถ€ํ„ฐ ์‹œ๊ณ„์—ด์˜ ํŠน์ง•(sequential feature)์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด bidirectional GRU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

2.2 Encoder

  ์ธ์ฝ”๋”์˜ ๋ชฉํ‘œ๋Š” ์ž…๋ ฅ ํ…์ŠคํŠธ๋กœ๋ถ€ํ„ฐ ๊ฐ•๊ฑดํ•œ sequential representation์„ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

1) Input sequence: ์ธ์ฝ”๋”์˜ ์ž…๋ ฅ๊ฐ’์€ ๋ฌธ์ž ์‹œํ€€์Šค๋กœ, ๊ฐ ๋ฌธ์ž๋Š” ์›-ํ•ซ ๋ฒกํ„ฐ(one-hot vector)๋กœ ํ‘œํ˜„๋˜๋ฉฐ ์—ฐ์†์ ์ธ ๋ฒกํ„ฐ(continuous vector)๋กœ ์ž„๋ฒ ๋”ฉ๋ฉ๋‹ˆ๋‹ค.

2) Pre-net: pre-net์„ ํ†ตํ•ด ๋น„์„ ํ˜• ๋ณ€ํ™˜(non-linear transformation)์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. Pre-net์—๋Š” dropout์ด ํฌํ•จ๋˜์–ด ์žˆ์–ด bottleneck ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์ด๋Š” ์ˆ˜๋ ด(convergence)์„ ๋•๊ณ  ์ผ๋ฐ˜ํ™”(generalization)์„ ๊ฐœ์„ ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.

3) CBHG module: CBHG module์„ ํ†ตํ•ด pre-net์˜ ์ถœ๋ ฅ๊ฐ’์„ ์ธ์ฝ”๋”์˜ ์ตœ์ข… representation์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” CBHG ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๊ฐ€ ์˜ค๋ฒ„ํ”ผํŒ…(overfitting)์„ ์ค„์ด๋Š” ๊ฒƒ๋ฟ ์•„๋‹ˆ๋ผ, ์ผ๋ฐ˜์ ์ธ RNN๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋ณด๋‹ค ์ž˜๋ชป๋œ ๋ฐœ์Œ์˜ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ์ค„์—ฌ์ค€๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

2.3 Decoder

1) Attention: ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ปจํ…์ธ  ๊ธฐ๋ฐ˜์˜ tanh attention ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ๊ฐ ๋””์ฝ”๋” ์‹œ๊ฐ„ ๋‹จ๊ณ„(time step)์—์„œ stateful recurrent ๊ณ„์ธต์„ ํ†ตํ•ด attention query๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Stateful recurrent model
  ๋ฐฐ์น˜ ๋‚ด์˜ ์ƒ˜ํ”Œ์„ ์ฒ˜๋ฆฌํ•˜๊ณ  ์–ป์€ state๋ฅผ ๋‹ค์Œ ๋ฐฐ์น˜์—์„œ ์ƒ˜ํ”Œ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ดˆ๊ธฐ state๋กœ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•˜์—ฌ ๊ณ„์‚ฐ ๋ณต์žก๋„(complexity)๋ฅผ ์ค„์ด๊ณ  ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

2) Decoder RNN: Attention์˜ ์ถœ๋ ฅ๊ฐ’๊ณผ attention RNN์˜ ์ถœ๋ ฅ๊ฐ’์„ concatenateํ•˜์—ฌ decoder RNN์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Vertical residual connection์„ ํฌํ•จํ•œ ์—ฌ๋Ÿฌ GRU ๊ณ„์ธต์„ ์‚ฌ์šฉํ•˜๋ฉฐ, residual connection์ด ์ˆ˜๋ ด ์†๋„๋ฅผ ๋†’์ธ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

3) Decoder target: ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” linear-spectrogram์„ ๋ฐ”๋กœ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Œ์„ฑ ์‹ ํ˜ธ์™€ ํ…์ŠคํŠธ ์‚ฌ์ด์˜ alingment๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ชฉ์ ์œผ๋กœ๋Š” ๋งค์šฐ ๋ถˆํ•„์š”ํ•œ representation์ด๊ธฐ ๋•Œ๋ฌธ์—, seq2seq ๋””์ฝ”๋”ฉ๊ณผ waveform ํ•ฉ์„ฑ์„ ์œ„ํ•ด ๋‹ค๋ฅธ ํƒ€๊ฒŸ์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. Seq2seq์˜ ํƒ€๊ฒŸ์œผ๋กœ๋Š” ๊ณ ์ •๋˜๊ฑฐ๋‚˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” inversion process์„ ์œ„ํ•ด ์ถฉ๋ถ„ํžˆ ๋ช…๋ฃŒ(intelligibility)ํ•˜์—ฌ์•ผ ํ•˜๊ณ  ํ”„๋กœ์†Œ๋””(prosody) ์ •๋ณด๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๋งค์šฐ ์••์ถ•๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋””์ฝ”๋”์˜ ํƒ€๊ฒŸ์œผ๋กœ 80๋ฐด๋“œ์˜ Mel-spectrogram์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

Mel-spectrogram
  ์‚ฌ๋žŒ๋“ค์ด ๋†’์€ ์ฃผํŒŒ์ˆ˜์— ๋น„ํ•ด ๋‚ฎ์€ ์ฃผํŒŒ์ˆ˜๋ฅผ ๋” ์˜ˆ๋ฏผํ•˜๊ฒŒ ์ธ์‹ํ•œ๋‹ค๋Š” ์ ์—์„œ ์ฐฉ์•ˆํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‚ฎ์€ ์ฃผํŒŒ์ˆ˜ ๋Œ€์—ญ์€ ์„ธ๋ฐ€ํ•˜๊ฒŒ ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋ถ„์„ํ•˜๊ณ , ๋‚˜๋จธ์ง€๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ๋œ ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋ถ„์„ํ•˜๋Š” Mel filter bank๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. Mel filter bank๋ฅผ ์ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ Mel-spectrogram์ด๋ผ ํ•˜๋ฉฐ, ์ด๋ฅผ ๋กœ๊ทธ ์Šค์ผ€์ผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒฝ์šฐ log-Mel spectrogram์ด๋ผ ํ•ฉ๋‹ˆ๋‹ค.

  ๋””์ฝ”๋”์˜ ํƒ€๊ฒŸ์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ์ˆœํ•œ FC layer๊ฐ€ ์‚ฌ์šฉ๋˜๋Š”๋ฐ, ์—ฌ๊ธฐ์„œ ์ €์ž๋“ค์€ ๊ฒน์น˜์ง€ ์•Š๋Š”(non-overlapping) ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„์„ ํ•œ๋ฒˆ์— ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š”ํ•œ ํŠธ๋ฆญ์ž„์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•œ๋ฒˆ์— $ r $๊ฐœ์˜ ํ”„๋ ˆ์ž„์„ ์˜ˆ์ธกํ•˜๋ฉด ์ด ๋””์ฝ”๋” ๋‹จ๊ณ„(decoder step)๋ฅผ $ r $๊ฐœ๋งŒํผ ์ค„์ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ชจ๋ธ์˜ ํฌ๊ธฐ์™€ ํ•™์Šต ์‹œ๊ฐ„, ์ถ”๋ก  ์‹œ๊ฐ„์„ ์ค„์—ฌ์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, attention์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ alignment๊ฐ€ ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์ด๋‹ค๋Š” ๊ฒƒ์„ ์ธก์ •ํ•จ์œผ๋กœ์จ ํ•ด๋‹น ํŠธ๋ฆญ์ด ์ˆ˜๋ ด ์†๋„๋ฅผ ์ƒ๋‹นํžˆ ๊ฐœ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ์ €์ž๋“ค์€ ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์ด์›ƒํ•œ ํ”„๋ ˆ์ž„๋“ค์ด ์„œ๋กœ ์—ฐ๊ด€๋˜์–ด ์žˆ์œผ๋ฉฐ ๊ฐ ๊ธ€์ž๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„๋“ค๋กœ ๊ตฌ์„ฑ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„๋“ค์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์€ ํ•™์Šต ๋‹จ๊ณ„์—์„œ attention์ด ๋” ๋น ๋ฅด๊ฒŒ ์•ž์œผ๋กœ ๋‚˜์•„๊ฐ€๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

  ๋””์ฝ”๋”์˜ ์ฒซ ๋‹จ๊ณ„๋Š” <GO> ํ”„๋ ˆ์ž„์ด๋ผ ๋ฌ˜์‚ฌ๋˜๋Š” ๋ชจ๋‘ 0์ธ ํ”„๋ ˆ์ž„์— ์˜ํ•ด condition๋ฉ๋‹ˆ๋‹ค. ์ถ”๋ก  ๋‹จ๊ณ„์—์„œ ๋””์ฝ”๋”์˜ $ t + 1 $ ๋ฒˆ์งธ ๋‹จ๊ณ„์˜ ์ž…๋ ฅ์œผ๋กœ ์ง์ „ ๋‹จ๊ณ„์ธ $ t $ ๋ฒˆ์งธ ๋‹จ๊ณ„์˜ $ r $๊ฐœ์˜ ์˜ˆ์ธก ๊ฐ€์šด๋ฐ ๋งˆ์ง€๋ง‰ ํ”„๋ ˆ์ž„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ํ”„๋ ˆ์ž„๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ad-hoc choice์ด๋ฉฐ, ์˜ˆ์ธก๋œ $ r $๊ฐœ์˜ ํ”„๋ ˆ์ž„ ๋ชจ๋‘ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” ๋””์ฝ”๋”์— ํ•ญ์ƒ ๋ชจ๋“  $ r $ ๋ฒˆ์งธ ์ฐธ๊ฐ’์ธ ํ”„๋ ˆ์ž„์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•™์Šต ๋ฐฉ์‹์„ 'Teacher forcing'์ด๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ์ž…๋ ฅ๋œ ํ”„๋ ˆ์ž„์€ ์ธ์ฝ”๋”์™€ ๋™์ผํ•˜๊ฒŒ pre-net์„ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค.

Teacher forcing
  ์ผ๋ฐ˜์ ์œผ๋กœ teacher forcing์€ seq2seq ๋ชจ๋ธ์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๋‹จ๊ณ„์˜ ์ดˆ๊ธฐ์— ์ƒ์„ฑ๋œ ํ”„๋ ˆ์ž„ ํ˜น์€ ๋‹จ์–ด๊ฐ€ ์ž˜๋ชป ์˜ˆ์ธก๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋ฉด ๋‹ค์Œ ๋‹จ๊ณ„์˜ ๊ฐ’ ์—ญ์‹œ๋„ ์ž˜๋ชป๋œ ์˜ˆ์ธก์œผ๋กœ ์ด์–ด์ง€๋Š”๋ฐ, ์ด๋Š” ํ•™์Šต ์†๋„๋ฅผ ์ €ํ•˜์‹œํ‚ค๋Š” ์›์ธ์ด ๋ฉ๋‹ˆ๋‹ค. Teacher forcing์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ•˜์˜€์œผ๋ฉฐ, ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” ํ˜„๋‹จ๊ณ„์˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์ฐธ๊ฐ’์ธ ๊ฐ’์„ ์ž…๋ ฅํ•ด์ค๋‹ˆ๋‹ค. ๋˜ํ•œ, seq2seq ๋ชจ๋ธ ์™ธ์—๋„ ์–ด๋–ค feature๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ์ด์šฉํ•˜๋Š” ๋ชจ๋ธ์—์„œ๋„ ์ฐธ๊ฐ’์ธ feature๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์ฃผ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

2.4 Post-Processing Net and Waveform Synthesis

  ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” waveform์„ ์ƒ์„ฑํ•˜๋Š” synthesizer๋กœ Griffin-Lim ์•Œ๊ณ ๋ฆฌ์ฆ˜[D. Griffin and J. Lim, 1984] ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, decoder์—์„œ ์˜ˆ์ธกํ•œ Mel-spectrogram์„ Linear-spectrogram์œผ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด post-processing net์€ ์„ ํ˜• ์ฃผํŒŒ์ˆ˜ ๊ทœ๋ชจ๋กœ ์ƒ˜ํ”Œ๋ง๋œ spectral magnitude (์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด linear-spectrogram)์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต๋ฉ๋‹ˆ๋‹ค. ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ๋งŒ ์ž‘๋™ํ•˜๋Š” seq2seq ๋ชจ๋ธ๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, post-processing net์€ ์ด๋ฏธ ์ „์ฒด ๊ธธ์ด์˜ ๋””์ฝ”๋”ฉ๋œ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ๋ฐ›์œผ๋ฏ€๋กœ ๊ฐ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•˜์—ฌ ์˜ˆ์ธก ์˜ค์ฐจ๋ฅผ ์ •์ •ํ•˜๊ธฐ ์œ„ํ•ด ์–‘๋ฐฉํ–ฅ์˜ ์ •๋ณด ๋ชจ๋‘๋ฅผ ์ด์šฉํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” post-processing net์œผ๋กœ CBHG ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.


3. Experiments

3.1 Ablation Analysis

 

[๊ทธ๋ฆผ 3] Attention alignment ๊ฒฐ๊ณผ

  ๋ช‡๋ช‡ ablation ์‹คํ—˜์„ ํ†ตํ•ด Tacotron์˜ ์ค‘์š”ํ•œ ์š”์†Œ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๋•๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋น„๊ต๋ฅผ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

1) Vanilla seq2seq: ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋” ๋ชจ๋‘ residual RNN์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 3 (a)์˜ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด vanilla seq2seq๊ฐ€ ํ˜•ํŽธ์—†๋Š” attention alignment๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Attention์ด ์•ž์œผ๋กœ ์ด๋™ํ•˜๊ธฐ ์ „์— ๋งŽ์€ ํ”„๋ ˆ์ž„๋“ค์—์„œ ๊ผผ์ง ๋ชปํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์œผ๋ฉฐ, ์ด๋Š” ํ•ฉ์„ฑ๋œ ์‹ ํ˜ธ์—์„œ ๋‚˜์œ ๋ช…๋ฃŒ์„ฑ์œผ๋กœ ์ด์–ด์กŒ์Šต๋‹ˆ๋‹ค. ์ด์™€ ๋ฐ˜๋Œ€๋กœ Tacotron์˜ ๊ฒฝ์šฐ, ๊ทธ๋ฆผ 3 (c)์—์„œ ๋ณด์ด๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋” ๊นจ๋—ํ•˜๊ณ  ๋ถ€๋“œ๋Ÿฌ์šด alingment๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

2) GRU ์ธ์ฝ”๋”: Tacotron์˜ CBHG ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๋ฅผ residual GRU ์ธ์ฝ”๋”๋กœ ๋Œ€์ฒดํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆผ 3 (b)์™€ (c)์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด GRU ์ธ์ฝ”๋”์˜ alingment ๊ฒฐ๊ณผ๊ฐ€ ๋” ๋ถˆ์•ˆ์ •(noisy)ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถˆ์•ˆ์ •ํ•œ alignment๋Š” ์ข…์ข… ์ž˜๋ชป๋œ ๋ฐœ์Œ์œผ๋กœ ์ด์–ด์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ CBHG ๊ธฐ๋ฐ˜์˜ ์ธ์ฝ”๋”๊ฐ€ ์˜ค๋ฒ„ํ”ผํŒ…์„ ์ค„์ด๊ณ  ๊ธธ๊ณ  ๋ณต์žกํ•œ ๊ตฌ์ ˆ์„ ๋” ์ž˜ ์ผ๋ฐ˜ํ™”ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

[๊ทธ๋ฆผ 4] ์˜ˆ์ธก๋œ ์ŠคํŽ™๋“œ๋กœ๊ทธ๋žจ ๊ฒฐ๊ณผ

3) Post-processing net: ๊ทธ๋ฆผ 4์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด post-processing net๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ, ๋งฅ๋ฝ์  ์ •๋ณด์™€ harmonic ์ •๋ณด(100-400 ์‚ฌ์ด์˜ bins)๊ฐ€ ๋” ํ’๋ถ€ํ•˜๊ณ  ๋†’์€ ์ฃผํŒŒ์ˆ˜์˜ ํฌ๋จผํŠธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ด๋Š” ํ•ฉ์„ฑ ์•„ํ‹ฐํŒฉํŠธ๋ฅผ ์ค„์ž…๋‹ˆ๋‹ค.

3.2 Mean Opinion Score Tests

[ํ‘œ 1] MOS ํ‰๊ฐ€ ๊ฒฐ๊ณผ

  MOS ํ‰๊ฐ€ ๊ฒฐ๊ณผ, Tacotron์€ ๋น„๊ต ๋ชจ๋ธ์„ ๋›ฐ์–ด๋„˜๋Š” 3.82์˜ ์ ์ˆ˜๋ฅผ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. Griffin-Lim ํ•ฉ์„ฑ์„ ๋„์ž…ํ•จ์œผ๋กœ์จ ๊ฐ•๋ ฅํ•œ ๋ฒ ์ด์Šค๋ผ์ธ๊ณผ ์ดํ‹ฐํŒฉํŠธ๋“ค์ด ์ฃผ์–ด์ง€๋ฉฐ, ์ด๋Š” ๋งค์šฐ ๊ฐ•๋ ฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Reference

  1. J. Lee et al., "Fully Character-Level Neural Machine Translation without Explicit Segmentation," in TACL, 2017.
  2. D. Griffin and J. Lim, "Signal Estimation from Modified Short-Time Fourier Transform," in TASSP, 1984.

'์ธ๊ณต์ง€๋Šฅ ๋…ผ๋ฌธ ์š”์•ฝ > Text-to-Speech' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

WaveNet: A Generative Model for Raw Audio ์ •๋ฆฌ  (0) 2023.06.18