TTS 2

Tacotron: Towards End-to-End Speech Synthesis ์š”์•ฝ

๐Ÿ“œ Y. Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," in Interspeech, 2017 ๋…ผ๋ฌธ 3์ค„ ์š”์•ฝ ๋ณต์žกํ•œ ๊ตฌ์กฐ์˜ ํ˜„๋Œ€ TTS ๋ชจ๋ธ์„ end-to-end ๊ตฌ์กฐ๋กœ ๋ณ€ํ™”ํ•˜์˜€๋‹ค. ์Œ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ, ๋” ๋‹ค์–‘ํ•œ ํŠน์ง•์˜ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค. ์˜ค๋””์˜ค ์ƒ˜ํ”Œ ๋‹จ์œ„์˜ ์ƒ์„ฑ์ด ์•„๋‹Œ, Mel-spectrogram ํ”„๋ ˆ์ž„ ๋‹จ์œ„๋กœ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜์—ฌ ๋” ๋น ๋ฅธ ํ•™์Šต๊ณผ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. Abstract Text-to-Speech (TTS, ๋ฌธ์ž ์Œ์„ฑ ๋ณ€ํ™˜) ์‹œ์Šคํ…œ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„์„์„ ์œ„ํ•œ frontend์™€ ์Œํ–ฅ ๋ชจ๋ธ(acoustic model), ์˜ค๋””์˜ค ํ•ฉ์„ฑ ๋ชจ๋“ˆ(audio synthesis module)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๊ตฌ์ถ•์—..

WaveNet: A Generative Model for Raw Audio ์ •๋ฆฌ

๐Ÿ“œ A. Oord et al., "WaveNet: A Generative Model for Raw Audio," in arXiv, 2016. ๋…ผ๋ฌธ 1์ค„ ์š”์•ฝ WaveNet์€ dilated causal convolution์„ ๊ธฐ๋ฐ˜์œผ๋กœ audio waveform์„ ์ƒ์„ฑํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Abstract ๋ณธ ๋…ผ๋ฌธ์€ ์˜ค๋””์˜ค ํŒŒํ˜•(audio waveform)์„ ์ƒ์„ฑํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ธ "WaveNet"์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. WaveNet์€ ๋ชจ๋“  ์ด์ „์˜ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ ์กฐ์ ˆ๋œ ๊ฐ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ™•๋ฅ ์ ์ด๋ฉฐ auto-regressiveํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. WaveNet์€ ๊ฐ๊ฐ์˜ ๋ฐœํ™”์ž(speaker)์— ์œ ์‚ฌํ•˜๊ฒŒ ํŠน์ง•์„ ํฌ์ฐฉํ•˜๊ณ  ์ด๋ฅผ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ๋‹ค๋ฅธ ๋ฐœํ™”์ž์˜ ๋ชฉ์†Œ๋ฆฌ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 1. Introduct..