์ธ๊ณต์ง€๋Šฅ ๋…ผ๋ฌธ ์š”์•ฝ 4

Tacotron: Towards End-to-End Speech Synthesis ์š”์•ฝ

๐Ÿ“œ Y. Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," in Interspeech, 2017 ๋…ผ๋ฌธ 3์ค„ ์š”์•ฝ ๋ณต์žกํ•œ ๊ตฌ์กฐ์˜ ํ˜„๋Œ€ TTS ๋ชจ๋ธ์„ end-to-end ๊ตฌ์กฐ๋กœ ๋ณ€ํ™”ํ•˜์˜€๋‹ค. ์Œ์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ, ๋” ๋‹ค์–‘ํ•œ ํŠน์ง•์˜ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค. ์˜ค๋””์˜ค ์ƒ˜ํ”Œ ๋‹จ์œ„์˜ ์ƒ์„ฑ์ด ์•„๋‹Œ, Mel-spectrogram ํ”„๋ ˆ์ž„ ๋‹จ์œ„๋กœ ์Œ์„ฑ์„ ์ƒ์„ฑํ•˜์—ฌ ๋” ๋น ๋ฅธ ํ•™์Šต๊ณผ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. Abstract Text-to-Speech (TTS, ๋ฌธ์ž ์Œ์„ฑ ๋ณ€ํ™˜) ์‹œ์Šคํ…œ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ํ…์ŠคํŠธ ๋ถ„์„์„ ์œ„ํ•œ frontend์™€ ์Œํ–ฅ ๋ชจ๋ธ(acoustic model), ์˜ค๋””์˜ค ํ•ฉ์„ฑ ๋ชจ๋“ˆ(audio synthesis module)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๊ตฌ์ถ•์—..

WaveNet: A Generative Model for Raw Audio ์ •๋ฆฌ

๐Ÿ“œ A. Oord et al., "WaveNet: A Generative Model for Raw Audio," in arXiv, 2016. ๋…ผ๋ฌธ 1์ค„ ์š”์•ฝ WaveNet์€ dilated causal convolution์„ ๊ธฐ๋ฐ˜์œผ๋กœ audio waveform์„ ์ƒ์„ฑํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Abstract ๋ณธ ๋…ผ๋ฌธ์€ ์˜ค๋””์˜ค ํŒŒํ˜•(audio waveform)์„ ์ƒ์„ฑํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ธ "WaveNet"์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. WaveNet์€ ๋ชจ๋“  ์ด์ „์˜ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ๋กœ๋ถ€ํ„ฐ ์กฐ์ ˆ๋œ ๊ฐ ์˜ค๋””์˜ค ์ƒ˜ํ”Œ์— ๋Œ€ํ•œ ๋ถ„ํฌ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ™•๋ฅ ์ ์ด๋ฉฐ auto-regressiveํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. WaveNet์€ ๊ฐ๊ฐ์˜ ๋ฐœํ™”์ž(speaker)์— ์œ ์‚ฌํ•˜๊ฒŒ ํŠน์ง•์„ ํฌ์ฐฉํ•˜๊ณ  ์ด๋ฅผ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ๋‹ค๋ฅธ ๋ฐœํ™”์ž์˜ ๋ชฉ์†Œ๋ฆฌ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 1. Introduct..

Very Deep Convolutional Networks for Large-Scale Image Recognition ์š”์•ฝ

๐Ÿ“œ K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in ICLR, 2014 ๋…ผ๋ฌธ 2์ค„ ์š”์•ฝ ๊ณ ์ „์ ์ธ Convolution ์•„ํ‚คํ…์ฒ˜์—์„œ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๊ณ  ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€ํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋งค์šฐ ์ž‘์€ $ 3\times3 $ ํฌ๊ธฐ์˜ Convolutional filter๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. Abstract ๋ณธ ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ์ธ์‹ ๋ฌธ์ œ์—์„œ convolution network (ConvNet)์˜ ๊นŠ์ด๊ฐ€ ์ •ํ™•๋„(accuracy)์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์˜ ์ฃผ๋œ ์„ฑ๊ณผ๋Š” ๋งค์šฐ ์ž‘์€ $ 3\times3 $ Conv. ํ•„ํ„ฐ(filte..

Going Deeper with Convolutions ์š”์•ฝ

๐Ÿ“œ C. Szegedy et al., "Going Deeper with Convolutions", in CVPR, 2014 ๋…ผ๋ฌธ 3์ค„ ์š”์•ฝ ๋ชจ๋ฐ”์ผ๊ณผ ์ž„๋ฒ ๋””๋“œ ์ƒ์—์„œ ์ž˜ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด ์ปดํ“จํŒ… ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ์š”๊ตฌ๊ฐ€ ๋†’์•„์กŒ๋‹ค. ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ†ตํ•œ ๊ณ„์‚ฐ์–‘ ๊ฐ์†Œ์™€ ๋น„์„ ํ˜•์„ฑ ์ถ”๊ฐ€ ๋‘ ๊ฐ€์ง€๋ฅผ ๋ชฉ์ ์œผ๋กœ ์ธ์…‰์…˜ ๋ชจ๋“ˆ์„ ๋„์ž…ํ–ˆ๋‹ค. ์ธ์…‰์…˜ ๋ชจ๋“ˆ์„ ํ†ตํ•ด ์ปดํ“จํŒ… ๋น„์šฉ์€ ์ ๊ฒŒ ์ƒ์Šนํ•˜์ง€๋งŒ, ๋” ๊นŠ๊ณ  ๋„“์œผ๋ฉด์„œ ์„ฑ๋Šฅ๋„ ์ข‹์€ GoogLeNet์„ ๊ตฌ์ถ•ํ–ˆ๋‹ค. Abstract ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014์—์„œ ๋ถ„๋ฅ˜์™€ ํƒ์ง€ ๋ฌธ์ œ์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘” '์ธ์…‰์…˜(Inception)'์ด๋ผ๋Š” ์ด๋ฆ„์˜ deep convolution neu..