Ai 8

Very Deep Convolutional Networks for Large-Scale Image Recognition ์š”์•ฝ

๐Ÿ“œ K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in ICLR, 2014 ๋…ผ๋ฌธ 2์ค„ ์š”์•ฝ ๊ณ ์ „์ ์ธ Convolution ์•„ํ‚คํ…์ฒ˜์—์„œ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๊ณ  ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€ํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด ๋งค์šฐ ์ž‘์€ $ 3\times3 $ ํฌ๊ธฐ์˜ Convolutional filter๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค. Abstract ๋ณธ ์—ฐ๊ตฌ๋Š” ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€ ์ธ์‹ ๋ฌธ์ œ์—์„œ convolution network (ConvNet)์˜ ๊นŠ์ด๊ฐ€ ์ •ํ™•๋„(accuracy)์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์˜ ์ฃผ๋œ ์„ฑ๊ณผ๋Š” ๋งค์šฐ ์ž‘์€ $ 3\times3 $ Conv. ํ•„ํ„ฐ(filte..

ํ•™์Šต๊ณผ ๊ด€๋ จ๋œ ๊ธฐ์ˆ ๋“ค

๐Ÿ’ก 'Deep Learning from Scratch'๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ 1. ๊ฐ€์ค‘์น˜์˜ ์ดˆ๊นƒ๊ฐ’(Initial value) ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต์—์„œ ํŠนํžˆ๋‚˜ ์ค‘์š”ํ•œ ๊ฒƒ์€ ๊ฐ€์ค‘์น˜์˜ ์ดˆ๊นƒ๊ฐ’์ž…๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜์˜ ์ดˆ๊นƒ๊ฐ’์„ ๋ฌด์—‡์œผ๋กœ ์„ค์ •ํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ์‹ ๊ฒฝ๋ง ํ•™์Šต์˜ ์„ฑํŒจ๊ฐ€ ๊ฐˆ๋ฆฌ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. 1.1 ์ดˆ๊นƒ๊ฐ’์„ 0์œผ๋กœ ์„ค์ • ๊ฐ€์ค‘์น˜์˜ ์ดˆ๊นƒ๊ฐ’์„ ๋ชจ๋‘ 0์œผ๋กœ ์„ค์ •ํ•˜๋ฉด, ์˜ฌ๋ฐ”๋ฅธ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๊ฐ€ ๋ชจ๋‘ 0์ผ ๊ฒฝ์šฐ, ์ˆœ์ „ํŒŒ์‹œ ๊ฐ™์€ ๊ฐ’๋“ค์ด ๋‹ค์Œ์œผ๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์˜ค์ฐจ์—ญ์ „ํŒŒ๋ฒ•(back-propagation)์—์„œ ๋ชจ๋“  ๊ฐ€์ค‘์น˜์˜ ๊ฐ’์ด ๋™์ผํ•˜๊ฒŒ ๊ฐฑ์‹ ๋˜๋„๋ก ๋งŒ๋“ญ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ๊ฐ€์ค‘์น˜๊ฐ€ ๊ณ ๋ฅด๊ฒŒ ๋˜์–ด๋ฒ„๋ฆฌ๋Š” ์ƒํ™ฉ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ์ดˆ๊นƒ๊ฐ’์€ ๋ฌด์ž‘์œ„๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 1.2 ์€๋‹‰์ธต(Hidden layer)์˜ ํ™œ์„ฑํ™”๊ฐ’ ๋ถ„ํฌ 1.2์ ˆ์—์„œ๋Š” ๊ฐ€์ค‘..

์˜ตํ‹ฐ๋งˆ์ด์ €(Optimizer) (2/2)

๐Ÿ’ก 'Deep Learning from Scratch'์™€ 'CS231N'์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ (๊ฐ ์ ˆ์˜ ๋„˜๋ฒ„๋ง์€ ์ง€๋‚œ ๊ฒŒ์‹œ๋ฌผ์—์„œ ์ด์–ด์ง‘๋‹ˆ๋‹ค) 2. ์˜ตํ‹ฐ๋งˆ์ด์ € ์ง€๋‚œ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” SGD์˜ ๋ฌธ์ œ์ ์œผ๋กœ ์ง€์ ๋˜์—ˆ๋˜ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ฐ€์šด๋ฐ ์Šคํ… ๋ฐฉํ–ฅ์„ ๊ฐœ์„ ํ•œ ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ดค์Šต๋‹ˆ๋‹ค. ์˜ค๋ฒ„์ŠˆํŒ…(overshooting)์œผ๋กœ ์•ˆ์žฅ์ (saddle point)๊ณผ ์ง€์—ญ ์ตœ์†Ÿ๊ฐ’(local minima)์„ ํ†ต๊ณผํ•˜๋ฉฐ ์ „์—ญ ์ตœ์†Ÿ๊ฐ’(global minimum)์„ ์ฐพ๋˜ SGD+Momentum, NAG๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๊ฐฑ์‹ ๋œ ์ •๋„์— ๋”ฐ๋ผ ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •ํ•˜๋ฉฐ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š” AdaGrad๋ฅ˜์˜ ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 2.5 Adaptive Gradient (AdaGrad) AdaGrad๋Š” ..

์˜ตํ‹ฐ๋งˆ์ด์ €(Optimizer) (1/2)

๐Ÿ’ก 'Deep Learning from Scratch'์™€ 'CS231N'์„ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ ์‹ ๊ฒฝ๋ง(neural network)์˜ ํ•™์Šต ๋ชฉ์ ์€ ์†์‹ค ํ•จ์ˆ˜(loss function)์˜ ๊ฐ’์„ ์ตœ๋Œ€ํ•œ ๋‚ฎ์ถ”๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜(parameter)๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ณง ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ตœ์ ๊ฐ’์„ ์ฐพ๋Š” ๋ฌธ์ œ์ด๋ฉฐ, ์ด๋ฅผ ์ตœ์ ํ™” ๋ฌธ์ œ(optimization)๋ผ ํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ๋“ค์„ ์ด์šฉํ•ด ๊ธฐ์šธ๊ธฐ(gradient)์˜ ๊ฐ’์„ ๊ตฌํ•˜๊ณ , ๊ทธ ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜์•„๊ฐˆ ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ์ตœ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ธ ์˜ตํ‹ฐ๋งˆ์ด์ €์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 1. ๊ธฐ์šธ๊ธฐ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ฐ ๋ณ€์ˆ˜๋“ค์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. $$ f(x_0, x_1..

์˜ค์ฐจ์—ญ์ „ํŒŒ(Back-Propagation)

๐Ÿ’ก 'Deep Learning from Scratch'๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑํ•จ ์‹ ๊ฒฝ๋ง(neural network)์˜ ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜(weight parameter)์— ๋Œ€ํ•œ ์†์‹ค ํ•จ์ˆ˜(loss function)์˜ ๊ธฐ์šธ๊ธฐ(gradient)๋ฅผ ๊ตฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์€ ์ˆ˜์น˜ ๋ฏธ๋ถ„(numerical differentation)์„ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ˆ˜์น˜ ๋ฏธ๋ถ„์€ '๊ทผ์‚ฌ์น˜(approximation)'๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹จ์ˆœํ•˜๊ณ  ๊ตฌํ˜„ํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ๊ณ„์‚ฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ฒŒ์‹œ๋ฌผ์€ ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ '์˜ค์ฐจ์—ญ์ „๋ฒ•'์„ ์„ค๋ช…ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. 1. ์—ญ์ „ํŒŒ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ง„ํ–‰๋˜๋Š” ์‹ ๊ฒฝ๋ง์ด ์žˆ์„ ๋•Œ, ์‹ ๊ฒฝ๋ง์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ตฌํ•˜๊ธฐ ..

์†์‹ค ํ•จ์ˆ˜(Loss function)

๐Ÿ’ก 'Deep Learning from Scratch'๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ ์‹ ๊ฒฝ๋ง์—์„œ ํ•™์Šต(train)์ด๋ž€ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜(weight parameter)์˜ ์ตœ์ ๊ฐ’(optimal value)์„ ์ž๋™์œผ๋กœ ํš๋“ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋Š” ์‹ ๊ฒฝ๋ง์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์ง€ํ‘œ, ์†์‹ค ํ•จ์ˆ˜์— ๋Œ€ํ•˜์—ฌ ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. 1. ๋ฐ์ดํ„ฐ์™€ ํ•™์Šต ์‹ ๊ฒฝ๋ง์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๊ฐ€์ค‘์น˜ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ์ž๋™์œผ๋กœ ๊ฒฐ์ •ํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ˆ˜์ž‘์—…์œผ๋กœ ๊ฒฐ์ •ํ•ด์•ผ ํ•˜๋Š” ์–ด๋ ค์›€์„ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค. 1.1 ๋ฐ์ดํ„ฐ ์ฃผ๋„ ํ•™์Šต ๊ธฐ๊ณ„ ํ•™์Šต(machine learning)์˜ ์ƒ๋ช…์€ ๋ฐ”๋กœ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์—์„œ ๋‹ต์„ ์ฐพ๊ณ  ๋ฐ์ดํ„ฐ์—์„œ ํŒจํ„ด์„ ๋ฐœ๊ฒฌํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋กœ ์ด์•ผ..

Going Deeper with Convolutions ์š”์•ฝ

๐Ÿ“œ C. Szegedy et al., "Going Deeper with Convolutions", in CVPR, 2014 ๋…ผ๋ฌธ 3์ค„ ์š”์•ฝ ๋ชจ๋ฐ”์ผ๊ณผ ์ž„๋ฒ ๋””๋“œ ์ƒ์—์„œ ์ž˜ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด ์ปดํ“จํŒ… ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•ด์•ผ ํ•œ๋‹ค๋Š” ์š”๊ตฌ๊ฐ€ ๋†’์•„์กŒ๋‹ค. ์ฐจ์› ์ถ•์†Œ๋ฅผ ํ†ตํ•œ ๊ณ„์‚ฐ์–‘ ๊ฐ์†Œ์™€ ๋น„์„ ํ˜•์„ฑ ์ถ”๊ฐ€ ๋‘ ๊ฐ€์ง€๋ฅผ ๋ชฉ์ ์œผ๋กœ ์ธ์…‰์…˜ ๋ชจ๋“ˆ์„ ๋„์ž…ํ–ˆ๋‹ค. ์ธ์…‰์…˜ ๋ชจ๋“ˆ์„ ํ†ตํ•ด ์ปดํ“จํŒ… ๋น„์šฉ์€ ์ ๊ฒŒ ์ƒ์Šนํ•˜์ง€๋งŒ, ๋” ๊นŠ๊ณ  ๋„“์œผ๋ฉด์„œ ์„ฑ๋Šฅ๋„ ์ข‹์€ GoogLeNet์„ ๊ตฌ์ถ•ํ–ˆ๋‹ค. Abstract ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) 2014์—์„œ ๋ถ„๋ฅ˜์™€ ํƒ์ง€ ๋ฌธ์ œ์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘” '์ธ์…‰์…˜(Inception)'์ด๋ผ๋Š” ์ด๋ฆ„์˜ deep convolution neu..

ํ™œ์„ฑํ™” ํ•จ์ˆ˜(Activation function)

๐Ÿ’ก 'Deep Learning from Scratch'๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ž‘์„ฑ 1. ํผ์…‰ํŠธ๋ก (perceptron)์—์„œ ์‹ ๊ฒฝ๋ง(neural network)์œผ๋กœ 1.1 ํผ์…‰ํŠธ๋ก  ์•ž์„œ ๊ณต๋ถ€ํ•œ ํผ์…‰ํŠธ๋ก ์€ $ x_1 $๊ณผ $ x_2 $๋ผ๋Š” ๋‘ ์‹ ํ˜ธ๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ $ y $๋ฅผ ์ถœ๋ ฅํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค. $$ y=\begin{cases} 0\ (b+w_1x_1+w_2x_2\leq0)\\ 1\ (b+w_1x_1+w_2x_2>0) \end{cases} $$ ์—ฌ๊ธฐ์„œ $ b $๋Š” ํŽธํ–ฅ(bias)๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜(parameter)๋กœ ๋‰ด๋Ÿฐ์ด ์–ผ๋งˆ๋‚˜ ์‰ฝ๊ฒŒ ํ™œ์„ฑํ™”๋˜๋Š”์ง€๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. $ w_1 $๊ณผ $ w_2 $๋Š” ๊ฐ ์‹ ํ˜ธ์˜ ๊ฐ€์ค‘์น˜(weight)๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ ๊ฐ ์‹ ํ˜ธ์˜ ์˜ํ–ฅ๋ ฅ์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ๋„คํŠธ์›Œํฌ..