๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
AI ๊ณต๋ถ€ ํ•ญ์ƒํ•˜์ž/๊ด€๋ จ ์ด๋ก 

[Math] Deep-Learning ํ•™์Šต๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ

by ์ž„๋ฆฌ๋‘ฅ์ ˆ 2025. 1. 20.
๋ฐ˜์‘ํ˜•
๋”๋ณด๊ธฐ

๊ธฐ์กด ๋ถ€์บ  ๋•Œ ๋…ธ์…˜์— ๊ฐœ์ธ์ ์œผ๋กœ ์ •๋ฆฌํ•œ ๊ฒƒ์„ ๊ณต๋ถ€ํ•  ๊ฒธ ์ž‘์„ฑํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.
๊ฐœ์ธ์ ์œผ๋กœ ํ•ด์„ํ•ด์„œ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค. (ํ‹€๋ฆด ์ˆ˜ ์žˆ์Œ. ์ •์ •์š”์ฒญ ์š”๋งใ…‹)
** ๊ฐ•์˜์ž๋ฃŒ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค **
** ์ƒ์—…์  ์ด์šฉ์„ ๊ธˆ์ง€ํ•ฉ๋‹ˆ๋‹ค **

Today's Keyword
์‹ ๊ฒฝ๋ง, softmax, activation function, Backpropagation, chain Rule

๋น„์„ ํ˜•๋ชจ๋ธ - ์‹ ๊ฒฝ๋ง neural network

๊ฐ ํ–‰๋ฒกํ„ฐ Oi = ๋ฐ์ดํ„ฐ Xi X ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W + ์ ˆํŽธ b

์ „์ฒด ๋ฐ์ดํ„ฐ X, x๋ฅผ ๋‹ค๋ฅธ ๊ณต๊ฐ„์œผ๋กœ ๋ณด๋‚ด์ฃผ๋Š” ๊ฐ€์ค‘์น˜ W์˜ ๊ณฑ์œผ๋กœ ํ‘œํ˜„ + b(y์ ˆํŽธ)

์ด ๋•Œ ์ถœ๋ ฅ ๋ฒกํ„ฐ์˜ ์ฐจ์›์€ d -> p 

d๊ฐœ์˜ ๋ณ€์ˆ˜๋กœ p๊ฐœ์˜ ์„ ํ˜• ๋ชจ๋ธ ๋งŒ๋“ค์–ด์„œ p๊ฐœ์˜ ์ž ์žฌ๋ณ€์ˆ˜ ์„ค๋ช…

x to O๋กœ ์—ฐ๊ฒฐํ•  ๋•Œ P๊ฐœ์˜ ๋ชจ๋ธ.

softmax ํ•จ์ˆ˜

์ถœ๋ ฅ ๋ฒกํ„ฐ 0 ์— softmax ํ•ฉ์„ฑ -> ํŠน์ • ํด๋ž˜์Šค k ์— ์†ํ•  ํ™•๋ฅ ๋กœ ํ•ด์„

  • ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์„ ํ™•๋ฅ ๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ
  • ๋ถ„๋ฅ˜ ๋ฌธ์ œ ํ’€๋•Œ ๋ชจ๋ธ X ์†Œํ”„ํŠธ๋งฅ์Šค → ์˜ˆ์ธก
  • softmax(o) = softmax(Wx +b)
  • ํ•™์Šตํ•  ๋•Œ softmax O
  • ์ถ”๋ก ํ•  ๋•Œ one-hot vector ์‚ฌ์šฉ( 1๋กœ ์ถœ๋ ฅํ•˜๋Š” ์—ฐ์‚ฐ. ๊ทธ๋ž˜์„œ softmax๋ฅผ ์‚ฌ์šฉํ•˜์ง„ ์•Š๋Š”๋‹ค)
def softmax(vec):
	denumerator = np.exp(vec - np.max(vev, axis=-1, keepdims=True))
    numerator = np.sum(denumerator, xis=-1, keepdims=True)
    val = denumerator / numerator
    return val
    
    
# [1, 2, 0] -> [0.24, 0.67, 0.09] ๋‹ค ๋”ํ•˜๋ฉด 1
import numpy as np

def one_hot(val, dim):
    return [np.eye(dim)[_] for _ in val]

def one_hot_encoding(vec):
    vec_dim = vec.shape[1]
    vec_argmax = np.argmax(vec, axis=-1)
    return one_hot(vec_argmax, vec_dim)

def softmax(vec):
    denumerator = np.exp(vec - np.max(vec, axis=-1, keepdims=True))
    numerator = np.sum(denumerator, axis=-1, keepdims=True)
    val = denumerator / numerator
    return val

# ํ…Œ์ŠคํŠธ
vec = np.array([[1, 2, 0], [-1, 0, 1], [-10, 0, 10]])
print(one_hot_encoding(vec))
print(one_hot_encoding(softmax(vec)))

ํ™œ์„ฑํ•จ์ˆ˜ ์‹œ๊ทธ๋งˆ = ๋น„์„ ํ˜•ํ•จ์ˆ˜๋กœ ์ž ์žฌ๋ฒกํ„ฐ z = () ์˜ ๊ฐ ๋…ธ๋“œ์— ๊ฐœ๋ณ„ ์ ์šฉ -> ์ƒˆ๋กœ์šด ์ž ์žฌ๋ฒกํ„ฐ H=()

์‹ ๊ฒฝ๋ง = ์„ ํ˜•๋ชจ๋ธ + ํ™œ์„ฑํ™”ํ•จ์ˆ˜ (activation function) (๋น„์„ ํ˜•ํ•จ์ˆ˜ ๊ฐ๊ฐ์— ์ ์šฉํ•˜๋Š”..)

softmax๋Š” ์ถœ๋ ฅ๋ฌผ ์ „์ฒด ๊ณ ๋ คํ•ด์„œ ํ•œ๋‹ค๋ฉด์€, ํ™œ์„ฑํ™”ํ•จ์ˆ˜๋Š” ํ•ด๋‹นํ•˜๋Š” ์ฃผ์†Œ์—๋งŒ ์ ์šฉ.

์ด๋Ÿฐ์‹์œผ๋กœ ๋ณ€ํ˜•์‹œํ‚จ ๋ฒกํ„ฐ → hidden vector

perceptron

Activation function

  • ๋น„์„ ํ˜• ํ•จ์ˆ˜ (nonlinear)
  • ํ™œ์„ฑํ•จ์ˆ˜๋ฅผ ์“ฐ์ง€ ์•Š์œผ๋ฉด ๋”ฅ๋Ÿฌ๋‹์€ ์„ ํ˜•๊ณผ ์ฐจ์ด๊ฐ€ ์—†์Œ.
  • sigmoid, tanh .. ๋”ฅ๋Ÿฌ๋‹์—์„  ReLU ๋งŽ์ด ์”€

ReLU → ์ „ํ˜•์ ์ธ ๋น„์„ ํ˜•ํ•จ์ˆ˜.

 

์ด๋ ‡๊ฒŒ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๋งฅ์—ฌ์„œ H๋งŒ๋“ค์–ด์„œ ์ ์šฉ

 

๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W2, W1 ์ด์šฉํ•˜๋‹ˆ → 2layer ์‹ ๊ฒฝ๋ง์ด ๋˜์—ˆ๋‹ค

  • multi-layer perceptron (MLP) - ์‹ ๊ฒฝ๋ง ์—ฌ๋Ÿฌ์ธต

o(Z) = 1~n๊นŒ์ง€ ์ด๋ฃจ์–ด์ง„ ํ–‰๋ ฌ (ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๊ฐ ๋ฒกํ„ฐ์— ์ ์šฉํ•˜์—ฌ ํ‘œํ˜„) / 1๋ถ€ํ„ฐ L๊นŒ์ง€ ๋ฐ˜๋ณต,

  • ์ด๋ก ์ ์œผ๋ก  2์ธต ์‹ ๊ฒฝ๋ง์œผ๋กœ๋„ ๋˜๊ธดํ•จ. (universal approximation theorem)
  • BUT ๊นŠ์„์ˆ˜๋ก ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๊ทผ์‚ฌํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ๋‰ด๋Ÿฐ(node)์˜ ์ˆซ์ž๊ฐ€ ๋นจ๋ฆฌ ์ค„์–ด๋“ค์–ด ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต.
    ์ธต์ด ์–‡์œผ๋ฉด ๋‰ด๋Ÿฐ์ด ๋Š˜์–ด๋‚˜ wideํ•œ ์‹ ๊ฒฝ๋ง์ด ๋˜์–ด์•ผํ•จ.
    • ์ด๊ฒŒ ๋ญ”์†Œ๋ฆฌ๋ƒ? (๋‚ด๊ฐ€ ํ•„๊ธฐํ•œ ๊ฑธ ๋‹ค์‹œ ๋ณด๋‹ˆ๊นŒ ๋ญ”์†Œ๋ฆฐ์ง€ ๋ชจ๋ฅด๊ฒ ์Œ)
    • ๊นŠ์€ ์‹ ๊ฒฝ๋ง -> ๋” ์ ์€ ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ๋„ ํ‘œํ˜„. ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๋‹จ์ˆœ ๋ฌธ์ œ๋กœ ๋ถ„ํ•ด. 
    • ์–•์€ ์‹ ๊ฒฝ๋ง -> ๊ทธ๋งŒํผ ๋„“์–ด์•ผ๋จ (๋„ˆ๋น„), ๋ณต์žกํ•œ ๋ฌธ์ œ๋ฅผ ๋งŽ์€ ๋งค๊ฐœ๋ณ€์ˆ˜ ์จ์•ผ๋จ. 
    • ์ผ๋ฐ˜์ ์œผ๋กœ ๋„คํŠธ์›Œํฌ๋ฅผ ๋” ๊นŠ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๋„ˆ๋น„๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๋ณด๋‹ค ์ •ํ™•๋„ ๊ฐœ์„ ์— ๋” ํšจ๊ณผ์ ์ด๋‹ค !

Backpropagation ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์œผ๋กœ ๊ฐ ์ธต์— ์“ฐ์ด๋Š” parameter๋ฅผ ํ•™์Šต ์œ„ํ•จ.

parameter

๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W(l)์— ๋Œ€ํ•ด์„œ ์†์‹คํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐ

๊ฐ ์ธต ํŒŒ๋ผ๋ฉ”ํƒ€์˜ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋Š” ์œ—์ธต๋ถ€ํ„ฐ ์—ญ์ˆœ์œผ๋กœ

  • ํ•ฉ์„ฑํ•จ์ˆ˜ ๋ฏธ๋ถ„๋ฒ•์ธ chain-rule ๊ธฐ๋ฐ˜ ์ž๋™ ๋ฏธ๋ถ„ ์‚ฌ์šฉ

z๋ฅผ x๋กœ ๋ฏธ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” chain-rule ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ

  • 2์ธต ์‹ ๊ฒฝ๋ง์˜ ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ˆ์‹œ

๋นจ๊ฐ„์ƒ‰์ด backward, ๋ฏธ๋ถ„์ด ์ „๋‹ฌ๋˜๋Š”

(์˜ค๋ฅธ์ชฝ๋ฐ‘์—์„œ 4) ์†์‹คํ•จ์ˆ˜๋ฅผ ์ถœ๋ ฅ o์— ๋Œ€ํ•ด ๋ฏธ๋ถ„

(์˜ค๋ฅธ์ชฝ๋ฐ‘์—์„œ 3) 4๋ฅผ h๋กœ ๋ฏธ๋ถ„.

(์˜ค๋ฅธ์ชฝ๋ฐ‘์—์„œ 2) 3์„ hidden value z ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„.

(์˜ค๋ฅธ์ชฝ๋ฐ‘์—์„œ 1) 2๋ฅผ w1์— ๋Œ€ํ•ด ๋ฏธ๋ถ„

์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐํ•œ ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ํ–‰๋ ฌ์— ๋Œ€ํ•œ gradient vector๋ฅผ sgd๋ฅผ ์ด์šฉ, ๋ฐ์ดํ„ฐ๋ฅผ mini-batch๋กœ ๋ฒˆ๊ฐˆ์•„ ๊ฐ€๋ฉฐ ํ•™์Šต, ์ฃผ์–ด์ง„ ๋ชฉ์ ์น˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š”.

ํ•™์Šต ์›๋ฆฌ : backpropagation ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜

Parameter

์œผ๋กœ ๊ฐ ์ธต์— ์“ฐ์ด๋Š” parameter๋ฅผ ํ•™์Šต ์œ„ํ•จ.

๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ W(l)์— ๋Œ€ํ•ด์„œ ์†์‹คํ•จ์ˆ˜์— ๋Œ€ํ•œ ๋ฏธ๋ถ„์„ ๊ณ„์‚ฐ

๊ฐ ์ธต ํŒŒ๋ผ๋ฉ”ํƒ€์˜ ๊ทธ๋ ˆ๋””์–ธํŠธ ๋ฒกํ„ฐ๋Š” ์œ—์ธต๋ถ€ํ„ฐ ์—ญ์ˆœ์œผ๋กœ

ํ•ฉ์„ฑํ•จ์ˆ˜ ๋ฏธ๋ถ„๋ฒ•์ธ chain-rule ๊ธฐ๋ฐ˜ ์ž๋™ ๋ฏธ๋ถ„(auto-differentiation) ์‚ฌ์šฉ
z๋ฅผ x๋กœ ๋ฏธ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” chain-rule ๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ

์˜ˆ์ œ : 2์ธต ์‹ ๊ฒฝ๋ง ์—ญ์ „ํŒŒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

๋นจ๊ฐ„์ƒ‰์ด backward, ๋ฏธ๋ถ„์ด ์ „๋‹ฌ๋˜๋Š”
์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐํ•œ ๊ฐ๊ฐ์˜ ๊ฐ€์ค‘์น˜ํ–‰๋ ฌ์— ๋Œ€ํ•œ gradient vector๋ฅผ sgd๋ฅผ ์ด์šฉ, ๋ฐ์ดํ„ฐ๋ฅผ mini-batch๋กœ ๋ฒˆ๊ฐˆ์•„ ๊ฐ€๋ฉฐ ํ•™์Šต, ์ฃผ์–ด์ง„ ๋ชฉ์ ์น˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š”.

๋”ฅ๋Ÿฌ๋‹์˜ ํ•™์Šต ์›๋ฆฌ

์–ด๋ ต์ง€๋งŒ ํ•œ๋ฒˆ ๋” ์ฐพ์•„๋ณด๊ณ  ์ง์ ‘ ์Šฌ๋ž˜์‹œ ํ•˜๋ฉด์„œ ํ•˜๋ฉด ์ดํ•ดํ•˜๊ธฐ ๋” ์‰ฝ๋‹ค 


ํ€ด์ฆˆ ์˜ค๋‹ต 

๋‚ด๊ฐ€ ํ‹€๋ ธ๋˜ ํ€ด์ฆˆ ์˜ค๋‹ต...

๋‚˜๋Š” ์ด๊ฑฐ ๋ฌผ์Œํ‘œ๋ฅผ 200๋งŒ๊ฐœ๋ฅผ ์ณ๋†จ๋‹ค.. ใ…‹ใ…‹ใ…‹ใ…‹ ์ €๋ ‡๊ฒŒ k๋ฅผ ์ด์šฉํ•ด์„œ ๋ฏธ๋ถ„์‹ ๊ตฌํ•˜๋Š” ๊ฐ์„ ์žก์•„์•ผ๊ฒ ๋‹ค

 

๋ฐ˜์‘ํ˜•

์ตœ๊ทผ๋Œ“๊ธ€

์ตœ๊ทผ๊ธ€

skin by ยฉ 2024 ttutta