Stanford CS229: Machine Learning | Lecture 11 & 12 강의 정리

Stanford에서 2018년 가을학기에 열린 Andrew Ng 교수의 CS229 머신러닝 강의의 대부분의 내용을 번역 정리한 글입니다.

Course : Stanford CS229: Machine Learning Full Course taught by Andrew Ng | Autumn 2018
Course 소개 : CS229는 기계 학습과 통계적 패턴 인식에 대한 포괄적인 소개를 제공합니다. 지도 학습(생성적/판별적 학습, 모수/비모수 학습, 신경망, 서포트 벡터 머신); 비지도 학습(클러스터링, 차원 축소, 커널 메소드); 학습 이론(편향/분산 트레이드오프, 실용적인 조언); 강화 학습 및 적응 제어에 대한 강의를 다룹니다. 또한 기계 학습의 최근 응용에 대해 논의되며, 로봇 제어, 데이터 마이닝, 자율 항법, 생명 정보학, 음성 인식, 텍스트 및 웹 데이터 처리 등이 포함됩니다.
강의 제목 :

Lecture 11 - Introduction to Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)

Lecture 12 - Backprop & Improving Neural Networks | Stanford CS229: Machine Learning (Autumn 2018)
강의 영상:
https://youtu.be/MfIjxPh6Pys?si=Dfk31a9B5rfg5Mxq
https://youtu.be/zUazLXZZA2U?si=Vv8ohn0tDkW2oadO

강의 자료 (복원): https://github.com/maxim5/cs229-2018-autumn
강의 syllabus : http://cs229.stanford.edu/syllabus-autumn2018.html
강의를 듣기 전에 알고 있어야 하는 내용
선형대수학 : https://cs229.stanford.edu/notes2022fall/cs229-linear_algebra_review.pdf
확률이론 : https://cs229.stanford.edu/notes2022fall/cs229-probability_review.pdf

Outline

Logistic Regression
Neural Networks
Improving your Neural Networks

Logistic Regression

goal 1: Finds cat in image

initiate $w,\ b$
Find the optimal $w,\ b$
- $\mathcal{L}=-[y\log\hat{y}+(1-y)\log(1-\hat{y})]$
use $\hat{y}=\sigma (wx+b)$ to predict

neuron = linear + activation
model = architecture + parameter

goal 2: Find cat / lion / iguana in image

$\mathcal{L}{3N}=-\sum\limits^{3}{k=1}[y\log\hat{y}+(1-y)\log(1-\hat{y})]$
parameter 수 : $3n+3$

goal 3: (+contraint) unique animal on an image

$\mathcal{L}{3N}=-\sum\limits^{3}{k=1}y_{k}\log\hat{y}_{k}$
parameter 수 : $3n+3$

Nueral Networks

goal 1: cat vs no cat in image

특징

black box model
end-to-end learning

Propagation Equation

$w^{[1]},\ w^{[2]},\ w^{[3]}\ b^{[1]},\ b^{[2]},\ b^{[3]}$ 를 optimize 하기 위해 cost / loss function을 정의한다.

$$
\mathcal{J}(\hat{y},y)=\frac{1}{m}\sum\limits^{m}_{i=1}\mathcal{L}^{(i)}
$$

$$
\mathcal{L}^{(i)}=-\sum\limits^{3}_{k=1}[y^{(i)}\log\hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})]
$$

Backward Propagation

back propagation은 chain rule을 사용해 계산할 수 있다.

cost function : $\mathcal{J}(\hat{y},y)=\frac{1}{m}\sum\limits^{m}_{i=1}\mathcal{L}^{(i)}$
- with $\mathcal{L}^{(i)}=-\sum\limits^{3}_{k=1}[y^{(i)}\log\hat{y}^{(i)}+(1-y^{(i)})\log(1-\hat{y}^{(i)})]$
update : $w^{[l]}=w^{[l]}-\alpha\frac{\partial \mathcal{J}}{\partial w^{[l]}}$

이후엔 chain rule을 이용해 쉽게 구할 수 있다.

$a^{[2]},\ a^{[3]}$는 forward propagation을 통해 얻는 값이지만 backward propagation에 쓰인다. 매번 forward propagation을 해서 값을 구하는 것은 비효율적이기 때문에 값을 저장해서 사용한다.

Improving Neural Networks

Activation Function

activation function으로 sigmoid, ReLU, tanh를 주로 사용한다.

$$
\text{sigmoid}(z)=\frac{1}{1+e^{-z}}
$$

$$
\tanh (z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}
$$

$$
\text{ReLU}(z)= \begin{cases}0\quad \text{if}\ z\leq0\1\quad \text{if}\ z>0 \end{cases}
$$

sigmoid는 classification에서 확률로 사용할 수 있다는 장점이 있다. sigmoid와 tanh는 $z$값이 너무 작거나 클 때 gradient가 0에 수렴하기 때문에 gradient vanishing 문제가 발생한다. 반면 ReLU는 gradient vanishing 문제가 발생하지 않는다.

Activation Function이 필요한 이유

Activation function이 없다면 network가 얼마나 깊던 linear regression이 된다. Nueral Network의 complexity는 activation function으로부터 만들어진다.

Initialization Method

Normalizing

example들의 평균과 표준편차를 구해 평균을 빼고 표준편차로 나누어주는 normaizing을 한다.

normalizing을 하면 loss funcion의 등고선이 원형으로 바뀌어 optimum에 쉽고 빠르게 도달할 것이라고 기대할 수 있다.

Vanishing / Exploding Gradient

activation function이 identity function이고 $b^{[l]}=0$이라고 할 때 $\hat{y}$은 다음과 같이 구해진다.

layer가 쌓일 수록 첫번째 경우는 exploding하고 두번째 경우는 vanishing하게 된다. 위는 forward propagation에서 나타낸 경우지만 미분을 할 때에도 같은 상황이 발생한다.

Avoid Problem

$w_{i}$가 크다면 input size에 따라 $z$값이 너무 커질 수 있다. 따라서 $w_{i}$의 값을 $w_{i}\sim \frac{1}{n}$과 같이 input size $n$에 따라 조절해준다.

sigmoid : $w^{[l]}$ = np.random.randn(shape)*np.sqrt(1/n[l-1])
ReLu : $w^{[l]}$ = np.random.randn(shape)*np.sqrt(2/n[l-1])
Xavien Initialization (for tanh) : $w^{[l]}=\sqrt{\frac{1}{n^{(l-1)}}}$
He Initialization (for tanh) : $w^{[l]}=\sqrt{\frac{2}{n^{[l]}+n^{[l-1]}}}$
- $n^{[l-1]}$은 forward propagtaion을 할 때의 input size이다.
- $n^{[l]}$은 backword propagation을 할 때의 input sie이다.

Optimization

batch gradient descent는 vectorization을 통해 모든 example을 한 번에 학습할 수 있다.
stochastic gradient descent는 update가 빠르다는 장점이 있다.
두 방법의 장단점을 trade-off 한 것이 mini-batch gradient descent이다.

위와 같이 example을 1000개씩 나누어 학습하면 batch gradient descent보다 빠르고 stochasitic gradient descent보다 정확하게 optimize 할 수 있다.

algorithm:
For iteration t=1… select batch $(X^{[t]}, Y^{[t]})$:
forward prop
$\mathcal{J}=\frac{1}{1000}\sum\limits^{1000}_{i=1}\mathcal{L}^{(i)}$
backword prop
update $w^{[l]}, b^{[l]}$

batch gradient descent와 mini-batch gradient descent의 $\mathcal{J}$ 그래프는 다음과 같은 차이를 보인다.

Gradient Descent + Momentum

더 빠르게 업데이트 하기 위해선 수직으로 더 적게 수평으로 더 많이 움직여야한다. 모멘텀을 적용하여 $\beta$를 통해 이전 update의 속도를 반영한다. $\beta$로 이전 $v$의 가중치를 조절하여 업데이트 한다.

저작자표시 비영리 변경금지 (새창열림)

'Courses > CS229' 카테고리의 다른 글

Stanford CS229: Machine Learning \| Lecture 14강의 정리 (0)	2024.11.14
Stanford CS229: Machine Learning \| Lecture 13 강의 정리 (0)	2024.11.13
Stanford CS229: Machine Learning \| Lecture 10 강의 정리 (0)	2024.11.13
Stanford CS229: Machine Learning \| Lecture 9 강의 정리 (0)	2024.11.12
Stanford CS229: Machine Learning \| Lecture 8 강의 정리 (0)	2024.11.12