Stanford CS229: Machine Learning | Lecture 3 강의 정리

Outline
Intro
Parametric learning algorithms and non-parametric learning algorithms
Locally weighted regression
Probabilistic Interpretation
Logistic Regression
Newton's method

Stanford에서 2018년 가을학기에 열린 Andrew Ng 교수의 CS229 머신러닝 강의의 대부분의 내용을 번역 정리한 글입니다.

Course : Stanford CS229: Machine Learning Full Course taught by Andrew Ng | Autumn 2018

Course 소개 : CS229는 기계 학습과 통계적 패턴 인식에 대한 포괄적인 소개를 제공합니다. 지도 학습(생성적/판별적 학습, 모수/비모수 학습, 신경망, 서포트 벡터 머신); 비지도 학습(클러스터링, 차원 축소, 커널 메소드); 학습 이론(편향/분산 트레이드오프, 실용적인 조언); 강화 학습 및 적응 제어에 대한 강의를 다룹니다. 또한 기계 학습의 최근 응용에 대해 논의되며, 로봇 제어, 데이터 마이닝, 자율 항법, 생명 정보학, 음성 인식, 텍스트 및 웹 데이터 처리 등이 포함됩니다.

강의 제목 : Locally Weighted & Logistic Regression | Stanford CS229: Machine Learning - Lecture 3 (Autumn 2018)

강의 영상: https://www.youtube.com/watch?v=het9HFqo1TQ&t=114s

강의 자료 (복원): https://github.com/maxim5/cs229-2018-autumn

강의 syllabus : http://cs229.stanford.edu/syllabus-autumn2018.html

강의를 듣기 전에 알고 있어야 하는 내용

선형대수학 : https://cs229.stanford.edu/notes2022fall/cs229-linear_algebra_review.pdf

확률이론 : https://cs229.stanford.edu/notes2022fall/cs229-probability_review.pdf

Outline

- Linear regression (Recap)
- Locally weighted regression
- Probabilistic interpretation
- Logistic regression
- Newton's method

Intro

저번 강의에서 gradient descnt와 normal equation을 포함한 linear regression에 대해 다뤄보았다. 이번 강의에선 linear regression을 변형하여 직선을 fitting 하는 것이 아닌 non-linear한 함수를 fitting 하는 방법인 Locally weighted regression을 다룰 것이다. 또한 Linear regression의 확률적 해석을 다룰 것인데 이것은 Logistic regression이라고 불리는 첫번째 classification algorithm으로 이어진다. 그리고 Newton's method라고 불리는 logistic regression을 위한 알고리즘에 대해 다룰 것이다.

Parametric learning algorithms and non-parametric learning algorithms

Parametric learning algorithm에서는 $\theta_i$ 와 같은 고정된 parameters들의 집합을 data에 fitting 한다. 고정된 parameter를 사용하기 때문에 parametric 이라고 한다. 학습 시 사용한
training set이 컴퓨터 메모리에 없어도 parameters를 가지고 predictions을 만들 수 있다.

Non-parametric learning algorithm은 parameters가 data의 크기에 따라 성장해야 한다. dataset을 컴퓨터 메모리에 저장하고 있거나 보관해야한다. 따라서 굉장히 큰 dataset에 대해서는 좋지 않다. Locally weighted regression이 Non-parametric learning algorithm에 속한다.

Locally weighted regression

Pasted image 20240116095147.png — Figure 1

다음과 같은 dataset이 있다고 한다면 $h$ 를 특정 $x$ 에 대해 계산해보자

Pasted image 20240116095552.png — Figure 2

Linear regression은 $\frac{1}{2} \sum^m_{i=1}\left(y^{(i)}-\theta^Tx^{(i)}\right)^2$ 를 최소화하도록 $\theta$ 를 fitting 한 후 $\theta^{T}x$ 를 구하면 된다.

Pasted image 20240116095909.png — Figure 3

Locally weighted regression은 그림과 같이 예측값을 만들고 싶은 $x$ 값과 가까운 training examples에 집중하여 직선을 만든다. 가깝다는 의미는 $x$ 값이 비슷하다는 것이다. 이것을 수식화 하면 아래와 같다.
- Fit $\theta$ to minimize $\frac{1}{2} \sum^m_{i=1}w^{(i)}\left(y^{(i)}-\theta^{T}x^{(i)}\right)^2$
- where $w^{(i)}$ is weight function $w^{(i)}=\exp\left(-\frac{(x^{(i)}-x)^2}{2}\right)$
- If $|x^{(i)}-x|$ is small, $w^{(i)} \approx 1$
- If $|x^{(i)}-x|$ is large, $w^{(i)} \approx 0$

cost function의 주요 변화는 weighted term이 추가 되었다는 것이다. $x^{(i)}$ 가 예측값을 만들고 싶은 $x$ 와 멀다면 error term이 0과 매우 가까운 상수와 곱해진다. 반대로 가까운 경우 1과 가까운 상수가 곱해진다. 따라서 필요한 squared error들만( $x$ 와 가까운 값) 더해지는 효과를 볼 수 있다. weight function은 가우시안 분포와 같은 모양이다. (아래 넓이가 1이 아니기 때문에 확률 분포가 아니다.)

Pasted image 20240116112209.png — Figure 4

weight function은 $x$ 를 중심으로 종 모양으로 그려진다. $x^{i}$ 에 대한 wight는 그 값에서 종 모양의 높이 즉 weight function의 값이다. 위의 그림에서 함수의 넓이에 따라 weight function의 값이 달라지는 것을 볼 수 있다. 이 넓이를 bandwidth, parameter $\tau$ 라고 한다. $\tau$ 는 직선을 fitting할 때 얼마나 가까운 examples까지 고려할 것인가를 나타낸다. 이것을 고려하여 weight function은 다음과 같이 쓸 수 있다.
$w^{(i)}=\exp\left(-\frac{(x^{(i)}-x)^2}{2\tau^{2}}\right)$
이뿐만 아니라 weight function은 많은 버전이 존재한다. bandwidth $\tau$ 값은 overfitting과 underfitting에 영향을 준다. $\tau$ 값이 너무 크다면 underfitting이 되고 너무 작다면 overfitting이 된다.

Probabilistic Interpretation

Assumption
- $y^{(i)}=\theta^{T}x^{(i)}+\epsilon^{(i)}$
- $\epsilon^{(i)} \sim \mathcal{N}(0,\sigma^{2})$ : error(unmodeled effects, random noise)
- $P(\epsilon^{(i)})=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(\epsilon^{(i)})^2}{2\sigma^{2}}\right)$
- error terms : IID(Independently and Identically Distributed)
- ex) 한 집의 error term은 다른 집의 error term과 독립인 같은 분포를 따른다.
위 assumption들로 아래 식을 얻어낼 수 있다.
$P(y^{(i)}|x^{(i)};\theta)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}}\right)$

이것은 평균이 $\theta^{T}x^{(i)}$ 이고 분산이 $\sigma^{2}$ 인 가우시안 확률 분포이므로 같이 나타낼 수 있다.
$y^{(i)}|x^{(i)};\theta \sim \mathcal{N}(\theta^{T}x^{(i)},\sigma^{2})$
위 식들에서 $;$ 의 의미는 $\theta$ 에 의해 parameterized 되었다는 것을 의미한다. $P(y^{(i)}|x^{(i)},\theta)$ 로 쓴다면 $\theta$ 에 conditioning 되었다는 의미인데 $\theta$ 는 random variable이 아니기 때문에 $\theta$ 에 conditioning 할 수 없다.

위에서 만든 Assumtion에서 $\theta$ 의 likelihood는 아래와 같다.
$\begin{align} \mathcal{L}(\theta) &= p(\vec{y}|x;\theta)\\ &=\prod^{m}_{i=1}p(y^{(i)}|x^{(i)};\theta)\\ &=\prod^{m}_{i=1}\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}}\right) \end{align}$
Likelyhood와 probabillity의 차이에 대해 많이 헷갈려한다. parameters의 likelihood는 data의 probability와 같다. data가 고정되어 있을 때 parameter의 함수를 likelihood라고 한다. training set이 고정되어 있을 때 그 data에 대해 다양한 parameters를 확인해 보고 싶은 경우 likelihood를 사용한다. 반면 특정한 parameter $\theta$ 를 다양한 data에 대해 확인해 보고 싶으면 probability를 사용한다.

아래의 식 likelihood( $\mathcal{L}(\theta)$ )에 log를 씌운 log likelihood( $\mathcal{l}(\theta)$ )이다. likelihood를 maximize 하는 것 보다 log likelihood를 maximize 하는 것이 계산이 더 쉽기 때문에 log likelihood를 사용한다.
$\begin{align} \mathcal{l}(\theta) &=\log\mathcal{L}(\theta)\\ &=\log\prod^{m}_{i=1}p(y^{(i)}|x^{(i)};\theta)\\ &=\sum\limits^{m}_{i=1}\left(\log\frac{1}{\sqrt{2\pi}\sigma}+\log\exp\left(-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}} \right)\right)\\ &=m\log\frac{1}{\sqrt{2\pi}\sigma}+\sum^{m}_{i=1}-\frac{(y^{(i)}-\theta^{T}x^{(i)})^{2}}{2\sigma^{2}} \end{align}$
log likelihood를 미분하여 likelihood를 maximize하는 $\theta$ 인 maximum likelihood estimation(MLE)를 구할 수 있다. 위 식에서 첫 항이 상수이기 때문에 likelihood를 maximize 하는 조건은 두번째 항에 $\sigma^{2}$ 를 곱해준 $\frac{1}{2}\sum^{m}_{i=1}(y^{(i)}-\theta^{T}x^{(i)})^{2}$ 를 최소화 하는 것과 같다. 이 식은 cost function $J(\theta)$ 와 같은데 이것으로 MLE가 cost function을 최소화 한다는 것을 알 수 있다.

Logistic Regression

$y\in \{0,1\}$ in binary classification에서는 linear regression을 사용하는 것은 바람직하지 않다. 가장 자주 쓰이는 classification algorithm은 logistic regression이다. 앤드류 교수는 linear regression과 logistic regression을 가장 자주 사용한다고 한다. logistic regression을 설계할 때 $h_{\theta}(x) \in [0,1]$ 즉 output이 0과 1 사이인 hypothesis를 원한다. 따라서 $h_{\theta}(x)$ 를 아래와 같이 정의한다.
$h_{\theta}(x)=g(\theta^{T}x)=\frac{1}{1+e^{-\theta^{T}x}}$ $g(z)=\frac{1}{1+e^{-z}}$ 는 “sigmoid function” 혹은 “logistic function”이라고 한다. linear regression은 $h_{\theta}(x)=\theta^{T}x$ 라고 hypothesis를 정의했었는데 이것은 0보다 작거나 1보다 큰 값을 output 하기도 한다. 이것에 sigmoid function을 취하면서 0과1사이 값으로 압축해준다. logistic regression model을 만들기 위해 $y|x;\theta$ 의 분포에 대한 assumption을 만들면 다음과 같다.
$\begin{align} &P(y=1|x;\theta)=h_{\theta}(x)\\ &P(y=0|x;\theta)=1-h_{\theta}(x) \end{align}$
$y \in \{0,1\}$ 이기 때문에 위 식을 하나의 식으로 정리할 수 있다.
$P(y|x;\theta)=h_{\theta}(x)^{y}(1-h_{\theta}(x))^{1-y}$
$y$ 가 1이면 오른쪽 제곱식이 0제곱이 되어 1이되고 0이면 왼쪽 제곱식이 1이 된다.
위의 assumption에서 likelihood를 적어보면 아래와 같다.
$\begin{align} \mathcal{L}(\theta) &= p(\vec{y}|x;\theta)\\ &=\prod^{m}_{i=1}p(y^{(i)}|x^{(i)};\theta)\\ &=\prod^{m}_{i=1}h_{\theta}(x^{(i)})^{y^{(i)}}(1-h_{\theta}(x^{(i)})^{1-y^{(i)}} \end{align}$
MLE는 likelihood를 maximize하는 $\theta$ 값을 찾는 것이다. linear regression에서 했던 것과 같이 계산을 쉽게 하기 위해 log likelihood를 사용한다.
$\begin{align} \mathcal{l}(\theta) &=\log\mathcal{L}(\theta)\\ &=\sum\limits^{m}_{i=1}y^{(i)}\log h_{\theta}(x^{(i)})+(1-y^{(i)})\log (1-h_{\theta}(x^{(i)})) \end{align}$
$\mathcal{l}(\theta)$ 를 maximize하는 $\theta$ 값을 구하기 위해 Batch gradient ascent 알고리즘을 사용한다.
$\theta_j := \theta_j + \alpha \frac{\partial}{\partial \theta_j}\mathcal{l}(\theta)$
$J(\theta)$ 을 minimize하기 위한 gradient descent 알고리즘과 다르게 $\mathcal{l}(\theta)$ maximize 해야하기 때문에 $-$ 대신 $+$ 를 사용한다. 미분 계산을 해주면 아래와 같은 식이 나온다.
$\theta_j := \theta_j + \alpha (y-h_{\theta}(x))x_{j}$
sigmoid function을 사용한다면 $\mathcal{l}(\theta)$ 는 local maximum이 없고 오직 하나의 global maximum이 존재한다. 이것이 sigmoid function을 사용하는 이유 중 하나이다.

Newton's method

gradient descent algorithm이나 gradient ascent algorithm은 수렴하는데 까지 많은 반복을 거쳐야한다. Newton's method는 더 크게 점프 할 수 있기 때문에 훨씬 적은 반복으로 좋은 값을 얻을 수 있다.

Newton's method에서는 살짝 다른 문제를 다룬다. 함수 $f$ 가 있을 때 $f(\theta)=0$ 을 만족시키는 $\theta$ 를 찾는다. logistic regression에서 $\mathcal{l}(\theta)$ 를 maximize 하는 $\theta$ 를 찾을 때 $\mathcal{l}^{\prime} (\theta)=0$ 을 만족시키는 $\theta$ 를 찾기 때문에 newton's method를 사용할 수 있다.
$\theta^{(t+1)}:=\theta^{t}-\frac{f(\theta^{(t)})}{f^{\prime}(\theta^{(t)})}$
Newton's method는 "Quadratic convergence"의 특징을 가지고 있다. 이것은 에러가 $0.01,\ 0.0001, 0.00000001 \cdots$ 에러의 제곱만큼 줄어든다는 것이다.

$\theta$ 가 vector일 경우 위 식도 vector 형식으로 바꿔주어야 한다.
$\theta^{(t+1)}:=\theta^{t}-H^{-1}\nabla_{\theta}\mathcal{l}(\theta)$
$H$ 는 n-by-n 행렬로 Hessian이라고 불린다.
$H_{ij}=\frac{\partial^2\mathcal{l}(\theta)}{\partial\theta_{i}\partial\theta_{j}}$
Newton's method의 단점은 $\theta$ 가 고차원 벡터일 때 Hessian의 역행렬을 구하는 것이 매우 힘들기 때문에 각 iteration이 매우 비싸진다는 것이다.

'Courses > CS229' 카테고리의 다른 글

Stanford CS229: Machine Learning \| Lecture 7 강의 정리 (0)	2024.11.06
Stanford CS229: Machine Learning \| Lecture 6 강의 정리 (0)	2024.11.06
Stanford CS229: Machine Learning \| Lecture 5 강의 정리 (0)	2024.10.17
Stanford CS229: Machine Learning \| Lecture 4 강의 정리 (0)	2024.10.17
Stanford CS229: Machine Learning \| Lecture 2 강의 정리 (0)	2024.02.27

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Stanford CS229: Machine Learning | Lecture 3 강의 정리

Outline

Intro

Parametric learning algorithms and non-parametric learning algorithms

Locally weighted regression

Probabilistic Interpretation

Logistic Regression

Newton's method

'Courses > CS229' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역