lecture 3. regularization and optimization

엄밀하게 따지면 bias도 포함돼야 하지만 수식적으로 생략한듯함.
- data loss = 각 $x$에 대해 $|x \times W + b - y|$ 의 전체 합

$$ \text{L2 regularization: } R(W) = \sum_k \sum_l W_{k,l}^2 \\\text{L1 regularization: } R(W) = \sum_k \sum_l |W_{k,l}| $$

to make the model work good not only on the train data but also in test(unseen) data

may be less better on the training data but better on test data (generalizability enhances)

In other words, it’s used to avoid overfitting.

intuition (occam’s razor 오컴의 면도날): it might not fit/classify the dataset perfectly, but is simpler.

f2 is more prefered

Occam’s razor:

Among multiple competing hypotheses, the simplest is the best, William of Ockham 1285-1347