Activation Functions

the main goal is to introduce non-linearity → to express more complex representations.

the problem of sigmoid is, at its best, it’s deriviation is 1/4 and worse on each tails (approximated to 0) which makes layers of sigmoid will kill the gradient.

explanation of Andrej Karpathy from his blog https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b

ReLU

ReLU is great. It’s simple to use, introduces non-linearity. But also has a problem called Dead(or Dying) ReLUs which if the majority value is negative, it will vanish the gradient.

GELU(Gaussian Error Linear Unit)

A more recently introduced activation function is GELU.

This has higher computational cost than ReLU but negative values are still preserved more than ReLU. Also for large negative numbers it becomes zero.

ReLU

GELU(Gaussian Error Linear Unit)

Where are activations used in CNNs?