image.png

the problem of sigmoid is, at its best, it’s deriviation is 1/4 and worse on each tails (approximated to 0) which makes layers of sigmoid will kill the gradient.

explanation of Andrej Karpathy from his blog https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b

image.png

ReLU

image.png

ReLU is great. It’s simple to use, introduces non-linearity. But also has a problem called Dead(or Dying) ReLUs which if the majority value is negative, it will vanish the gradient.

GELU(Gaussian Error Linear Unit)

image.png

A more recently introduced activation function is GELU.

This has higher computational cost than ReLU but negative values are still preserved more than ReLU. Also for large negative numbers it becomes zero.

Where are activations used in CNNs?