
the problem of sigmoid is, at its best, it’s deriviation is 1/4 and worse on each tails (approximated to 0) which makes layers of sigmoid will kill the gradient.
explanation of Andrej Karpathy from his blog https://karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b


ReLU is great. It’s simple to use, introduces non-linearity. But also has a problem called Dead(or Dying) ReLUs which if the majority value is negative, it will vanish the gradient.

A more recently introduced activation function is GELU.
This has higher computational cost than ReLU but negative values are still preserved more than ReLU. Also for large negative numbers it becomes zero.