Image (or more generally) inputs are represented as vectors/matrices. For example an image can be 32x32x3(h x w x channel) = 3072.
This is passed to neural network $f(x, W)$ and the neural network predicts based on the task.
There are three ways to understand this:
How do we train them?
We need a loss function that optimizes the prediction of the neural network.
For example, we can use the ground truth label $y$ and compute the loss with $y - f(x,W)$.
There are two types of losses which are regularization loss and data loss ($y-f(x,W)$) and they are summed to get the final loss $L$.
After getting the loss function, we need to optimize (or update) the neural network’s parameter to make it perform better. There are multiple ways to do this such as: