Demonstrate that a neural network to maximize the log likelihood of observing th
ID: 3721255 • Letter: D
Question
Demonstrate that a neural network to maximize the log likelihood of observing the training data is one that has softmax output nodes and minimizes the criterion function of the negative log probability of training data set: Jo(w--logp({(Kn, tn): ?, 2, , },W) -log I a neural network to maximize the a posterior likelihood of observing the training data given a Gaussian prior of the weight distribution Pw;aNCO, ox) is one 9 ? ? p(t,-m xniw) Demonstrate that that minimizes the criterion function with L2 regularization (w) 0(w)-log p (w; ?- )Explanation / Answer
Neural networks with at least one hidden layer are universal approximators, which means that they can approximate any (continuous) function. This approximation can be improved by increasing the number of hidden neurons in the network (but increases the risk of overfitting).
A key advantage to neural networks is that they are capable of learning features independently - without much human involvement.
Softmax function -
The softmax function (called such because it is like a "softened" maximum function) may be used as the output layer's activation function. It takes the form:
Softmax is usually used for multivariate logistic regression because it produces a categorical distribution by squashing activation values to be between 0 and 1 and sum to 1. In our lab we have tried to use it to implement a different type of penalty (entropy-based) on distributions.
This function has the properties that it sums to 1 and that all of its outputs are positive, which are useful for modeling probability distributions.
The cost function to use with softmax is the (categorical) cross-entropy loss function. It has the nice property of having a very big gradient when the target value is 1 and the output is almost 0.
Negative Log-Likelihood -
In practice, the softmax function is used in tandem with the negative log-likelihood. This loss function is very interesting if we interpret it in relation to the behavior of softmax. First, let’s write down our loss function:
This is summed for all the correct classes.
Recall that when training a model, we aspire to find the minima of a loss function given a set of parameters (in a neural network, these are the weights and biases). We can interpret the loss as the “unhappiness” of the network with respect to its parameters. The higher the loss, the higher the unhappiness: we don’t want that. We want to make our models happy.
For Example -
Let me assume I have N images and yi is the label of the image i, where, yi conatins RC×1 - a binary vector of lenght C (number of classes).yic=1 when the image ii belonging to class C.
Consider the following two loss functions.
L1= - sigma sigma c=1to C yic log(P(yic|D))
L2=sigma i sigma c=1 to C(yic - P(yic|D))2
L2 is not always used with neural networks, indeed for statistical pattern recognition problems the cross-entropy loss (with a softmax activation function for the output layer) is the preferred option.
Secondly, I would say that L1 would be a maximum likelihood approach, rather than maximum a-posteriori as there is no prior distribution involved, just a likelihood.
Asymptotically (in the limit of an infinite amount of data and hidden units), both L1 and L2 will give the same answer, because the minimiser of L1 is given when the output of the model is the true probability of class membership, and the minimiser of L2 is the conditional mean of the target variable, which in this case is also the true probability of class membership.
The difference arises away from these asymptotic conditions (i.e. more or less every practical case), where I would suggest that L1 is probably more efficient in converging to the optimal solution in terms of the number of training samples given, but I am not confident that the practical difference is likely to be great for most problems.
In practice, most reasonable loss functions are suitable for training neural networks , however I almost always use a maximum likelihood criterion for training neural networks because this is the most theoretically sound approach as it most accurately describes the variability of the target variable around its conditional mean
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.