Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

As noted earlier, gradient descent can be performed for any function E that is d

ID: 657394 • Letter: A

Question

As noted earlier, gradient descent can be performed for any function E that is differentiable with respect to the parameterized hypothesis space. While the basic Backpropagation algorithm defines E in terms of the sum of squared errors of the network, other definitions have been suggested in order to incorporate other constraints into the weight-tuning rule. For each new definition of E a new weight-tuning rule for gradient descent must be derived. Examples of alternative definitions of E include Adding a penalty term for weight magnitude. As discussed above, we can add a term to E that increases with the magnitude of the weight vector. This causes the gradient descent search to seek weight vectors with small magnitudes, thereby reducing the risk of overfitting. One way to do this is to redefine E as which yields a weight update rule identical to the Backpropagation rule, except that each weight is multiplied by the constant (1 -2gamma eta) upon each iteration. Thus, choosing this definition of E is equivalent to using a weight decay strategy (see Exercise 4.10.) Derive the gradient descent update rule for this definition of E. Note that D is the set of training examples, tkd is the target output for output unit k and the training example d, and Okd is the computed or actual output for the output unit k and the training example d. The standard backpropagation training algorithm is given below for reference. Create a teed-Forward network with nin inputs. nhidden Hidden units, nouts output units. Initialize all network weights to small random numbers (e.g., between -.05 and .05). Until the termination condition is met. Do For each in training examples, Do Propagate the input forward through the network: Input the instance x to the network and compute the output ou of every unit u in the network. Propagate the errors backward through the network: For each hidden unit h, calculate its error term Update each network weight wji

Explanation / Answer

Gradient descent-

Gradient descent could be a first-order optimisation formula. to search out an area minimum of a operateexploitation gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the operate at this purpose. If instead one takes steps proportional to the positive of the gradient, one approaches an area most of that function; the procedure is then referred to as gradient ascent.
Gradient descent relies on the observation that if the multivariable operate E(x) is outlined and differentiablein an exceedingly neighborhood of a degree ,then E(x) decreases fastest if one goes from  in the direction of the negative gradient of E at a, 0     . It follows that, if

for small enough, then E(a) > E(B) With this observation in mind, one starts with a guess x0 for a local minimum of E, and considers the sequence   such that

Here E is assumed to be defined on the plane, and that its graph has abowl shape. The blue curves are the contour lines, that is, the regions on which the value of E is constant.

EXAMPLE OF GRADIENT DESCENT IN PROGRAMMING LANGUAGE-

a_old = 0

a_new = 4

gamma = 0.01

precision = 0.00001

def f_deri(a):

    return 4 * a**3 - 9 * a**2

while abs(a_new - a_old) > precision:

    a_old = a_new

    a_new = a_old - gamma * f_derivative(a_old)

print("Local minimum occurs at", a_new)