Question 4

Let f be some function so that

f(θ01) outputs a number. For this problem,

f is some arbitrary/unknown smooth function (not necessarily the

cost function of linear regression, so f may have local optima).

Suppose we use gradient descent to try to minimize f(θ01) as a function of θ0 and θ1. Which of the

following statements are true? (Check all that apply.)

  • Even if the learning rate α is very large, every iteration of gradient descent will decrease the value of f(θ01).
  • If the learning rate is too small, then gradient descent may take a very long time to converge.
  • If θ0 and θ1 are initialized at a local minimum, then one iteration will not change their values.
  • If θ0 and θ1 are initialized so that θ01, then by symmetry (because we do simultaneous updates to the two parameters), after one iteration of gradient descent, we will still have θ01.

Answers:

True or FalseStatementExplanation
TrueIf the learning rate is too small, then gradient descent may take a very long time to converge.If the learning rate is small, gradient descent ends up taking an extremely small step on each iteration, and therefore can take a long time to converge
TrueIf θ0 and θ1 are initialized at a local minimum, then one iteration will not change their values.At a local minimum, the derivative (gradient) is zero, so gradient descent will not change the parameters.
FalseEven if the learning rate α is very large, every iteration of gradient descent will decrease the value of f(θ01).If the learning rate is too large, one step of gradient descent can actually vastly “overshoot” and actually increase the value of f(θ01).
FalseIf θ0 and θ1 are initialized so that θ01, then by symmetry (because we do simultaneous updates to the two parameters), after one iteration of gradient descent, we will still have θ01.The updates to θ0 and θ1 are different (even though we’re doing simulaneous updates), so there’s no particular reason to update them to be same after one iteration of gradient descent.

Other Options:

True or FalseStatementExplanation
TrueIf the first few iterations of gradient descent cause f(θ01) to increase rather than decrease, then the most likely cause is that we have set the learning rate to too large a valueif alpha were small enough, then gradient descent should always successfully take a tiny small downhill and decrease f(θ01) at least a little bit. If gradient descent instead increases the objective value, that means alpha is too large (or you have a bug in your code!).
FalseNo matter how θ0 and θ1 are initialized, so long as learning rate is sufficiently small, we can safely expect gradient descent to converge to the same solutionThis is not true, depending on the initial condition, gradient descent may end up at different local optima.
FalseSetting the learning rate to be very small is not harmful, and can only speed up the convergence of gradient descent.If the learning rate is small, gradient descent ends up taking an extremely small step on each iteration, so this would actually slow down (rather than speed up) the convergence of the algorithm.

Leave a Comment

Your email address will not be published. Required fields are marked *