Optimization
Last updated
Last updated
A good video here explains the topic.
Assume is a sample follow independent and identical distribution
,
Deep Learning uses MLE. We do not know the real distribution of the dataset
,
For is not related to ,
MAP usually used in Bayisan Machine Learning
The difference is the prior probability
Your cost function takes both your model output and your ground truth, e.g.:
https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html
A neural network propagates the signal of the input data forward through its parameters towards the moment of decision, and then back propagates information about the error through the network so that it can alter the parameters one step at a time.
As seen above, forward propagation can be viewed as a long series of nested equations. If you think of feed forward this way, then back propagation is merely an application the Chain rule to find the Derivatives of cost with respect to any variable in the nested equation. Given a forward propagation function:
This simple technique extends to any variable within a function and allows us to precisely pinpoint the exact impact each variable has on the total output.
Let's use the chain rule to calculate the derivative of cost with respect to any weight in the network. The chain rule will help us identify how much each weight contributes to our overall error and the direction to update each weight to reduce our error. Here are the equations we need to make a prediction and calculate total error, or cost:
Given a network consisting of a single neuron, total cost could be calculated as:
Using the chain rule we can easily find the derivative of Cost with respect to weight W.
NOTE:
SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily, as in Image 1.
Efficient when update gradient
Slow converge speed, need good learning rate
Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. Common mini-batch sizes range between 50 and 256, but can vary for different applications.
Momentum is a method that helps accelerate SGD in the relevant direction and dampends oscillations seen in the image below.
Compute bias-corrected first moment estimate and bias-corrected second raw moment estimate.
Let's say your hypothesis function contains multiple parameters, defined as , , .
Then we need to calculate the partial derivative of the cost function with respect to each , from 0 to n.
Then we can simultaneously update the
A, B, and C are activation functions at different layers. Using the chain rule, we easily calculate the derivative of with respect to :
How about the derivative with respect to B? To find the derivative with respect to B you can pretend is a constant, replace it with a placeholder variable B, and proceed to find the derivative normally with respect to B.
Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function to the parameters for the entire training dataset.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example and label . Note that we shuffle the training data at every epoch.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. ). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.
Adaptive Moment Estimation (ADAM) is a method that computes adaptive learning rate for each parameter. In addition to storing an exponentially decaying average of past squared gradients like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients , similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.
and are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.
The authors propose default values of 0.9 for , 0.999 for , and for .