Optimization

MLE and MAP

A good video here explains the topic.

Assume X={x1,x2,...,xn}X = \{x_1, x_2,...,x_n\} is a sample follow independent and identical distribution

Maximum Likelihood Estimation:

θ^MLE=maximizeP(θ;X)maximize(logP(θ;X))=minimize(logP(θ;X))\hat \theta_{MLE} = maximize P(\theta ; X) \approx maximize (logP(\theta ; X)) = minimize ( -logP(\theta ; X)),

  • Deep Learning uses MLE. We do not know the real distribution of the dataset

Maximum A Posteriori

θ^MAP=maximizeP(θX)\hat \theta_{MAP} = maximize P(\theta| X)

θ^MAP=minimizelogP(Xθ)logP(θ)+logP(X)\hat \theta_{MAP}= minimize -logP(X |\theta)-logP(\theta)+logP(X),

For P(X)P(X) is not related to θ\theta,

θ^MAP=minimizelogP(Xθ)logP(θ)\hat \theta_{MAP} =minimize -logP(X |\theta)-logP(\theta)

  • MAP usually used in Bayisan Machine Learning

The difference is the prior probability P(θ)P(\theta)

Gradient Descent

Let's say your hypothesis function contains multiple parameters, defined as θ1\theta_1, θ2\theta_2, θ3\theta_3.

Your cost function takes both your model output and your ground truth, e.g.:

J(θ0,θ1,...,θn)=12mi=1m(hθ(xi)yi)2J(\theta_0, \theta_1,...,\theta_n) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^i) - y^i)^2

Then we need to calculate the partial derivative of the cost function with respect to each θ\theta, from 0 to n.

Then we can simultaneously update the θj=θjαJ(θ0,θ1,...,θn)θj\theta_j = \theta_j - \alpha\frac{\partial J(\theta_0, \theta_1,...,\theta_n)}{\partial \theta_j}

derivative of a matrix-matrix product

D=WXD = W \cdot X
dW=dDXTdW = dD \cdot X^T
dX=WTdDdX = W^T \cdot dD

Back propagation

https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html

A neural network propagates the signal of the input data forward through its parameters towards the moment of decision, and then back propagates information about the error through the network so that it can alter the parameters one step at a time.

Chain Rule

As seen above, forward propagation can be viewed as a long series of nested equations. If you think of feed forward this way, then back propagation is merely an application the Chain rule to find the Derivatives of cost with respect to any variable in the nested equation. Given a forward propagation function:

f(x)=A(B(C(x)))f(x) = A(B(C(x)))

A, B, and C are activation functions at different layers. Using the chain rule, we easily calculate the derivative of f(x)f(x) with respect to xx:

f(x)=f(A)A(B)B(C)C(x)f'(x) = f'(A) \cdot A'(B) \cdot B'(C) \cdot C'(x)

How about the derivative with respect to B? To find the derivative with respect to B you can pretend B(C(x))B(C(x)) is a constant, replace it with a placeholder variable B, and proceed to find the derivative normally with respect to B.

f(B)=f(A)A(B)f'(B) = f'(A) \cdot A'(B)

This simple technique extends to any variable within a function and allows us to precisely pinpoint the exact impact each variable has on the total output.

Applying the chain rule

Let's use the chain rule to calculate the derivative of cost with respect to any weight in the network. The chain rule will help us identify how much each weight contributes to our overall error and the direction to update each weight to reduce our error. Here are the equations we need to make a prediction and calculate total error, or cost:

Given a network consisting of a single neuron, total cost could be calculated as:

Cost=C(R(Z(XW)))Cost = C(R(Z(XW)))

Using the chain rule we can easily find the derivative of Cost with respect to weight W.

C(W)=C(R)R(Z)Z(W)=(y^y)R(Z)XC'(W) = C'(R) \cdot R'(Z) \cdot Z'(W) = (\hat{y} - y) \cdot R'(Z) \cdot X

Example

NOTE:

E0=LOE_0 = \frac{\partial L }{\partial O}
EH=LHE_H = \frac{\partial L }{\partial H}

Batch gradient descent

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function to the parameters θ\theta for the entire training dataset.

for i in range(nb_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad

Stochastic gradient descent

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example xix^i and label yiy^i. Note that we shuffle the training data at every epoch.

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily, as in Image 1.

for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad
  • Efficient when update gradient

  • Slow converge speed, need good learning rate

Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. Common mini-batch sizes range between 50 and 256, but can vary for different applications.

for i in range(nb_epochs):
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params)
    params = params - learning_rate * params_grad

SGD momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampends oscillations seen in the image below.

mt=γmt1+ηθJ(θ)m_t = \gamma m_{t-1}+\eta\nabla_{\theta}J(\theta)
θ=θmt\theta=\theta - m_t

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ<1\gamma<1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

ADAM

Adaptive Moment Estimation (ADAM) is a method that computes adaptive learning rate for each parameter. In addition to storing an exponentially decaying average of past squared gradients vtv_t like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mtm_t, similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.

  1. mtm_t and vtv_t are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
  1. Compute bias-corrected first moment estimate and bias-corrected second raw moment estimate.

mt^=mt1β1t\hat{m_t} = \frac{m_t}{1-\beta_1^t}
vt^=vt1β2t\hat{v_t} = \frac{v_t}{1-\beta_2^t}
θt+1=θtηvt^+ϵmt^\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon } \hat{m_t}

The authors propose default values of 0.9 for β1\beta_1, 0.999 for β2\beta_2, and 10810^{-8} for ϵ\epsilon.

Last updated