ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • MLE and MAP
  • Maximum Likelihood Estimation:
  • Maximum A Posteriori
  • Gradient Descent
  • derivative of a matrix-matrix product
  • Back propagation
  • Chain Rule
  • Applying the chain rule
  • Example
  • Batch gradient descent
  • Stochastic gradient descent
  • Mini-batch gradient descent
  • SGD momentum
  • ADAM

Was this helpful?

  1. ML Fundamentals

Optimization

PreviousBasicsNextHow to prevent overfitting

Last updated 3 years ago

Was this helpful?

MLE and MAP

A good video explains the topic.

Assume X={x1,x2,...,xn}X = \{x_1, x_2,...,x_n\}X={x1​,x2​,...,xn​} is a sample follow independent and identical distribution

Maximum Likelihood Estimation:

θ^MLE=maximizeP(θ;X)≈maximize(logP(θ;X))=minimize(−logP(θ;X))\hat \theta_{MLE} = maximize P(\theta ; X) \approx maximize (logP(\theta ; X)) = minimize ( -logP(\theta ; X))θ^MLE​=maximizeP(θ;X)≈maximize(logP(θ;X))=minimize(−logP(θ;X)),

  • Deep Learning uses MLE. We do not know the real distribution of the dataset

Maximum A Posteriori

θ^MAP=maximizeP(θ∣X)\hat \theta_{MAP} = maximize P(\theta| X)θ^MAP​=maximizeP(θ∣X)

θ^MAP=minimize−logP(X∣θ)−logP(θ)+logP(X)\hat \theta_{MAP}= minimize -logP(X |\theta)-logP(\theta)+logP(X)θ^MAP​=minimize−logP(X∣θ)−logP(θ)+logP(X),

For P(X)P(X)P(X) is not related to θ\thetaθ,

θ^MAP=minimize−logP(X∣θ)−logP(θ)\hat \theta_{MAP} =minimize -logP(X |\theta)-logP(\theta)θ^MAP​=minimize−logP(X∣θ)−logP(θ)

  • MAP usually used in Bayisan Machine Learning

The difference is the prior probability P(θ)P(\theta)P(θ)

Gradient Descent

Let's say your hypothesis function contains multiple parameters, defined as θ1\theta_1θ1​, θ2\theta_2θ2​, θ3\theta_3θ3​.

Your cost function takes both your model output and your ground truth, e.g.:

J(θ0,θ1,...,θn)=12m∑i=1m(hθ(xi)−yi)2J(\theta_0, \theta_1,...,\theta_n) = \frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x^i) - y^i)^2J(θ0​,θ1​,...,θn​)=2m1​i=1∑m​(hθ​(xi)−yi)2

Then we need to calculate the partial derivative of the cost function with respect to each θ\thetaθ, from 0 to n.

Then we can simultaneously update the θj=θj−α∂J(θ0,θ1,...,θn)∂θj\theta_j = \theta_j - \alpha\frac{\partial J(\theta_0, \theta_1,...,\theta_n)}{\partial \theta_j}θj​=θj​−α∂θj​∂J(θ0​,θ1​,...,θn​)​

derivative of a matrix-matrix product

D=Wâ‹…XD = W \cdot XD=Wâ‹…X
dW=dDâ‹…XTdW = dD \cdot X^TdW=dDâ‹…XT
dX=WTâ‹…dDdX = W^T \cdot dDdX=WTâ‹…dD

Back propagation

A neural network propagates the signal of the input data forward through its parameters towards the moment of decision, and then back propagates information about the error through the network so that it can alter the parameters one step at a time.

Chain Rule

As seen above, forward propagation can be viewed as a long series of nested equations. If you think of feed forward this way, then back propagation is merely an application the Chain rule to find the Derivatives of cost with respect to any variable in the nested equation. Given a forward propagation function:

f(x)=A(B(C(x)))f(x) = A(B(C(x)))f(x)=A(B(C(x)))

A, B, and C are activation functions at different layers. Using the chain rule, we easily calculate the derivative of f(x)f(x)f(x) with respect to xxx:

f′(x)=f′(A)⋅A′(B)⋅B′(C)⋅C′(x)f'(x) = f'(A) \cdot A'(B) \cdot B'(C) \cdot C'(x)f′(x)=f′(A)⋅A′(B)⋅B′(C)⋅C′(x)

How about the derivative with respect to B? To find the derivative with respect to B you can pretend B(C(x))B(C(x))B(C(x)) is a constant, replace it with a placeholder variable B, and proceed to find the derivative normally with respect to B.

f′(B)=f′(A)⋅A′(B)f'(B) = f'(A) \cdot A'(B)f′(B)=f′(A)⋅A′(B)

This simple technique extends to any variable within a function and allows us to precisely pinpoint the exact impact each variable has on the total output.

Applying the chain rule

Let's use the chain rule to calculate the derivative of cost with respect to any weight in the network. The chain rule will help us identify how much each weight contributes to our overall error and the direction to update each weight to reduce our error. Here are the equations we need to make a prediction and calculate total error, or cost:

Given a network consisting of a single neuron, total cost could be calculated as:

Cost=C(R(Z(XW)))Cost = C(R(Z(XW)))Cost=C(R(Z(XW)))

Using the chain rule we can easily find the derivative of Cost with respect to weight W.

C′(W)=C′(R)⋅R′(Z)⋅Z′(W)=(y^−y)⋅R′(Z)⋅XC'(W) = C'(R) \cdot R'(Z) \cdot Z'(W) = (\hat{y} - y) \cdot R'(Z) \cdot XC′(W)=C′(R)⋅R′(Z)⋅Z′(W)=(y^​−y)⋅R′(Z)⋅X

Example

NOTE:

E0=∂L∂OE_0 = \frac{\partial L }{\partial O}E0​=∂O∂L​
EH=∂L∂HE_H = \frac{\partial L }{\partial H}EH​=∂H∂L​

Batch gradient descent

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function to the parameters θ\thetaθ for the entire training dataset.

for i in range(nb_epochs):
  params_grad = evaluate_gradient(loss_function, data, params)
  params = params - learning_rate * params_grad

Stochastic gradient descent

Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example xix^ixi and label yiy^iyi. Note that we shuffle the training data at every epoch.

SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily, as in Image 1.

for i in range(nb_epochs):
  np.random.shuffle(data)
  for example in data:
    params_grad = evaluate_gradient(loss_function, example, params)
    params = params - learning_rate * params_grad
  • Efficient when update gradient

  • Slow converge speed, need good learning rate

Mini-batch gradient descent

Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of n training examples. Common mini-batch sizes range between 50 and 256, but can vary for different applications.

for i in range(nb_epochs):
  np.random.shuffle(data)
  for batch in get_batches(data, batch_size=50):
    params_grad = evaluate_gradient(loss_function, batch, params)
    params = params - learning_rate * params_grad

SGD momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampends oscillations seen in the image below.

mt=γmt−1+η∇θJ(θ)m_t = \gamma m_{t-1}+\eta\nabla_{\theta}J(\theta)mt​=γmt−1​+η∇θ​J(θ)
θ=θ−mt\theta=\theta - m_tθ=θ−mt​

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity if there is air resistance, i.e. γ<1\gamma<1γ<1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

ADAM

Adaptive Moment Estimation (ADAM) is a method that computes adaptive learning rate for each parameter. In addition to storing an exponentially decaying average of past squared gradients vtv_tvt​ like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mtm_tmt​, similar to momentum. Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.

  1. mtm_tmt​ and vtv_tvt​ are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively.

mt=β1mt−1+(1−β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_tmt​=β1​mt−1​+(1−β1​)gt​
vt=β2vt−1+(1−β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2vt​=β2​vt−1​+(1−β2​)gt2​
  1. Compute bias-corrected first moment estimate and bias-corrected second raw moment estimate.

mt^=mt1−β1t\hat{m_t} = \frac{m_t}{1-\beta_1^t}mt​^​=1−β1t​mt​​
vt^=vt1−β2t\hat{v_t} = \frac{v_t}{1-\beta_2^t}vt​^​=1−β2t​vt​​
θt+1=θt−ηvt^+ϵmt^\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v_t}} + \epsilon } \hat{m_t}θt+1​=θt​−vt​^​​+ϵη​mt​^​

The authors propose default values of 0.9 for β1\beta_1β1​, 0.999 for β2\beta_2β2​, and 10−810^{-8}10−8 for ϵ\epsilonϵ.

_images/backprop_ff_equations.png
here
https://ml-cheatsheet.readthedocs.io/en/latest/backpropagation.html
img
img
img