ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Linear Regression
  • Close Form of Linear Regression
  • Logistic Regression
  • KNN
  • Naive Bayes
  • SVM
  • Decision tree
  • Random forests
  • Bagging in Random Forest

Was this helpful?

  1. Classical Machine Learning

Basics

PreviousQuantizationNextUnsupervised Learning

Last updated 3 years ago

Was this helpful?

Linear Regression

It attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

  • Hypothesis Funtion: h(x)=θTxh(x) = \theta^{T}xh(x)=θTx

  • Cost Function: J(θ)=12∑i=1m(h(x(i))−y(i))2J(\theta) = \frac{1}{2}\sum_{i = 1}^{m}(h(x^{(i)}) - y^{(i)})^2J(θ)=21​∑i=1m​(h(x(i))−y(i))2

  • θ\thetaθ - the weight

  • m - total number of samples

  • To simplify our notation, we introduce the convention of letting x0=1x_0 = 1x0​=1

Close Form of Linear Regression

θ=(XTX)−1XTy\theta = (X^{T}X)^{-1}X^{T}yθ=(XTX)−1XTy , which assume (XTX)(X^{T}X)(XTX) is invertible. Intuition see [cs299-notes1]()

Logistic Regression

  • Hypothesis Funtion: h(x)=sigmoid(θTx)h(x) = sigmoid(\theta^{T}x)h(x)=sigmoid(θTx)

  • Cost Function: J(θ)=∑i=1my(i)log(h(x(i)))+∑i=1m(1−y(i))log(1−h(x(i)))J(\theta) = \sum_{i = 1}^{m}y^{(i)}log(h(x^{(i)})) + \sum_{i = 1}^{m} (1- y^{(i)})log(1 -h(x^{(i)}))J(θ)=∑i=1m​y(i)log(h(x(i)))+∑i=1m​(1−y(i))log(1−h(x(i)))

  • θ\thetaθ - the weight

  • m - total number of samples

  • To simplify our notation, we introduce the convention of letting x0=1x_0 = 1x0​=1

KNN

  • Keywords: Non-parametric Method, Time consuming

  • Given a data point, we compute the K nearest data points (neighbors) using certain distance metric (e.g., Euclidean metric). For classification, we take the majority label of neighbors; for regression, we take the mean of the label values.

  • Note for KNN we don't train a model; we simply compute during inference time. This can be computationally expensive since each of the test example need to be compared with every training example to see how close they are.

  • When K equals 1 or other small number the model is prone to overfitting (high variance), while when K equals number of data points or other large number the model is prone to underfitting (high bias)

Naive Bayes

TODO explain formula

  • It is called naive because it builds the naive assumption that each feature are independent of each other

  • NB can make different assumptions (i.e., data distributions, such as Gaussian, Multinomial, Bernoulli)

  • Despite the over-simplified assumptions, NB classifier works quite well in real-world applications, especially for text classification (e.g., spam filtering)

  • NB can be extremely fast compared to more sophisticated methods

SVM

Try to find an optimal hyperplane to separate two classes of data.

Cost function:

  • Prime Problem:

  • KKT condition:

    • Applying the standard method of Lagrange multipliers, the Lagrangian function is:

    • Thus, the corresponding KKT conditions(Karush–Kuhn–Tucker_conditions) are as following:

      • if is the dataset is not sepratable:

  • Dual Problem:

    1. This term is very efficiently calculated if there are only few support vectors. Further, since we now have a scalar product only involving data vectors, we may apply the kernel trick.

  • Can perform linear, nonlinear, or outlier detection (unsupervised) depending on the kernel funciton

  • Large margin classifier: using SVM we not only have a decision boundary, but want the boundary to be as far from the closest training point as possible

  • The closest training examples are called support vectors, since they are the points based on which the decision boundary is drawn

  • SVMs are sensitive to feature scaling

  • If C is very large, SVM is very sensitive to outliers.But if C is reasonably small, or a not too large, then you stick with the decision boundary more robust with outliers.

Decision tree

  • Non-parametric, supervised learning algorithms

  • Given the training data, a decision tree algorithm divides the feature space into regions. For inference, we first see which region does the test data point fall in, and take the mean label values (regression) or the majority label value (classification).

  • Construction: top-down, chooses a question to split the data such that the target variables within each region are as homogeneous as possible. Calculate the gini impurity and information gain, then pick the question with the most information gain.

  • Advantage:

    • simply to understand & interpret, mirrors human decision making

  • Disadvantage:

    • can overfit easily (and generalize poorly) if we don't limit the depth of the tree

    • can be non-robust: A small change in the training data can lead to a totally different tree

    • instability: sensitive to training set rotation due to its orthogonal decision boundaries

  • How to choose depth:

    • It depends on the possible questions you can have for certain problems.

    • Also you need to use validation data to see which depth works best

Random forests

An ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

  • The number of tree? In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.

Bagging in Random Forest

3. The tree is grown to the largest.

There are approximation methods can have faster inference time by partitioning the training data into regions (e.g., )

Naive Bayes (NB) is a supervised learning algorithm based on applying

minθC∑i=1m[yicost1(θTxi)+(1−yi)cost0(θTxi)]+12∑j=1nθj2min_{\theta}C\sum_{i=1}^m[y^icost_1(\theta^Tx^i)+(1-y^i)cost_0(\theta^Tx^i)] + \frac{1}{2}\sum_{j=1}^n\theta_j^2minθ​C∑i=1m​[yicost1​(θTxi)+(1−yi)cost0​(θTxi)]+21​∑j=1n​θj2​

minimizew,w012wTwminimize_{w, w_{0}} \frac{1}{2}w^{T}wminimizew,w0​​21​wTw

s.t.∀i:yi(wTxi+w0)>=1s.t. \forall i: y_{i}(w^{T}x_{i} + w_{0}) >= 1s.t.∀i:yi​(wTxi​+w0​)>=1

which is a quadratic program with d+1d+1d+1 variables to be optimized for and iii constraints.

Orignal Blog from:

J=12wTw+C∑i=1nεi+∑i=1nαi(yi(wTx+w0)−1+εi)−∑i=1nβiεiJ=\frac{1}{2}w^{T}w+C\sum_{i=1}^{n} \varepsilon_{i} + \sum_{i=1}^{n}\alpha_{i}(y_{i}(w^Tx + w_0)-1 + \varepsilon_{i}) - \sum_{i = 1}^{n}\beta_{i} \varepsilon_{i}J=21​wTw+C∑i=1n​εi​+∑i=1n​αi​(yi​(wTx+w0​)−1+εi​)−∑i=1n​βi​εi​

∂J∂w=w−∑i=1nαiyixi=0=>w=∑i=1nαiyixi\frac{\partial J}{\partial w} = w - \sum_{i = 1}^n \alpha_i y_ix_i = 0 => w = \sum_{i = 1}^n \alpha_i y_ix_i∂w∂J​=w−∑i=1n​αi​yi​xi​=0=>w=∑i=1n​αi​yi​xi​

∂J∂b=∑i=1nαiyi=0\frac{\partial J}{\partial b} = \sum_{i = 1}^n \alpha_i y_i = 0∂b∂J​=∑i=1n​αi​yi​=0

∂J∂εi=C−αi−βi=0\frac{\partial J}{\partial \varepsilon_{i}} = C - \alpha_i - \beta_i = 0∂εi​∂J​=C−αi​−βi​=0

maximizeα∑i=1nαi−12∑i=1n∑j=1nyiyjαiαjxiTxjmaximize_{\alpha} \sum_{i = 1}^{n}\alpha_{i}-\frac{1}{2}\sum_{i = 1}^{n}\sum_{j = 1}^{n}y_{i}y_{j}\alpha_{i}\alpha_{j}x_{i}^{T}x_{j}maximizeα​∑i=1n​αi​−21​∑i=1n​∑j=1n​yi​yj​αi​αj​xiT​xj​

s.t.∀i:αi>=0∧∑i=1nyiαi=0s.t. \forall i:\alpha_{i} >= 0 \wedge \sum_{i = 1}^{n}y_{i}\alpha_{i}=0s.t.∀i:αi​>=0∧∑i=1n​yi​αi​=0

which is a quadratic program with n+1n+1n+1 variables to be optimized for and nnn inequality and nnn equality constraints.

Why use dual problem?(answer from )

Solving the primal problem, we obtain the optimal www, but know nothing about the αi\alpha_{i}αi​. In order to classify a query point xxx we need to explicitly compute the scalar product wTxw^TxwTx, which may be expensive if ddd is large.

Solving the dual problem, we obtain the αi\alpha_{i}αi​ (where αi=0\alpha_{i} = 0αi​=0 for all but a few points - the support vectors). In order to classify a query point xxx, we calculate:

wTx+w0=(∑i=1nαiyixi)Tx+w0=∑i=1nαiyi⟨ xi,x⟩+w0w^{T}x + w^{0} = (\sum_{i = 1}^{n}\alpha_{i}y_{i}x_{i})^{T}x + w_{0}= \sum_{i = 1}^{n}\alpha_{i}y_{i}\langle\,x_{i},x\rangle + w_{0}wTx+w0=(∑i=1n​αi​yi​xi​)Tx+w0​=∑i=1n​αi​yi​⟨xi​,x⟩+w0​

(Optional): Why Large margin classifier? Let's say a linear svm. If you take a look at the cost function, in order to minimize the cost, the inner product of θTx\theta^TxθTx need to be greater than 1 or less than -1. In this case, if θ\thetaθ is not the perfect decision boundary, it will have larger cost.

1. Suppose there are NNN observations and M features in training data set. First, a sample from training data set is taken randomly with replacement. (Bagging for Dataset)

2. A subset of MMM features(M\sqrt{M}M​) are selected randomly and whichever feature gives the best split is used to split the node iteratively. (Bagging for Feature)

4. Above steps are repeated and prediction is given based on the aggregation of predictions from nnn number of trees.

annoy
Bayes' theorem
http://mypages.iit.edu/~jwang134/posts/KKT-SVs-SVM.html
here
[http://cs229.stanford.edu/notes/cs229-notes1.pdf](http://cs229.stanford.edu/notes/cs229-notes1.pdf)
KNN
svm
Image result for Decision tree image
Image result for Random forest