Basics

Linear Regression

It attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

  • Hypothesis Funtion: h(x)=θTxh(x) = \theta^{T}x

  • Cost Function: J(θ)=12i=1m(h(x(i))y(i))2J(\theta) = \frac{1}{2}\sum_{i = 1}^{m}(h(x^{(i)}) - y^{(i)})^2

  • θ\theta - the weight

  • m - total number of samples

  • To simplify our notation, we introduce the convention of letting x0=1x_0 = 1

Close Form of Linear Regression

θ=(XTX)1XTy\theta = (X^{T}X)^{-1}X^{T}y , which assume (XTX)(X^{T}X) is invertible. Intuition see [cs299-notes1]([http://cs229.stanford.edu/notes/cs229-notes1.pdf](http://cs229.stanford.edu/notes/cs229-notes1.pdf))

Logistic Regression

  • Hypothesis Funtion: h(x)=sigmoid(θTx)h(x) = sigmoid(\theta^{T}x)

  • Cost Function: J(θ)=i=1my(i)log(h(x(i)))+i=1m(1y(i))log(1h(x(i)))J(\theta) = \sum_{i = 1}^{m}y^{(i)}log(h(x^{(i)})) + \sum_{i = 1}^{m} (1- y^{(i)})log(1 -h(x^{(i)}))

  • θ\theta - the weight

  • m - total number of samples

  • To simplify our notation, we introduce the convention of letting x0=1x_0 = 1

KNN

  • Keywords: Non-parametric Method, Time consuming

  • Given a data point, we compute the K nearest data points (neighbors) using certain distance metric (e.g., Euclidean metric). For classification, we take the majority label of neighbors; for regression, we take the mean of the label values.

  • Note for KNN we don't train a model; we simply compute during inference time. This can be computationally expensive since each of the test example need to be compared with every training example to see how close they are.

  • There are approximation methods can have faster inference time by partitioning the training data into regions (e.g., annoy)

  • When K equals 1 or other small number the model is prone to overfitting (high variance), while when K equals number of data points or other large number the model is prone to underfitting (high bias)

Naive Bayes

TODO explain formula

  • Naive Bayes (NB) is a supervised learning algorithm based on applying Bayes' theorem

  • It is called naive because it builds the naive assumption that each feature are independent of each other

  • NB can make different assumptions (i.e., data distributions, such as Gaussian, Multinomial, Bernoulli)

  • Despite the over-simplified assumptions, NB classifier works quite well in real-world applications, especially for text classification (e.g., spam filtering)

  • NB can be extremely fast compared to more sophisticated methods

SVM

Try to find an optimal hyperplane to separate two classes of data.

Cost function:

minθCi=1m[yicost1(θTxi)+(1yi)cost0(θTxi)]+12j=1nθj2min_{\theta}C\sum_{i=1}^m[y^icost_1(\theta^Tx^i)+(1-y^i)cost_0(\theta^Tx^i)] + \frac{1}{2}\sum_{j=1}^n\theta_j^2

  • Prime Problem:

    minimizew,w012wTwminimize_{w, w_{0}} \frac{1}{2}w^{T}w

    s.t.i:yi(wTxi+w0)>=1s.t. \forall i: y_{i}(w^{T}x_{i} + w_{0}) >= 1

    which is a quadratic program with d+1d+1 variables to be optimized for and ii constraints.

  • KKT condition:

    Orignal Blog from: http://mypages.iit.edu/~jwang134/posts/KKT-SVs-SVM.html

    • Applying the standard method of Lagrange multipliers, the Lagrangian function is:

      J=12wTw+Ci=1nεi+i=1nαi(yi(wTx+w0)1+εi)i=1nβiεiJ=\frac{1}{2}w^{T}w+C\sum_{i=1}^{n} \varepsilon_{i} + \sum_{i=1}^{n}\alpha_{i}(y_{i}(w^Tx + w_0)-1 + \varepsilon_{i}) - \sum_{i = 1}^{n}\beta_{i} \varepsilon_{i}

    • Thus, the corresponding KKT conditions(Karush–Kuhn–Tucker_conditions) are as following:

      • Jw=wi=1nαiyixi=0=>w=i=1nαiyixi\frac{\partial J}{\partial w} = w - \sum_{i = 1}^n \alpha_i y_ix_i = 0 => w = \sum_{i = 1}^n \alpha_i y_ix_i

        Jb=i=1nαiyi=0\frac{\partial J}{\partial b} = \sum_{i = 1}^n \alpha_i y_i = 0

        if is the dataset is not sepratable:

        Jεi=Cαiβi=0\frac{\partial J}{\partial \varepsilon_{i}} = C - \alpha_i - \beta_i = 0

  • Dual Problem:

    maximizeαi=1nαi12i=1nj=1nyiyjαiαjxiTxjmaximize_{\alpha} \sum_{i = 1}^{n}\alpha_{i}-\frac{1}{2}\sum_{i = 1}^{n}\sum_{j = 1}^{n}y_{i}y_{j}\alpha_{i}\alpha_{j}x_{i}^{T}x_{j}

    s.t.i:αi>=0i=1nyiαi=0s.t. \forall i:\alpha_{i} >= 0 \wedge \sum_{i = 1}^{n}y_{i}\alpha_{i}=0

    which is a quadratic program with n+1n+1 variables to be optimized for and nn inequality and nn equality constraints.

  • Why use dual problem?(answer from here)

    1. Solving the primal problem, we obtain the optimal ww, but know nothing about the αi\alpha_{i}. In order to classify a query point xx we need to explicitly compute the scalar product wTxw^Tx, which may be expensive if dd is large.

    2. Solving the dual problem, we obtain the αi\alpha_{i} (where αi=0\alpha_{i} = 0 for all but a few points - the support vectors). In order to classify a query point xx, we calculate:

      wTx+w0=(i=1nαiyixi)Tx+w0=i=1nαiyixi,x+w0w^{T}x + w^{0} = (\sum_{i = 1}^{n}\alpha_{i}y_{i}x_{i})^{T}x + w_{0}= \sum_{i = 1}^{n}\alpha_{i}y_{i}\langle\,x_{i},x\rangle + w_{0}

      This term is very efficiently calculated if there are only few support vectors. Further, since we now have a scalar product only involving data vectors, we may apply the kernel trick.

  • Can perform linear, nonlinear, or outlier detection (unsupervised) depending on the kernel funciton

  • Large margin classifier: using SVM we not only have a decision boundary, but want the boundary to be as far from the closest training point as possible

  • (Optional): Why Large margin classifier? Let's say a linear svm. If you take a look at the cost function, in order to minimize the cost, the inner product of θTx\theta^Tx need to be greater than 1 or less than -1. In this case, if θ\theta is not the perfect decision boundary, it will have larger cost.

  • The closest training examples are called support vectors, since they are the points based on which the decision boundary is drawn

  • SVMs are sensitive to feature scaling

  • If C is very large, SVM is very sensitive to outliers.But if C is reasonably small, or a not too large, then you stick with the decision boundary more robust with outliers.

Decision tree

  • Non-parametric, supervised learning algorithms

  • Given the training data, a decision tree algorithm divides the feature space into regions. For inference, we first see which region does the test data point fall in, and take the mean label values (regression) or the majority label value (classification).

  • Construction: top-down, chooses a question to split the data such that the target variables within each region are as homogeneous as possible. Calculate the gini impurity and information gain, then pick the question with the most information gain.

  • Advantage:

    • simply to understand & interpret, mirrors human decision making

  • Disadvantage:

    • can overfit easily (and generalize poorly) if we don't limit the depth of the tree

    • can be non-robust: A small change in the training data can lead to a totally different tree

    • instability: sensitive to training set rotation due to its orthogonal decision boundaries

  • How to choose depth:

    • It depends on the possible questions you can have for certain problems.

    • Also you need to use validation data to see which depth works best

Random forests

An ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

  • The number of tree? In general, the more trees you use the better get the results. However, the improvement decreases as the number of trees increases, at a certain point the benefit in prediction performance from learning more trees will be lower than the cost in computation time for learning these additional trees.

Bagging in Random Forest

1. Suppose there are NN observations and M features in training data set. First, a sample from training data set is taken randomly with replacement. (Bagging for Dataset)

2. A subset of MM features(M\sqrt{M}) are selected randomly and whichever feature gives the best split is used to split the node iteratively. (Bagging for Feature)

3. The tree is grown to the largest.

4. Above steps are repeated and prediction is given based on the aggregation of predictions from nn number of trees.

Last updated