## The 3 parts of an SVM

The SVM consists of 3 "tricks" combined to give one of the most influential and arguably state-of-the-art machine learning algorithm. The 3rd, "kernel trick" is what gives SVM its power.During the talk I will refer to Cortes and Vepnik's 1995 paper.

- A
classification problem, ie, separating a set of data points by a straight line / plane. This is very simple.**linear** - Transforming the
optimization problem to its*primal*via Lagrangian multipliers. This is a standard trick that does not result in any gain / loss in problem complexity. But the dual form has the advantage of depending only on the*dual***inner product**of input values. - The
**kernel trick**that transforms the input space to a high dimensional feature space. The transform function can be non-linear and highly complex, but it is defined*implicitly*by the inner product.

## 1. The maximum-margin classifier

The margin is the separation between these 2 hyper-planes:

$ \bar{w}^* \cdot \bar{x} = b^* + k $

$ \bar{w}^* \cdot \bar{x} = b^* - k $

so the maximum margin is given by:

$ m^* = \frac{2k}{|| \bar{w}^* ||} $

The optimization problem is to maximize $m^*$, which is equivalent to minimizing:

$$ \max (m^*) \Rightarrow \max_{\bar{w}} (\frac{2k}{ || \bar{w} || })

\Rightarrow \min_{\bar{w}} (\frac{ || \bar{w} || }{2k}) \Rightarrow \min_{\bar{w}} ( \frac{1}{2} \bar{w} \cdot \bar{w} ) $$subject to the constraints that the the hyper-plane separates the + and - data points:

$$ \bar{w} \cdot (y_i \bar{x_i} ) \ge 1 + y_i b $$

This is a

**convex optimization**problem because the objective function has a nice quadratic form $\bar{w} \cdot \bar{w} = \sum \bar{w_i}^2$ which has a convex shape

**:**

## 2. The Lagrangian dual

The Lagrangian dual of the maximum-margin classifier is to maximize:$$ \max_{\bar{\alpha}} ( \sum_{i=1}^{l} \alpha_i - \frac{1}{2} \sum_{i=1}^{l} \sum_{j=1}^{l} \alpha_i \alpha_j y_i y_j \bar{x_i} \cdot \bar{x_j} ) $$subject to the constraints:

$$ \sum_{i=1}^l \alpha_i y_i = 0, \quad \alpha_i \ge 0 $$

Lagrangian multipliers is a standard technique so I'll not explain it here.

Question: why does the inner product appear in the dual? But the important point is that the dual form is dependent only on this inner product $(\bar{x_i} \cdot \bar{x_j})$, whereas the primal form is dependent on $(\bar{w} \cdot \bar{w})$. This allows us to apply the kernel trick to the dual form.

## 3. The kernel trick

A**kernel**is just a symmetric inner product defined as:

$$ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $$By letting $\Phi$ free we can define the kernel as we like, for example, $ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} )^2 $.

In the Lagrangian dual, the dot product $(\bar{x_i} \cdot \bar{x_j})$ appears, which can be

*replaced*by the kernel $k(\bar{x_i}, \bar{x_j}) = (\Phi(\bar{x_i}) \cdot \Phi(\bar{x_j}))$.
Doing so has the effect of transforming the input space $\mathbf{x}$ into a

**feature**space $\Phi(\mathbf{x})$, as shown in the diagram:But notice that $\Phi$ is

*not explicitly*defined; it is defined via the kernel, as an inner product. In fact, $\Phi$ is

*not uniquely*determined by a kernel function.

This is a possible transformation function $\Phi$ for our example:

$$ \Phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2) $$It just makes $ \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $ equal to $ (\mathbf{x} \cdot \mathbf{y} )^2 $.

Recall that the inner product induces a

**norm**via:

$ ||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}} $

and the **distance**between 2 points

**x**and

**y**can be expressed as:

$ d(\mathbf{x},\mathbf{y})^2 = || \mathbf{x - y} ||^2 = \mathbf{x} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} - 2 \mathbf{x} \cdot \mathbf{y}$

Thus, defining the inner product is akin to re-defining the distances between data points, effectively "warping" the input space into a feature space with a different geometry.Some questions: What determines the dimension of the feature space? How is the decision surface determined by the parameters $\alpha_i$?

## Hilbert Space

A Hilbert space is just a**vector space**endowed with an

**inner product**or

**dot product**, notation $\langle \mathbf{x},\mathbf{y}\rangle$ or $(\mathbf{x} \cdot \mathbf{y})$.

The

**Hilbert-Schmidt theorem**(used in equation (35) in the Cortes-Vapnik paper) says that every self-adjoint operator in a Hilbert space H (which can be infinite-dimensional) forms an

**orthogonal basis**of H. In other words, every vector in space H has a unique representation of the form:

$$ \mathbf{x} = \sum_{i=1}^\infty \alpha_i \mathbf{u}_i $$This is called a

**spectral decomposition**.*** * ***

As a digression, Hilbert spaces can be used to define

**function spaces**. For example, this is the graph of a function with an

*integer*domain:

with values of $f(1), f(2), f(3), ... $ on to infinity. So,

*each graph corresponds to one function*.

Imagine an

**infinite dimensional space**. Each point in such a space has dimensions or

**coordinates**$(x_1, x_2, ... )$ up to infinity.

**So we can say that each point defines a function via**

$$f(1) = x_1, \quad f(2) = x_2, \quad f(3) = x_3, ... $$

In other words,

*each point corresponds to one function*. That is the idea of a function space. We can generalize from the integer domain ($\aleph_0$) to the

**continuum**($2^{\aleph_0}$), thus getting the function space of continuous functions.

End of digression :)

## No comments:

## Post a Comment