Geniferology: Support vector machines

The 3 parts of an SVM

The SVM consists of 3 "tricks" combined to give one of the most influential and arguably state-of-the-art machine learning algorithm. The 3rd, "kernel trick" is what gives SVM its power.

During the talk I will refer to Cortes and Vepnik's 1995 paper.

A linear classification problem, ie, separating a set of data points by a straight line / plane. This is very simple.
Transforming the primal optimization problem to its dual via Lagrangian multipliers. This is a standard trick that does not result in any gain / loss in problem complexity. But the dual form has the advantage of depending only on the inner product of input values.
The kernel trick that transforms the input space to a high dimensional feature space. The transform function can be non-linear and highly complex, but it is defined implicitly by the inner product.

1. The maximum-margin classifier

The margin is the separation between these 2 hyper-planes:

$ \bar{w}^* \cdot \bar{x} = b^* + k $

$ \bar{w}^* \cdot \bar{x} = b^* - k $

so the maximum margin is given by:

$ m^* = \frac{2k}{|| \bar{w}^* ||} $

The optimization problem is to maximize $m^*$, which is equivalent to minimizing:
$$ \max (m^*) \Rightarrow \max_{\bar{w}} (\frac{2k}{ || \bar{w} || })
\Rightarrow \min_{\bar{w}} (\frac{ || \bar{w} || }{2k}) \Rightarrow \min_{\bar{w}} ( \frac{1}{2} \bar{w} \cdot \bar{w} ) $$subject to the constraints that the the hyper-plane separates the + and - data points:
$$ \bar{w} \cdot (y_i \bar{x_i} ) \ge 1 + y_i b $$

This is a convex optimization problem because the objective function has a nice quadratic form $\bar{w} \cdot \bar{w} = \sum \bar{w_i}^2$ which has a convex shape:

2. The Lagrangian dual

The Lagrangian dual of the maximum-margin classifier is to maximize:
$$ \max_{\bar{\alpha}} ( \sum_{i=1}^{l} \alpha_i - \frac{1}{2} \sum_{i=1}^{l} \sum_{j=1}^{l} \alpha_i \alpha_j y_i y_j \bar{x_i} \cdot \bar{x_j} ) $$subject to the constraints:
$$ \sum_{i=1}^l \alpha_i y_i = 0, \quad \alpha_i \ge 0 $$
Lagrangian multipliers is a standard technique so I'll not explain it here.

Question: why does the inner product appear in the dual? But the important point is that the dual form is dependent only on this inner product $(\bar{x_i} \cdot \bar{x_j})$, whereas the primal form is dependent on $(\bar{w} \cdot \bar{w})$. This allows us to apply the kernel trick to the dual form.

3. The kernel trick

A kernel is just a symmetric inner product defined as:
$$ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $$By letting $\Phi$ free we can define the kernel as we like, for example, $ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} )^2 $.

In the Lagrangian dual, the dot product $(\bar{x_i} \cdot \bar{x_j})$ appears, which can be replaced by the kernel $k(\bar{x_i}, \bar{x_j}) = (\Phi(\bar{x_i}) \cdot \Phi(\bar{x_j}))$.

Doing so has the effect of transforming the input space $\mathbf{x}$ into a feature space $\Phi(\mathbf{x})$, as shown in the diagram:

But notice that $\Phi$ is not explicitly defined; it is defined via the kernel, as an inner product. In fact, $\Phi$ is not uniquely determined by a kernel function.

This is a possible transformation function $\Phi$ for our example:
$$ \Phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2) $$It just makes $ \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $ equal to $ (\mathbf{x} \cdot \mathbf{y} )^2 $.

Recall that the inner product induces a norm via:

$ ||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}} $

and the distance between 2 points x and y can be expressed as:

$ d(\mathbf{x},\mathbf{y})^2 = || \mathbf{x - y} ||^2 = \mathbf{x} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} - 2 \mathbf{x} \cdot \mathbf{y}$

Thus, defining the inner product is akin to re-defining the distances between data points, effectively "warping" the input space into a feature space with a different geometry.

Some questions: What determines the dimension of the feature space? How is the decision surface determined by the parameters $\alpha_i$?

Hilbert Space

A Hilbert space is just a vector space endowed with an inner product or dot product, notation $\langle \mathbf{x},\mathbf{y}\rangle$ or $(\mathbf{x} \cdot \mathbf{y})$.

The Hilbert-Schmidt theorem (used in equation (35) in the Cortes-Vapnik paper) says that every self-adjoint operator in a Hilbert space H (which can be infinite-dimensional) forms an orthogonal basis of H. In other words, every vector in space H has a unique representation of the form:

$$ \mathbf{x} = \sum_{i=1}^\infty \alpha_i \mathbf{u}_i $$This is called a spectral decomposition.

* * *

As a digression, Hilbert spaces can be used to define function spaces. For example, this is the graph of a function with an integer domain:

with values of $f(1), f(2), f(3), ... $ on to infinity. So, each graph corresponds to one function.

Imagine an infinite dimensional space. Each point in such a space has dimensions or coordinates $(x_1, x_2, ... )$ up to infinity. So we can say that each point defines a function via
$$f(1) = x_1, \quad f(2) = x_2, \quad f(3) = x_3, ... $$
In other words, each point corresponds to one function. That is the idea of a function space. We can generalize from the integer domain ($\aleph_0$) to the continuum ($2^{\aleph_0}$), thus getting the function space of continuous functions.

End of digression :)

Blog Archive

Latex Maths

Thursday, November 29, 2012

Support vector machines

The 3 parts of an SVM

1. The maximum-margin classifier

2. The Lagrangian dual

3. The kernel trick

Hilbert Space

No comments:

Post a Comment