## The 3 parts of an SVM

The SVM consists of 3 "tricks" combined to give one of the most influential and arguably state-of-the-art machine learning algorithm.  The 3rd, "kernel trick" is what gives SVM its power.

During the talk I will refer to Cortes and Vepnik's 1995 paper.
1. linear classification problem, ie, separating a set of data points by a straight line / plane.  This is very simple.
2. Transforming the primal optimization problem to its dual via Lagrangian multipliers.  This is a standard trick that does not result in any gain / loss in problem complexity.  But the dual form has the advantage of depending only on the inner product of input values.
3. The kernel trick that transforms the input space to a high dimensional feature space.  The transform function can be non-linear and highly complex, but it is defined implicitly by the inner product.

## 1. The maximum-margin classifier

The margin is the separation between these 2 hyper-planes:
$\bar{w}^* \cdot \bar{x} = b^* + k$
$\bar{w}^* \cdot \bar{x} = b^* - k$
so the maximum margin is given by:
$m^* = \frac{2k}{|| \bar{w}^* ||}$

The optimization problem is to maximize $m^*$, which is equivalent to minimizing:
$$\max (m^*) \Rightarrow \max_{\bar{w}} (\frac{2k}{ || \bar{w} || }) \Rightarrow \min_{\bar{w}} (\frac{ || \bar{w} || }{2k}) \Rightarrow \min_{\bar{w}} ( \frac{1}{2} \bar{w} \cdot \bar{w} )$$subject to the constraints that the the hyper-plane separates the + and - data points:
$$\bar{w} \cdot (y_i \bar{x_i} ) \ge 1 + y_i b$$

This is a convex optimization problem because the objective function has a nice quadratic form $\bar{w} \cdot \bar{w} = \sum \bar{w_i}^2$ which has a convex shape:

## 2. The Lagrangian dual

The Lagrangian dual of the maximum-margin classifier is to maximize:
$$\max_{\bar{\alpha}} ( \sum_{i=1}^{l} \alpha_i - \frac{1}{2} \sum_{i=1}^{l} \sum_{j=1}^{l} \alpha_i \alpha_j y_i y_j \bar{x_i} \cdot \bar{x_j} )$$subject to the constraints:
$$\sum_{i=1}^l \alpha_i y_i = 0, \quad \alpha_i \ge 0$$
Lagrangian multipliers is a standard technique so I'll not explain it here.

Question:  why does the inner product appear in the dual?  But the important point is that the dual form is dependent only on this inner product $(\bar{x_i} \cdot \bar{x_j})$, whereas the primal form is dependent on $(\bar{w} \cdot \bar{w})$.  This allows us to apply the kernel trick to the dual form.

## 3. The kernel trick

kernel is just a symmetric inner product defined as:
$$k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y})$$By letting $\Phi$ free we can define the kernel as we like, for example, $k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} )^2$.

In the Lagrangian dual, the dot product $(\bar{x_i} \cdot \bar{x_j})$ appears, which can be replaced by the kernel $k(\bar{x_i}, \bar{x_j}) = (\Phi(\bar{x_i}) \cdot \Phi(\bar{x_j}))$.

Doing so has the effect of transforming the input space $\mathbf{x}$ into a feature space $\Phi(\mathbf{x})$, as shown in the diagram:

But notice that $\Phi$ is not explicitly defined;  it is defined via the kernel, as an inner product.  In fact, $\Phi$ is not uniquely determined by a kernel function.

This is a possible transformation function $\Phi$ for our example:
$$\Phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2)$$It just makes $\Phi(\mathbf{x}) \cdot \Phi(\mathbf{y})$ equal to $(\mathbf{x} \cdot \mathbf{y} )^2$.

Recall that the inner product induces a norm via:
$||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}}$
and the distance between 2 points x and y can be expressed as:
$d(\mathbf{x},\mathbf{y})^2 = || \mathbf{x - y} ||^2 = \mathbf{x} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} - 2 \mathbf{x} \cdot \mathbf{y}$
Thus, defining the inner product is akin to re-defining the distances between data points, effectively "warping" the input space into a feature space with a different geometry.

Some questions:  What determines the dimension of the feature space?  How is the decision surface determined by the parameters $\alpha_i$?

## Hilbert Space

A Hilbert space is just a vector space endowed with an inner product or dot product, notation $\langle \mathbf{x},\mathbf{y}\rangle$ or $(\mathbf{x} \cdot \mathbf{y})$.

The Hilbert-Schmidt theorem (used in equation (35) in the Cortes-Vapnik paper) says that every self-adjoint operator in a Hilbert space H (which can be infinite-dimensional) forms an orthogonal basis of H.  In other words, every vector in space H has a unique representation of the form:
$$\mathbf{x} = \sum_{i=1}^\infty \alpha_i \mathbf{u}_i$$This is called a spectral decomposition.

*  *  *

As a digression, Hilbert spaces can be used to define function spaces.  For example, this is the graph of a function with an integer domain:
with values of $f(1), f(2), f(3), ...$ on to infinity.  So, each graph corresponds to one function.

Imagine an infinite dimensional space.  Each point in such a space has dimensions or coordinates $(x_1, x_2, ... )$ up to infinity.  So we can say that each point defines a function via
$$f(1) = x_1, \quad f(2) = x_2, \quad f(3) = x_3, ...$$
In other words, each point corresponds to one function.  That is the idea of a function space.  We can generalize from the integer domain ($\aleph_0$) to the continuum ($2^{\aleph_0}$), thus getting the function space of continuous functions.

End of digression :)