The 3 parts of an SVM
The SVM consists of 3 "tricks" combined to give one of the most influential and arguably state-of-the-art machine learning algorithm. The 3rd, "kernel trick" is what gives SVM its power.During the talk I will refer to Cortes and Vepnik's 1995 paper.
- A linear classification problem, ie, separating a set of data points by a straight line / plane. This is very simple.
- Transforming the primal optimization problem to its dual via Lagrangian multipliers. This is a standard trick that does not result in any gain / loss in problem complexity. But the dual form has the advantage of depending only on the inner product of input values.
- The kernel trick that transforms the input space to a high dimensional feature space. The transform function can be non-linear and highly complex, but it is defined implicitly by the inner product.
1. The maximum-margin classifier
The margin is the separation between these 2 hyper-planes:
$ \bar{w}^* \cdot \bar{x} = b^* + k $
$ \bar{w}^* \cdot \bar{x} = b^* - k $
so the maximum margin is given by:
$ m^* = \frac{2k}{|| \bar{w}^* ||} $
The optimization problem is to maximize $m^*$, which is equivalent to minimizing:
$$ \max (m^*) \Rightarrow \max_{\bar{w}} (\frac{2k}{ || \bar{w} || })
\Rightarrow \min_{\bar{w}} (\frac{ || \bar{w} || }{2k}) \Rightarrow \min_{\bar{w}} ( \frac{1}{2} \bar{w} \cdot \bar{w} ) $$subject to the constraints that the the hyper-plane separates the + and - data points:
$$ \bar{w} \cdot (y_i \bar{x_i} ) \ge 1 + y_i b $$
This is a convex optimization problem because the objective function has a nice quadratic form $\bar{w} \cdot \bar{w} = \sum \bar{w_i}^2$ which has a convex shape:
2. The Lagrangian dual
The Lagrangian dual of the maximum-margin classifier is to maximize:$$ \max_{\bar{\alpha}} ( \sum_{i=1}^{l} \alpha_i - \frac{1}{2} \sum_{i=1}^{l} \sum_{j=1}^{l} \alpha_i \alpha_j y_i y_j \bar{x_i} \cdot \bar{x_j} ) $$subject to the constraints:
$$ \sum_{i=1}^l \alpha_i y_i = 0, \quad \alpha_i \ge 0 $$
Lagrangian multipliers is a standard technique so I'll not explain it here.
Question: why does the inner product appear in the dual? But the important point is that the dual form is dependent only on this inner product $(\bar{x_i} \cdot \bar{x_j})$, whereas the primal form is dependent on $(\bar{w} \cdot \bar{w})$. This allows us to apply the kernel trick to the dual form.
3. The kernel trick
A kernel is just a symmetric inner product defined as:$$ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $$By letting $\Phi$ free we can define the kernel as we like, for example, $ k(\mathbf{x}, \mathbf{y}) = \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) = (\mathbf{x} \cdot \mathbf{y} )^2 $.
In the Lagrangian dual, the dot product $(\bar{x_i} \cdot \bar{x_j})$ appears, which can be replaced by the kernel $k(\bar{x_i}, \bar{x_j}) = (\Phi(\bar{x_i}) \cdot \Phi(\bar{x_j}))$.
Doing so has the effect of transforming the input space $\mathbf{x}$ into a feature space $\Phi(\mathbf{x})$, as shown in the diagram:
But notice that $\Phi$ is not explicitly defined; it is defined via the kernel, as an inner product. In fact, $\Phi$ is not uniquely determined by a kernel function.
This is a possible transformation function $\Phi$ for our example:
$$ \Phi(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2} x_1 x_2) $$It just makes $ \Phi(\mathbf{x}) \cdot \Phi(\mathbf{y}) $ equal to $ (\mathbf{x} \cdot \mathbf{y} )^2 $.
Recall that the inner product induces a norm via:
$ ||\mathbf{x}|| = \sqrt{\mathbf{x} \cdot \mathbf{x}} $
and the distance between 2 points x and y can be expressed as:
$ d(\mathbf{x},\mathbf{y})^2 = || \mathbf{x - y} ||^2 = \mathbf{x} \cdot \mathbf{x} + \mathbf{y} \cdot \mathbf{y} - 2 \mathbf{x} \cdot \mathbf{y}$
Thus, defining the inner product is akin to re-defining the distances between data points, effectively "warping" the input space into a feature space with a different geometry.Some questions: What determines the dimension of the feature space? How is the decision surface determined by the parameters $\alpha_i$?
Hilbert Space
A Hilbert space is just a vector space endowed with an inner product or dot product, notation $\langle \mathbf{x},\mathbf{y}\rangle$ or $(\mathbf{x} \cdot \mathbf{y})$.The Hilbert-Schmidt theorem (used in equation (35) in the Cortes-Vapnik paper) says that every self-adjoint operator in a Hilbert space H (which can be infinite-dimensional) forms an orthogonal basis of H. In other words, every vector in space H has a unique representation of the form:
$$ \mathbf{x} = \sum_{i=1}^\infty \alpha_i \mathbf{u}_i $$This is called a spectral decomposition.
* * *
As a digression, Hilbert spaces can be used to define function spaces. For example, this is the graph of a function with an integer domain:
with values of $f(1), f(2), f(3), ... $ on to infinity. So, each graph corresponds to one function.
Imagine an infinite dimensional space. Each point in such a space has dimensions or coordinates $(x_1, x_2, ... )$ up to infinity. So we can say that each point defines a function via
$$f(1) = x_1, \quad f(2) = x_2, \quad f(3) = x_3, ... $$
In other words, each point corresponds to one function. That is the idea of a function space. We can generalize from the integer domain ($\aleph_0$) to the continuum ($2^{\aleph_0}$), thus getting the function space of continuous functions.
End of digression :)