1 Department of Applied Mathematics and Computer Science, Technical University of Denmark2 Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark3 Copenhagen Center for Health Technology, Center, Technical University of Denmark
Many important machine learning models, supervised and unsupervised, are based on simple Euclidean distance or orthogonal projection in a high dimensional feature space. When estimating such models from small training sets we face the problem that the span of the training data set input vectors is not the full input space. Hence, when applying the model to future data the model is effectively blind to the missed orthogonal subspace. This can lead to an inflated variance of hidden variables estimated in the training set and when the model is applied to test data we may find that the hidden variables follow a different probability law with less variance. While the problem and basic means to reconstruct and deflate are well understood in unsupervised learning, the case of supervised learning is less well understood. We here investigate the effect of variance inflation in supervised learning including the case of Support Vector Machines (SVMS) and we propose a non-parametric scheme to restore proper generalizability. We illustrate the algorithm and its ability to restore performance on a wide range of benchmark data sets.
Pattern Recognition Letters, 2013, Vol 34, Issue 16, p. 2173-2180