Karaman, İbrahim4; Hedemann, Mette Skou4; Knudsen, Knud Erik Bach4; Qannari, El Mostafa2; Kohler, Achim3
1 Department of Animal Science - Molecular nutrition and reproduction, Department of Animal Science, Science and Technology, Aarhus University2 2UNAM University, ONIRIS, USC "Sensometrics and Chemometrics Laboratory"3 3CIGENE - Center for Integrative Genetics, Dept. of Mathematical Sciences and Technology (IMT), Norwegian University of Life Sciences4 Department of Animal Science - Molecular nutrition and reproduction, Department of Animal Science, Science and Technology, Aarhus University
When applying LC-MS or NMR spectroscopy in metabolomics studies, high-dimensional data are generated and effective tools for variable selection are needed in order to detect the important metabolites. Methods based on sparsity combined with PLSR have recently attracted attention in the field of genomics . They became quickly well established in the field of statistics because a close relationship to elastic net has been established. In sparse variable selection combined with PLSR, a soft thresholding is applied on each loading weight separately. In the field of chemometrics Jack-knifing has been introduced for variable selection in PLSR . Jack-knifing has been frequently applied in the field of spectroscopy and is implemented in software tools like The Unscrambler. In Jack-knifing uncertainty estimates of regression coefficients are estimated and a t-test is applied on these estimates in order to assess whether the regression coefficient associated to each variable is significantly different from zero. In a recent study we have compared sparse PLSR  and Jack-knife PLSR for FTIR spectroscopic data, metabolomics data (LC-MS, NMR) and simulated data. While sparse PLSR turned out to be very stable in terms of the selected variables and, to a minor degree, selected uninformative variables, Jack-knife PLSR turned out to be very sensitive to the selection of uninformative variables when these variables have a high stability. This is due to the fact that in Jack-knife PLSR the stability of regression coefficients is estimated by cross-validation and very stable uninformative variables may be selected even when the regression coefficients are small. We have therefore suggested adding a perturbation parameter to the estimation formula for t-values in Jack-knife PLSR. We show that by optimizing this parameter, the selection of uninformative variables in Jack-knife PLSR can be substantially suppressed. Both sparse variable selection and Jack-knifing can be extended to multi-block methods. Whereas the extension of the Jack-knife PLSR to a multi-block situation is straightforward, for the sparse variable selection, effective iterative algorithms for multi-block situations are needed. In this study we suggest a NIPALS algorithm for the sparse PLSR that is equivalent to the algorithms suggested by . We show that this NIPALS algorithm can easily be extended to a multi-block situation. Thereby the close relationship to elastic net remains established.  K. A. Lê Cao, D. Rossouw, C. Robert-Granié, and P. Besse, A sparse PLS for variable selection when integrating omics data, Statistical Applications in Genetics and Molecular Biology, 7 (2008).  F. Westad and H. Martens, Variable selection in near infrared spectroscopy based on significance testing in partial least squares regression, Journal of Near Infrared Spectroscopy, 8 (2000) 117-124.