1 Department of Animal Science - Molecular nutrition and reproduction, Department of Animal Science, Science and Technology, Aarhus University2 Centre for Integrative Genetics (CIGENE), Department of Mathematical Sciences and Technology (IMT), Norwegian University of Life Sciences3 Department of Animal Science - Molecular nutrition and reproduction, Department of Animal Science, Science and Technology, Aarhus University
Partial least squares regression (PLSR) has been applied to various fields such as psychometrics, consumer science, econometrics and process control. Recently it has been applied to metabolomics based data sets (GC/LC-MS, NMR) and proven to be a very powerful in situations with many variables for the purpose of reducing over-fitting problems and providing useful interpretation tools. It has excellent possibilities for giving a graphical overview of sample and variation patterns. It can handle co-linearity in an efficient way and make it possible to use different highly correlated data sets in one integrated approach. Due to the high number of variables in data sets (both raw data and after peak picking) the selection of important variables in an explorative analysis is difficult, especially when different data sets of metabolomics data need to be related. Variable selection (or removal of irrelevant variables) aids the model by improving predictions, providing better interpretation and decreasing measurement costs. In addition, overfitting is an issue when we are dealing with high number of variables. To overcome this, we used cross-model-validation in order to validate the models. In this paper different strategies for variable selection on PLSR method were considered and compared with respect to selected subset of variables and the possibility for biological validation. Sparse PLSR  as well as PLSR with Jack-knifing  was applied to data in order to achieve variable selection prior to comparison. Sparse PLSR is based on penalization of the loading weights (by elastic net, soft/hard thresholding etc.) on a PLSR model. In PLSR with Jack-knifing, significance of variables are calculated by uncertainty test. The data set used in this study is LC-MS data from an animal intervention study. The aim of the metabolomics study was to investigate the metabolic profile in pigs fed various cereal fractions with special attention to the metabolism of lignans using LC-MS based metabolomic approach. References 1. Lê Cao KA, Rossouw D, Robert-Granié C, Besse P: A Sparse PLS for Variable Selection when Integrating Omics data. Statistical Applications in Genetics and Molecular Biology, 7:Article 35, 2008. 2. Martens H and Martens M. Modifed Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Quality and Preference, 11:5-16, 2000.