1 Department of Food Science, Faculty of Science, Københavns Universitet2 Department of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Iran3 Københavns Universitet
With the increasing ease of measuring and calculating multiple descriptors per molecule in quantitative structure-activity relationship, the importance of variable selection for data reduction and improving interpretability is gaining importance. While variable selection has been extensively studied in the context of supervised learning, in this paper, an unsupervised learning method is proposed for variable selection and its performance is assessed using a typical QSAR data set. Whereas there is no real dependent variable in the proposed variable selection algorithm, applied variable selection is unsupervised indeed. Besides, scores that are the linear combination of the data variables are set as dependent variables (artificial dependent variables). It includes 107 derivatives of HEPT molecule, characterized by 160 descriptors encoding the steric, hydrophobic, electronic and structural features of HEPT derivatives. The aims of this procedure are generating a subset of descriptors from a data set with the relevant variables, eliminating redundancy, and reducing multicollinearity. The core of this methodology is based on jack-knife resampling method. In this paper, using jack-knife led to selection of 48 out of 160 initial descriptors, so that the data information was preserved. Lastly, using influence effect on prediction resulted in eight descriptors as representative of the 160 descriptors. Constructed model with final 8 descriptors has Q(IN)(2) = 0.67, R-2 = 0.74, Q(EXT)(2) = 0.85. It represents adequacy of our strategy for preserving data structure. (C) 2013 Elsevier B.V. All rights reserved.
Chemometrics and Intelligent Laboratory Systems, 2013, Vol 128, p. 135-143