Title
Evaluation of multiple variate selection methods from a biological perspective: A nutrigenomics case study
Author
Tapp, H.S.
Radonjic, M.
Kemsley, E.K.
Thissen, U.
Publication year
2012
Abstract
Genomics-based technologies produce large amounts of data. To interpret the results and identify the most important variates related to phenotypes of interest, various multivariate regression and variate selection methods are used. Although inspected for statistical performance, the relevance of multivariate models in interpreting biological data sets often remains elusive. We compare various multivariate regression and variate selection methods applied to a nutrigenomics data set in terms of performance, utility and biological interpretability. The studied data set comprised hepatic transcriptome (10,072 predictor variates) and plasma protein concentrations [2 dependent variates: Leptin (LEP) and Tissue inhibitor of metalloproteinase 1 (TIMP-1)] collected during a high-fat diet study in ApoE3Leiden mice. The multivariate regression methods used were: partial least squares ''PLS''; a genetic algorithmbased multiple linear regression, ''GA-MLR''; two leastangle shrinkage methods, ''LASSO'' and ''ELASTIC NET''; and a variant of PLS that uses covariance-based variate selection, ''CovProc.'' Two methods of ranking the genes for Gene Set Enrichment Analysis (GSEA) were also investigated: either by their correlation with the protein data or by the stability of the PLS regression coefficients. The regression methods performed similarly, with CovProc and GA performing the best and worst, respectively (R-squared values based on ''double cross-validation'' predictions of 0.762 and 0.451 for LEP; and 0.701 and 0.482 for TIMP-1). CovProc, LASSO and ELASTIC NET all produced parsimonious regression models and consistently identified small subsets of variates, with high commonality between the methods. Comparison of the gene ranking approaches found a high degree of agreement, with PLS-based ranking finding fewer significant gene sets. We recommend the use of CovProc for variate selection, in tandem with univariate methods, and the use of correlationbased ranking for GSEA-like pathway analysis methods. © The Author(s) 2012.
Subject
EELS - Earth, Environmental and Life Sciences
Life
Healthy Living
Healthy for Life
High-fat diet
Microarrays
Multivariate statistical analysis
Nutrigenomics
Pathway analysis
QS - Quality & Safety
To reference this document use:
http://resolver.tudelft.nl/uuid:ce9f5ad8-7d08-47fb-a1c7-76371b15cf36
DOI
https://doi.org/10.1007/s12263-012-0288-4
TNO identifier
464282
ISSN
1555-8932
Source
Genes and Nutrition, 7 (7), 387-397
Document type
article