Evaluation of Multivariate Classification Models for Analyzing NMR Metabolomics Data

Thao Vu, Parker Siemek, Fatema Bhinderwala, Yuhang Xu, Robert Powers

Research output: Contribution to journalArticlepeer-review

19 Scopus citations


Analytical techniques such as NMR and mass spectrometry can generate large metabolomics data sets containing thousands of spectral features derived from numerous biological observations. Multivariate data analysis is routinely used to uncover the underlying biological information contained within these large metabolomics data sets. This is typically accomplished by classifying the observations into groups (e.g., control versus treated) and by identifying associated discriminating features. There are a variety of classification models to select from, which include some well-established techniques (e.g., principal component analysis [PCA], orthogonal projection to latent structure [OPLS], or partial least-squares projection to latent structures [PLS]) and newly emerging machine learning algorithms (e.g., support vector machines or random forests). However, it is unclear which classification model, if any, is an optimal choice for the analysis of metabolomics data. Herein, we present a comprehensive evaluation of five common classification models routinely employed in the metabolomics field and that are also currently available in our MVAPACK metabolomics software package. Simulated and experimental NMR data sets with various levels of group separation were used to evaluate each model. Model performance was assessed by classification accuracy rate, by the area under a receiver operating characteristic (AUROC) curve, and by the identification of true discriminating features. Our findings suggest that the five classification models perform equally well with robust data sets. Only when the models are stressed with subtle data set differences does OPLS emerge as the best-performing model. OPLS maintained a high-prediction accuracy rate and a large area under the ROC curve while yielding loadings closest to the true loadings with limited group separations.

Original languageEnglish (US)
Pages (from-to)3282-3294
Number of pages13
JournalJournal of proteome research
Issue number9
StatePublished - Sep 6 2019


  • NMR
  • classification models
  • metabolomics
  • multivariate

ASJC Scopus subject areas

  • General Chemistry
  • Biochemistry


Dive into the research topics of 'Evaluation of Multivariate Classification Models for Analyzing NMR Metabolomics Data'. Together they form a unique fingerprint.

Cite this