Classification of breast cancer patients using somatic mutation profiles and machine learning approaches

Suleyman Vural, Xiaosheng Wang, Chittibabu Guda

Research output: Contribution to journalArticlepeer-review

56 Scopus citations


Background: The high degree of heterogeneity observed in breast cancers makes it very difficult to classify the cancer patients into distinct clinical subgroups and consequently limits the ability to devise effective therapeutic strategies. Several classification strategies based on ER/PR/HER2 expression or the expression profiles of a panel of genes have helped, but such methods often produce misleading results due to their dynamic nature. In contrast, somatic DNA mutations are relatively stable and lead to initiation and progression of many sporadic cancers. Hence in this study, we explore the use of gene mutation profiles to classify, characterize and predict the subgroups of breast cancers. Results: We analyzed the whole exome sequencing data from 358 ethnically similar breast cancer patients in The Cancer Genome Atlas (TCGA) project. Somatic and non-synonymous single nucleotide variants identified from each patient were assigned a quantitative score (C-score) that represents the extent of negative impact on the gene function. Using these scores with non-negative matrix factorization method, we clustered the patients into three subgroups. By comparing the clinical stage of patients, we identified an early-stage-enriched and a late-stage-enriched subgroup. Comparison of the mutation scores of early and late-stage-enriched subgroups identified 358 genes that carry significantly higher mutations rates in the late stage subgroup. Functional characterization of these genes revealed important functional gene families that carry a heavy mutational load in the late state rich subgroup of patients. Finally, using the identified subgroups, we also developed a supervised classification model to predict the stage of the patients. Conclusions: This study demonstrates that gene mutation profiles can be effectively used with unsupervised machine-learning methods to identify clinically distinguishable breast cancer subgroups. The classification model developed in this method could provide a reasonable prediction of the cancer patients' stage solely based on their mutation profiles. This study represents the first use of only somatic mutation profile data to identify and predict breast cancer subgroups and this generic methodology can also be applied to other cancer datasets.

Original languageEnglish (US)
Article number62
JournalBMC systems biology
StatePublished - Aug 26 2016


  • Breast cancer classification
  • Breast cancer subtypes
  • Cancer stage prediction
  • Gene mutation profiles
  • TCGA
  • Unsupervised and supervised machine learning
  • Whole exome sequencing data analysis

ASJC Scopus subject areas

  • Structural Biology
  • Modeling and Simulation
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics


Dive into the research topics of 'Classification of breast cancer patients using somatic mutation profiles and machine learning approaches'. Together they form a unique fingerprint.

Cite this