TY - JOUR
T1 - Model-based clustering with certainty estimation
T2 - Implication for clade assignment of influenza viruses
AU - Zhang, Shunpu
AU - Li, Zhong
AU - Beland, Kevin
AU - Lu, Guoqing
N1 - Funding Information:
This publication was made possible by NIH grant number R01 LM009985-01A1. The authors also acknowledge the UCRCA, the University of Nebraska at Omaha (UNO), for continuous funding support to this research program.
Publisher Copyright:
© 2016 The Author(s).
PY - 2016/7/21
Y1 - 2016/7/21
N2 - Background: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.
AB - Background: Clustering is a common technique used by molecular biologists to group homologous sequences and study evolution. There remain issues such as how to cluster molecular sequences accurately and in particular how to evaluate the certainty of clustering results. Results: We presented a model-based clustering method to analyze molecular sequences, described a subset bootstrap scheme to evaluate a certainty of the clusters, and showed an intuitive way using 3D visualization to examine clusters. We applied the above approach to analyze influenza viral hemagglutinin (HA) sequences. Nine clusters were estimated for high pathogenic H5N1 avian influenza, which agree with previous findings. The certainty for a given sequence that can be correctly assigned to a cluster was all 1.0 whereas the certainty for a given cluster was also very high (0.92-1.0), with an overall clustering certainty of 0.95. For influenza A H7 viruses, ten HA clusters were estimated and the vast majority of sequences could be assigned to a cluster with a certainty of more than 0.99. The certainties for clusters, however, varied from 0.40 to 0.98; such certainty variation is likely attributed to the heterogeneity of sequence data in different clusters. In both cases, the certainty values estimated using the subset bootstrap method are all higher than those calculated based upon the standard bootstrap method, suggesting our bootstrap scheme is applicable for the estimation of clustering certainty. Conclusions: We formulated a clustering analysis approach with the estimation of certainties and 3D visualization of sequence data. We analysed 2 sets of influenza A HA sequences and the results indicate our approach was applicable for clustering analysis of influenza viral sequences.
KW - Bootstrap
KW - Certainty
KW - Influenza A hemagglutinin (HA)
KW - Model-based clustering
KW - Multidimensional scaling
UR - http://www.scopus.com/inward/record.url?scp=84978795065&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84978795065&partnerID=8YFLogxK
U2 - 10.1186/s12859-016-1147-x
DO - 10.1186/s12859-016-1147-x
M3 - Article
C2 - 27439701
AN - SCOPUS:84978795065
VL - 17
JO - BMC Bioinformatics
JF - BMC Bioinformatics
SN - 1471-2105
IS - 1
M1 - 287
ER -