TY - JOUR
T1 - Simple alignment-free methods for protein classification
T2 - A case study from G-protein-coupled receptors
AU - Strope, Pooja K.
AU - Moriyama, Etsuko N.
N1 - Funding Information:
We thank Stephen D. Scott (Computer Science, University of Nebraska at Lincoln) for helpful discussion on the machine learning aspects of this project. We thank anonymous reviewers for their constructive comments. This work was funded by a Nebraska EPSCoR Women in Science grant and an NSF EPSCoR Type II grant to E.N.M. and by Bioinformatics Interdisciplinary Research Scholars sponsored by an NSF EPSCoR Infrastructure Improvement grant, Bioinformatics Research Laboratory to P.K.S.
PY - 2007/5
Y1 - 2007/5
N2 - Computational methods of predicting protein functions rely on detecting similarities among proteins. However, sufficient sequence information is not always available for some protein families. For example, proteins of interest may be new members of a divergent protein family. The performance of protein classification methods could vary in such challenging situations. Using the G-protein-coupled receptor superfamily as an example, we investigated the performance of several protein classifiers. Alignment-free classifiers based on support vector machines using simple amino acid compositions were effective in remote-similarity detection even from short fragmented sequences. Although it is computationally expensive, a support vector machine classifier using local pairwise alignment scores showed very good balanced performance. More commonly used profile hidden Markov models were generally highly specific and well suited to classifying well-established protein family members. It is suggested that different types of protein classifiers should be applied to gain the optimal mining power.
AB - Computational methods of predicting protein functions rely on detecting similarities among proteins. However, sufficient sequence information is not always available for some protein families. For example, proteins of interest may be new members of a divergent protein family. The performance of protein classification methods could vary in such challenging situations. Using the G-protein-coupled receptor superfamily as an example, we investigated the performance of several protein classifiers. Alignment-free classifiers based on support vector machines using simple amino acid compositions were effective in remote-similarity detection even from short fragmented sequences. Although it is computationally expensive, a support vector machine classifier using local pairwise alignment scores showed very good balanced performance. More commonly used profile hidden Markov models were generally highly specific and well suited to classifying well-established protein family members. It is suggested that different types of protein classifiers should be applied to gain the optimal mining power.
KW - Amino acid composition
KW - G-protein-coupled receptors
KW - Profile hidden Markov models
KW - Protein classification
KW - Support vector machines
UR - http://www.scopus.com/inward/record.url?scp=34247112742&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34247112742&partnerID=8YFLogxK
U2 - 10.1016/j.ygeno.2007.01.008
DO - 10.1016/j.ygeno.2007.01.008
M3 - Article
C2 - 17336495
AN - SCOPUS:34247112742
SN - 0888-7543
VL - 89
SP - 602
EP - 612
JO - Genomics
JF - Genomics
IS - 5
ER -