TY - GEN
T1 - MinIsoClust
T2 - 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
AU - Behera, Sairam
AU - Deogun, Jitender S.
AU - Moriyama, Etsuko N.
N1 - Funding Information:
We thank Dr. Voshall (UNL) for suggestions and critical comments. Some part of this work was completed utilizing the Holland Computing Center of the University of Nebraska, which receives support from the Nebraska Research Initiative. This work has been supported by NSF EPSCoR RII Track-1: Center for Root and Rhizo-biome Innovation (CRRI) Award OIA-1557417 to ENM.
Publisher Copyright:
© 2020 ACM.
PY - 2020/9/21
Y1 - 2020/9/21
N2 - With the advent of next-generation sequencing technologies, computational transcriptome assembly of RNA-Seq data has become a critical step in many biological and biomedical studies. The accuracy of these transcriptome assembly methods is hindered by the presence of alternatively spliced transcripts (isoforms). Identifying and quantifying isoforms is also essential in understanding complex biological functions, many of which are often associated with various diseases. However, clustering of isoform sequences using only sequence identities when quality reference genomes are not available is often difficult due to heterogeneous exon composition among isoforms. Clustering of a large number of transcript sequences also requires a scalable technique. In this study, we propose a minwise-hashing based method, MinIsoClust, for fast and accurate clustering of transcript sequences that can be used to identify groups of isoforms. We tested this new method using simulated datasets. The results demonstrated that MinIso-Clust was more accurate than CD-HIT-EST, isONclust, and MM-seqs2/Linclust. MinIsoClust also performed better than isONclust and MMseqs2/Linclust in terms of computational time and space efficiency. The source codes of MinIsoClust is freely available at https://github.com/srbehera/MinIsoClust.
AB - With the advent of next-generation sequencing technologies, computational transcriptome assembly of RNA-Seq data has become a critical step in many biological and biomedical studies. The accuracy of these transcriptome assembly methods is hindered by the presence of alternatively spliced transcripts (isoforms). Identifying and quantifying isoforms is also essential in understanding complex biological functions, many of which are often associated with various diseases. However, clustering of isoform sequences using only sequence identities when quality reference genomes are not available is often difficult due to heterogeneous exon composition among isoforms. Clustering of a large number of transcript sequences also requires a scalable technique. In this study, we propose a minwise-hashing based method, MinIsoClust, for fast and accurate clustering of transcript sequences that can be used to identify groups of isoforms. We tested this new method using simulated datasets. The results demonstrated that MinIso-Clust was more accurate than CD-HIT-EST, isONclust, and MM-seqs2/Linclust. MinIsoClust also performed better than isONclust and MMseqs2/Linclust in terms of computational time and space efficiency. The source codes of MinIsoClust is freely available at https://github.com/srbehera/MinIsoClust.
KW - clustering
KW - isoform
KW - locality-sensitive hashing
KW - minwise-hashing
UR - http://www.scopus.com/inward/record.url?scp=85096965048&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85096965048&partnerID=8YFLogxK
U2 - 10.1145/3388440.3412424
DO - 10.1145/3388440.3412424
M3 - Conference contribution
AN - SCOPUS:85096965048
T3 - Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
BT - Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
PB - Association for Computing Machinery, Inc
Y2 - 21 September 2020 through 24 September 2020
ER -