MinIsoClust: Isoform clustering using minhash and locality sensitive hashing

Sairam Behera, Jitender S. Deogun, Etsuko N. Moriyama

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the advent of next-generation sequencing technologies, computational transcriptome assembly of RNA-Seq data has become a critical step in many biological and biomedical studies. The accuracy of these transcriptome assembly methods is hindered by the presence of alternatively spliced transcripts (isoforms). Identifying and quantifying isoforms is also essential in understanding complex biological functions, many of which are often associated with various diseases. However, clustering of isoform sequences using only sequence identities when quality reference genomes are not available is often difficult due to heterogeneous exon composition among isoforms. Clustering of a large number of transcript sequences also requires a scalable technique. In this study, we propose a minwise-hashing based method, MinIsoClust, for fast and accurate clustering of transcript sequences that can be used to identify groups of isoforms. We tested this new method using simulated datasets. The results demonstrated that MinIso-Clust was more accurate than CD-HIT-EST, isONclust, and MM-seqs2/Linclust. MinIsoClust also performed better than isONclust and MMseqs2/Linclust in terms of computational time and space efficiency. The source codes of MinIsoClust is freely available at https://github.com/srbehera/MinIsoClust.

Original languageEnglish (US)
Title of host publicationProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
PublisherAssociation for Computing Machinery, Inc
ISBN (Electronic)9781450379649
DOIs
StatePublished - Sep 21 2020
Event11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020 - Virtual, Online, United States
Duration: Sep 21 2020Sep 24 2020

Publication series

NameProceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020

Conference

Conference11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, BCB 2020
Country/TerritoryUnited States
CityVirtual, Online
Period9/21/209/24/20

Keywords

  • clustering
  • isoform
  • locality-sensitive hashing
  • minwise-hashing

ASJC Scopus subject areas

  • Computer Science Applications
  • Software
  • Biomedical Engineering
  • Health Informatics

Fingerprint

Dive into the research topics of 'MinIsoClust: Isoform clustering using minhash and locality sensitive hashing'. Together they form a unique fingerprint.

Cite this