TY - GEN
T1 - KmerEstimate
T2 - 9th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB 2018
AU - Behera, Sairam
AU - Gayen, Sutanu
AU - Deogun, Jitender S.
AU - Vinodchandran, N. V.
N1 - Funding Information:
The works of S. G. and N. V. V. are supported in part by NSF grant CCF-1422668
Publisher Copyright:
© 2018 ACM.
PY - 2018/8/15
Y1 - 2018/8/15
N2 - The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.
AB - The frequency distribution of k-mers (substrings of length k in a DNA/RNA sequence) is very useful for many bioinformatics applications that use next-generation sequencing (NGS) data. Some examples of these include de Bruijn graph based assembly, read error correction, genome size prediction, and digital normalization. In developing tools for such applications, counting (or estimating) k-mers with low frequency is a pre-processing phase. However, computing k-mer frequency histogram becomes computationally challenging for large-scale genomic data. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and is within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate. We present KmerEstimate, a \em streaming algorithm that approximates the count of k-mers with a given frequency in a genomic data set. Our algorithm is based on a well known adaptive sampling based streaming algorithm due to Bar-Yossef et al. for approximating distinct elements in a data stream. We implemented and tested our algorithm on several data sets. The results of our algorithm are better than that of other streaming approaches used so far for this problem (notably $ntCard$, the state-of-the-art streaming approach) and are within 0.6% error rate. It uses less memory than $ntCard$ as the sample size is almost 85% less than that of $ntCard$. In addition, our algorithm has provable approximation and space usage guarantees. We also show certain space complexity lower bounds. The source code of our algorithm is available at \urlhttps://github.com/srbehera11/KmerEstimate.
KW - Genome assembly
KW - K-mer counting
KW - Streaming algorithm
UR - http://www.scopus.com/inward/record.url?scp=85056109624&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85056109624&partnerID=8YFLogxK
U2 - 10.1145/3233547.3233587
DO - 10.1145/3233547.3233587
M3 - Conference contribution
AN - SCOPUS:85056109624
T3 - ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
SP - 438
EP - 447
BT - ACM-BCB 2018 - Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
PB - Association for Computing Machinery, Inc
Y2 - 29 August 2018 through 1 September 2018
ER -