TY - GEN
T1 - Next generation sequence assembler mis-assembly of phage genomes with terminal redundancy
AU - Warnke-Sommer, Julia
AU - Thapa, Ishwor
AU - Ali, Hesham
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/16
Y1 - 2015/12/16
N2 - Next generation sequencing (NGS) has become the platform of numerous biomedical applications. The study of viral genomes using NGS technologies has led to the characterization of viral species in numerous environments including the human gut microbiome and plant hosts. Many viral genomes are circular or have terminally redundant ends. Circular or linear viral genomes with indeterminate starting and ending points pose a challenge for NGS assemblers, which may erroneously duplicate sections of these genomes. The length of an assembly, often characterized by the N50 length, is frequently used as an indication of an assembly's completeness and even quality. In this paper, we show that the longest contig produced by various assemblers is not always the best assembly for circular or terminally redundant phage genomes and may represent erroneously repeated genomic regions. Results demonstrate that assembly tools may even produce assembled genomes of different lengths for the same species, depending on content inaccurately repeated, leading to results that might be confusing to or inaccurately used by a researcher. To overcome this problem, we introduce strategies for using coverage depth to identify inaccurately repeated content in circular or terminally redundant phage genomes. We conclude the paper by providing the results of assembling two bacteriophage genomes and a bacteriophage metagenomics dataset, highlighting the impact of using the proposed strategies.
AB - Next generation sequencing (NGS) has become the platform of numerous biomedical applications. The study of viral genomes using NGS technologies has led to the characterization of viral species in numerous environments including the human gut microbiome and plant hosts. Many viral genomes are circular or have terminally redundant ends. Circular or linear viral genomes with indeterminate starting and ending points pose a challenge for NGS assemblers, which may erroneously duplicate sections of these genomes. The length of an assembly, often characterized by the N50 length, is frequently used as an indication of an assembly's completeness and even quality. In this paper, we show that the longest contig produced by various assemblers is not always the best assembly for circular or terminally redundant phage genomes and may represent erroneously repeated genomic regions. Results demonstrate that assembly tools may even produce assembled genomes of different lengths for the same species, depending on content inaccurately repeated, leading to results that might be confusing to or inaccurately used by a researcher. To overcome this problem, we introduce strategies for using coverage depth to identify inaccurately repeated content in circular or terminally redundant phage genomes. We conclude the paper by providing the results of assembling two bacteriophage genomes and a bacteriophage metagenomics dataset, highlighting the impact of using the proposed strategies.
KW - Assembly validation
KW - Next generation sequencing
KW - Viral genome assembly
UR - http://www.scopus.com/inward/record.url?scp=84962425677&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84962425677&partnerID=8YFLogxK
U2 - 10.1109/BIBM.2015.7359836
DO - 10.1109/BIBM.2015.7359836
M3 - Conference contribution
AN - SCOPUS:84962425677
T3 - Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
SP - 1102
EP - 1108
BT - Proceedings - 2015 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
A2 - Schapranow, lng. Matthieu
A2 - Zhou, Jiayu
A2 - Hu, Xiaohua Tony
A2 - Ma, Bin
A2 - Rajasekaran, Sanguthevar
A2 - Miyano, Satoru
A2 - Yoo, Illhoi
A2 - Pierce, Brian
A2 - Shehu, Amarda
A2 - Gombar, Vijay K.
A2 - Chen, Brian
A2 - Pai, Vinay
A2 - Huan, Jun
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2015
Y2 - 9 November 2015 through 12 November 2015
ER -