TY - GEN
T1 - Parallel NGS assembly using distributed assembly graphs enriched with biological knowledge
AU - Warnke-Sommer, Julia D.
AU - Ali, Hesham H.
PY - 2017/6/30
Y1 - 2017/6/30
N2 - High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.
AB - High performance computing has become essential for many biomedical applications as the production of biological data continues to increase. Next Generation Sequencing (NGS) technologies are capable of producing millions to even billions of short DNA fragments called reads. These short reads are assembled into larger sequences called contigs by graph theoretic software tools called assemblers. High performance computing has been applied to reduce the computational burden of several steps of the NGS data assembly process. Several parallel de Bruijn graph assemblers rely on a distributed assembly graph. However, the majority of assemblers that utilize distributed assembly graphs do not take the input properties of the data set into consideration to improve the graph partitioning process. Furthermore, the graph theoretic foundation for the majority of these assemblers is a distributed de Bruijn graph. In this paper, we introduce a distributed overlap graph based model upon which our parallel assembler Focus is built. The contribution of this paper is three-fold. First, we demonstrate that the application of data specific knowledge regarding the inherent linearity of DNA sequences can be used to improve the partitioning processes for distributing the assembly graph. Second, we implement several parallel graph algorithms for assembly with greatly improved speedup. Finally, we demonstrate that for metagenomics datasets, the graph partitioning provides insights into the structure of the microbial community.
KW - algorithms
KW - assembly graph
KW - high performance computing
KW - next generation sequencing
UR - http://www.scopus.com/inward/record.url?scp=85028059789&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85028059789&partnerID=8YFLogxK
U2 - 10.1109/IPDPSW.2017.143
DO - 10.1109/IPDPSW.2017.143
M3 - Conference contribution
AN - SCOPUS:85028059789
T3 - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
SP - 273
EP - 282
BT - Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 31st IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017
Y2 - 29 May 2017 through 2 June 2017
ER -