Abstract
Next generation sequencing (NGS) has become a major focus in many recent biological research applications. NGS produces thousands to millions of short DNA fragments in a single run. Individually, these fragments represent only a small fraction of an original biological sample. To obtain any useful information, overlapping fragments must be assembled into long stretches of contiguous sequence. Various assemblers have been developed to address the fragment assembly problem. The majority of current assemblers were developed to fill an important gap, however, they were developed with a pure computational focus without taking the properties of the input datasets into consideration. NGS dataset characteristics such as fragment coverage and underlying genome complexity vary dramatically between different sequencing applications. Generic assemblers that are data independent are unlikely to produce accurate solutions in all problem domains. In this study, we propose a graph theoretic approach based on the concept of tolerance graphs to develop a domain-specific assembler. The proposed assembler is designed to extract signals associated with local features in the input dataset and reintegrate this knowledge into the assembly process through customized tolerance graph parameters. We conducted a number of experiments to study the impact of various input parameters on the quality of the assembled genomes. Results from this study show that the proposed assembler produces excellent results and outperforms other known assembly algorithms for some input datasets. This approach also presents the foundation for developing domain-specific assemblers to be applied in an intelligent and customized manner to a wide variety of input instances, resulting in more efficient assembly tactics and improved overall assembly quality.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - IEEE 13th International Conference on Data Mining Workshops, ICDMW 2013 |
Publisher | IEEE Computer Society |
Pages | 88-95 |
Number of pages | 8 |
DOIs | |
State | Published - 2013 |
Event | 2013 13th IEEE International Conference on Data Mining Workshops, ICDMW 2013 - Dallas, TX Duration: Dec 7 2013 → Dec 10 2013 |
Other
Other | 2013 13th IEEE International Conference on Data Mining Workshops, ICDMW 2013 |
---|---|
City | Dallas, TX |
Period | 12/7/13 → 12/10/13 |
Keywords
- Graph theory
- Knowledge-based genome assembly
- Next generation sequencing
- Tolerance graph
ASJC Scopus subject areas
- Software