For a complete overview of genome assembly, take a look at Assembly of large genomes using second-generation sequencing.
glossary of terms
- sequence (noun): the order of nucleotide bases that compose a molecule of DNA: adenine (A), guanine (G), cytosine (C), and thyamine (T)
- sequence (verb): to determine the order of nucleotide bases that compose a molecule of DNA
- read (noun): the sequence of a fragment as proposed by a DNA sequencer. Examples of DNA sequencers include Illumina HiSeq, PacBio RS, and Roche FLX.
problem overview
Eukaryotic genomes are too large and complex for current technology to sequence from beginning to end. Scientists are overcoming this problem by breaking the genome up into fragments and sequencing them in parallel. Each sequenced fragment is referred to as a read. The task of piecing together the reads to reconstruct the original genome, without prior information of the genome sequence, is de novo genome assembly.
complicating factors
-
Reads are relatively tiny and generally do not completely cover repeat regions.
- There has to be more reads to compensate for the shorter read length. This necessitates more computation.
- There is no direct way to tell the length of a repeat that isn’t spanned by a single read.
-
There is up to a 2% error rate in nucleotide bases predicted by DNA sequencers.
the classic approach
Mathematical graphs naturally represents the overlaps in reads. A mathematical graph is composed of vertices that are connected by edges. Each read is represented by a vertex and overlaps between reads are represented by an edge connecting the overlapping reads (vertices).
The problem of genome assembly is then reduced to finding a path through the graph that visits each edge at least once.
dwarf’s contribution
dwarf differs from existing approaches in a few ways:
- Uses the BSP model instead of serial graph transversal for computation. This allows for more efficient algorithms in distributed memory architectures.
- Specifically designed to run on Amazon’s Elastic Compute Cloud (EC2)
- Error correction centers around converting the directed graph to a directed acyclic graph (DAG). Mate pairs are then used to decide between alternate routes through the DAG.