dwarf, genome assembly

For a complete overview of genome assembly, take a look at Assembly of large genomes using second-generation sequencing.

glossary of terms

problem overview

Eukaryotic genomes are too large and complex for current technology to sequence from beginning to end. Scientists are overcoming this problem by breaking the genome up into fragments and sequencing them in parallel. Each sequenced fragment is referred to as a read. The task of piecing together the reads to reconstruct the original genome, without prior information of the genome sequence, is de novo genome assembly.

complicating factors

the classic approach

Mathematical graphs naturally represents the overlaps in reads. A mathematical graph is composed of vertices that are connected by edges. Each read is represented by a vertex and overlaps between reads are represented by an edge connecting the overlapping reads (vertices).

The problem of genome assembly is then reduced to finding a path through the graph that visits each edge at least once.

dwarf’s contribution

dwarf differs from existing approaches in a few ways: