dwarf is a distributed memory Python sparse graph library that can handle billions of vertices. It was inspired by Google’s Pregel which was in turn inspired by Valiant’s bulk synchronous parallel paradigm.
It is being designed with de novo genome assembly specifically in mind.
Much of the project is still being ironed out. If you are interested in using dwarf or learning more, email spenthil@gmail.com. To whet your appetite, the following code walks you through creating a distributed graph, finding the distance from a given vertex to all other vertices, and finding the PageRank of each vertex.
First we need to create a graph instance
import dwarf
graph = dwarf.Graph()
# find and join an existing masterless cluster on this subnet
# if there is no existing cluster, create one
graph.join()
Edges and vertices added on one computer will automatically be known by all members of the cluster.
map(graph.add_edge, (
('John', 'Mary'),
('John', 'Jennifer'),
('Mary', 'Jennifer'),
('Jennifer', 'Nick'),
('Nick', 'Tom'),
))
The cluster now has the following graph represented across it:

dwarf breaks down computation into a series of supersteps. Within each superstep, each vertex can:
- Send messages to other vertices
- Receive messages sent to it in the previous superstep
- Add/remove vertices/edges. This will take into affect at the beginning of the next superstep.
- Modify its internal state.
Since the graph is effectively immutable within a superstep, computation can be be performed on all vertices in parallel without concern for concurrency issues.
Using this paradigm, we can concisely ask questions while effortlessly distributing computation across the cluster. For example, the following finds the distance from a given source vertex to every other vertex in the graph:
def distance_from(vertex, combiner=min):
# `combiner` is used on `vertex.messages` before each superstep
if vertex.state is None or vertex.messages < vertex.state:
# tell vertex's neighbors that there is a path there with
# length `vertex.messages + 1`
graph.send(vertex.successors, vertex.messages + 1)
# update vertex's state in the next superstep
return vertex.messages
# didn't find a shorter way, maintain the current distance
return vertex.state
# message the source vertex to kick off the first superstep
graph.send("John", 0)
graph.do(distance_from)
# "Tom" is the most distant vertex from "John"
assert(graph.max_state() == "Tom")
The following finds the PageRank of each vertex:
def pagerank(vertex, combiner=sum):
# calculate `vertex`'s PageRank for this superstep
score = 0.15/graph.number_of_vertices + 0.85*vertex.messages
# keep on calculating for 30 supersteps
if vertex.superstep < 30:
# if `vertex` is a sink, give all vertices in the graph a
# equal portion of `vertex`'s PageRank, otherwise give all
# successors of `vertex` an equal share of `vertex`'s PageRank
if not vertex.out_degree:
graph.broadcast(score/number_of_vertices)
else:
graph.send(vertex.successors, score/vertex.out_degree)
# ensure this vertex wakes up for next superstep
graph.send(vertex.id, 0)
# the state of `vertex` in the next superstep will be the
# calculated PageRank from this superstep
return score
# message all the vertices with their initial PageRank value
# to kickoff the first superstep
graph.broadcast(1/graph.number_of_vertices)
graph.do(pagerank)
# "Tom" has the highest PageRank
assert(graph.max_state() == "Tom")