dwarf, about

dwarf is a distributed memory Python sparse graph library that can handle billions of vertices. It was inspired by Google’s Pregel which was in turn inspired by Valiant’s bulk synchronous parallel paradigm.

It is being designed with de novo genome assembly specifically in mind.

Much of the project is still being ironed out. If you are interested in using dwarf or learning more, email spenthil@gmail.com. To whet your appetite, the following code walks you through creating a distributed graph, finding the distance from a given vertex to all other vertices, and finding the PageRank of each vertex.

First we need to create a graph instance

import dwarf
graph = dwarf.Graph()
# find and join an existing masterless cluster on this subnet
# if there is no existing cluster, create one
graph.join()

Edges and vertices added on one computer will automatically be known by all members of the cluster.

map(graph.add_edge, (
    ('John', 'Mary'),
    ('John', 'Jennifer'),
    ('Mary', 'Jennifer'),
    ('Jennifer', 'Nick'),
    ('Nick', 'Tom'),
))

The cluster now has the following graph represented across it:

graph

dwarf breaks down computation into a series of supersteps. Within each superstep, each vertex can:

Since the graph is effectively immutable within a superstep, computation can be be performed on all vertices in parallel without concern for concurrency issues.

Using this paradigm, we can concisely ask questions while effortlessly distributing computation across the cluster. For example, the following finds the distance from a given source vertex to every other vertex in the graph:

def distance_from(vertex, combiner=min):
    # `combiner` is used on `vertex.messages` before each superstep
    if vertex.state is None or vertex.messages < vertex.state:
        # tell vertex's neighbors that there is a path there with
        # length `vertex.messages + 1`
        graph.send(vertex.successors, vertex.messages + 1)
        # update vertex's state in the next superstep
        return vertex.messages
    # didn't find a shorter way, maintain the current distance
    return vertex.state
# message the source vertex to kick off the first superstep
graph.send("John", 0)
graph.do(distance_from)

# "Tom" is the most distant vertex from "John"
assert(graph.max_state() == "Tom")

The following finds the PageRank of each vertex:

def pagerank(vertex, combiner=sum):
    # calculate `vertex`'s PageRank for this superstep
    score = 0.15/graph.number_of_vertices + 0.85*vertex.messages
    # keep on calculating for 30 supersteps
    if vertex.superstep < 30:
        # if `vertex` is a sink, give all vertices in the graph a
        # equal portion of `vertex`'s PageRank, otherwise give all 
        # successors of `vertex` an equal share of `vertex`'s PageRank
        if not vertex.out_degree:
            graph.broadcast(score/number_of_vertices)
        else:
            graph.send(vertex.successors, score/vertex.out_degree)
        # ensure this vertex wakes up for next superstep
        graph.send(vertex.id, 0)
    # the state of `vertex` in the next superstep will be the 
    # calculated PageRank from this superstep
    return score
# message all the vertices with their initial PageRank value 
# to kickoff the first superstep
graph.broadcast(1/graph.number_of_vertices)
graph.do(pagerank)

# "Tom" has the highest PageRank
assert(graph.max_state() == "Tom")