So far, we have seen pretty limited parallelism in our graph algorithms:
Is there any way to expose more parellelism?
What would a divide and conquer algorithm look like for graphs?
Could we partition the graph and combine solutions to each partition?
Want nearly equal-sized partitions -- this is hard to compute!
The edges between partitions will make it difficult to solve subproblems independently.
Recall contraction, which we used to implement scan
:
We'll develop contraction algorithms to achieve better parallelism in graph algorithms.
How can we partition a graph?
Recall the notion of graph cut:
A graph cut of a graph $(G,V)$ is a partitioning of vertices $V_1 \subset V$, $V_2 = V - V_1$.
Each vertex set $V_i \subset V$ defines a vertex-induced subgraph consisting of edges where both endpoints are in $V_i$.
For example:
In this partition, we have:
The cut edges are those that join the two subgraphs, e.g., $\{(b,e), (d,f)\}$.
a collection of graphs $\{G_1 = (V_1, E_1), \ldots, G_{k} = (V_{k}, E_{k})\}$ such that
We refer to each subgraph $G_i$ as a block or part of $G$.
For a given partition, we have two types of edges in $E$:
contract step:
recursive step:
expansion step:
We want partitions that:
We'll look at different ways of partitioning in a moment. For now, let's look at an example of how contraction works to solve a specific problem without worrying too much about the details of the partitioning.
# graph contraction
# we still need to specify partition_graph_f!
def contract_graph(vertices, edges, partition_graph_f):
if len(edges) == 0:
return vertices, edges
else:
# partition the graph
# vertex_map is a dict from vertex->super_vertex
# e.g., {'a': 'a', 'b': 'a', 'c': 'a'...} in above example
new_vertices, vertex_map = partition_graph_f(vertices, edges)
# keep only cut eges
new_edges = set([(vertex_map(e[0]), vertex_map(e[1]))
for e in edges if vertex_map(e[0]) != vertex_map(e[1])])
return contract_graph(new_vertices, new_edges, partition_graph_f)
vertices = set(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
edges = set([('a', 'b'), ('a', 'c'), ('b', 'c'), ('b', 'g'),
('d', 'e'), ('d', 'f'), ('e', 'f'),
('h', 'i'), ('i', 'g'), ('h', 'g')])
contract_graph(vertices, edges, partition_graph_f)
Recall in recitation, we have computed the number of connected components in a graph.
How did we do this? What was the worst-case span of our approach?
Now let's think how we might do this with graph contraction.
What does the connectivity in the contracted graph tell us about the connectivity in the original graph?
Since $a$, $b$, $c$ are placed in the same partition, we know they are connected.
Since $a$ and $g$ are connected in the second graph, then every node in the $g$ partition is reachable from every node in the $a$ partition.
Similarly, since $d$ is not connected to $a$ or $g$, then we know that no node in the $d$ partition is reachable from any node in either the $a$ or $g$ partition.
What does the final contracted graph tell us about the number of connected components?
# we still need to specify partition_graph_f!
def num_components(vertices, edges, partition_graph_f):
if len(edges) == 0:
# base case: return the number of super vertices in the final partition
return len(vertices)
else:
new_vertices, vertex_map = partition_graph_f(vertices, edges)
# keep only cut eges
# can use filter here to do in parallel: O(log|E|) span
new_edges = set([(vertex_map[e[0]], vertex_map[e[1]])
for e in edges if vertex_map[e[0]] != vertex_map[e[1]]])
return num_components(new_vertices, new_edges, partition_graph_f)
vertices = set(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
edges = set([('a', 'b'), ('a', 'c'), ('b', 'c'), ('b', 'g'),
('d', 'e'), ('d', 'f'), ('e', 'f'),
('h', 'i'), ('i', 'g'), ('h', 'g')])
num_components(vertices, edges, partition_graph_f)
While we have not yet specified partition_graph_f
, let's assume it
What is the recurrence for the full algorithm to compute num_components
?
$S(|V|) = S\left(\frac{|V|}{2}\right) + \lg (|V|)$
which evaluates to?
$O(\lg^2 |V|)$
This is much better than the span for our BFS/DFS implementations of num_components
.
$O(|V| + |E|)$
Take-aways?
To get this level of parallelism, we need a way of partitioning a graph that has low span and reduces the number of vertices by a contant (geometric) factor each round.
This is nice because:
Here's a simple way to find an edge partition: Randomly pick half the edges in the graph
What can go wrong with this approach?
We don't want to have the same node in two partitions.
What is a greedy algorithm to avoid this case?
Initialize an empty edge set M
Iterate over each edge e in the graph:
If the vertices in e don't appear in M, add it to M
What is the span of this algorithm?
$O(|E|)$ -- it is sequential, every choice depends on previous choices.
Can we come up with a parallel version of this idea? Hint: use randomness
We can randomly select edges, contigent on certain conditions.
filter
on all the results, selecting those edges $e$ where:Here, we make partitions for two edges: $\{a,c\}, \{g,h\}$
The rest are put in their own singleton partitions: $\{b\}, \{d\}, \{e\}, \{f\}, \{i\}$
def edge_contract(vertices, edges):
# sample each edge with 50% chance
sampled_edges = [e for e in edges if random.choice([True, False])]
#print('%d/%d sampled edges' % (len(sampled_edges), len(edges)))
#print(sorted(sampled_edges))
# count how often each vertex appears in the sampled edges.
# could do this in parallel (map-reduce) in O(log |V|) span
vertex_counts = Counter()
for p in sampled_edges:
vertex_counts.update(p)
#print('\nvertex counts in sampled edges:', vertex_counts.items())
# now, do a filter to get those edges where both vertices
# appear only once in sampled_edges
valid_edges = [e for e in sampled_edges if vertex_counts[e[0]] == 1 and vertex_counts[e[1]]==1]
#print('\nkeeping these valid edges:', valid_edges)
# make partitions
vertex_map = dict()
for e in valid_edges:
vertex_map[e[0]] = e[1]
# put all the rest in a singleton partition
for v in vertices:
if v not in vertex_map:
vertex_map[v] = v
return set(vertex_map.values()), vertex_map
def num_components(vertices, edges, partition_graph_f):
if len(edges) == 0:
# base case: return the number of super vertices in the final partition
return len(vertices)
else:
new_vertices, vertex_map = partition_graph_f(vertices, edges)
# keep only cut eges
# can use filter here to do in parallel: O(log|E|) span
new_edges = set([(vertex_map[e[0]], vertex_map[e[1]])
for e in edges if vertex_map[e[0]] != vertex_map[e[1]]])
return num_components(new_vertices, new_edges, partition_graph_f)
vertices = set(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])
edges = set([('a', 'b'), ('a', 'c'), ('b', 'c'), ('b', 'g'),
('d', 'e'), ('d', 'f'), ('e', 'f'),
('h', 'i'), ('i', 'g'), ('h', 'g')])
num_components(vertices, edges, edge_contract)
2
By construction, we know that this will be a valid partitioning.
But, will it make the graph small enough fast enough?
Consider a cycle graph, where each vertex has exactly two neighbors:
What is the probability that a given edge $e$ will be selected for a partition?
This is the probability that: it is $H$, and it's two neighboring edges are $T$
So, if there are $|E|$ edges, and each has probability of $\frac{1}{8}$ of being selected, then what is the expected number of edges selected at each round?
Since $|V| = |E|$ in a cycle graph, and since we remove 1 vertex for each edge partition we select, we will reduce the number of vertices by $\frac{1}{8}$ at each round, in expectation.
Thus, the recurrence:
But, we have assumed a cycle graph. What would be a graph structure where edge contraction performs poorly? (i.e., doesn't remove enough edges)?
Next time: Star Graphs!