Indian Computing Olympiad

Training Material


Advanced graph algorithms

Disjoint-set datastructure (union-find)

We have a universe U = {1,2,...,N}. Initially, U is divided into N singleton subsets {{1},{2},...,{N}}.

We want to maintain a collection of disjoint subsets of U (components) to support the following operations.

  • Given an element, find which subset it belongs to (return a unique identifier for that subset)
  • Merge two subsets (union)

We could maintain an array Comp[1..N] so that Comp[i] is a unique label identifying the subset that i belongs to. We can check whether two vertices are in the same component in constant time. However, to merge two components, we have to traverse Comp and make the labels of the two components the same. This takes time O(N), which is too expensive --- it will take MN steps across M merges.

Alternatively, for each label (component), we maintain a (linked) list of vertices that have that label. This will not help us much because the size of each component could be about N.

Suppose, in addition, we maintain an array Size that records the size of component. When we merge two components, we merge the smaller component into the larger one --- that is, we change the labels of the smaller component to be the same as the larger one. (For this, we assume that we maintain, for each component, both an array mapping vertices to components and a list of the elements that belong to that component. We use the list to efficiently identify all the vertices in the component to be renamed and update each of them in the array in constant time, so that "find" remains a constant time operation.)

The find operation continues to be constant time.

What is the cost of union? Naively, each union could take N time, as before. However, let us count more carefully, across all the merges.

Each union relabels every element of the smaller set. How many times does the label of an element change? Each time the label of an element s changes, the size of the component that s belongs to (at least) doubles. Thus, in log N steps, the component that s belongs to will be the full set, so no element can be relabelled more than log N times. Thus, over any sequence of unions, the total number of relabel operations is at most N log N.

Path Compression

With each element, store a pointer to the component that it belongs to. The last element in a component, points to itself.

Initially, each element is a singleton component and points to itself.

  ----------   ----------   ----------       ----------
 | 1 | self | | 2 | self | | 3 | self | ... | N | self |
  ----------   ----------   ----------       ----------
      

When we merge, we make pointer for the head of one component point to the head of the other one. For instance, if we have components {1,2,3,4} and {5,9} as on the left, we can merge component 5 into component 1 by making the pointer from 5 point to 1, as shown on the right. (In the picture, all edges point up.)

   Component 1     Component 5     Component 1 after merge

   {1,2,3,4}          {5,9}             {1,2,3,4,5,9}

      1 self          5 self                1 self
     / \              |                   /  |  \
    3    4            9                 3    4     5
    |                                   |          |
    2                                   2          9
      

Merge thus takes constant time.

What about find? To find the component of an element, we start with the pointer attached to that element and walk to the root of the tree to get the identity of the component to which it belongs.

If we merge from N,N-1,...2,1 in that order

((... ((N U N-1) U N-2) ....) U 2 ) U 1)

we could end up with a tree consisting of a single path

N → N-1 → N-2 → N-3 → ... → 3 → 2 → 1 self

so find could take time O(N).

Once again, we can maintain the size of each component and insist on merging smaller components into larger components. This will guarantee that the height of a component is only (log N).

Alternatively, we can ignore the sizes of the components and do "path compression". When we find an element for the first time, we walk up a path to the root of the component to identify the component. Now, we make a second pass from that element and replace pointers along the way to point directly to the root. This corresponds to "compressing" the path from each element along the way to a direct edge to the root.

It turns out that with path compression, a sequence of K finds will take time about 4K. (Strictly speaking, the "constant" 4 is not a constant but a very very slow growing function of K.)

In this representation, we don't need to maintain sizes --- it works whichever direction we merge. Also, we don't need to maintain a list of elements of each component.

This is called the "union-find" or "merge-find" data structure.

Video material

Video lectures on Disjoint set data structure (Union-find), NPTEL online course on Design and Analysis of Algorithms



Problems based on disjoint sets