Wednesday, September 24

A Union Algorithm For Sorted Lists

To union two sets is to create a new one which has all the members of both. For example, using a list (or technically a set since there aren't any duplicates) of numbers:

[1,2,3] U [2,3,4] = [1,2,3,4]

Let's say that we have two Hashmaps, L and R, that we wish to join into a larger collection. By definition, the collections of keys on these Hashmaps are sets, and so a union operation would be perfect to construct the new larger Hashmap. In practice this is simply achieved by adding each of the items in R to L at cost O(1) each time, handling collisions as appropriate (by implementation default you'd take R's, but there's no reason why you can't handle that differently). To create a new set from Hashmaps of size n and m respectively, this takes time O(n+m), O(n) to add L's items and O(m) to add R's.

But since we're all now using sorted lists for our indexed item needs, the cost of inserting is now O(log n) and so the algorithm above results in a total time of O(n log n + m log (m+n)). Not great, but perhaps there's something we can do to better this.

The trick is in leveraging the sorted property of a sorted list. Imagine if we had two piles of sorted cards in front of us - it would be easy enough to create a union of these by repeatedly taking the lower value card off the top of each and placing it in a new pile, while chucking away any duplicates. Since we only go through each list once, we would do this in O(n+m) time. Super! And again, as a bonus, the resultant set would have been sorted by key.

Although it's a straightforward enough algorithm translating the above to code can be tricky. What follows is some pseudocode which tries to nail it; ListL/R represent the sorted lists, while IndexL/R are pointers (integers) to items in the list:

if ListL.Size is zero, then return ListR;
otherwise if ListR.Size is zero, then return ListL;
otherwise
  create an empty ListU for the results;
  set IndexL = 0; set IndexR = 0;
  while (true)
    if ListL[IndexL] is the same as ListR[IndexR]
      add ListL[IndexL] into ListU;
      //or ListR[IndexR],
      //or manage the conflict.
      increment IndexL; increment IndexR;
    else if ListL[IndexL] is lower than ListR[IndexR]
      add ListL[IndexL] into ListU;
      increment IndexL;
    else if ListL[IndexL] is higher than ListR[IndexR]
      add ListR[IndexR] into ListU;
      increment IndexR;
    if IndexL is greater than ListL.Size
    or IndexR is greater than ListR.Size
    //invalid indexes
      quit the loop;
  restart loop
  if IndexL is still pointing to a valid item in ListL
    add the remainder of ListL to ListU;
  if IndexR is still pointing to a valid item in ListR
    add the remainder of ListR to ListU;
  return ListU
end;


Still not convinced? Well for completeness, let's prove that this algorithm actually works. The parts outside the main loop can be proven trivially; the union of two empty sets is empty, and L U Empty is L too.

For the final two cases we can rely on the following observation:

if max(ListU) < min(ListL) then ListU U ListL = ListU + ListL

This is true since it's guaranteed that there will be no items shared between U and L.

Which leaves us with the loop proper. For this, we can use induction; the above prove the trivial or base cases, so let's just consider when we're in the thick of things:

Let IndexL and IndexR point to somewhere in the middle of ListL and ListR respectively. Assume that the loop has been working correctly so far: so all lists are sorted, last(ListU) < min(ListL[IndexL], ListR[IndexR]), and ListU is the union of all items already seen. Let's call these observations the inductive property - we're going to go through the loop and see if they all still hold.

Now, three things could happen. Let Index+ denote an incremented index:

  1. ListL[IndexL] = ListR[IndexR]. Here we add ListL[IndexL] to ListU, and increment both Indexes. By definition of our sorted list and its unique keys, ListL[IndexL] = ListR[IndexR] < ListL[IndexL+] and ListL[IndexL] = ListR[IndexR] < ListR[IndexR+]. Since last(ListU) = ListL[IndexL], last(ListU) < ListL[IndexL+] and last(ListU) < ListR[IndexR+], or last(ListU) < min(ListL[IndexL+], ListR[IndexR+]). Since ListL[IndexL] is greater than all the items originally in ListU, ListU is stil sorted. ListL and ListR are still sorted, trivially. And finally, since ListU now contains ListL[IndexL], it's the union of all items seen already.
  2. ListL[IndexL] < ListR[IndexR]. Here we add ListL[IndexL] to ListU, and increment IndexL. By definition of our sorted list and its unique keys, ListL[IndexL] < ListL[IndexL+]. Since last(ListU) = ListL[IndexL], last(ListU) < ListL[IndexL+], and so last(ListU) < min(ListL[IndexL+], ListR[IndexR]). Since ListL[IndexL] is greater than all the items originally in ListU, ListU is stil sorted. ListL and ListR are still sorted, trivially. And finally, since ListU now contains ListL[IndexL], it's the union of all items seen already.
  3. ListR[IndexR] < ListL[IndexL]. Similar to the above case, except for ListR instead.
In all three cases, the inductive property holds. Hence when loop exits, ListU will be the union of all items already seen.

Phew. Still here? I'd congratulate you, but I'm wondering exactly how bored you are.

Nevertheless you might also be wondering what the whole point of the above is. Although what we want to do is simple enough, it's sometimes good to think about why an algorithm works, if anything to back up the original hunch that it does. Conversely, if you struggle to prove a simple algorithm, that might just indicate that it's not quite as "obviously correct" as you think it might be. Now that I've proven the union algorithm (albeit informally, since there is still a whole bunch missing) I can remain confident that it actually does what it's supposed to.

5 comments:

  1. Or since they are just sets you use the Set data type and do

    set1.addAll(set2);

    Which does the union for you and if youre worried about keeping a ref to both sets then

    Set setUnion = set1;
    setUnion.addAll(set2);

    voila much more simple eh?

    ReplyDelete
  2. Zahera (if that indeed is your real name),

    You're absolutely right - in fact you could just go back to the Hashmap implementation and use its containsKey method to do pretty much the same thing. But this is a discussion on computational complexity, not implementation.

    I covered why one may want to use sorted lists instead of other data structures in a previous post (linked to above if you're interested). I touch on key sets too.

    In short though a sorted list will outperform a Hashmap in a few specific cases (like iteration or sorting, or with respect to space) and presumably a Set too.

    For these cases, I still need union and intersect operators and the above is discussing the best way to do this. Furthermore although I've not looked into Set as much as Hashmap, I'm guessing addAll runs in O(nm) time. That's pretty slow compared to the O(n+m) above.

    ReplyDelete
  3. Oh and what does the following actually do wrt references?

    Set setUnion = set1;
    setUnion.addAll(set2);

    ReplyDelete
  4. :-( think i preffered it when you didnt reply on your blog- atleast then i could pretend im all clever.

    :-D someone hacked into my computer.

    ReplyDelete
  5. Anonymous03:15

    losers

    ReplyDelete