Solution to Problem 2: What is the Rank?
You are given a list of integers a_1, ... a_N, and your aim is to determine the rank of a_i among a_1,...,a_i. Let us dispose off the direct algorithm. For each i, simply scan from a_i down to a_1 and determine the rank of a_i by counting the number of elements below (or above) a_i. This takes time proportional to i. Thus, summing up over all i, this algorithm takes roughly N*N steps. This will get you 30% of the marks. On the face of it, this problems seems similar to the "End of Corruption" problem, where we had a stream of numbers coming in and we had to insert the numbers into a list. We were also required to find the current maximum element in the list and delete it. Recall, from the discussion in that problem, that we could maintain the list either in sorted order or in the order in which the numbers arrive. Maintaining the list in sorted order would require us to spend time proportional to the length of the list in order to insert an element into it. On the other hand, if we choose to maintain the numbers in the order in which they arrive, then, each insertion cost us only one instruction. If the list is maintained in arbitrary order, then finding the rank of an element will take take proportional to the length of the list. The naive algorthim described above uses this idea and cannot be improved. On the other hand if the list is maintained in sorted order, then we can find the rank of an element in time proportional to log(n) where n is the number of elements in the list. Here is how: Suppose a_1 > a_2 > a_3 > ... a_n and the next element is x. (Recall from the statment of the problem the x is different from a_1 ... a_n.) If x > a_1 (or x < a_n) then its rank is 1 (is n+1). Otherwise, compare x with the "middle element" of the list. The middle element is a_m where m = n/2 (/ denotes integer division). 1) If x > a_m then it suffices to find the rank of x in the list a_1 ... a_{m-1}. 2) If x < a_m and x > a_{m+1} then rank of x is m+1. 3) If x < a_{m+1} then it suffices to find the rank of x in the list a_{m+1} ... a_n and add m to it. Notice that after making the necessary comparisons required by 1, 2 and 3, we are left with a list whose length is < n/2. In finding the rank of x in this smaller list we will apply the same ideas above, i.e. compare it with the middle element and then restrict the search to a list whose length < (n/2)/2 = n/4 and so on. This repeated halving can go on for at most log(n) steps. The details are as follows: We now formulate a recursive algorithm motivated by the above idea. ---------------------------------------------------------------- /* The numbers are available in the array A. The number of elements in the array is n. The array A is in descending order */ /* findrank takes 3 arguments, x,l and r where 1 ≤ l ≤ r ≤ n and returns the rank of x in the list A[l], A[l+1], ... A[r] */ findrank(x,l,r) { if (x > A[l]) return(1) if (x < A[r]) return(r-l+2) mid = (l + r)/2 if (x > A[mid]) return(findrank(x,l,mid-1)) /* 1 above else if (x > A[mid+1]) return(mid-l+1) /* 2 above else return(mid+findrank(x,mid+1,r)) /* 3 above } print(findrank(x,1,n)) ----------------------------------------------------------- As mentioned earlier if there are n elements in the list (i.e. between l and r) when findrank is called, in any recursive call there are at most n/2 elements. Thus, there are at most log(n) (i.e. the smallest k such that 2^k is more than n) recursive calls before the list is empty. Each call, carries out a fixed number, say k instructions and thus the total number of instructions executed is log(n)*k which is roughly log(n). Exercises: 1) Write down a non-recursive version of this algorithm. 2) If we are only interested in the rank of x in the entire array (and do not require that findrank(x,l,r) return the rank of x among the elements A[l],..., A[r], then, we can simply the above function considerably. How? So what do we have so far: 1) Maintain the list in the order in which the numbers arrive. Each insert takes only a contant number of steps but finding the rank takes time proportional to the length of the list. Thus, if there are N elements in all we would time proportional to N*1 + N*N (the first term is contributed by the inserts and the second one is the time taken in finding the ranks). Thus the time taken is roughly N*N. 2) Maintain the array in sorted order. Each insert takes time proportional to the length of the list. Thus inserting N element takes time proportional to N*N. Finding the rank of an element takes time proportional to log(n) where n is the current length of the list. So the overall time time taken to find the rank of the N elements (as they are inserted) is not more than N*log(N). N*log(N) + N*N is roughly N*N (since log(N) is much smaller than N). So, there is no real difference between the two. But, if we learnt our lessons (from the "End of Corruption" problem) right, we would like to balance the efficiency of the inserts in the first algorithm with the efficiency of the rank finding in the second algorithm. And this is how you do it: We use Sqrt(N) lists, each containing at most Sqrt(N) numbers, and each of these lists is maintained in sorted order. To insert an element, we look for the first list with space (that will take time proportional to Sqrt(N)) and then insert it into the list at the appropriate place (to maintain the sorted order) and this take time proportional to Sqrt(N). Thus each insert takes time proportional to Sqrt(N). To find the rank of an element, we find its rank in each of the Sqrt(N) lists. That takes time proportional to Sqrt(N) * log(Sqrt(N)). Suppose, the rank in the Sqrt(N) lists are r_1,r_2,...,r_Sqrt(N). Then, the overall rank is (r_1 - 1) + (r_2 - 1) + ... + (r_Sqrt(N) - 1) + 1. (Since there are (r_i - 1) elements in list i bigger than the given number.) Thus the rank can be found in time proportional to Sqrt(N)*log(Sqrt(N)). So the overall time of inserting and finding the current rank of N elements is proportional to N*(Sqrt(N) + log(Sqrt(N))). Note that log(N) = 2 * log(Sqrt(N)). Thus, this is roughly N*Sqrt(N)*log(N). Since log(N) is much smaller than Sqrt(N) (for example if N is 40000, Sqrt(N) is 200 and log(N) is 16) this algorithm performs significantly better than one that takes time proportional to N*N. This algorithm would get you 100% marks. Here is the full algorithm. We use a 2-dimensional array of size Sqrt(N) X Sqrt(N). Each row stores a list. We also keep an array Next[] of size Sqrt(N) to keep track of the first vacant position in each of the Sqrt(N) lists. --------------------------------------------------------------------- integer array A[Sqrt(N)][Sqrt(N)] integer array Next[Sqrt(N)] for i=1 to Sqrt(N) { Next[i] = 1 } for i=1 to N { x = read next input y = Rank(x) Insert(x) print(y) } Insert(x) { Find the first i such that Next[i] ≤ Sqrt(N). /* Find a list with space */ Find the first j such that A[i][j] ≤ x. /* Have to insert x at (If no such j is found return Next[i]) position j */ for (k=Next[i] down to j+1) /* create space to insert x */ A[i][k]=A[i][k-1] A[i][j] = x /* insert x */ Next[i] = Next[i] + 1 } /* A function that finds the rank of the element x among the elements between positions l and r in row i of the Array A (The same as the findrank above except that instead of working over a one dimensional array A it works over the row i of a two dimensional array A) */ findrowrank(i,x,l,r) { if (x > A[i][l]) return(1) if (x < A[i][r]) return(r-l+2) mid = (l + r)/2 if (x > A[i][mid]) return(findrowrank(i,x,l,mid-1)) else if (x > A[i][mid+1]) return(mid-l+1) else return(mid+findrowrank(i,x,mid+1,r)) } Rank(x) { rank = 0 for (i=1 to Sqrt(N)) { /* add up the number of elements below x in all the rows */ rank = rank + findrowrank(i,x,1,Next[i]-1) - 1 } /* rank has the number of elements below x in the entire two dimensional array now. So the Rank of x is rank+1 */ return(rank+1) } -------------------------------------------------------------------- It turns out that one do a lot better than what is described here using a very specialised data structure. One can do the N inserts and find the ranks in overall time proportional to N*log(N). Thus, for N = 40000, that algorithm would be about 200 (Sqrt(40000)) times faster!! We postpone a discussion on this data structure for the moment. We would like to add now that the algorithm for finding the rank described above can be modified to find out whether x is a already present in sorted array of size N in time proportional to log(N). This technique is called "binary search" and we describe that algorithm below for the sake of completeness (you are encouraged to work it out for yourselves and not read on!). Binary Search: -------------- Let A be a array with N elements sorted in ASCENDING order. Given x, we would like to know if x appears in A. (The array could contain duplicates and of course x could appear in A.) The naive algorithm would scan the array and would take time proportional to the length of the array. The binary search algorithm works very much like the "findrank" described above. ----------------------------------------------------------------- /* The algorithm bsearch takes 3 inputs, x, l and r. It returns 0 if x does not appear among A[l] ... A[r]. Otherwise, it returns an index j, l ≤ j ≤ r such that A[j] = x */ int bsearch(int x, int l, int r) { int mid; mid = (l+r)/2; /* x is smaller than the smallest element in A */ if (x < A[l]) return(0); /* x is bigger than the largest element in A */ if (x > A[r]) return(0); /* x is smaller than A[mid]. So it cannot appear among A[mid],A[mid+1],...A[r] */ if (x < A[mid]) return(bsearch(x,l,mid-1)); /* x is bigger than A[mid] so it cannot appear among A[1],A[2],... A[mid] */ if (x > A[mid]) { /* if x is also less than A[mid+1] then x does not appear in A */ if (x < A[mid+1]) return(0); /* Otherwise, search among A[m+1], ... A[r] */ return(bsearch(x,mid+1,r)); } /* x equals A[mid] */ return(mid); } ----------------------------------------------------------------------- As in the case of findrank, noting that each recursive call reduces the distance between l and r by half, guarantees that there are at most Log(N) recursive calls to bsearch and since each call takes only a constant number of operations the overall algorithm takes roughly Log(n) steps. Exercise: Can we use the binary search algorithm to solve the Average problem (the other problem in the January problem set) in N*N*log(N) steps (needless to add, without using the ideas leading to the N*N solution).