Indian Computing Olympiad

Training Material


Dynamic Programming→DNA (APIO 2008)

One interesting use of computer is to analyze biological data such as DNA sequences. Biologically, a strand of DNA is a chain of nucleotides Adenine, Cytosine, Guanine, and Thymine. The four nucleotides are represented by characters A, C, G, and T, respectively. Thus, a strand of DNA can be represented by a string of these four characters. We call such a string a DNA sequence.

It is possible that the biologists cannot determine some nucleotides in a DNA strand. In such a case, the character N is used to represent an unknown nucleotides in the DNA sequence of the strand. In other words, N is a wildcard character for any one character among A, C, G or T. We call a DNA sequence with one or more character N an incomplete sequence ; otherwise, it is called a complete sequence. A complete sequence is said to agree with an incomplete sequence if it is a result of substituting each N in the incomplete sequence with one of the four nucleotides. For example, ACCCT agrees with ACNNT, but AGGAT does not.

Researchers often order the four nucleotides the way we order the English alphabets: A comes before C, C comes before G, G comes before T. A DNA sequence is classified as form-1 if every nucleotide in it is the same as or comes before the nucleotides immediately to its right. For example, AACCGT is form-1, but AACGTC is not.

In general, a sequence is form-j , for j>1, if it is a form-(j-1) or it is a concatenation of a form-(j-1) sequence and a form-1 sequence. For example, AACCC, ACACC, and ACACA are form-3, but GCACAC and ACACACA are not.

Again, researchers order DNA sequences lexicographically the way we order words in a dictionary. As such, the first form-3 sequence of length 5 is AAAAA, and the last is TTTTT. As another example, consider the incomplete sequence ACANNCNNG. The first seven form-3 sequences that agree with it are:

	ACAAACAAG
	ACAAACACG
	ACAAACAGG
	ACAAACCAG
	ACAAACCCG
	ACAAACCGG
	ACAAACCTG
      

Given an incomplete sequence of length M, and two values K and R, the task to find the Rth form-K sequence that agrees with the given incomplete sequence.

Here M≤50000, K≤10, R≤212.

Solution

A form-K sequence can be decomposed into at most K blocks, where block has letters in non-decreasing order.

To start with let us write a recurrence to compute the number of form-K sequences of a given length, without any constraints.

For 1≤i≤m, 1≤j≤K, X in {A,C,G,T}, let f(i,j,X) denote the number of form-K sequences from positions i to M starting with X.

There are two cases to consider:

  • The letter at position i+1 is strictly smaller than X. If so, a new block starts at i+1 so we need to consider form-(j-1) sequences from i+1
  • The letter at position i+1 is greater than or equal to X. If so, the block at i continues into i+1, so we need to consider form-j sequences from i+1.

This gives us the recurrence

  • f(i,j,X) = sum_{Y<X} f(i+1,j-1,Y) + sum_{Y≥X} f(i+1,j,Y)

Now, how do we count the number of form-K sequences compatible with a given pattern? The pattern fixes values for certain positions. Suppose the p[i], the ith position in the pattern, is 'A'. Then, for all j

  • f(i,j,X) = 0 for X not equal to 'A'

In this way, we fill in 0's for all values f(i,j,X) such that p[i] is not N and p[i] is not equal to X. We then process the recurrence as before to get the total number of form-K sequences compatible with the pattern.

Now, how do we find the Rth such sequence?

The total number of form-K sequences is, in general, the sum

f(1,j,'A') + f(1,j,'C') + f(1,j,'G') + f(1,j,'T')

Let these values be n_A, n_C, n_G and n_T, respectively. Clearly, the first n_A sequences in lexicographic order start with A, the next n_C sequence start with C, …. If R≤n_A, we know the first letter is A. If n_A<R≤n_A + n_C, the first letter is C, and so on.

Suppose that n_A<R≤n_A+n_C. This fixes the first letter to be C and we now look for the sequence with index R_1 = (R - n_A) starting with C.

As before, we know that n_C is the sum of four terms, n_CA, n_CC, n_CG and n_CT. Again, by checking if R1≤n_CA or n_CA<R1≤n_CA+n_CC, … we can determine the second letter of the sequence.

This yields a new value R2 with which we can compute the third letter, and so on.