Next: Probability Up: dd-mm Previous: Group-theoretic techniques

Entropy

Holy Entropy! It's boiling!
Mr. Tomkins in [6]

There is some kind of mystique about Entropy, be it in Physics, Mathematics or Computing. It has been said [4,7], that von Neumann suggested the term to Shannon, saying that ``it will give you a great edge in debates because nobody really knows what entropy is anyway''.

Indeed, our first encounter seems to be confirm the worst fears: the entropy of a (discrete valued) random variable , denoted , is defined to be

$\begin{displaymath}H(X) := - \sum_x \mbox{{\rm Pr}}[X=x] \log \mbox{{\rm Pr}}[X=x], \end{displaymath}$

i.e. it is (the negative of) the expected value of logarithm (in base 2, of course) of the distribution function. This is quite mysterious and certainly seems quite arbitrary as a measure of the ``uncertainty in

'', or the ``amount of information obtained when

is revealed'', as the way this fundamental formula of Shannon is usually motivated. In actual fact, it emerges quite ineluctably as the unique measure satisfying some rather natural properties that one can axiomatically require any information measure to satisfy. Good discussions of this can be found in the classic texts of Khinchin and Renyi, where it is illustrated in discussions of the famous coin weighing problems. A good modern account is the book by Cover and Thomas [3].

But in fact, the concept of entropy - its power and versatility - can be understood and illustrated most effectively through very simple examples in combinatorics and computing. One such that Pranab began this series of talks with, is Bregman's Theorem: the number of perfect matchings in a bipartite graph with $\vert A\vert = \vert B\vert$ is at most $\prod_{v \in A} (d(v)!)^{1/d(v)}$ . Here is the gist of the argument: a perfect matching here is simply given by specifying for each $v \in A$ , it's partner $\sigma(v) \in B$ . Suppose $\sigma$ is such an assignment chosen uniformly at random. The entropy of this random variable is $\log N$ , where is the (unknown) number of perfect matchings. Now compute this another way: $\sigma$ is determined fully by giving the values of $\sigma(v)$ for $v \in A$ , and thus $H(\sigma) \leq \sum_{v \in A} H(\sigma(v)) \leq \sum_{v \in A} \log d(v)$ . This shows that $N \leq \prod_{v \in A} d(v)$ . Now it seems quite silly to derive this obvious fact using entropy! However, just a slightly more careful and obvious modification of this very same argument gives Bergman's Theorem.

In Computer Science proper, take the mother of all examples: sorting. Every baby knows the bound $\Theta(n \log n)$ , and also (hopefully!) one favourite sorting method (Quicksort, Mergesort) and a lower bound argument (decision tree) for this. However, what if you are given some additional information: say that the input values are only of distinct kinds? Or even more generally, that there are $n_1,n_2, \ldots,n_r$ of each kind respectively? What is the bound now? The beautiful answer is that it is where is a random variable taking one of the distinct input values with probabilities proportional the corresponding frequencies: $n_1/n,n_2/n, \cdots, n_r/n$ respectively. This answer includes all the cases: when no information is available a priori, $H(X) = \log n$ (all distinct input values) and if one knows there are distinct values (but not how many of each kind), $H(X) = \log r$ and the sorting bound is $n \log r$ . Moreover, some simple modifications of the standard sorting algorithms result in an output sensitive performance achieving these bounds. (Incidentally, it also shows that sorting goes better when there are non-distinct elements. I once heard a speaker at a conference claim the opposite! He was speaking on a very technical result about the tail distribution of quicksort, employing some very high powered generating function and complex analytic technology ...)

Or take this curious geometrical fact: given distinct points in three dimensions, project them onto the , and planes to get , and points respectively. Then $n^2 \leq n_xn_yn_z$ . There is a proof by Cauchy-Schwarz, but it is always a challenge to reproduce it. On the other hand, there is a charming four line proof via entropy: take a point at random and compute the entropy two different ways, directly as $\log n$ and indirectly as the result of revealing the point via its three projections. The crucial point is intuitively, that the latter gives the full information ``twice over'' (since the information in each coordinate is repeated twice).

This simple example has a fairly natural generalization called Shearer's Lemma. The statement is somewhat technical, so I won't give it here, but the proof is almost the same as that above. But what is truly impressive is the ease with which the inequality can be applied in diverse situations to impressive effect. One example was to counting independent sets in regular bipartite graphs. But the most impressive and elegant one was the solution by Friedgut and Kahn to the following question of Erdös (is it? must be!): given a graph , what is the maximum number $N(H,\ell)$ of copies of that can appear in a graph with $\ell$ edges? The answer is:

$\begin{displaymath}c_1 \ell^{\rho^*(H)} \leq N(H, \ell) \leq c_2 \ell^{\rho^*(H)}, \end{displaymath}$

where

are constants and $\rho^*(H)$ is the fractional cover number of

i.e. the optimum of the LP relaxation of the natural integer LP formulation for the optimum edge cover of

. We saw an alternative proof of the result (also due to Friedgut and Kahn) in the harmonic analysis series which the authors claim is ``even more interesting'' (however it doesn't extend to hypergraphs).

The next segment of talks dealt with a notion of graph entropy due originally to Janos Körner. I have often been motivated to look at it via the problem of prefix-free codes which is discussed in elementary algorithms courses: assign binary codes to a bunch of objects so that no two codes are prefixes of each other. We want to minimise the average (or total) length of the code words, given the frequencies of occurrence of the objects. The prefix-free condition is motivated by the need to decode online. Huffman coding comes within 1 of the optimal which is where is a random variable taking the different objects with probabilities proportional to the frequencies of occurrence. Now the generalization is to do the same for objects not all of which need to be distinguished from each other. There is a graph on the underlying objects as vertices, where an edge between two objects indicates that they must be distinguished from each other. The classical case corresponds to the complete graph. Graph entropy is an approach to this problem, which seems to be tantalizingly still largely unresolved.

Starting with the seminal paper of Körner where he used graph entropy to give a very simple proof of earlier results by Fredman and Kömlos on sizes of covering families, graph entropy has been successfully applied to a variety of problems in combinatorics and complexity. So it certainly seems to be a candidate for the essential toolkit. If one looks at the literature on the subject, however, we are confronted with two seemingly completely different and both equally mystifying definitions, but nevertheless allegedly equivalent. One is a definition with a distinctly combinatorial optimization flavour: the entropy of a graph is the minimum of the convex function $- \sum_v \log a_v / \vert G\vert$ where $a := (a_v, v \in V(G))$ ranges over the vertex packing polytope of i.e. the convex hull of the independent sets in . The other definition, on the other hand, seems hardcore information theory: it is the minimum of the mutual information over all possible joint distributions, of the pair , where ranges over the vertices of and over its independent sets subject to the condition that $X \in Y$ with probability 1.

Promising to clear the mists, Jaikumar started his talks with the following simple puzzle: what is the minimum number of bipartite graphs whose union will cover the complete graph on vertices? There are a number of ways to tackle this.

One is the deletion method: try to maintain as large an independent set as permitted by the edges of the bipartite graphs. At the outset we have the full vertex set. Consider the first bipartite graph: now we must delete some vertices to maintain the independence property. But we can always retain the larger of the two halves of the bipartition. Thus, as we consider successive bipartite graphs, we can keep at least a half of the vertices. At the end, we have only a single vertex (a maximal independent set in the complete graph). Hence we need at least $\log n$ bipartite graphs. Now, one can tinker with this deletion process (making it probabilistic!) and we are led to the polyhedral definition. I believe this is an original route to arrive at the definition and Jaikumar should be given the credit officially somewhere (could he be persuaded to convert the notes into a full fledged survey for publication?).

The other is the entropy method: a vertex in the graph is completely specified by giving its colour in each of the bipartitions. Two different vertices cannot have the same colour in all the bipartite graphs, because the edge connecting them must appear in at least one of them. So, pick a vertex at random, and let $\chi_i$ be its colour in bipartite graph . Now compute the entropy of the random vertex in two different ways as usual and we get the same result: at least $\log n$ bipartite graphs are necessary. Once again, one can improve this argument, and this leads naturally to the information theoretic definition.

Amongst the most basic and useful properties of graph entropy is its subadditivity: the entropy of a union of graphs is at most the sum of the entropies of the components. This lies at the heart of its nature as a measure of complexity and is repeatedly used in all the applications. A beautiful side light is that it also provides an information-theoretic characterization of Berge's perfect graphs: a graph is perfect iff it splits entropy perfectly i.e. $H(G)+H(\overline{G}) = H(K_n)$ ( $= \log n$ ).

The subadditivity of entropy led Ravi Boppana to give an elegant lower bound argument in parallel complexity. It was this argument that inspired Kahn and Kim to their wonderful solution to the problem of sorting under partial information mentioned earlier. The argument is very easy to outline: at the start we have the partial order identified with the underlying comparability graph , and at the end we have the total order whose underlying comparability graph is ; so the sorting process leads to an increase in graph entropy by $H(K_n) - H(P) = \log n - H(P)$ . So, if we are able to show the change in entropy resulting from a single comparison is $\Theta(1/n)$ , then the total number of comparisons required is $n(\log n - H(P))$ . How does this relate to $\log e(P)$ , the information theoretic bound? Kahn and Kim show that for some constant ,

$n(\log n - H(P)) \geq \log e(P) \geq$
$\max(\log(n!) - nH(P), Cn(\log n - H(P))).$

Second they show that the required comparison always exists: in any poset, the increase in entropy after a single comparison is never more than but for suitable choice of the comparison is at least for some constant . Given the fact that there is a polynomial time deterministic algorithm to compute graph entropy via the ellipsoid algorithm, we then have a complete answer to the problem of sorting under partial information. Interestingly, the comparison made at any stage need not always involve the most balanced pair! The simplicity of the strategy notwithstanding, the proof is quite elaborate and involves inter alia a nice laminar decomposition of the poset which might have other applications. For details, you could see [14], but there's a much better exposition in the workshop lecture notes [1]. An open problem is to get an algorithm that avoids the ellipsoid algorithm.

Next: Probability Up: dd-mm Previous: Group-theoretic techniques

Meena Mahajan 2002-02-13