Holy Entropy! It's boiling!There is some kind of mystique about Entropy, be it in Physics, Mathematics or Computing. It has been said [4,7], that von Neumann suggested the term to Shannon, saying that ``it will give you a great edge in debates because nobody really knows what entropy is anyway''.
Mr. Tomkins in [6]
Indeed, our first encounter seems to be confirm the worst fears: the
entropy of a (discrete valued) random variable , denoted
, is
defined to be
But in fact, the concept of entropy - its power and
versatility - can be understood and illustrated most
effectively through very simple examples in combinatorics
and computing. One such that Pranab began this series of
talks with, is Bregman's Theorem: the number of
perfect matchings in a bipartite graph with
is at most
. Here
is the gist of the argument: a perfect matching here is
simply given by specifying for each
, it's partner
. Suppose
is such an assignment
chosen uniformly at random. The entropy of this random
variable is
, where
is the (unknown) number of
perfect matchings. Now compute this another way:
is
determined fully by giving the values of
for
, and thus
. This shows that
. Now it seems quite silly to derive
this obvious fact using entropy! However, just a slightly
more careful and obvious modification of this very same
argument gives Bergman's Theorem.
In Computer Science proper, take the mother of all examples:
sorting. Every baby knows the bound
, and
also (hopefully!) one favourite sorting method (Quicksort,
Mergesort) and a lower bound argument (decision tree) for
this. However, what if you are given some additional
information: say that the input values are only of
distinct kinds? Or even more generally, that there are
of each kind respectively? What is the
bound now? The beautiful answer is that it is
where
is a random variable taking one of the distinct input
values with probabilities proportional the corresponding
frequencies:
respectively. This
answer includes all the cases: when no information is
available a priori,
(all distinct
input values) and if one knows there are
distinct values
(but not how many of each kind),
and the
sorting bound is
. Moreover, some simple
modifications of the standard sorting algorithms result in
an output sensitive performance achieving these
bounds. (Incidentally, it also shows that sorting goes
better when there are non-distinct elements. I once heard a
speaker at a conference claim the opposite! He was speaking
on a very technical result about the tail distribution of
quicksort, employing some very high powered generating
function and complex analytic technology ...)
Or take this curious geometrical fact: given distinct
points in three dimensions, project them onto the
,
and
planes to get
,
and
points
respectively. Then
. There is a proof by
Cauchy-Schwarz, but it is always a challenge to reproduce
it. On the other hand, there is a charming four line proof
via entropy: take a point at random and compute the entropy
two different ways, directly as
and indirectly as
the result of revealing the point via its three
projections. The crucial point is intuitively, that the
latter gives the full information ``twice over'' (since the
information in each coordinate is repeated twice).
This simple example has a fairly natural generalization
called Shearer's Lemma. The statement is somewhat
technical, so I won't give it here, but the proof is almost
the same as that above. But what is truly impressive is the
ease with which the inequality can be applied in diverse
situations to impressive effect. One example was to counting
independent sets in regular bipartite graphs. But the most
impressive and elegant one was the solution by Friedgut and
Kahn to the following question of Erdös (is it? must be!):
given a graph , what is the maximum number
of
copies of
that can appear in a graph with
edges?
The answer is:
The next segment of talks dealt with a notion of graph
entropy due originally to Janos Körner. I have often
been motivated to look at it via the problem of prefix-free
codes which is discussed in elementary algorithms courses:
assign binary codes to a bunch of objects so that no two
codes are prefixes of each other. We want to minimise the
average (or total) length of the code words, given the
frequencies of occurrence of the objects. The prefix-free
condition is motivated by the need to decode online. Huffman
coding comes within 1 of the optimal which is
where
is a random variable taking the different objects with
probabilities proportional to the frequencies of
occurrence. Now the generalization is to do the same for
objects not all of which need to be distinguished from each
other. There is a graph on the underlying objects as
vertices, where an edge between two objects indicates that
they must be distinguished from each other. The classical
case corresponds to the complete graph. Graph entropy is an
approach to this problem, which seems to be tantalizingly
still largely unresolved.
Starting with the seminal paper of Körner where he used
graph entropy to give a very simple proof of earlier results
by Fredman and Kömlos on sizes of covering families, graph
entropy has been successfully applied to a variety of
problems in combinatorics and complexity. So it certainly
seems to be a candidate for the essential toolkit. If one
looks at the literature on the subject, however, we are
confronted with two seemingly completely different and both
equally mystifying definitions, but nevertheless allegedly
equivalent. One is a definition with a distinctly
combinatorial optimization flavour: the entropy of a graph
is the minimum of the convex function
where
ranges over the vertex packing polytope of
i.e. the convex hull of
the independent sets in
. The other definition, on the
other hand, seems hardcore information theory: it is the
minimum of the mutual information over all possible
joint distributions, of the pair
, where
ranges
over the vertices of
and
over its independent sets
subject to the condition that
with probability 1.
Promising to clear the mists, Jaikumar started his talks
with the following simple puzzle: what is the minimum number
of bipartite graphs whose union will cover the complete
graph on vertices? There are a number of ways to tackle
this.
One is the deletion method: try to maintain as large an
independent set as permitted by the edges of the bipartite
graphs. At the outset we have the full vertex set. Consider
the first bipartite graph: now we must delete some vertices
to maintain the independence property. But we can always
retain the larger of the two halves of the bipartition. Thus,
as we consider successive bipartite graphs, we can keep at
least a half of the vertices. At the end, we have only a
single vertex (a maximal independent set in the complete
graph). Hence we need at least bipartite
graphs. Now, one can tinker with this deletion process
(making it probabilistic!) and we are led to the polyhedral
definition. I believe this is an original route to arrive at
the definition and Jaikumar should be given the credit
officially somewhere (could he be persuaded to convert the
notes into a full fledged survey for publication?).
The other is the entropy method: a vertex in the graph is
completely specified by giving its colour in each of the
bipartitions. Two different vertices cannot have the same
colour in all the bipartite graphs, because the edge
connecting them must appear in at least one of them. So,
pick a vertex at random, and let be its colour in
bipartite graph
. Now compute the entropy of the random
vertex in two different ways as usual and we get the same
result: at least
bipartite graphs are
necessary. Once again, one can improve this argument, and
this leads naturally to the information theoretic
definition.
Amongst the most basic and useful properties of graph
entropy is its subadditivity: the entropy of a union
of graphs is at most the sum of the entropies of the
components. This lies at the heart of its nature as a
measure of complexity and is repeatedly used in all the
applications. A beautiful side light is that it also
provides an information-theoretic characterization of
Berge's perfect graphs: a graph is perfect iff
it splits entropy perfectly i.e.
(
).
The subadditivity of entropy led Ravi Boppana to give an
elegant lower bound argument in parallel complexity. It was
this argument that inspired Kahn and Kim to their wonderful
solution to the problem of sorting under partial information
mentioned earlier. The argument is very easy to outline: at
the start we have the partial order identified with the
underlying comparability graph
, and at the end we have
the total order whose underlying comparability graph is
; so the sorting process leads to an increase
in graph entropy by
. So, if
we are able to show the change in entropy resulting from a
single comparison is
, then the total number of
comparisons required is
. How does this
relate to
, the information theoretic bound? Kahn
and Kim show that for some constant
,
Second they show that
the required comparison always exists: in any poset, the increase in
entropy after a single comparison is never more than but for
suitable choice of the comparison is at least
for some constant
. Given the fact that there is a polynomial time deterministic
algorithm to compute graph entropy via the ellipsoid algorithm, we
then have a complete answer to the problem of sorting under partial
information. Interestingly, the comparison made at any stage need not
always involve the most balanced pair! The simplicity of the strategy
notwithstanding, the proof is quite elaborate and involves inter
alia a nice laminar decomposition of the poset which might have
other applications. For details, you could see [14], but
there's a much better exposition in the workshop lecture notes
[1]. An open problem is to get an algorithm that
avoids the ellipsoid algorithm.