Entropy is a property of physical systems. In communication theory, it’s a number that tells you how many bits you need to store a typical sequence produced by some source (more on that in a moment). In other words, it’s the *size of a piece of memory*. It’s a dimensionless quantity, but it doesn’t have to be – if your bits are fixed in size, you can measure entropy in inches! A high entropy source requires a large register to store its messages, while a zero-entropy source doesn’t need to store them at all.

It turns out that when you boil off all the mathematical formalism, mutual infomation has a similar intuitive physical interpretation, as the *useful size of a message*. I always forget how this argument works, so I thought I’d record it here.

Imagine you’re given access to some source of data, which produces a symbol $x$ each time it’s queried. You’re tasked with communicating this information as efficiently as possible to someone else, and you’re allowed to use whatever encoding scheme you like. Doing so efficiently just means using as little space as possible in communicating the message.

For some sources, this is quite easy. For example, if the source just responds with “hello!” at each query – and if we know it will *only* ever do this – then we don’t have to record the particulars of any message, except its length (the number of times the source was queried). To transmit the result of querying the source $N$ times, we would only need to write down the number $N$ – we could then later reconstruct the message exactly by repeating “hello!” $N$ times. Sources that don’t change much require less space.

We can quantify this by assigning a probability $p(x)$ to each possible symbol $x$. To figure out how much space is typically required, we can consider “typical sequences” – those where each symbol shows up exactly as many times as we’d expect it to. For sequences of length $N$, the mean number of occurences of $x$ in a particular sequence is $Np(x)$. And if each symbol is drawn independently from $p(x)$, then for very large $N$, by the central limit theorem the number of $x$ appearances will be normally distributed around $Np$, with a standard deviation proportional to $\sqrt{N}$. This means that the fluctuation in the number of occurrences is tiny compared to the total message size, as long as the message is long enough; the upshot is that for calculating message statistics we only have to worry about typical sequences.

This is great, because it means that if we want to encode a length-$N$ string, it’s enough to have a system for encoding each possible typical sequence. Here’s one way to do it:

- Assign an integer label to each of the typical sequences
- Whenever a message comes in (assuming it’s a typical sequence), write down a really compact description of the integer (like a binary encoding) which labels the sequence

That’s it!

With the source that only says “hello!”, there’s only one typical sequence – formed by a bunch of “hello!”’s – so only a single label is required. In general, the exact number of these is not hard to calculate – it’s a multinomial coefficient,

$$ N_t = \textrm{Number of typical sequences} = \frac{N!}{\left(Np(x_1)\right)! … \left(Np(x_k)\right)!}, $$

if there are $k$ possible symbols $x_1, … , x_k$. Applying Stirling’s approximation, which is not a bad idea when $N$ is large, and rearranging a bit,

$$ \log N_t \approx N \left( - \sum_{i=1}^k p(x_i) \log p(x_i) \right) $$

That thing on the right is what’s used to define the entropy (I’ll assume base-2 logs):

$$ H(X) = - \sum_{i=1}^k p(x_i) \log p(x_i) $$

which means that the number of typical sequences is just $N_t = 2^{N H(X)}$. Cool. Note that $H(X)$ is just a property of the source, and has nothing to do with the individual messages or how long they are.

So under the simple labeling scheme above, in order to pass on any typical sequence, you’ll need $2^{N H(X)}$ different labels, and therefore room for $\log 2^{NH(X)} = NH(X)$ digits on whatever piece of paper you use to record each message. The entropy is the number of output digits required (per input digit). It’s a size.

Just as entropy can be defined formally for a single random variable, without ever thinking about messages and code sizes, mutual information can be defined abstractly for any pair of random variables $X$ and $Y$. Here it is:

$$ I(X, Y) = H(X) + H(Y) - H(X, Y) $$

where $H(X, Y) = \sum_{x,y} -p(x,y) \log p(x,y)$ is the joint entropy.

As it turns out, there’s a nice physical interpretation of the mutual information as well: given some communication channel, it tells you how many bits of an input actually make it through to the output. It’s the size of the conveyed message.

To be more precise, imagine someone pushing a sequence of symbols $x$ into a long pipe; you’re sitting at the other end, recording outputs $y \sim p(y | x)$. Your task is to write down the input sequence. Strictly speaking this is impossible, since given some output $y$, you generally can’t be certain about the corresponding value of the input – the pipe is of dubious quality, set by $p(y | x)$. After you’ve finished recording the outputs, therefore, you’ll want to leave some space on the paper in order to write down all the possible inputs that are compatible with what you’ve observed. How much space do you need to leave?

Well, fix one particular output symbol $y_i$. Each time it appears, a number of different inputs could be present; if it appears $n_i$ times in a sequence of length $N$, these compatible inputs are length-$n_i$ sequences sampled from $p(x | y_i)$. Assuming $n_i$ to be large, we need only consider typical sequences – and as we saw above, there are about $2^{n_i H(X |y_i)}$ of these, where

$$ H(X| y_i) = \sum_x - p(x|y_i) \log p(x|y_i) $$

is the entropy of the input, conditioned on $y_i$. So, you’ll need to reserve $n_i H(X | y_i)$ digits for that particular set of missing input symbols. The same argument holds for every other output symbol, meaning that in total you’ll have to reserve

$$ N_{\textrm{missing}} = \sum_{i=1}^k n_i H(X| y_i) $$

digits to hold the possible inputs. Approximating $n_i = N p(y_i)$, this means

$$ N_{\textrm{missing}} / N = \sum_{i=1}^k p(y_i) H(X| y_i) $$

On the other hand, a bit of algebra shows that the definition of the mutual information given above can be rewritten as

$$ I(X, Y) = H(X) - \sum_{i=1}^k p(y_i) H(X| y_i), $$

or presented more suggestively,

$$ I(X, Y) = \frac{ N_{\textrm{input}} - N_{\textrm{missing}} }{N} $$

where $ N_{\textrm{input}} = N H(X)$ is the number of bits you’d need to record the input if you had direct access to it. So the mutual information measures the difference between the size of the input, and the size of the output memory that must be reserved due to uncertainty in the input. This, too, is just a measure of space! – in particular, how much of the output memory is actually conveying useful information. If the two variables have maximal mutual information, $I(X, Y) = N_{\textrm{input}}$, and the output captures all of the input. If they have none, $N_{\textrm{missing}} = N_{\textrm{input}}$ and you know nothing about the input at all.