Science on the Squares: Introduction to Mutual Information

Despite the vague sounding term, the concept behind Mutual Information is a rather simple one. It is essentially a measurement of the dependence between two random variables. Huh? Put simply, it describes a sort of correlation between the outcomes for two different varying quantities.

Below, I try to introduce the concept and some uses of Mutual information. Since my experience in using this statistical technique has been one focused on bio-informatic analysis, most of the examples and uses I describe relate to biology and sequence analysis. There are a multitude of other uses! To read the rest of this article, select the 'Read More' Link below.

What ways have you used mutual information, or what problems do you think it can be applied to? Please share below by leaving a comment.

The image above (and some below) is from Wikipedia (I claim no rights to them)

Take me out to the ballgame

To make this idea easier to comprehend, I will borrow an analogy from the world of baseball. A number of baseball statistics have been developed to record the progress and performance of players. In this example, I'll focus on the RBI (runs batted in, or the number of runs a player is responsible for) and batting average (number of hits per at bat).

Consider two players, Keith and Darryl (yes, I'm a Mets fan) which bat in consecutive order. Keith has a batting average of 0.250 (meaning that, roughly 1 out of every 4 at bats, he gets a hit). He also has a AB/RBI (At Bat per Runs Batted In) of 14 (or, if you'd like, an RBI/AB of 0.07). Darryl has a batting average of 0.333 and a RBI/AB of 0.08. Note that it is possible to get more than one RBI per at bat (for example, if the player hits a homerun with men on base).

From the statistics alone, you would predict the probability of a hit, or a run batted in, for the individual at bats of either Keith or Darryl. These values can be thought of as random variables (the variable being either a hit or an out for a particular at bat).

These variables, however, are not independent. It's easy to see that the probability Darryl obtains a certain RBI number depends on the performance of Keith from the proceeding at bat; if Keith is already on base, then Darryl will have a higher probability of hitting an RBI (or two).

Therefore, you could say that there is mutual information between the batting average of Keith and the RBI of Darryl. When Keith gets a hit, Darryl tends (relatively more often) to get an RBI.

How to calculate Mutual Information

Mutual information is calculated from the following equation (courtesy the Wikipedia entry on this topic):

In which the mutual information score I(X,Y) is the sum of the following (for each value of the variables X and Y):

p(X, Y): The probability or frequency of the combination of the two variables. In other words, the observed frequency or probability of the variables X and Y having a particular value, compared to all possible combinations of X and Y.

p(X) times p(Y): The probability or frequency of a particular value for variable X or Y.

Log( (p(X, Y) / p(X) * p(Y) ): This term combines the above terms. Note that if there is no mutual information or dependence between the two variables, then p(X, Y) will equal p(X) * p(Y), and Log (1) = 0.

There are a number of ways to actually organize a large amount of data and calculate mutual information between two variables. One method that I have used, which is somewhat of a dumb, ad hoc approach (although pretty clever if you consider the limited tools I use), is to use the SUMPRODUCT formula in Microsoft Excel. How I actually turn a couple of formulas in excel into statistical magic is outside the scope of this article (and might be the subject of a future post, either here or somewhere else.)

Closer to home; analysis sequences of letters

Mutual information analysis can be adapted to a variety of problems and fields. Instead of measuring frequencies of hits, strike-outs, and RBIs, this statistical technique can be coupled to frequency analysis of words or other strings of letters (or characters). I will discuss later how this approach is applied to biological sequences, for now let us consider an example using words in the English language.

I have created a list of 179 five-letter words, 117 of which consist of words that start with the letters BA, the remaining words starting with the letters TH. (I omit the list here for space considerations, but I can provide them on request). For the first letter, we can observe that B occurs at a frequency of 65%, and T occurs at 35%. Likewise, the frequencies of 65% and 35% are observed for A and H at the second letter, respectively.

From this alone, you might conclude that the combination BA will occur at 65% x 65%, or approximately 42% of the time. This same approach should yield the following expected frequencies (again, just looking at the first two letters):

Letter Combination Expected Frequency Observed Frequency

BA 42.3% 65%

TH 12.3% 35%

BH 22.8% 0%

TA 22.8% 0%

In this particular set, however, there are no words that begin with the letter pairs BH or TA. This is reflected in the observed frequency column in the above grid. This particular word set is an extreme example, specially selected to illustrate this point, but I would suspect that a much larger, and unbiased set, would still uncover particular letter combinations at higher frequencies. In particular, the TH set is common to many words, although many other consonants do not pair well with T or H to begin a word (SH, CH, ST are some notable exceptions).

In this case, there is a functional connection between the first and second letter. If we apply mutual information analysis across the length of all five letters (for all 179 words), we can generate a grid with scores for each pair of positions. For more information on how this is calculated, please see the next section. (For any additional details, please contact me.)

In the above grid, the score for the covariance of two positions in a five letter word is given. The diagonal (highlighted by the staircase pattern and italicized), where the same position is compared against itself, gives some of the highest scores. This is to be expected, for example, the frequency of a B at position 1 when position 1 is a B is 100%. The differences in scores among the diagonal points reflects the relative variety at each position. As stated earlier, this data set only contains words which start with the letter pair BA and TH. Therefore, there is not much variety at positions 1 and 2, just two choices at each position. This limits the size, and is an inherent limitation to mutual information analysis (I will expand upon this point later).

From the score grid, it would appear that the positions with the most mutual information between them is letter 3 and 4, with the pairing between 4 and 5 a close second (squares highlighted in yellow). Intuitively, this is incorrect, and is simply a reflection of the low variety of choices for other positions (as stated in the preceding paragraph). What was expected is to find a higher score for the pair 1 and 2 (green squares). In fact, this would be a higher score, but it is limited by the fact that these two positions do not have a great variety of possible letters. To correct for the suppression of scores from less varying positions, we can take the mutual information score from each row as a percentage of the score found at the diagonal. As you might expect, this will inflate the score for the 1-2 pair ( in the unadjusted grid above, they are equal to their respective diagonals)

As you can see, when this adjustment is made, the letter positions that display covariance are given much higher scores. From this grid, we can conclude that there is a connection, or mutual information, between the first and second letter positions (Bright yellow squares). This confirms what we already knew, since this set of words was crafted specifically to give this result (as an example). It is also not surprising to observe higher scores for adjacent pairs, even though this word set was not intentionally designed to produce pairs 2-3, 3-4, and 4-5 with high scores (Light yellow squares). This again makes some intuitive sense; certain letter pairs are found with a greater than expected frequency, because they combine to produce a particular sound. In this case, the need for a certain sound constrains the choice of letters, leading to covariance between different letters in a word.

Why should I bother finding Mutual Information?

There are a number of fields which utilize the computation of mutual information. One in particular that I have experience with is bioinformatic analysis in molecular biology. Here, mutual information is used to determine if the sequence (either nucleotide or amino acid, for nucleic acid or protein) of a particular gene or molecule shares a dependency with the sequence of another gene. Sometimes, this type of analysis is done comparing regions within the same gene.

Mutual information can arise because of a functional connection between a pair of variables. In the baseball analogy, there is a functional connection between the batting average of one player and the run production of the subsequent batter. In the five-letter example, the functional connection of letter pairs is built around the sounds they produce. There also exists a functional basis for the mutual information found in biological systems. Most often, the variables in a biological system subjected to this type of analysis is sequence information. This is a discrete form of mutual information (as opposed to continuous) and is very similar to the analysis of five-letter words given above.

The differences in the sequence of a gene observed between different species is caused by random mutation, but is then subjected to evolutionary pressure. Therefore, the changes that are observed in a given gene or molecule are ones that the organism can tolerate (it can still survive with the altered version of a particular molecule). For a refresher on the importance and flow of biological information, please see an earlier post.

The interactions between individual amino acids in a protein help determine its structure. In addition to dictating the three dimensional shape of the protein, some of these amino acids are bound to interact with other proteins or molecules within the cell. The identity of the amino acid (the sequence at a particular point) will determine how well it fulfills its particular role (whiter it is binding, catalysis, etc). Therefore, in some cases the change of the amino acid at a particularly important position in the molecule can severely impair function. As already stated, this can be caused by disrupting an interaction that residue has with parts of the same molecule or another molecule entirely. If the amino acid or residue on the other side of this disrupted interaction has also changed, it might be possible that the function of the protein is restored.

Therefore, at certain positions, the frequency of a particular amino acid will depend on the sequence in another part of the molecule, or another molecule entirely. These two positions then are said to share mutual information. Note that some positions have a very strict amino acid requirement for function, and thus no variation is seen in the protein across species. It is therefore impossible to observe coevolution at these points.

Detecting mutual information between two proteins (or protein-RNA, or RNA-RNA) can be very useful to the biologist. Mutual information suggests that the two positions are evolutionarily dependent on one another, and this also suggests a functional interaction. Therefore, armed only with a large batch of sequences (of the same gene across many different species), a biologist can create predictions about which amino acids or nucleic acids form interacting pairs. This information is very important, because it can yield clues about both the three dimensional structure of a molecule, as well as the residues important for function and binding to other molecules.

Limits to Mutual Information Analysis

There are a few important caveats that anybody using a mutual information technique should keep in mind. Since the bulk of my experience using mutual information is with sequences that are biological in nature, I will use those examples. The problems, however, can transcend fields.

Mutual information analysis can uncover links between two variables, or in the case of biology, a link between two residues. As stated in the preceding section, if there is a direct and sequence specific interaction between two residues, this will link and shape the evolution of each residue.

The forces that cause a pair of residues to coevolve can also be exceedingly complex. For example, suppose there is a direct interaction between residues A and B, as well as residues B and C. Mutual information analysis will detect links not only between A-B and B-C, but also A and C, despite the lack of a direct interaction. While this analysis can correctly identify coevolving pairs, it cannot a priori determine which subset of these pairs are involved in biochemical interactions. This can be attempted by judging the size of the mutual information score between residues (one would expect A-B and B-C to be greater than A-C), but experimental evidence is necessary to definitely determine if two residues interact. At the very least, ranking interactions by score might give a researcher a prioritized list of predicted interactions which can be tested (This strategy is based upon the same underlying theme of Occam's razor, as I have posted about previously).

The complex network of residue pairs displaying mutual information can be difficult to entangle. This task is made even more daunting by the introduction of either noise or bias that is inherent in mutual information analysis. The level of conservation of a particular residue may skew the mutual information. Ideally, a particular residue will vary in a homogenous fashion (approximately equal occurrence of each choice), and will display some covariance with an equally homogenous partner. In some situations, however, one or both of the residues will feature one choice / nucleotide / amino acid at a very high frequency. This will limit the size of the score possible. This effect is evident in the analysis of the five-letter words in the preceding section. Similar problems can be introduced by virtue of an oversampling of a particular group of sequences (phylogentic bias) or a small sample size at a particular position.

Various techniques or adjustments have been developed to combat factors that skew mutual information scores. These include the adjustments featured above on the five-letter word set, and other techniques such as row and column weighing (RCW), sequence count corrections, and the average-product correction (APC; arguably the best method in this category). There are even techniques, such as direct-coupling analysis (DCA) which attempt to disentangle direct and indirect contacts. A more detailed explanation of how these techniques address short comings of mutual information is outside the scope of this survey; I encourage the reader to explore the references given below for more details.

A final word... for now

In some ways, I learnt about mutual information out of necessity. I was attempting to demonstrate coevolution of residues in two genes, SmpB and tmRNA, which function in the process of trans-translation. My initial attempts were successful at counting the occurrence of particular sequence pairings, but I had no good way to score the resulting numbers. Mutual information provided the framework that I was then able to use to judge the level of covariation between two positions in the sequence of each gene.

Recently, as I work with others to expand this analysis, I started to wonder for what else I might use this approach. I have a simple, ad hoc way to process data and compute mutual information from medium to large data sets, and would like to leverage this for more than making predictions about biological interactions.

In a future post, I will ruminate on the possibilities for using this statistical technique to examine chess positions and openings, as well as the arguably much more important (or at least lucrative) analysis of stock market prices!

References (Resources for further reading)

Brandman, et al. Sequence Coevolution between RNA and Protein Characterized by Mutual Information between Residue Triplets. PLoS One (2012) Vol 7 (1)

Burger, L and Nimwegen, E. Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments. PLoS Computational Biology (2010) Vol 6 (1)

Dunn et al. Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics (2008) vol. 24 (3) pp. 333-40

(Average-Product Correction)

Harish, et al. Ribosomal History Reveals Origins of Modern Protein Synthesis. PLoS One (2012) Vol 7 (3)

Gloor et al. Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry (2005) vol. 44 (19) pp. 7156-65

Gouveia-Oliveira and Pedersen. Finding coevolving amino acid residues using row and column weighting of mutual information and multi-dimensional amino acid representation. Algorithms Mol Biol (2007) vol. 2 pp. 12

Morcos, et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. PNAS (2011) Vol 108 (49) pp. 1293-1301

Subject Filter

Tuesday, November 13, 2012

Introduction to Mutual Information

No comments:

Post a Comment