Experiment Aggregation#

With machine-learning setups, it can happen that we have several experiments measuring the same effect. For example, when we perform cross-fold validation, we try to measure the generalization gap for the same learning algorithm, just using different subsets of the same dataset. In those cases, we're typically not interested in the individual experiment values. Instead, we want to know the statistics about the average experiment distribution.

In bayes_conf_mat, we call this experiment aggregation. Using only samples drawn from the \(M\) individual empirical metric distributions, \(\mu\sim p(\mu_{m})\), we wish to find the distribution of the aggregate distribution \(q(\mu)\). Ideally, the aggregate distribution consolidates the information present in each experiment distribution, and produces a distribution that is more confident about the true metric value than any individual experiment could be.

Specifically for bayes_conf_mat, we utilize two frameworks specifically for producing such aggregate distributions.

Meta-Analysis#

The first, and by far most common, approach to combining different probability distributions is through a method called meta-analysis. The simplest meta-analysis estimator is the fixed-effects model, which infers the parameters of the aggregate distribution from an inverse variance weighted mean of the individual distributions.

\[ \begin{aligned} w_{i}&=\dfrac{\sigma_{i}^{-2}}{\sum_{j}^{M}\sigma_{j}^{-2}} \\ \tilde{\mu}&=\sum_{i}^{M}w_{i}\mu_{i} \\ \tilde{\sigma}^2&=\dfrac{1}{\sum_{i}^{M}\sigma_{i}^{-2}} \\ q(\mu;\tilde{\mu}, \tilde{\sigma})&=\mathcal{N}(\mu;\tilde{\mu}, \tilde{\sigma}) \end{aligned} \]

There are many more meta-analysis tools, including Bayesian approaches, but these come with additional complexity and assumptions, with little added benefit for bayes_conf_mat.

We implement the following aggregation methods stemming from this aggregation framework:

FEGaussianAggregator: the standard fixed effects model
REGaussianAggregator: a random effects model that tries to correct for inter-experiment heterogeneity

A standard work on meta-analyses (from the perspective of systematic reviews) is the Cochrane Handbook¹.

Conflation#

The second framework for consolidating different experiments is conflation. It's a term coined by Hill in Hill (2011)² & Hill & Miller (2011)³, leverages a very simple method derived from probability theory first principles. Specifically, for a set of \(M\) probability distributions over the same space (in our case, the classification evaluation metric \(\mu\)), then the conflated distribution is computed as:

\[\begin{aligned} q(\mu)&=\&\left(p_{1}(\mu), p_{2}(\mu),\ldots,p_{M}(\mu)\right) \\ &=\dfrac{\prod_{i=1}^{M} p_{i}(\mu)}{\int_{-\infty}^{\infty}\prod_{j=1}^{M} p_{j}(\mu)d\mu}\end{aligned}\]

In other words, the renormalized product of the individual distributions. For discrete distributions, the integral becomes a sum over the distributions' support.

This method has a nice intuition: the only way a value \(\mu\) receives large probability mass/density in the aggregate distribution, is by having mutual agreement among all individual distributions.

In their papers, Hill & Miller²³ go on to show that conflation uniquely minimizes the loss of Shannon information due to aggregation. What this means, is that the conflated distribution minimizes the additional 'surprise' incurred when replacing each \(p_i(\mu)\) with \(\&(p_i(\mu))\). They also go on to prove a variety of other useful properties, but most importantly, show that under normality assumptions, the fixed-effect meta-analytical estimator is just a special case of conflation.

We implement the following aggregation methods stemming from this aggregation framework:

FEGaussianAggregator: the conflation of several Gaussian distributions. Equal to the above meta-analysis framework
BetaAggregator: the conflation of several Beta distributions. Good for metrics with values near its maxima (or minima)
GammaAggregator: the conflation of several Gamma distributions. Good for positively unbounded metrics
HistogramAggregator: the conflation of the discretized histogram approximation of the individual experiment samples

Parametric Assumptions#

So far, we've been careful not to introduce parametric assumptions into the distributions of experiment metrics. Despite this, each metric distribution has its own distinct shape, even when using uninformative priors.

When using the listed experiment aggregation methods, however, most require the user to make some assumptions about the metric distributions. While the conflation operator makes non-parametric experiment aggregation possible, in theory, this still requires having access to the probability density/mass function. By design, we don't have access to the PDF/PMF, and have to estimate this from the samples. Unfortunately, especially as distributions become more narrow, it can happen that large parts of the support receive 0 density. As a result, the products in the conflation operator become 0 as well, resulting in an indeterminate expression.

If all the individual distributions share a support, we can just compute the product over the region where all distributions have non-zero density. There are situations, however, where we would not expect such a region to exist: specifically whenever there is high inter-experiment heterogeneity.

This can be seen in the following two examples. We have two experiments whose empirical distributions do not overlap in the region between the two experimental distributions, precisely where we would expect the aggregate distribution to lie.

Experiment aggregation under heterogeneity with a histogram aggregator

Using the histogram aggregator, we get an implausible aggregate distribution, despite it maximizing the areas of high density in the individual experiment distributions. Hence, making a parametric assumption (as we do in the how to guides) would be prudent here.