Compare Many Experiments#
In Getting Started/02 Comparing Experiments, we discussed comparing two experiments against each other using a variety of methods. This provided lots of information about the probability of existence and significance, as well as visual intuition for how different the metric distributions of two different experiments are.
Repeating this process for many experiments is tedious, however: the number of pairwise combinations scales factorially with the number of experiments!
One setting where we can expect having to compare many experiments at once is in an open-source competition (e.g., Kaggle). This is exactly a scenario tested by Tötsch & Hoffmann (2020)1, where they computed the expected winnings for the top-10 participants in the 'Recursion Cellular Image Classification' challenge, based on the probability that their classifier exceeded all other participants.
Competition Winnings#
While the test set confusion matrices are hidden, from the accuracy scores and the size of the test set we can recreate a confusion matrix that would have produced those accuracy scores. The table for the top 10 participants might look like this:
Rank | TeamId | Score | TP+TN | FP+FN | TP | FN | FP | TN |
---|---|---|---|---|---|---|---|---|
1 | 3467175 | 0.99763 | 15087 | 36 | 7544 | 18 | 18 | 7544 |
2 | 3394520 | 0.99672 | 15073 | 50 | 7537 | 25 | 25 | 7537 |
3 | 3338942 | 0.99596 | 15062 | 61 | 7531 | 31 | 31 | 7531 |
4 | 3339018 | 0.99512 | 15049 | 74 | 7525 | 37 | 37 | 7525 |
5 | 3338836 | 0.99498 | 15047 | 76 | 7524 | 38 | 38 | 7524 |
6 | 3429037 | 0.99380 | 15029 | 94 | 7515 | 47 | 47 | 7515 |
7 | 3346448 | 0.99296 | 15017 | 106 | 7509 | 53 | 53 | 7509 |
8 | 3338664 | 0.99296 | 15017 | 106 | 7509 | 53 | 53 | 7509 |
9 | 3338358 | 0.99282 | 15014 | 109 | 7507 | 55 | 55 | 7507 |
10 | 3339624 | 0.99240 | 15008 | 115 | 7504 | 58 | 58 | 7504 |
Using the Study.report_listwise_comparison
we can request a table with the probability that each competitor's accuracy score achieved a certain rank:
Group | Experiment | Rank 1 | Rank 2 | Rank 3 | Rank 4 | Rank 5 | Rank 6 | Rank 7 | Rank 8 | Rank 9 | Rank 10 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3467175 | 0.9297 | 0.0677 | 0.0025 | |||||||
2 | 3394520 | 0.0678 | 0.7916 | 0.1264 | 0.0090 | 0.0053 | |||||
3 | 3338942 | 0.0024 | 0.1254 | 0.6522 | 0.1282 | 0.0902 | 0.0016 | ||||
4 | 3339018 | 0.0137 | 0.1659 | 0.4415 | 0.3568 | 0.0191 | 0.0012 | 0.0013 | 0.0004 | ||
5 | 3338836 | 0.0016 | 0.0502 | 0.3609 | 0.4572 | 0.0997 | 0.0120 | 0.0124 | 0.0049 | 0.0010 | |
6 | 3429037 | 0.0026 | 0.0518 | 0.0755 | 0.5209 | 0.1282 | 0.1281 | 0.0688 | 0.0241 | ||
8 | 3338664 | 0.0001 | 0.0070 | 0.0121 | 0.2024 | 0.2622 | 0.2572 | 0.1764 | 0.0825 | ||
7 | 3346448 | 0.0014 | 0.0024 | 0.0964 | 0.2524 | 0.2540 | 0.2381 | 0.1552 | |||
9 | 3338358 | 0.0002 | 0.0006 | 0.0445 | 0.2099 | 0.2116 | 0.2734 | 0.2598 | |||
10 | 3339624 | 0.0154 | 0.1341 | 0.1353 | 0.2379 | 0.4773 |
While we see no rank inversions, for many of the ranks, there is considerable ambiguity in many ranks. This indicates, that despite a large test set of \(N=15.1k\) images, the margins between the top competitors are so narrow that it's difficult to say definitively which team achieved which rank (especially when \(r>2\)).
The competition organizers offered a $10,000 prize for 1st place, a $2,000 prize for 2nd place and a $1,000 prize for 3rd place. Using these probabilities we can compute that expected prize money each competitor should have received in a fair division of the competition winnings. Using the study, we can compute this by calling study.report_expected_reward(metric='acc', rewards=[10000, 2000, 1000])
:
Group | Experiment | E[Reward] |
---|---|---|
1 | 3467175 | 9435.19 |
2 | 3394520 | 2385.68 |
3 | 3338942 | 930.13 |
4 | 3339018 | 146.50 |
5 | 3338836 | 100.89 |
6 | 3429037 | 1.57 |
8 | 3338664 | 0.03 |
7 | 3346448 | 0.01 |
9 | 3338358 | 0.00 |
10 | 3339624 | 0.00 |
So while ranks 1 & 2 clearly did deserve the lion's share of the competition winnings, rank 4 & 5 comparatively deserved substantially more than the 0 they received.
The following figure (source code can be found in Explanation/A Replication of Tötsch, N. & Hoffmann, D. (2020). 'Classifier uncertainty: evidence, potential impact, and probabilistic treatment' ) summarizes the situation:
-
Tötsch, N. & Hoffmann, D. (2020). 'Classifier uncertainty: evidence, potential impact, and probabilistic treatment' ↩