The current implementation of the challenge score metric uses Monte Carlo tie-breaking with 10**4 permutations to approximate the expected confusion matrix. This has several drawbacks:
- Computationally expensive (runtime grows as
O(num_permutations n log n))
- Non-reproducible due to randomness
- Results can vary between runs, which complicates benchmarking and CI.
- When using bootstrapping to estimate performance distributions (as we did in our paper), the repeated Monte Carlo sampling makes the metric prohibitively slow and effectively unusable.
Proposal:
Replace the sampling with an exact computation using the hypergeometric expectation. This removes stochasticity, guarantees reproducibility, and reduces runtime to O(n log n).
Related PR: #4