分析不足采样对进化算法的基准测试和配置的影响

论文标题

分析不足采样对进化算法的基准测试和配置的影响

Analyzing the Impact of Undersampling on the Benchmarking and Configuration of Evolutionary Algorithms

论文作者

Vermetten, Diederick, Wang, Hao, López-Ibañez, Manuel, Doerr, Carola, Bäck, Thomas

论文摘要

迭代优化启发式方法的随机性质导致固有的嘈杂性能测量。由于这些测量值通常一次收集一次，然后反复使用，因此收集的样品的数量将对算法比较的可靠性产生重大影响。我们表明，基于有限的数据做出决策时应注意。特别是，我们表明，在许多基准测试研究中使用的运行次数，例如，可可环境建议的15个默认值不足以可靠地对众所周知的数值优化基准进行可靠的排名。此外，自动算法配置的方法对样本量不足敏感。这可能会导致配置器选择“幸运”但表现不佳的配置，尽管探索了更好的配置。我们表明，像许多配置者一样，依靠平均性能值可能需要大量运行，以提供所考虑的配置之间的准确比较。在大多数情况下，常见的统计测试可以大大改善情况，但并非总是如此。我们展示了超过20％的性能损失的示例，即使使用统计竞赛来动态调整IRACE的运行次数。我们的结果强调了适当考虑性能值的统计分布的重要性。

The stochastic nature of iterative optimization heuristics leads to inherently noisy performance measurements. Since these measurements are often gathered once and then used repeatedly, the number of collected samples will have a significant impact on the reliability of algorithm comparisons. We show that care should be taken when making decisions based on limited data. Particularly, we show that the number of runs used in many benchmarking studies, e.g., the default value of 15 suggested by the COCO environment, can be insufficient to reliably rank algorithms on well-known numerical optimization benchmarks. Additionally, methods for automated algorithm configuration are sensitive to insufficient sample sizes. This may result in the configurator choosing a `lucky' but poor-performing configuration despite exploring better ones. We show that relying on mean performance values, as many configurators do, can require a large number of runs to provide accurate comparisons between the considered configurations. Common statistical tests can greatly improve the situation in most cases but not always. We show examples of performance losses of more than 20%, even when using statistical races to dynamically adjust the number of runs, as done by irace. Our results underline the importance of appropriately considering the statistical distribution of performance values.

下载PDF全文

下载文献需遵守相关版权规定

论文标题