论文标题
在结构化数据上缩放批处理主动搜索的层次结构方法
A Hierarchical Approach to Scaling Batch Active Search Over Structured Data
论文作者
论文摘要
主动搜索是在较大且通常是高维参数空间中识别高价值数据点的过程,这些数据点可能很昂贵。传统的主动搜索技术(例如贝叶斯优化)对连续评估进行了探索和剥削,并且历史上专注于每轮评估的单个或小(<5)示例数量。随着现代数据集的增长,需要将主动搜索扩展到大型数据集和批处理大小。在本文中,我们提出了一个基于强盗算法的一般分层框架,通过最大化从每个数据集的唯一结构中得出的信息来扩展主动搜索到大批量大小。我们的层次结构框架,分层批量匪徒搜索(HBB),通过促进数据集中的不同结构元素的广泛探索,从策略性地分配了批处理选择。我们将HBB的应用集中在现代生物学上,在现代生物学上,大批次实验通常是研究过程的基础,并证明了生物序列的批处理设计(蛋白质和DNA)。我们还提出了一个新的健身房环境,以轻松模拟各种生物学序列,并能够更全面地评估异构数据集的主动搜索方法。 HBBS框架通过在每个结构化数据的每个分区中使用跨粗分段的广泛探索策略和细粒度的剥削来改善用于批处理搜索的标准性能,墙壁锁定和可扩展性基准。
Active search is the process of identifying high-value data points in a large and often high-dimensional parameter space that can be expensive to evaluate. Traditional active search techniques like Bayesian optimization trade off exploration and exploitation over consecutive evaluations, and have historically focused on single or small (<5) numbers of examples evaluated per round. As modern data sets grow, so does the need to scale active search to large data sets and batch sizes. In this paper, we present a general hierarchical framework based on bandit algorithms to scale active search to large batch sizes by maximizing information derived from the unique structure of each dataset. Our hierarchical framework, Hierarchical Batch Bandit Search (HBBS), strategically distributes batch selection across a learned embedding space by facilitating wide exploration of different structural elements within a dataset. We focus our application of HBBS on modern biology, where large batch experimentation is often fundamental to the research process, and demonstrate batch design of biological sequences (protein and DNA). We also present a new Gym environment to easily simulate diverse biological sequences and to enable more comprehensive evaluation of active search methods across heterogeneous data sets. The HBBS framework improves upon standard performance, wall-clock, and scalability benchmarks for batch search by using a broad exploration strategy across coarse partitions and fine-grained exploitation within each partition of structured data.