Abstract

Design researchers have struggled to produce quantitative predictions for exactly why and when diversity might help or hinder design search efforts. This paper addresses that problem by studying one ubiquitously used search strategy—Bayesian optimization (BO)—on a 2D test problem with modifiable convexity and difficulty. Specifically, we test how providing diverse versus non-diverse initial samples to BO affects its performance during search and introduce a fast ranked-determinantal point process method for computing diverse sets, which we need to detect sets of highly diverse or non-diverse initial samples. We initially found, to our surprise, that diversity did not appear to affect BO, neither helping nor hurting the optimizer’s convergence. However, follow-on experiments illuminated a key trade-off. Non-diverse initial samples hastened posterior convergence for the underlying model hyper-parameters—a model building advantage. In contrast, diverse initial samples accelerated exploring the function itself—a space exploration advantage. Both advantages help BO, but in different ways, and the initial sample diversity directly modulates how BO trades those advantages. Indeed, we show that fixing the BO hyper-parameters removes the model building advantage, causing diverse initial samples to always outperform models trained with non-diverse samples. These findings shed light on why, at least for BO-type optimizers, the use of diversity has mixed effects and cautions against the ubiquitous use of space-filling initializations in BO. To the extent that humans use explore-exploit search strategies similar to BO, our results provide a testable conjecture for why and when diversity may affect human-subject or design team experiments.

1 Introduction and Related Work

One open question within design research is when or under what conditions providing diverse stimuli or starting solutions to either humans or algorithms can improve their designs’ final performance. Researchers have struggled to produce quantitative predictions or explanations for exactly why and when diversity might help or hinder design search efforts. In studies of human designers or teams, there have been numerous empirical results on the effect of diverse stimuli or sets of stimuli on designers, typically referred to under the topic of design fixation (for recent reviews, see Refs. [1,2]). In general, available empirical results are mixed and it is difficult to quantitatively predict, for a new problem or person, whether or not diversity in problem stimuli will or will not help. For instance, there are a number of empirical demonstrations of positive effects of example diversity on novelty and diversity of ideas [35], but substantially more mixed results on the effects of diversity on solution quality, with some observations of positive effects [68], some null or contingent effects [4,914], and even some negative effects on solution quality [15,16].

Likewise, in research focused purely on optimization, common academic and industrial practice initializes search algorithms with different strategies like Latin hypercube sampling (LHS) [17] and others in an attempt to “fill” or “cover” a space as uniformly as possible [18] or via quasi-random methods [1921]. Some methods build diversity-encouraging loss functions directly into their core search algorithms, such as in common meta-heuristic optimizers [22] such as particle swarm optimization, simulated annealing, and genetic algorithms, with one of the most well-known diversity-inducing ones being the non-dominated sorting Genetic Algorithm-II (NSGA-II) [23]. For Bayesian optimization (BO) specifically, a common strategy is to build diversity directly into the acquisition function used in sampling new points from the Gaussian process (GP) posterior [24]. As with human-subjects experiments, the precise effect of diversity on optimization performance is often problem dependent [22] and difficult to predict a priori. Nevertheless, optimization practitioners take these steps to improve initial sample diversity with the hope that the optimizer will converge faster or find better global optima.

But is encouraging initial diversity in this way always a good idea? If so, when and why is it good? Are there any times or conditions when diversity might hurt rather than help our search for good designs?

(Spoiler Alert: Yes, it can—see Sec. 4 for how and Sec. 6 for why.)

To address the above questions, this paper studies one type of commonly used search strategy—BO—and how the diversity of its initialization points affects its performance on a search task. We uncover a fascinating dance that occurs between two competing advantages that initial samples endow upon BO—a model building versus space exploration advantage that we define later—and how the initial samples’ diversity directs the choreography. While the fundamental reason for this interplay will later appear straightforward (and perhaps even discernible through thought experiments rather than numerical experiments), it nevertheless flies in the face of how most practitioners initialize their BO routines or conduct optimal experimental design studies. It also posits a testable prediction about how to induce greater effects of diversity on novice human designers or the conditions under which there may be mixed or even negative effects (see Sec. 6).

Before describing our particular experiment and results, we will first review why BO is a meaningful and generalizable class of search algorithm to use, as well as past work that has tried to understand how diversity affects search processes such as optimization.

Why model design search as Bayesian optimization? While this paper addresses only BO, this is an important algorithm in that it plays an out-sized role within the design research and optimization community. For example, BO underlies a vast number of industrially relevant gradient-free surrogate modeling approaches implemented in major design or analysis packages, where it is referred to under a variety of names, including Kriging methods or meta-modeling [25,26]. Its use in applications of computationally expensive multidisciplinary optimization problems is, while not unilateral [27], quite widespread. Likewise, researchers studying human designers often use BO as a proxy model [28] to understand human search, due to the interplay between exploration and exploitation that lies at the heart of most BO acquisition functions like expected improvement (EI). More generally, there is a robust history of fruitful research in cognitive science modeling human cognition as Bayesian processing [29], such as concept learning in cognitive development [30], causal learning [31], and analogical reasoning [32].

While the bulk of BO-related papers focus on new algorithms or acquisition functions, few papers focus on how BO is initialized, preferring instead the general use of space-filling initializations that have a long history in the field of optimal experiment design [27]. In contrast, this paper shows that in certain situations that faith in space-filling designs might be misplaced, particularly when the BO kernel hyper-parameters are adjusted or fit during search.

What does it even mean for samples to be diverse? As a practical matter, if we wish to study how diverse samples impact BO, we face a subtle but surprisingly non-trivial problem: how exactly do you quantify whether one set of samples is more or less diverse than another? This is a set-based (i.e., combinatorially large) problem with its own rich history too large to cover extensively here, however our past work on diversity measurement [3335], computation [36], and optimization [37] provides further pointers for interested readers, and in particular the thesis of Ahmed provides a good starting point for the broader literature and background in this area [38].

For the purposes of understanding how this paper relates to existing approaches, it suffices to know the following regarding common approaches to quantifying diversity: (1) most diversity measurement approaches focus on some variant of a hyper-volume objective spanned by the set of selected points; (2) since this measure depends on a set rather than individual points, it becomes combinatorially expensive, necessitating fast polynomial-time approximation, one common tool for which is a determinantal point process (DPP) [39]; however, (3) while sampling the most diverse set via DPPs is easy, sampling percentile sets from the DPP distribution to get the top 5%, median, or lowest 5% of diverse sets becomes exceedingly slow for a large sample pool.

In contrast, for this paper, we created a faster DPP-type sampling method to extract different percentiles of the distribution without actually needing to observe the entire DPP distribution and whose sampling error we can bound using concentration inequalities. Section 2 provides further mathematical background, including information on DPP hyper-parameters and how to select them intelligently, and the Supplementary Material on the ASME Digital Collection provides further algorithmic details. With an understanding of diversity distribution measures in hand, we can now address diversity’s specific effects on optimization more generally.

How does diversity in initial inputs affect optimizers? While there are a number of papers that propose either different initialization strategies or benchmarking of existing strategies for optimization, there is limited prior work addressing the direct effect of initial sample diversity.

For general reviews and benchmarking on how to initialize optimizers and the effects of different strategies, papers such as Refs. [20,22] compare initialization strategies for particular optimizers and quantify performance differences. An overall observation across these contributions is the inability of a single initialization method to improve performance across functions of varying complexity. These studies also do not directly measure or address the role of sample diversity directly, only noting such behavior as it correlates indirectly with the sampling strategy.

A second body of work tries to customize initialization strategies on a per-problem basis, often achieving faster convergence on domain-specific problems [18,19,4042]. While useful in their designed domain, these studies do not directly address the role of diversity either. In contrast, this paper addresses diversity directly using properties of BO that are sufficiently general to apply across multiple domains and applications.

Lastly, how to initialize optimizers has garnered new interest from the machine learning community, for example, in the initial settings of weights and biases in a neural network and the downstream effects on network performance [43,44]. There is also general interest in how to collect diverse samples during learning, either in an active learning [45] or reinforcement learning context [46,47]; however, those lines of work address only diversity throughout data collection, rather than the impact of initial samples considered in this paper.

What does this paper contribute beyond past work? This paper’s specific contributions are:

  1. To compute diversity: we describe a fast DPP-based diversity scoring method for selecting diverse initial examples with a fixed size k. Any set of size k with these initial examples can be then used to approximate the percentile of diversity that the set belongs to. This method requires selecting a hyper-parameter relating to the DPP measure. We describe a principled method for selecting this parameter in Sec. 2.1 and provide numerical evidence of the improved sampling performance in the Supplementary Material. Compared to prior work, this makes percentile sampling of DPP distributions computationally tractable.

  2. To study effects on BO: we empirically evaluate how diverse initial samples affect the convergence rate of a Bayesian optimizer. Section 4 finds that low diversity samples provide a model building advantage to BO while diverse samples provide a space exploration advantage that helps BO converge faster. Section 5 shows that removing the model building advantage makes having diverse initial samples uniformly better than non-diverse samples.2

We will next describe our overall experimental approach and common procedures used across all three of our main experiments. We will introduce individual experiment-specific methods only when relevant in each separate experiment section. To fully reproduce the below results, the link in the footnote provides the code associated with this paper.3

2 Overall Experimental Approach

This section will first describe how we compute diverse initial samples, including how we set a key hyper-parameter that controls the DPP kernel needed to measure sample set diversity. It then briefly describes the controllable 2D test problem that we use in our experiments. It ends with a description of how we setup the BO search process and the hyper-parameters that we study more deeply in each individual experiment.

Fig. 1
Correlation matrix showing the relative correlation between two gammas by comparing the way our DPP approach ranks 10,000 sampled sets of cardinality, k = 10. The gamma values in both axes here are logarithmic values with base 10.
Fig. 1
Correlation matrix showing the relative correlation between two gammas by comparing the way our DPP approach ranks 10,000 sampled sets of cardinality, k = 10. The gamma values in both axes here are logarithmic values with base 10.
Close modal

2.1 Measuring and Sampling From Diverse Sets Using Determinantal Point Processes.

As mentioned above, we measure diversity of a set of points using DPP, which get their name from the fact that they compute the determinant of a matrix referred to as an L-ensemble (as seen in Eq. (1)) that correlates with the volume spanned by a collection or set of samples (Y) taken from all possible sets (Y) given a diversity/similarity (feature) metric.
(1)
Here, L is the ensemble defined by any positive semi-definite matrix [39], and K is the kernel matrix. For sampling diverse examples, this positive semi-definite matrix is typically chosen as a kernel matrix (K) that defines the similarity measure between pairs of data points. For this paper, we use a standard and commonly used similarity measure defined using a radial basis function (RBF) kernel matrix [48]. Specifically, each entry in LY for two data points with index i and j is
(2)
The hyper-parameter γ in the DPP kernel can be set in the interval (0, ∞) and will turn out to be quite important in how well we can measure diversity. The next section explores this choice in more depth, but to provide some initial intuition: set γ too high and any selection of points looks equally diverse compared to any other set, essentially destroying the discriminative power of the DPP, while setting γ too low causes the determinant of L to collapse to zero for any set of cardinality greater than the feature-length of x.
With L in hand, we can now turn Eq. (1) into an equality by using the fact that YYdet(LY)=det(L+I), where I is an identity matrix of the same shape as the ensemble matrix L. Then, using Theorem 2.2 from Ref. [39], we can write the P(YY) as follows:
(3)

This is the probability that a given set of points (Y) is highly diverse compared to other possible sets (Y)—that is, the higher P(Y) the more diverse the set. The popularity of DPP-type measures is due to their ability to efficiently sample diverse samples of fixed size k. Sampling a set of k samples from a DPP is done using a conditional DPP called k-DPP [49]. k-DPP are able to compute marginal and conditional probabilities with polynomial complexity, in turn allowing sampling from the DPP in polynomial complexity. k-DPPs are also well researched and there exists several different methods to speed up the sampling process using a k-DPP [50,51]. Our approach allows sampling in constant complexity; however, there is a trade-off in complexity in generating the DPP distribution. The complexity for generating traditional DPP distributions is independent of “k,” while our approach has linear dependence on “k.” Since existing k-DPP approaches lack the ability to efficiently sample from different percentiles of diversity and thus make it computationally expensive to regenerate the distribution to alternatively sample from different percentiles.

To tackle this problem, our approach is designed to sample efficiently from different percentiles of diversity. This is made possible by creating an absolute diversity score. This score is generated by taking a log determinant of the kernel matrix defined over the set Y. Randomly sampling the k-DPP space allows us to bound errors in generating this absolute score through the use of concentration inequalities. The details about how to sample from this distribution and calculate the score are mentioned in the Supplementary Material, so as not to disrupt the paper’s main narrative. Additionally, the Supplementary Material provides empirical results to support our earlier claims regarding efficient sampling from our approach versus the traditional k-DPP approach, as well as the trade-off in complexity when generating the DPP distribution. Figure 2 shows example sets of five points and their corresponding DPP score, where the diversity score is monotonic and a positive score corresponds to a more diverse subset.

Fig. 2
Scatter plots showing randomly chosen sets with k = 5 high-diversity and low-diversity samples with their diversity score on top of each of the chosen set
Fig. 2
Scatter plots showing randomly chosen sets with k = 5 high-diversity and low-diversity samples with their diversity score on top of each of the chosen set
Close modal

2.1.1 Selecting the Hyper-parameter for the Determinantal Point Process Kernel.

As mentioned above, the choice of γ impacts the accuracy of the DPP score, and when we initially fixed γ to |Yi|/10, where Yi is the set of data points over which the RBF kernel is calculating the DPP score as suggested by Ref. [52], the DPP seemed to be producing largely random scores. To select an appropriate γ, we designed a kernel-independent diagnostic method for assessing the DPP kernel with four steps.

First, we randomly generate M samples of size k sets (think of these as random k-sized samples from Y). Second, we compute their DPP scores for different possible γ values and then sort those M sets by that score. Third, we compute the rank correlation among these sets for different pairs of γ—intuitively, if the rank correlation is high (toward 1), then either choice of γ would produce the same rank orders of which points were considered diverse, meaning the (relative) DPP scores are insensitive to γ. In contrast, if the rank correlation is 0, then the two γ values produce essentially random orderings. This rank correlation between two different γ settings is the color/value shown in each cell of the matrix in Fig. 1. Large ranges of γ with high-rank correlation mean that the rankings of DPP scores are stable or robust to small perturbations in γ. Lastly, we use this “robust γ” region by choosing the largest range of γ values that have a relative correlation index of 0.95 or higher. We compute the mean of this range and use that as our selected γ in our later experiments. We should note that the functional range of γ is dependent on sample size (k), and so this “robust γ” needs to be recomputed for different initialization sizes.

The detailed settings for the results as seen in Fig. 1 are as follows: M=10,000; k = 10; and γ ∈ [e − 7, e − 2]. The correlation matrix shows a range of γ with strongly correlating relative ordering of the test sets. All γ within this range provide a consistent ranking.

2.2 A Test Function With Tunable Complexity.

A problem that is common across the study of initialization methods is their inconsistency across problems of varying difficulty, motivating the need to test BO’s search behavior on a problem class with variable complexity. Synthetic objective functions are often used to test the efficiency of different optimizers and there are several libraries online to choose these functions from Ref. [53], though these functions are largely static, in the sense that there is only a single test function definition. There has been research into developing objective function generators; for example, in Ref. [54], the author uses a mixture of four features to generate synthetic objective functions. These have been well categorized and the relative performance of different optimizers is documented on each landscape. Similar to this, Ref. [55] looks at using a mixture of different sinusoidal functions to create a noisy 1D function. Both the generators discussed are capable of generating complicated landscapes, but the complexity arises from mixing different randomly generated sinusoids and thus are unable to control or quantify a measure of complexity of the generated landscapes.

To address this controllable complexity problem directly, we created a simple 2D test function generator with tunable complexity parameters that allow us to instantiate multiple random surfaces of similar optimization difficulty. We modified this function from the one used in Ref. [56] where it was referred to as “Wildcat Wells,” though the landscape is functionally just a normal distribution with additive noise of different frequency spectra. We used four factors to control the synthetic objective functions: (1) the number of peaks, (2) noise amplitude, (3) smoothness, and (4) distance between peaks and a seed. The number of peaks control the number of layers of multivariate normal with single peaks. The noise amplitude in the range of [0,1] controls the relative height of the noise compared to the height of the peaks. Setting this to 1 would essentially make the noise in the function as tall as the peaks and give the function infinite peaks. Smoothness in the range of [0,1] controls the weighted contribution of the smooth Gaussian function compared to the rugged noise to the wildcat wells landscape. Setting this to 1 would remove the noise from the function because then the normal distribution completely controls and dominates the function. The last parameter, the distance between peaks, can be tuned in the range of [0,1]. This parameter prevents overlap of peaks when the function is generated with more than one peak.

Some of these parameters overlap in their effects. For example, N controls the number of peaks, and ruggedness amplitude controls the height of the noise in the function, so increasing the noise automatically increases the peaks in the function thus we will only look at varying the ruggedness amplitude. Apart from this, ruggedness frequency (the distance between peaks) plays the same role as smoothness (radius of influence of each individual on its neighbor). Thus, for the numerical experiments presented in Secs. 35, only the ruggedness amplitude and smoothness will be varied between [0.2, 0.8] with increments of 0.2. To provide some visual examples of the effect of these parameters on the generated functions, Fig. 3 visualizes an example random surface generated with different smoothness and ruggedness amplitude parameters.

Fig. 3
A grid plot showing how the landscape of wildcat wells changes with smoothness and ruggedness amplitude
Fig. 3
A grid plot showing how the landscape of wildcat wells changes with smoothness and ruggedness amplitude
Close modal

2.3 Bayesian Optimization.

BO has emerged as a popular sample-efficient approach for optimization of these expensive black-box functions. BO models the black-box function using a surrogate model, typically a GP. The next design to evaluate is then selected according to an acquisition function. The acquisition function uses the GP posterior and makes the next recommendation for function evaluation by balancing between exploration and exploitation. It allows exploration of regions with high uncertainty in the objective function and exploitation of regions where the mean of the objective function is optimum. At each iteration, the GP gets updated according to the selected sample, and this process continues iteratively according to the available budget.

Each data point in the context of Bayesian optimization is extremely expensive; thus, there is a need for selection of an informative set of initial samples for the optimization process. Toward this, this paper investigates the effect of the level of initial diverse coverage of the input space on convergence of Bayesian optimization policies.

For the purpose of numerical experiments, the optimizer used is from the BOTorch Library [57]. The optimizer uses a single task GP model with expected improvement; the kernel used is a Matérn kernel.

A GP is specified by its mean and covariance functions as
(4)
where μ(.) and k(.,.) are the mean function and a real-valued kernel function encoding the prior belief on the correlation among the samples in the design space. In Gaussian process regression, the kernel function dictates the structure of the surrogate model we can fit. An important kernel for Bayesian optimization is the Matérn kernel, which incorporates a smoothness parameter ν to permit greater flexibility in modeling functions:
(5)
where Γ(·) and Hν(·) are the Gamma function and the Bessel function of order ν, and θ is the length scale hyper-parameter which denotes the correlation between the points within each dimension and specifies the distance that the points in the design space influence one another. Here, we use a constant mean for the mean function. The model building advantage that we refer to in this paper corresponds to learning these hyper-parameters. The hyper-parameters of the Gaussian process, namely, the parameters of the kernel function and the mean function are:

Lengthscale of the Matérn Kernel: In Eq. (5), θ is the lengthscale parameter of the kernel. This parameter controls the ruggedness expected by the Bayesian optimizer in the black box function being studied.

The effects of the parameter are similar to ν, but ν is not learned during the optimization process while lengthscale is. So, ν is not studied as a parameter that influences the modeling behavior but rather studied as an additional parameter for sensitivity.

Output Scale of Scale Kernel: Output scale is used to control how the Matérn kernel is scaled for each batch. Since our Bayesian optimizer uses a single task GP, we do not use batch optimization. Thus, this parameter is unique for us and the way it is implemented using BoTorch can be seen in Eq. (6).
(6)

Noise for Likelihood Calculations: The noise parameter is used to model measurement error or noise in the data. So, as the Gaussian process gets more data, the noise term decreases. So, ideally, this term should converge to 0 when the Bayesian optimizer has found an optimal value since our test functions did not have any added noise.

Constant for Mean Module: This constant is used as the mean for the normal distribution that forms the prior of the Gaussian process as shown in Eq. (4).

Further studies and results regarding the effects of the hyper-parameters are available in the Supplementary Material.

We now describe the first experiment where we explore the effects of diversity of initial samples on the convergence of Bayesian optimizers.

3 Experiment 1: Does Diversity Affect Optimization Convergence?

3.1 Methods.

To test the effects of diversity of initial samples on optimizer convergence, we first generated a set of initial training samples of size (k) 10 either from low (5th percentile of diversity) or high-diversity (95th percentile of diversity) using our procedure in Sec. 2.1. Next, we created 100 different instances of the wildcat wells function with different randomly generated seeds for each cell in a 4 × 4 factor grid of four values each of the smoothness and ruggedness amplitude parameters of the wildcat wells function (ranging from 0.2 to 0.8, in steps of 0.2). For simplicity here, we refer to these combinations as families of the wildcat wells function. This resulted in 1600 function instances.

Our experiment consisted of 200 runs of the Bayesian optimizer within each of the smoothness–ruggedness function families, where each run consisted of 100 iterations, and half of the runs were initialized with a low-diversity training sample, and half were initialized with a high-diversity training sample.

We then compared the cumulative optimality gap (COG) across the iterations for the runs with low-diverse initializations and high-diverse initializations within each smoothness–ruggedness combination family. We did this by computing bootstrapped mean and confidence intervals within each low-diverse and high-diverse sets of runs within each family. Given the full convergence data, we compute a COG which is just the area under the optimality gap curve for both the 5th and 95th diversity curves. Intuitively, a larger COG corresponds to a worse overall performance by the optimizer. Using these COG values, we can numerically calculate the improvement of the optimizer in the 95th percentile. The net improvement of COG value while comparing the 5th and 95th percentile is also presented as text in each subplot in Fig. 4.

Fig. 4
Experiment 1: optimality gap grid plot showing the difference in current optimality gap between optimizers initialized with 5th versus 95th percentile diverse sample (y-axis) as a function of optimization iteration (x-axis). The different factors in the factor grid plot the effects of diversity as the noise amplitude and smoothness are varied in the range [0.2, 0.8]. Each plot also has text indicating the net cumulative optimality gap (NCOG), a positive value corresponds to a better performance by high-diversity samples compared to the low-diversity samples. The plot shows that BO benefits from diversity in some cases but not others. There is no obvious trends in how the NCOG values change in the grid. The results are further discussed in Sec. 3.
Fig. 4
Experiment 1: optimality gap grid plot showing the difference in current optimality gap between optimizers initialized with 5th versus 95th percentile diverse sample (y-axis) as a function of optimization iteration (x-axis). The different factors in the factor grid plot the effects of diversity as the noise amplitude and smoothness are varied in the range [0.2, 0.8]. Each plot also has text indicating the net cumulative optimality gap (NCOG), a positive value corresponds to a better performance by high-diversity samples compared to the low-diversity samples. The plot shows that BO benefits from diversity in some cases but not others. There is no obvious trends in how the NCOG values change in the grid. The results are further discussed in Sec. 3.
Close modal

3.2 Results.

As Fig. 4 shows, the cumulative optimality gap does not seem to have a consistent effect across the grid. Diversity produces a positive convergence effect for some cells, but is negative in others. Moreover, there are wide empirical confidence bounds on the mean effect overall, indicating that should an effect exist at all, it likely does not have a large effect size. Changing the function ruggedness or smoothness did not significantly modulate the overall effect. As expected, given sufficient samples (far right on the x-axis), both diverse and non-diverse initializations have the same optimality gap, since at that point the initial samples have been crowded out by the new samples gathered by BO during its search.

3.3 Discussion.

Overall, the results from Fig. 4 seem to indicate that diversity helps in some cases and hurts in others, and regardless has a limited impact one way or the other. This seems counter to the widespread practice of diversely sampling the initial input space using techniques like LHS. Figure 4 shows that it has little effect.

Why would this be? Given decades of research into initialization schemes for BO and optimal experiment design, we expected diversity to have at least some (perhaps small but at least consistent) positive effect on convergence rates, and not the mixed bag that we see in Fig. 4. How were the non-diverse samples gaining such an upper hand when the diverse samples had a head start on exploring the space—what we call a space exploration advantage?

The next section details an experiment we conducted to test a hypothesis regarding a potential implicit advantage that non-diverse samples might endow to BO that would impact the convergence of BO’s hyper-parameter posteriors. As we will see next, this accelerated hyper-parameter posterior convergence caused by non-diverse initialization is the Achilles’ heel of diversely initialized BO that allows the non-diverse samples to keep pace and even exceed diverse BO.

4 Experiment 2: Do Lower Diversity Samples Improve Hyper-parameter Posterior Convergence?

After reviewing the results from Fig. 4, we tried to determine why the space exploration advantage of diversity was not helping BO as we thought it should. We considered as a thought experiment the one instance where a poorly initialized BO model with the same acquisition function might outperform another: if one model’s kernel hyper-parameter settings were so grossly incorrect that the model would waste many samples exploring areas that it did not need to if it had the correct hyper-parameters.

Could this misstep be happening in the diversely sampled BO but not in the non-diverse case? If so, this might explain how non-diverse BO was able to keep pace: while diverse samples might give BO a head start, it might be unintentionally blindfolding BO to the true function posteriors, making it run ragged in proverbial directions that it need not. If this hypothesis was true, then we would see this reflected in the comparative accuracy of the kernel hyper-parameters learned by the diverse versus non-diverse BO samples. This experiment set out to test that hypothesis.

4.1 Methods.

The key difference from experiment 1 is that rather than comparing the overall optimization convergence, we instead focus on how the initial samples’ diversity affects BO’s hyper-parameter posterior convergence, and compare how far each is from the “ground truth” optimal hyper-parameters.

As with experiment 1, we used the same smoothness and ruggedness amplitude families of the wildcat wells function. To then generate the data for each instance in one of these families, we sampled 20 sets of initial samples. Half of the sampled 20 sets were low (5th percentile of diversity) and the other half from high-diversity (95th percentile of diversity) percentiles.

For each initial sample, we then maximized the GP’s kernel marginal log likelihood (via BOTorch’s GP fit method). We then recorded the hyper-parameters obtained for all 20 initial samples. The mean of the ten samples from low diversity was then used as one point in the box plot’s low-diversity distribution as seen in Fig. 5. We then repeated this process for the high-diversity initial samples. Each point in the box plot can then be understood as the mean hyper-parameter learned by BOTorch given just the initial sample of size (k) 10 points. To get the full box plot distribution for each family, the above process is repeated over 100 seeds and Fig. 5 provides the resulting box plot for both diverse and non-diverse initial samples for all the 16 families of wildcat wells function as described in experiment 1.

Fig. 5
Experiment 2: box plot showing distribution of “lengthscale” hyper-parameter learned by BO when initiated with diverse (right-side) and non-diverse samples (left-side) for 16 different families of wildcat wells functions of the same parameters but 100 different seeds. The optimal hyper-parameter for each of the 100 wildcat wells instances from each family is also plotted as horizontal lines—in many but not all cases these overlap. Each cell in the plot also has the 95th percentile confidence bound on MAE for both diverse and non-diverse samples. The results show that MAE confidence bounds for non-diverse samples are smaller compared to diverse samples for all the families of wildcat wells function. Thus, indicating a presence of model building advantage for non-diverse initial samples. The results of this figure are further discussed in Sec. 4.
Fig. 5
Experiment 2: box plot showing distribution of “lengthscale” hyper-parameter learned by BO when initiated with diverse (right-side) and non-diverse samples (left-side) for 16 different families of wildcat wells functions of the same parameters but 100 different seeds. The optimal hyper-parameter for each of the 100 wildcat wells instances from each family is also plotted as horizontal lines—in many but not all cases these overlap. Each cell in the plot also has the 95th percentile confidence bound on MAE for both diverse and non-diverse samples. The results show that MAE confidence bounds for non-diverse samples are smaller compared to diverse samples for all the families of wildcat wells function. Thus, indicating a presence of model building advantage for non-diverse initial samples. The results of this figure are further discussed in Sec. 4.
Close modal

To provide a ground truth for the true hyper-parameter settings, we ran a binary search to find the size of the sample (koptimal) for which BO’s kernel hyper-parameters converged for all families. The hyper-parameter found by providing koptimal amount of points for each instance in the family was then plotted as a horizontal line in each box plot. An interesting observation is that some families have non-overlapping horizontal lines. This is because for some families there are more than one modes of “optimal hyper-parameters.” The mode chosen as the “optimal hyper-parameter” is the more observed mode. The process for finding the “optimal hyper-parameter” and which mode is chosen as the optimal hyper-parameter have been described in the Supplementary Material.

If an initial sample provides a good initial estimate of the kernel hyper-parameter posterior, then the box plot should align well or close to the horizontal lines of the true posterior. Figure 5 only shows the results for the Matérn kernel’s lengthscale parameter, given its out-sized importance in controlling the GP function posteriors compared to the other hyper-parameters (e.g., output scale, noise, etc.), which we do not plot here for space reasons. We provide further details and plots for all hyper-parameters in the Supplementary Material for interested readers.

To quantify the average distance between the learned and true hyper-parameters, we also plot in Fig. 5 the mean absolute error (MAE) for both highly diverse (95th) and less diverse (5th) points. The MAE is the sum of the absolute distance of each predicted hyper-parameter from the optimal hyper-parameter for the particular surface of each wildcat wells function. The range as seen in each cell in Fig. 5 corresponds to a 95th percentile confidence bound on the mean absolute error across all the 100 runs.

4.2 Results and Discussion.

The results in Fig. 5 show that the MAE values for low diversity samples are always lower compared to the MAE for high-diversity samples. This general behavior is also qualitatively visible in the box plot. This means that after only the initial samples, the non-diverse samples provided much more accurate estimates of the kernel hyper-parameters compared to diverse samples. Moreover, BO systematically underestimates the correct lengthscale with diverse samples—this corresponds to the diverse BO modeling function posteriors that have higher frequency components than the true function actually does (as shown via the pedagogical examples in the Supplementary Material).

This provides evidence for the model building advantage of non-diverse samples that we defined in Sec. 2.3. It also confirms our previous conjecture from the thought experiment that diverse samples might be impacting BO by causing slower or less accurate convergence to the right BO hyper-parameters. The space exploration advantage of the diverse samples helps it compensate somewhat for its poor hyper-parameters, but BO trained with non-diverse samples can leverage the better hyper-parameters to make more judicious choices about what points to select next.

We did not see major differences in the other three kernel hyper-parameters such as output scale, noise, or the mean function (see Supplementary Material); however, this is not surprising, since BO is not highly sensitive to any of these parameters and the lengthscale parameter dominates large changes in BO behavior.

Comparing the different smoothness and ruggedness settings, when the function is more complex (the top right of the grid at low smoothness and high ruggedness amplitude values), the function’s lengthscale is lower and closer to the value learned by the diverse samples. Looking at the low diversity MAE values (“MAE 5”), we can see they are much closer to those of the high-diversity samples (“MAE 95”), in contrast to when the function is less complex (bottom left side of the grid). Under such conditions, low-diversity samples lose some of the relative model building advantage they have over high-diversity samples. This conjecture aligns with experiment 1 (Fig. 4) where the COG values on the top right part are positive while those on the bottom left are negative.

Figure 5 demonstrated our hypothesized model building advantage that non-diverse initial samples confer to BO. But how do we know that this is the actual causal factor that accelerates BO convergence, and not just correlated with some other effect? If correct, our conjecture posits a natural testable hypothesis: if we fix the values of the hyper-parameter posteriors to identical values between the non-diverse and diverse samples and do not allow the BO to update or optimize them, then this should effectively eliminate the model building advantage, and diverse samples should always outperform non-diverse samples. Metaphorically, if we were to take away the arrow that Paris used against Achilles, would the Battle of Troy have ended differently? Our next experiment finds this out.

5 Experiment 3: Does Diversity Affect Optimization Convergence If Hyper-parameters Are Fixed to Optimal Values?

5.1 Methods.

This experiment is identical to experiment 1, with two key differences: (1) we now fix the kernel hyper-parameters to the “optimal hyper-parameter” values we found in experiment 2 for all the instances in each family of the wildcat wells function (2) and we do not allow either BO model to further optimize the kernel hyper-parameters. This should remove the hypothesized model building advantage of non-diverse samples without altering any other aspects of experiment 1 and the results in Fig. 4.

5.2 Results and Discussion.

Figure 6 shows that once the kernel hyper-parameters are fixed—removing the model building advantage of non-diverse samples—diverse samples consistently and robustly outperform non-diverse initial samples. This holds for both the initial optimality gap at the beginning of the search as well as the cumulative optimality gap and is not qualitatively affected by the function smoothness or roughness amplitude. Unlike in experiment 1 where diversity could either help or hurt the optimizer, once we remove the model building advantage, diversity only helps.

Fig. 6
Experiment 3: optimality gap plot showing effects of diversity when the optimizer is not allowed to fit the hyper-parameters for the Gaussian process and the hyper-parameters are instead fixed to the values found in experiment 2. The results from this plot show positive NCOG values for all families of wildcat wells function, showing that once the model building advantage is taken away, the diverse samples outperform non-diverse samples. Further discussion on this plot can be read in Sec. 5.
Fig. 6
Experiment 3: optimality gap plot showing effects of diversity when the optimizer is not allowed to fit the hyper-parameters for the Gaussian process and the hyper-parameters are instead fixed to the values found in experiment 2. The results from this plot show positive NCOG values for all families of wildcat wells function, showing that once the model building advantage is taken away, the diverse samples outperform non-diverse samples. Further discussion on this plot can be read in Sec. 5.
Close modal

6 General Discussion and Conclusions

6.1 Summary and Interpretation of Findings.

This paper’s original goal was to investigate how and when diverse initial samples help or hurt Bayesian optimizers. Overall, we found that the initial diversity of the provided samples created two competing effects. First, experiment 2 showed that non-diverse samples improved BO’s abilities to quickly converge to optimal hyper-parameters—we called this a model building advantage. Second, experiment 3 showed that conditioned on the same fixed hyper-parameters diverse samples improved BO’s convergence to the optima through faster exploration of the space—we called this a space exploration advantage. In experiment 1, diversity had mixed-to-negligible effects since both of these advantages were in play and competed with one another. This interaction provides insight for academic or industrial BO users since common practice recommends initializing BO with space-filling samples (to take advantage of the space exploration advantage), and ignores the model building advantage of non-diverse samples.

Beyond our main empirical result, our improvements to existing diverse sampling approaches (Sec. 2.1) provide new methods for studying how different percentile diversity sets affect phenomena. Researchers may find this contribution of separate technical and scientific interest for related studies that investigate the impact of diversity.

6.2 Implications and Future Work.

Beyond the individual results we observed and summarized in each experiment, there are some overall implications and limitations that may guide future work or interpretation of our results more broadly, which we address below.

Where does this model building advantage induced by non-diverse samples come from? As we conjectured in experiment 2 (Sec. 4) and confirmed in experiment 3 (Sec. 5), the key advantage of using non-diverse initial samples lies in their ability to induce faster and more accurate posterior convergence when inferring the optimal kernel hyper-parameters, such as length scale and others. This allowed the BO to make more judicious and aggressive choices about what points to sample next, so while the diversely initialized models might get a head start on exploring the space, non-diversely initialized models needed to explore less of the space overall, owing to tighter posteriors of possible functions under the Gaussian process.

While we do not have space to include it in the main paper, the Supplementary Material document’s Sec. 5 shows how this model building advantage occurs as we provide BO with a greater number of initial samples. Briefly, there are three “regimes”: (1) sample-deficient, where there are too few samples to induce a modeling advantage regardless of how diversely we sample the initial points; (2) the “modeling advantage” region, where low-diversity samples induce better hyper-parameter convergence than high-diversity samples; and (3) sample-saturated, where there are enough initial samples to induce accurate hyper-parameter posteriors regardless of how diversely we sample initial points. We direct interested readers to Sec. 5 of the Supplementary Material for a deeper discussion on this.

What this behavior implies more broadly is that non-diverse samples, whether given to an algorithm or a person, have a unique and perhaps underrated value in cases where we have high entropy priors over the Gaussian process hyper-parameters or kernel. In such cases, sacrificing a few initial non-diverse points to better infer key length scales in the GP model may well be a worthwhile trade.

We also saw that in cases where the BO hyper-parameters were not further optimized (as in experiment 3 where hyper-parameters were fixed to optimal values), using diverse points only helped BO. Researchers or practitioners using BO would benefit from carefully reviewing what kernel optimization strategy their library or implementation of choice actually does since that will affect whether or not the model building advantage of non-diverse samples is actually in play.

What if hyper-parameters are fixed to non-optimal values? We showed in experiment 3 that fixing BO hyper-parameters to their optimal values ahead of time using an oracle allowed diverse initial samples to unilaterally outperform non-diverse samples. An interesting avenue of future work that we did not explore here for scope reasons would be to see if this holds when hyper-parameters are fixed to non-optimal values. In practical problems, we will not often know the optimal hyper-parameters ahead of time as we did in experiment 3 which caused diversity’s unilateral advantage, so we do not have evidence to generalize beyond this. However, our explanation of the model building advantage would predict that, so long as the hyper-parameters remain fixed (to any value), BO would not have a practical mechanism to benefit much from non-diverse samples, on average.

What are the implications for how we currently initialize BO? One of our result’s most striking implications is how it might influence BO initialization procedures that are often considered standard practice. For example, it is common to initialize a BO procedure with a small number of initial space-filling designs, using techniques like LHS before allowing BO to optimize its acquisition function for future samples. In cases where the BO hyper-parameters will remain fixed, experiment 3 implies that this standard practice is excellent advice and far better than non-diverse samples. However, in cases where you plan to optimize the BO kernel during search, using something like LHS becomes more suspect.

In principle, from experiment 1, we see that diverse samples may help or hurt BO, depending on how much leverage the model building advantage of the non-diverse samples can provide. For example, in the upper right of Fig. 4, the function is effectively random noise, and so there is not a strong model building advantage to be gained. In contrast, in the lower left, the smooth and well-behaved functions allowed non-diverse initialization to gain an upper hand.

Our results propose a perhaps now obvious initialization strategy: if you plan on optimizing the BO hyper-parameters, use some non-diverse samples to strategically provide an early model building advantage, while leveraging the rest of the samples to diversely cover the space.

How might other acquisition functions modulate diversity’s effect? While we have been referring to BO as though it is a single method throughout this paper, individual BO implementations can vary, both in terms of their kernel structure and their choice of acquisition function—or how BO uses information about the underlying fitted Gaussian process to select subsequent points. In this paper’s experiments, we used EI since it is one of the most widespread choices and behaves qualitatively like other common improvement-based measures like probability of improvement, posterior mean, and upper confidence bound functions. Indeed, we hypothesize that part of the reason non-diverse initial samples are able to gain a model building advantage over diverse samples due to a faster collapse in the posterior distribution of possible GP functions which serves as strong input to EI methods and related variants.

Yet EI and its cousins are only one class of acquisition function; would our results hold if we were to pick an acquisition function that directly attacked the GP’s posterior variance? For example, either entropy-based or active learning-based acquisition functions? This paper did not test this and it would be a logical and valuable future study. Our experimental results and proposed explanation would predict the following: the model building advantage seen by non-diverse samples should reduce or disappear in cases where the acquisition function explicitly samples new points to minimize the posterior over GP function classes since in such cases BO itself would try to select samples that reduced overall GP variance, reducing its dependence on what the initial samples provide.

To what extent should we expect these results to generalize to other types of problems? We selected a simple 2D function with controllable complexity in this paper to aid in experimental simplicity, speed, replicability, and ease of visualization; however, this does raise the question of whether or not these results would truly transfer to more complex problems of engineering interest. While future work would have to address more complex problems, we performed two additional experiments studying how the above phenomena change as we (1) increased the wildcat wells function from two to three dimensions and (2) how this behavior changes for other types of common optimization test functions—specifically, we chose the N-dimensional sphere, Rastrigin, and Rosenbrock functions from two to five dimensions. While the existing paper length limits did not allow us to include all of these additional results in the paper’s main body, we direct interested readers to Secs. 6 and 7 of the Supplementary Material document. Briefly, our results align overall with what we described above for the 2D wildcat wells function, and we do not believe that the phenomena we observed are restricted to only our chosen test function or dimension, although clearly future research would need to conduct further tests on other problems to say this with any certainty. Beyond these supplemental results, we can also look at a few critical problem-specific factors and ask what our proposed explanatory model would predict.

For higher dimensional problems, standard GP kernel choices like RBF or Matérn begin to face exponential cost increases due to how hyper-volumes expand. In such cases, having strong constraints (via hyper-parameter priors or posteriors) over possible GP functions becomes increasingly important for fast BO convergence. Our results would posit that any model building advantages from non-diverse sampling would become increasingly important or impactful in cases where it helped BO rapidly collapse the hyper-parameter posteriors.

For discontinuous functions (or GP kernels that assumed as much), the model building advantage of non-diverse samples would decrease since large sudden jumps in the GP posterior mean and variance would make it harder for BO to exploit a model building advantage. However, in discontinuous cases where there were still common global smoothness parameters that governed the continuous portions, the model building advantage would still accelerate advantages for BO convergence.

How might the results guide human-subject experiments or understanding of human designers? One possible implication of our results for human designers is that the effects of example diversity on design outcomes may vary as a function of the designer’s prior knowledge of the design problem. More specifically, the model building advantage observed in experiment 2 (and subsequent removal in experiment 3) suggests that when designers have prior knowledge of how quickly the function changes in a local area of the design space, they can more reliably benefit from the space exploration advantage of diverse examples. This leads to a potentially counter-intuitive prediction that domain experts may benefit more from diverse examples compared to domain novices since domain experts would tend to have prior knowledge of the nature of the design problem (a model building advantage). Additionally, perhaps under conditions of uncertainty about the nature of the design problem, it would be useful to combine the strengths of diverse and non-diverse examples; this could be accomplished with a cluster-sampling approach, where we sample diverse points of the design space, but include local non-diverse clusters of examples that are nearby, to facilitate learning of the shape of the design function.

While these implications might be counter-intuitive in that common guidance suggests that the most informative method is to only diversely sample initial points, the crux of our paper’s argument is that non-diverse points can, surprisingly, be informative to Bayesian optimization due to their ability to quickly concentrate the posterior distribution of the kernel hyper-parameters, and thus accelerate later optimization. Given this tension, a natural question is “how many non-diverse samples do I really need to take advantage of the modeling advantage without giving up the space exploration advantage?” If I have, say, a budget of ten experiments, should I spend only one low-diversity sample? Or do I need two? Half of my budget? We did not explore these practical questions in this work, due to space constraints, but we think this would be an excellent avenue for continued study.

Footnotes

2

For grammatical simplicity and narrative flow, we will use the phrase “non-diverse” throughout the paper to refer to cases where samples are taken from the 5th percentile of diverse sets—these are technically “low-diversity” rather than being absolutely “non-diverse” which would occur when all points in the set are identical, but we trust that readers can keep this minor semantic distinction in mind.

4

See Note 3.

Acknowledgment

This research was supported by funding from the National Science Foundation through award #1826083.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The data and information that support the findings of this article are freely available.4

References

1.
Sio
,
U. N.
,
Kotovsky
,
K.
, and
Cagan
,
J.
,
2015
, “
Fixation or Inspiration? A Meta-Analytic Review of the Role of Examples on Design Processes
,”
Des. Stud.
,
39
, pp.
70
99
.
2.
Crilly
,
N.
, and
Cardoso
,
C.
,
2017
, “
Where Next for Research on Fixation, Inspiration and Creativity in Design?
,”
Des. Stud.
,
50
, pp.
1
38
.
3.
Baruah
,
J.
, and
Paulus
,
P. B.
,
2011
, “
Category Assignment and Relatedness in the Group Ideation Process
,”
J. Exp. Social Psychol.
,
47
(
6
), pp.
1070
1077
.
4.
Siangliulue
,
P.
,
Arnold
,
K. C.
,
Gajos
,
K. Z.
, and
Dow
,
S. P.
,
2015
, “
Toward Collaborative Ideation at Scale: Leveraging Ideas From Others to Generate More Creative and Diverse Ideas
,”
Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, CSCW’15
,
Vancouver, BC, Canada
,
Mar. 14–18
, ACM, pp.
937
945
.
5.
Nijstad
,
B. A.
,
Stroebe
,
W.
, and
Lodewijkx
,
H. F. M.
,
2002
, “
Cognitive Stimulation and Interference in Groups: Exposure Effects in an Idea Generation Task
,”
J. Exp. Social Psychol.
,
38
(
6
), pp.
535
544
.
6.
Taylor
,
A.
, and
Greve
,
H. R.
,
2006
, “
Superman or the Fantastic Four? Knowledge Combination and Experience in Innovative Teams
,”
Acad. Manage. J.
,
49
(
4
), pp.
723
740
.
7.
Zeng
,
L.
,
Proctor
,
R. W.
, and
Salvendy
,
G.
,
2011
, “
Fostering Creativity in Product and Service Development: Validation in the Domain of Information Technology
,”
Human Factors
,
53
(
3
), pp.
245
270
.
8.
Howard-Jones
,
P. A.
,
Blakemore
,
S.-J. J.
,
Samuel
,
E. A.
,
Summers
,
I. R.
, and
Claxton
,
G.
,
2005
, “
Semantic Divergence and Creative Story Generation: An FMRI Investigation
,”
Cognit. Brain Res.
,
25
(
1
), pp.
240
250
.
9.
Chan
,
J.
, and
Schunn
,
C. D.
,
2015
, “
The Importance of Iteration in Creative Conceptual Combination
,”
Cognition
,
145
, pp.
104
115
.
10.
Gielnik
,
M. M.
,
Frese
,
M.
,
Graf
,
J. M.
, and
Kampschulte
,
A.
,
2011
, “
Creativity in the Opportunity Identification Process and the Moderating Effect of Diversity of Information
,”
J. Bus. Ventur.
,
27
(
5
), pp.
559
576
.
11.
Althuizen
,
N.
, and
Wierenga
,
B.
,
2014
, “
Supporting Creative Problem Solving With a Case-Based Reasoning System
,”
J. Manage. Inf. Syst.
,
31
(
1
), pp.
309
340
.
12.
Yuan
,
H.
,
Lu
,
K.
,
Jing
,
M.
,
Yang
,
C.
, and
Hao
,
N.
,
2022
, “
Examples in Creative Exhaustion: The Role of Example Features and Individual Differences in Creativity
,”
Personality Individual Diff.
,
189
, p.
111473
.
13.
Doboli
,
A.
,
Umbarkar
,
A.
,
Subramanian
,
V.
, and
Doboli
,
S.
,
2014
, “
Two Experimental Studies on Creative Concept Combinations in Modular Design of Electronic Embedded Systems
,”
Des. Stud.
,
35
(
1
), pp.
80
109
.
14.
Jang
,
S.
,
2014
, “
The Effect of Image Stimulus on Conceptual Combination in the Design Idea Generation Process
,”
Arch. Des. Res.
,
112
(
4
), p.
59
.
15.
Mobley
,
M. I.
,
Doares
,
L. M.
, and
Mumford
,
M. D.
,
1992
, “
Process Analytic Models of Creative Capacities: Evidence for the Combination and Reorganization Process
,”
Creativity Res. J.
,
5
(
2
), pp.
125
155
.
16.
Baughman
,
W. A.
, and
Mumford
,
M. D.
,
1995
, “
Process-Analytic Models of Creative Capacities: Operations Influencing the Combination-and-Reorganization Process
,”
Creativity Res. J.
,
8
(
1
), pp.
37
62
.
17.
Kamath
,
C.
,
2021
, “
Intelligent Sampling for Surrogate Modeling, Hyperparameter Optimization, and Data Analysis
,” Tech. Rep. LLNL-TR-829837,
Lawrence Livermore National Lab. (LLNL)
,
Livermore, CA, Dec
.
18.
Yang
,
X.-S.
,
2014
, “
Swarm Intelligence Based Algorithms: A Critical Analysis
,”
Evol. Intel.
,
7
(
1
), pp.
17
28
.
19.
Ma
,
Z.
, and
Vandenbosch
,
G. A. E.
,
2012
, “
Impact of Random Number Generators on the Performance of Particle Swarm Optimization in Antenna Design
,”
2012 6th European Conference on Antennas and Propagation (EUCAP)
,
Prague, Czech Republic
,
Mar. 26–30
, IEEE, pp.
925
929
.
20.
Kazimipour
,
B.
,
Li
,
X.
, and
Qin
,
A. K.
,
2014
, “
Effects of Population Initialization on Differential Evolution for Large Scale Optimization
,”
2014 IEEE Congress on Evolutionary Computation (CEC)
,
Beijing, China
,
July 6–11
, pp.
2404
2411
.
21.
Maaranen
,
H.
,
Miettinen
,
K.
, and
Mäkelä
,
M. M.
,
2004
, “
Quasi-random Initial Population for Genetic Algorithms
,”
Comput. Math. Appl.
,
47
(
12
), pp.
1885
1895
.
22.
Li
,
Q.
,
Liu
,
S.-Y.
, and
Yang
,
X.-S.
,
2020
, “
Influence of Initialization on the Performance of Metaheuristic Optimizers
,”
Appl. Soft Comput.
,
91
, p.
106193
.
23.
Deb
,
K.
,
Pratap
,
A.
,
Agarwal
,
S.
, and
Meyarivan
,
T.
,
2002
, “
A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II
,”
IEEE Trans. Evol. Comput.
,
6
(
2
), pp.
182
197
.
24.
Shu
,
L.
,
Jiang
,
P.
,
Shao
,
X.
, and
Wang
,
Y.
,
2020
, “
A New Multi-Objective Bayesian Optimization Formulation With the Acquisition Function for Convergence and Diversity
,”
ASME J. Mech. Des.
,
142
(
9
), p.
091703
.
25.
Simpson
,
T.
,
Poplinski
,
J.
,
Koch
,
P. N.
, and
Allen
,
J.
,
2001
, “
Metamodels for Computer-Based Engineering Design: Survey and Recommendations
,”
Eng. Comput.
,
17
(
2
), pp.
129
150
.
26.
Queipo
,
N. V.
,
Haftka
,
R. T.
,
Shyy
,
W.
,
Goel
,
T.
,
Vaidyanathan
,
R.
, and
Kevin Tucker
,
P.
,
2005
, “
Surrogate-Based Analysis and Optimization
,”
Prog. Aerosp. Sci.
,
41
(
1
), pp.
1
28
.
27.
Jin
,
R.
,
Chen
,
W.
, and
Simpson
,
T.
,
2001
, “
Comparative Studies of Metamodelling Techniques Under Multiple Modelling Criteria
,”
Struct. Multidiscipl. Optim.
,
23
(
1
), pp.
1
13
.
28.
Sexton
,
T.
, and
Ren
,
M. Y.
,
2017
, “
Learning an Optimization Algorithm Through Human Design Iterations
,”
ASME J. Mech. Des.
,
139
(
10
), p.
101404
.
29.
Tauber
,
S.
,
Navarro
,
D. J.
,
Perfors
,
A.
, and
Steyvers
,
M.
,
2017
, “
Bayesian Models of Cognition Revisited: Setting Optimality Aside and Letting Data Drive Psychological Theory
,”
Psychol. Rev.
,
124
(
4
), pp.
410
441
.
30.
Kemp
,
C.
, and
Tenenbaum
,
J. B.
,
2008
, “
The Discovery of Structural Form
,”
Proc. Natl. Acad. Sci. U.S.A.
,
105
(
31
), pp.
10687
10692
.
31.
Lu
,
H.
,
Yuille
,
A. L.
,
Liljeholm
,
M.
,
Cheng
,
P. W.
, and
Holyoak
,
K. J.
,
2008
, “
Bayesian Generic Priors for Causal Learning
,”
Psychol. Rev.
,
115
(
4
), pp.
955
984
.
32.
Lu
,
H.
,
Chen
,
D.
, and
Holyoak
,
K. J.
,
2012
, “
Bayesian Analogy With Relational Transformations
,”
Psychol. Rev.
,
119
(
3
), pp.
617
648
.
33.
Fuge
,
M.
,
Stroud
,
J.
, and
Agogino
,
A.
,
2013
, “
Automatically Inferring Metrics for Design Creativity
,”
IDETC-CIE2013, Volume 5: 25th International Conference on Design Theory and Methodology
,
Portland, OR
,
Aug 4–7
, American Society of Mechanical Engineers, p. V005T06A010.
34.
Ahmed
,
F.
,
Ramachandran
,
S. K.
,
Fuge
,
M.
,
Hunter
,
S.
, and
Miller
,
S.
,
2021
, “
Design Variety Measurement Using Sharma–Mittal Entropy
,”
ASME J. Mech. Des.
,
143
(
6
), p.
061702
.
35.
Miller
,
S. R.
,
Hunter
,
S. T.
,
Starkey
,
E.
,
Ramachandran
,
S.
,
Ahmed
,
F.
, and
Fuge
,
M.
,
2021
, “
How Should We Measure Creativity in Engineering Design? A Comparison Between Social Science and Engineering Approaches
,”
ASME J. Mech. Des.
,
143
(
3
), p.
031404
.
36.
Ahmed
,
F.
,
Ramachandran
,
S. K.
,
Fuge
,
M.
,
Hunter
,
S.
, and
Miller
,
S.
,
2019
, “
Interpreting Idea Maps: Pairwise Comparisons Reveal What Makes Ideas Novel
,”
ASME J. Mech. Des.
,
141
(
2
), p.
021102
.
37.
Ahmed
,
F.
, and
Fuge
,
M.
,
2018
, “
Ranking Ideas for Diversity and Quality
,”
ASME J. Mech. Des.
,
140
(
1
), p.
011101
.
38.
Ahmed
,
F.
,
2019
, “
Diversity and Novelty: Measurement, Learning and Optimization
,” Ph.D. thesis,
University of Maryland
,
College Park, MD
.
39.
Kulesza
,
A.
, and
Taskar
,
B.
,
2012
, “
Determinantal Point Processes for Machine Learning
,”
Found. Trends® Mach. Learn.
,
5
(
2–3
), pp.
123
286
.
40.
Li
,
C.
,
Chu
,
X.
,
Chen
,
Y.
, and
Xing
,
L.
,
2015
, “
A Knowledge-Based Initialization Technique of Genetic Algorithm for the Travelling Salesman Problem
,”
2015 11th International Conference on Natural Computation (ICNC)
,
Zhangjiajie, China
,
Aug. 15–17
, pp.
188
193
.
41.
Dong
,
N.
,
Wu
,
C.-H.
,
Ip
,
W.-H.
,
Chen
,
Z.-Q.
,
Chan
,
C.-Y.
, and
Yung
,
K.-L.
,
2012
, “
An Opposition-Based Chaotic GA/PSO Hybrid Algorithm and Its Application in Circle Detection
,”
Comput. Math. Appl.
,
64
(
6
), pp.
1886
1902
.
42.
Eskandar
,
H.
,
Sadollah
,
A.
,
Bahreininejad
,
A.
, and
Hamdi
,
M.
,
2012
, “
Water Cycle Algorithm—A Novel Metaheuristic Optimization Method for Solving Constrained Engineering Optimization Problems
,”
Comput. Struct.
,
110–111
, pp.
151
166
.
43.
Mishkin
,
D.
, and
Matas
,
J.
,
2016
, “
All You Need Is a Good Init
,”
4th International Conference on Learning Representations
,
San Juan, Puerto Rico
,
May 2–4
.
44.
Yuan
,
W.
,
Han
,
Y.
,
Guan
,
D.
,
Lee
,
S.
, and
Lee
,
Y.-K.
,
2011
, “
Initial Training Data Selection for Active Learning
,”
Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’11
,
Seoul, South Korea
,
Feb. 21–23
, Association for Computing Machinery, pp.
1
7
.
45.
Settles
,
B.
,
2012
,
Active Learning
(
Synthesis Lectures on Artificial Intelligence and Machine Learning
),
Springer International Publishing
,
Cham
.
46.
Yoon
,
J.
,
Arik
,
S.
, and
Pfister
,
T.
,
2020
, “
Data Valuation Using Reinforcement Learning
,”
Proceedings of the 37th International Conference on Machine Learning
,
Vienna, Austria
,
July 13–18
, PMLR, pp.
10842
10851
.
47.
Eysenbach
,
B.
,
Gupta
,
A.
,
Ibarz
,
J.
, and
Levine
,
S.
,
2019
, “
Diversity Is All You Need: Learning Skills Without a Reward Function
,”
Seventh International Conference on Learning Representations
,
New Orleans, LA
,
May 6–9
.
48.
Schölkopf
,
B.
, and
Smola
,
A. J.
,
2018
,
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
,
The MIT Press
,
Cambridge, MA
.
49.
Kulesza
,
A.
, and
Taskar
,
B.
,
2011
, “
k-DPPs: Fixed-Size Determinantal Point Processes
,”
Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11
,
Bellevue, WA
,
June 28–July 2
, pp.
1193
1200
.
50.
Calandriello
,
D.
,
Derezinski
,
M.
, and
Valko
,
M.
,
2020
, “
Sampling From a K-DPP Without Looking at All Items
,”
34th Conference on Neural Information Processing Systems
,
Vancouver, Canada
,
Dec. 6–12
, Curran Associates, Inc., pp.
6889
6899
.
51.
Li
,
C.
,
Jegelka
,
S.
, and
Sra
,
S.
,
2016
, “
Efficient Sampling for K-Determinantal Point Processes
,”
Proceedings of the 19th International Conference on Artificial Intelligence and Statistics
,
Cadiz, Spain
,
May 9–11
.
52.
Mariet
,
Z. E.
,
2016
,
“Learning and Enforcing Diversity With Determinantal Point Processes
,” Master’s thesis,
Massachusetts Institute of Technology
,
Boston, MA
.
53.
Hansen
,
N.
,
Finck
,
S.
,
Ros
,
R.
, and
Auger
,
A.
,
2009
, “
Real-Parameter Black-Box Optimization Benchmarking 2009: Noisy Functions Definitions
,” Research Report RR-6869,
INRIA
.
54.
Rönkkönen
,
J.
,
Li
,
X.
,
Kyrki
,
V.
, and
Lampinen
,
J.
,
2011
, “
A Framework for Generating Tunable Test Functions for Multimodal Optimization
,”
Soft Comput.
,
15
(
9
), pp.
1689
1706
.
55.
Mo
,
H.
,
Li
,
Z.
, and
Zhu
,
C.
,
2017
, “
Epistasis-Tunable Test Functions With Known Maximum Constructed With Sinusoidal Bases
,”
2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)
,
Jiangsu, China
,
Nov. 24–26
, pp.
1
6
.
56.
Mason
,
W.
, and
Watts
,
D. J.
,
2012
, “
Collaborative Learning in Networks
,”
Proc. Natl. Acad. Sci. U.S.A.
,
109
(
3
), pp.
764
769
.
57.
Balandat
,
M.
,
Karrer
,
B.
,
Jiang
,
D.
,
Daulton
,
S.
,
Letham
,
B.
,
Wilson
,
A. G.
, and
Bakshy
,
E.
,
2020
, “
BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization
,”
34th Conference on Neural Information Processing Systems
,
Vancouver, Canada
,
Dec. 6–12
, Vol. 33, Curran Associates, Inc., pp.
21524
21538
.

Supplementary data