## Abstract

Bayesian optimization (BO) is an efficient and flexible global optimization framework that is applicable to a very wide range of engineering applications. To leverage the capability of the classical BO, many extensions, including multi-objective, multi-fidelity, parallelization, and latent-variable modeling, have been proposed to address the limitations of the classical BO framework. In this work, we propose a novel multi-objective BO formalism, called srMO-BO-3GP, to solve multi-objective optimization problems in a sequential setting. Three different Gaussian processes (GPs) are stacked together, where each of the GPs is assigned with a different task. The first GP is used to approximate a single-objective computed from the multi-objective definition, the second GP is used to learn the unknown constraints, and the third one is used to learn the uncertain Pareto frontier. At each iteration, a multi-objective augmented Tchebycheff function is adopted to convert multi-objective to single-objective, where the regularization with a regularized ridge term is also introduced to smooth the single-objective function. Finally, we couple the third GP along with the classical BO framework to explore the convergence and diversity of the Pareto frontier by the acquisition function for exploitation and exploration. The proposed framework is demonstrated using several numerical benchmark functions, as well as a thermomechanical finite element model for flip-chip package design optimization.

## 1 Introduction

Global optimization is a common problem that appears in many contexts and particularly important for engineering design. The most used global optimization methods are population based searching algorithms, such as evolutionary algorithms. In contrast, surrogate-based searching methods rely on the guidance of surrogate models to perform sequential sampling. Bayesian optimization (BO) method is such an example. BO is a derivative-free surrogate-based global optimization technique that has started being used in engineering domains. A surrogate of the objective function is constructed and continuously updated during the search. The sequential sampling is based on the maximization of an acquisition function, also constructed in the same searching space. Different acquisition functions such as expected improvement (EI), probability of improvement (PI), upper confidence bound (UCB), and entropy-based approaches have been used. The classical BO methods were developed for single-objective optimization, whereas multi-objective optimization problems are of interest in practice. In real-world applications, objectives are often found to conflict with each other. Design decision-making requires us to enumerate options and make trade-offs between these objectives. Thus, it is important to obtain the set of optimal solutions in the Pareto sense so that the relationships among objectives can be fully explored.

Some efforts [1,2] have been made to develop multi-objective Bayesian optimization (MOBO) methods. Jeong and Obayashi [3] proposed an EI-based criteria and employed genetic algorithm (GA) to search for potential sampling points on the Pareto frontier. Knowles [4] introduced the ParEGO framework with the augmented Tchebycheff function to scalarize the multi-objective functions, which was later extended by Davins-Valldaura et al. [5] by including the probability of Pareto frontier with another GP. Dächert et al. [6] studied the importance of the augmented weighted Tchebycheff regularization parameter between *ℓ*^{∞} and *ℓ*^{1} norms. Zhang et al. [7] proposed the MOEA/D framework with the classical Tchebycheff scalarization function and decomposed a multi-objective (MO) problem into a number of single-objective optimization subproblems. Li et al. [8,9] improved the MO GA methods with the K-MOGA framework, where the posterior mean and posterior variance are considered to assist GA methods. Feliot et al. [10] proposed a multi-objective EI acquisition function considering constraints based on the posterior mean and posterior variance. Yang et al. [11] proposed an expected hypervolume improvement gradient focusing on bi-objective problems and further accelerated the computation of hypervolume using an analytical derivation of gradients. Valladares and Tovar [12] compared MO EI (or MEI) criteria against Tchebycheff in *ℓ*^{1} and *ℓ*^{∞} and concluded that randomized weights indeed help diversifying the Pareto frontier.

Gupta et al. [13] proposed a batch MOBO framework with a weighted linear average objective and demonstrated by optimizing the heat treatment process of an Al-Sc alloy. Shu et al. [14] developed a composite acquisition function for MOBO to efficiently sample with simultaneous improvement of convergence and diversity in constructing the Pareto frontier. Hernández-Lobato et al. [15,16] proposed a predictive entropy search acquisition function for MO problems considering constraints. Abdolshah et al. [17] proposed a MOBO framework where various computational efforts are taken into account when exploring the input domain. Gaudrie et al. [18] proposed a sequential and batch extension for MOBO method by maximizing the expected hypervolume improvement. Palar et al. [19] compared the impact of different covariance functions in MOBO and concluded that Matérn-3/2 is the most robust kernel for design optimization applications. Golovin et al. [20] proposed a hypervolume scalarization algorithm and proved the regret bound scales as $O(T)$. Han and Zheng [21] proposed an EIR2 indicator as a minimax extension for EI acquisition function in the context of MO optimization problems. While hypervolume-based approaches (also known as $S$-metric or Lebesgue measure) aim at both convergence and diversity, it also comes with a computational cost to compute the hypervolume. Let *s* be the number of objectives and *n* be the number of data points. The grid-based hypervolume algorithm scales at $O(m\u22c5nm)$, whereas the Walking Fish Group algorithm has a complexity of $O(s\u22c52n)$ [22]. Beume et al. [23] proved the lower bound of $O(nlogn)$ for any number of objectives *s* > 1 and presented an algorithm that scales particularly so for *s* = 3. We propose an improved acquisition function that can avoid the cost of computing hypervolume for arbitrarily many objective functions.

In this paper, we propose a new framework, called srMO-BO-3GP, to stack three GPs to solve the MO constrained optimization problems. Furthermore, we explore the possibility of introducing regularization in the Tchebycheff scalarization method. The approach is unique in three different perspectives. First, we employ GP as a machine learning technique to probabilistically search for the Pareto frontier, by stacking this classifier over the objective GP. The acquisition function in the classical BO method is employed to balance the exploitation and the exploration of this Pareto GP, which enhances its richness, diversity, and convergence in the Pareto frontier. Second, we combine the multi-objective functions to a single-objective function with a randomized weight vector and regularize this single-objective function by a ridge regularization to smoothen the single-objective function. The philosophy of our approach is intrinsically similar to that of classical BO methods. Compared to other approach in literature, the heuristic proposed in this research is greatly simplified, yet its efficiency is comparable to other methods. Our approach is more general in the sense that it does not restrict to a particular form of the acquisition function and leave the choice of acquisition functions up to the users. The idea is extended based on the work of Davins-Valldaura et al. [5]; however, our work differs from their work in treating the Pareto GP as well as the acquisition function. In our approach, we opt to include both exploitation and exploration of this Pareto GP by considering uncertainty, whereas in their approach, only the probability of Pareto frontier (i.e., exploitation) is considered. Another novelty in our approach is the ridge regularized term in the acquisition to smooth the acquisition function.

Two significant advantages of the proposed srMO-BO-3GP algorithms are highlighted as follows. First, the regularization terms in the acquisition can significantly mitigate the effects of noise on observations. Second, our approach is not limited by the number of objective functions, as in other approaches. For example, if hypervolume-based approach is considered, the number of objectives would have a scalability effect on the computation of the hypervolume. Our proposed approach avoids limitation by converting the identification of Pareto frontier into an uncertain classification problems in machine learning context, with a flavor of uncertainty quantification. Thus, as the Pareto GP classifier gains more accuracy, it converges to the true Pareto frontier through the mean of classical Bayesian optimization approach.

For the remaining of this paper, Sec. 2 describes the proposed srMO-BO-3GP framework. Section 3 provides the numerical results for several numerical analytical functions, as well as an engineering thermomechanical finite element model (FEM) using the proposed approach. Section 5 provides discussions, and Sec. 6 concludes the paper. Compared to our previous preliminary work in Ref. [24], in this paper, we provide a comprehensive benchmark for the proposed srMO-BO-3GP framework and update the weighted parameters balancing *ℓ*^{∞} and *ℓ*^{0} norms in the augmented Tchebycheff scalarization function.

## 2 Methodology

*d*continuous variables in

*d*-dimensional space, whereas $y={yj}j=1s$ as

*s*outputs. Traditionally, BO solves the single-objective optimization problem

*c*(

**x**) ≤ 0. In this paper, we consider the scenario of MO optimization problem,

**g**(

**x**) ≤ 0.

### 2.1 Gaussian Process.

*f*is a function of

**x**, where $x\u2208X$ is a

*d*-dimensional input, and

*y*is the observation. Let the dataset $D=(xi,yi)i=1n$, where

*n*is the number of observations. A GP regression assumes that

**f**=

*f*

_{1:n}is jointly Gaussian, and the observation

*y*is normally distributed given

*f*,

*m*

_{i}:=

*μ*(

**x**

_{i}) and

*K*

_{i,j}:=

*k*(

**x**

_{i},

**x**

_{j}).

**K**is a choice of modeling covariance between inputs. At an unknown sampling location

**x**, the predicted response is described by a posterior Gaussian distribution, where the posterior mean is

*k*(

**x**) is the covariance vector between the query point

**x**and

**x**

_{1:n}. The classical GP formulation assumes stationary covariance matrix, which only depends on the distance $r=\Vert x\u2212x\u2032\Vert $. Several most common kernels for GP include [26]

*θ*at the computational cost of $O(n3)$ due to the cost to compute the inverse of the covariance matrix.

### 2.2 Multi-Objective Function.

We adopt the definition of Pareto-dominant from Rojas-Gonzalez et al. [2] to define the Pareto frontier.

**x**_{1}*is said to dominate***x**_{2}, *denoted as*$x1\u2aafx2$, *if and only if*$\u2200j:1\u2264j\u2264s$, *such that**y*_{j}(**x**_{1}) ≤ *y*_{j}(**x**_{2}), *and*$\u2203j:1\u2264j\u2264s$, *such that**y*_{j}(**x**_{1}) < *y*_{j}(**x**_{2}).

**x**_{1}*is said to strictly dominate***x**_{2}, *denoted as*$x1\u227ax2$, *if and only if*$\u2200j:1\u2264j\u2264s$, *such that**y*_{j}(**x**_{1}) < *y*_{j}(**x**_{2}).

A MO problem can be solved by converting it to a single-objective problem, where the single-objective function is a weighted sum of multiple objectives [2], such as

weighted Tchebycheff scalarization function $y=max1\u2264i\u2264s$$wi(yi(x)\u2212zi*)$,

the weighted sum scalarization function: $y=\u2211i=1swiyi(x)$,

and the augmented (weighted) Tchebycheff scalarization function $y=max1\u2264i\u2264swi(yi(x)\u2212zi*)+\rho \u2211i=1swiyi(x)$,

*i*th objective, the weights 0 ≤

*w*

_{i}≤ 1, $\u2211i=1mwi=1$, and

*ρ*is a positive constant. Typically,

*ρ*is set as 0.05 in [4,5]. However, we follow Ref. [27] and set

*ρ*= 0.65 due to its superior performance; the idea of tuning

*ρ*adaptively remains open for future research.

*λ*is a regularization parameter. The first term can be thought of as an

*ℓ*

^{∞}-norm, whereas the second term can be thought of as a

*ℓ*

^{1}-norm [6]. Various treatments for the Tchebycheff decomposition exist in literature [28]. The regularization term is used to smoothen the objective function and mitigate the singularity effect when Tchebycheff decomposition is introduced. More specifically, the term max

_{1≤i≤s}

*w*

_{j}

*y*

_{j}(

**x**) does not have a continuous first derivative, while the assumption for fitting GP is that the underlying function needs to be smooth. In that sense, it is necessary to introduce a small ridge regularization term in

*ℓ*

^{2}-norm.

### 2.3 Constraints.

We consider the known constraints and hidden constraints, where the known constraints are known before the functional evaluation, whereas the hidden constraints must be learn indirectly through the functional evaluation. For real-world engineering design applications, some constraints are imposed to the design performance which does not have a simple analytical form to evaluate. Simulations such as finite element model and computational fluid dynamics need to be run to check if the constraints are satisfied. Those are called unknown constraints, which typically correspond to unforeseeable and reproducible failure of the applications with low performance for a certain region of inputs. It could also be due to numerical issues in simulations, e.g., mesh failure, numerical instability, and numerical divergence. Such errors must be accounted for, but they are only known after the functional evaluation of *f*(·) (in Eq. (2)) is invoked with simulations. We refer to the constraints that are related to the functional evaluation *f*(·) are unknown constraints. On the contrary, the known constraints *c*(**x**) are those that can be evaluated or invoked separately based on the available analytical forms without calling the functional evaluation *f*(·) by simulation.

#### 2.3.1 Known Constraints.

Known constraints can be represented as a set of inequalities **g**(**x**) ≤ **0**, where **g** is a relatively cheap set of functions to evaluate, compared to the real objective function *f*. The known constraints can be easily implemented by directly penalizing the acquisition by setting it to zero if known constraints are violated, while leaving unviolated sampling points as is.

*g*

_{k}denotes the

*k*th constraint in the set of known constraints.

#### 2.3.2 Unknown/hidden Constraints.

We adopt our previous strategy by employing a probabilistic binary classifier to learn the unknown or hidden constraints. As a result, feasible and infeasible regions are separated in the input domain $X$. These two regions are mutually exclusive, i.e., disjoint, because a sampling point cannot be both feasible and infeasible at the same time. The labels of feasible/infeasible for sampling points are fixed, in the sense that as the optimization process advances, the labels do not change.

Denote the feasibility dataset as ${xi,ci}i=1n$, where *n* is the number of data points and *c* is associated with unknown constraints. We assign *c*_{i} = 1 if **x**_{i} is feasible, and *c*_{i} = 0 if **x**_{i} is infeasible. At an unknown **x**, the feasibility classifier provides a probability mass function, with Pr(**x**|*c*(**x**) = 1) as the predicted probability of satisfying the hidden constraints, and Pr(**x**|*c*(**x**) = 0) as the predicted probability of failing the hidden constraints. It is noteworthy that their sum adds up Pr(**x**|*c*(**x**) = 0) + Pr(**x**|*c*(**x**) = 1), as they are mutually exclusive, and there are only two possibilities. The probability of passing the unknown constraints will be used to condition on the acquisition, resulting in the multiplication of Pr(**x**|*c*(**x**) = 1) in the conditioned acquisition.

Even though there are many available probabilistic binary classifier in the context of machine learning, for example, *k*-NN [29], AdaBoost [30], RandomForest [31], support vector machine [32] (SVM), least squares support vector machine (LSSVM) [33], and convolutional neural network [34]; in this work, we restrict our methodology to GP as a binary probabilistic classifier. Labeling feasibility as described above, the posterior mean of feasibility GP can be used to predict the probability of satisfying unknown constraints, i.e., *μ*_{feasible}(**x**) = Pr(**x**|*c*(**x**) = 1).

### 2.4 Pareto Frontier With an Uncertain Gaussian Process Classifier.

At each iteration, we construct the current Pareto frontier, which is subjected to change as the optimization process advances. If a sampling point is currently Pareto-dominant, the point is labeled as 1, and if the sampling point is not Pareto-dominant (i.e., it is dominated by another point in the dataset), the sampling point is labeled as 0. The classification process is thus uncertain, in the sense that the labels change from one iteration to another. This is to contrast with the constraint feasibility classifier *μ*_{feasible}(**x**), where the labels are fixed and do not change, as the optimization process advances. The uncertainty in the Pareto frontier classification gradually decreases, as the number of sampling data points increases.

We explore the possibility of using GP as an uncertain Pareto frontier classifier. Surprisingly, GP is one of a few well-established machine learning techniques that considers uncertainty in prediction through its posterior variance function. Because of this particular reason, GP is employed as an uncertain classifier to construct the Pareto frontier, where the uncertainty is quantified by the GP posterior variance function (Eq. (6)).

The main idea is to force the Pareto GP classifier to balance its learning by exploitation and exploration, especially when the Pareto frontier prediction is not accurate by focusing on the unknown region at the beginning of the optimization process. As the optimization advances, if the Pareto frontier can be classified with high accuracy, the BO framework should be exploited to promote convergence and explored to promote the diversity in the MO optimization settings. This is consistent with the philosophy of the classical BO approach, which strikes for the balance of exploration and exploitation. Furthermore, the convergence and diversity, which are the two keys measure of MO optimization problems [2], are promoted by employing common acquisition functions on the Pareto frontier GP. As a result, the acquisition of Pareto GP classifier is included as another in the main acquisition function for the classical BO method (Eq. (10)).

It is noteworthy to point out that the Pareto classification problem is also mutually exclusive, in the sense that a sampling point cannot be both Pareto-dominant and Pareto-non-dominant at the same time. For an unknown location **x**, the Pareto GP classifier provides both the probability of being Pareto-dominant *μ*_{Pareto}(**x**), which is bounded between 0 and 1, as well as the uncertainty associated with the probability as $\sigma Pareto2(x)$. Instead of employing logistic logit function which maps from (−∞, + ∞) to [0, 1] for binary classification problem [35], we simply use the regression formulation for GP and clip the prediction if the responses are beyond the range of [0, 1].

### 2.5 Acquisition Function.

**w**= (

*w*

_{1}, …,

*w*

_{s}) is sampled to combine multiple objectives ${yj}j=1s$ to a single-objective function

*y*. A GP model is fitted using the dataset of

*n*data points ${xi,yi}i=1n$. An acquisition function

*a*

_{obj}(

**x**) is formed, as a function of posterior mean

*μ*

_{obj}(

**x**) and posterior variance $\sigma obj2(x)$. The second term, $aPareto(x;\mu Pareto(x)$, $\sigma Pareto2(x))$, is the acquisition function based on the Pareto frontier GP classifier. The third term, Pr(

**x**|

*c*(

**x**) = 1) =

*μ*

_{feasible}(

**x**), is the probability of passing unknown constraints, provided by posterior mean of the second GP. The fourth term, $I(x)$, is the indicator function describing known constraints as in Eq. (9). The next sampling point

**x*** is obtained by maximizing the acquisition function described in Eq. (10), i.e.,

### 2.6 Summary.

The srMO-BO-3GP framework is summarized in Algorithm 1. CMA-ES [36,37] is used as the sub-optimizer to obtain the next sampling point in Eq. (11), where the default settings are retained. An interface between the srMO-BO-3GP optimizer and the engineering applications is constructed by matlab, python, and Shell scripts. Also, since the maximization is set as a default setting, the Tchebycheff function is algebraically manipulated to conform with the default setting.

**Input:** dataset $Dn$ consisting of inputs, observation, feasibility $(x,y,c)i=1n$, multi-objective $(xi,yi)i=1n$, constraint GP $(x,ci)i=1n$,

1: **for**$n=1,2,\u2026,$**do**

2: randomize a weight vector **w**

3: combine ${yj}j=1s$ to $y$ ▹ multi- to single- objective - Eq. 8)

4: construct single-objective GP ▹ GP #1: $\mu obj(x)$, $\sigma obj2(x)$

5: construct Pareto front ▹ GP #2: $\mu Pareto(x)$, $\sigma Pareto2(x)$

6: find current Pareto front

7: construct Pareto classifier GP

8: construct constraints classifier GP ▹ GP #3: $\mu \,feasible(x),\sigma \,feasible2(x)$

9: locate the next sampling point $xn+1$ ▹ (Eq. 10)

10: query objectives $yn+1\u2190{yj}j=1s$ and feasibility $cn+1$

11: augment dataset $Dn+1\u2190{Dn\u222a(xn+1,yn+1,cn+1)}$

12: **end for**

## 3 Numerical Benchmarks

In this section, we follow the standard test suite [5,38] to benchmark and investigate the effectiveness and the efficiency of the proposed srMO-BO-3GP algorithm by ZDT [39] and DTLZ [40] test suites, where the geometry of Pareto frontiers are well visualized by Jin et al. [41] as well as Ishibuchi et al. [42]; analytical solutions of Pareto frontiers are provided by the work of Deb et al. [40]. The regularization *λ* parameter is set to 0.01, even though its effect requires a further benchmark study. We compare the numerical performance between variants of srMO-BO-3GP, particularly for

variations of acquisition function for the objective GP: PI, EI, and UCB;

variations of acquisition function for the Pareto GP: PI, EI, and UCB;

regularized versus non-regularized Tchebycheff objective function.

This results in 18 different variants of the srMO-BO-3GP. The name convention for these variants is Reg/NoReg-{PI,EI,UCB}-{PI,EI,UCB}, which corresponds the option of regularization, the objective GP, and the Pareto GP. Note that in the DTLZ benchmark, *d* = *M* + *k* − 1, and the function *g*(**x**) is constructed based on *k* = *d* − *M* + 1 variables, $(xM,\u2026,xd)$. The benchmark is repeated five times for each variant BO method and each benchmark function, resulting in almost 1000 runs in total. In all the optimization runs, five initial sampling points are randomly chosen using Monte Carlo sampling. The maximum number of iteration is 1500. For ZDT and DTLZ benchmarks, the dimensions are 3 and 4, respectively. For ZDT benchmark, the number of objectives is 2, whereas for DTLZ benchmark, the number of objectives is 3.

### 3.1 ZDT Test Suite

#### 3.1.1 ZDT1.

*d*-dimensional input

**x**∈ [0, 1]

^{d}, a more general ZDT1 function can be introduced [39]

#### 3.1.2 ZDT2.

*d*-dimensional input

**x**∈ [0, 1]

^{d}, the ZDT2 function can be described as [39]

#### 3.1.3 ZDT3.

*d*-dimensional input

**x**∈ [0, 1]

^{d}, the ZDT3 function can be described as [39]

#### 3.1.4 ZDT4.

*d*-dimensional input

**x**∈ [0, 1]

^{1}× [ − 5, 5]

^{d−1}, the ZDT4 function can be described as [39]

*g*(

**x**) = 1.

#### 3.1.5 ZDT6.

*d*-dimensional input

**x**∈ [0, 1]

^{d}, the ZDT6 function can be described as [39]

### 3.2 DTLZ Test Suite

#### 3.2.1 DTLZ1.

*d*-dimensional input

**x**∈ [0, 1]

^{d}with

*M*objectives,

*f*

_{1:M}, is defined as [40]

#### 3.2.2 DTLZ2.

*d*-dimensional input

**x**∈ [0, 1]

^{d}with

*M*objectives,

*f*

_{1:M}, is defined as [40]

#### 3.2.3 DTLZ3.

*d*-dimensional input

**x**∈ [0, 1]

^{d}with

*M*objectives,

*f*

_{1:M}, is defined as [40]

#### 3.2.4 DTLZ4.

*d*-dimensional input

**x**∈ [0, 1]

^{d}with

*M*objectives,

*f*

_{1:M}, is defined as [40]

#### 3.2.5 DTLZ5.

*d*-dimensional input

**x**∈ [0, 1]

^{d}with

*M*objectives,

*f*

_{1:M}, is defined as [40]

*i*≤

*M*, $g=\u2211i=Md(xi\u22120.5)2$.

#### 3.2.6 DTLZ6.

*d*-dimensional input

**x**∈ [0, 1]

^{d}with

*M*objectives,

*f*

_{1:M}, is defined as [40]

*i*≤

*M*, $g=\u2211i=Mdxi0.1$.

### 3.3 Performance Metrics.

We followed and adopted the notation from Jin et al. [41] and Hernández-Lobato et al. [15], to measure numerical performance for our proposed method, namely the generational distance (GD), the inverted generational distance (IGD), and the hypervolume (HV). These metrics also require prior knowledge of the true Pareto frontier. For the sake of clarity, we denote PF* as a set of the discretized points on the true Pareto frontier of an arbitrary benchmark function, and PF as the set of obtained Pareto frontier.

#### 3.3.1 Generational Distance (GD).

*d*is the minimum Euclidean distance between an obtained Pareto-dominant point and the true Pareto frontier. In other words, $d(u,PF*)=argminv\u2208PF*d(u,v)$. The denominator, |#PF|, is the number of obtained solution in PF. A smaller value of GD indicates a better convergence to the true Pareto frontier.

#### 3.3.2 Inverted Generational Distance (IGD).

*v*on the true Pareto frontier with the obtained Pareto frontier. The denominator, |#PF*|, is the number of discretization Pareto-dominant point. As pointed out by Jin et al. [41], if |#PF*| is sufficiently large, i.e., when the true Pareto frontier is sufficiently discretized, then IGD can measure both convergence and diversity. A smaller of IGD indicates a better convergence and diversity to the true Pareto frontier.

#### 3.3.3 Log of Relative Hypervolume Difference (LRHD).

_{ideal}is the ideal hypervolume and HV > HV

_{ideal}is the computed hypervolume of the obtained Pareto frontier. The origin is chosen as the reference point to compute the hypervolume. A smaller of LRHD indicates a better convergence and diversity to the true Pareto frontier.

### 3.4 Comparison Between Variants.

In Figs. 1 and 2, the label of the *x*-axis is noted by the objective acquisition function, followed by the Pareto acquisition function. For example, UCB-PI means UCB is used as the objective acquisition function for the scalarized Tchebycheff, and PI is used to search for the Pareto frontier. Seaborn package [45] is used to visualize the performance metrics, measure the log of generational distance, log of inverted generational distance, and log of relative hypervolume difference.

Figure 1 shows the comparison between 18 variants of the srMO-BO-3GP methods in ZDT test suite, including regularized versus non-regularized augmented Tchebycheff scalarized single-objective functions. Overall, regularized objective function tends to slightly outperform non-regularized objective function. For the Pareto frontier acquisition function, EI and PI acquisition functions perform better than UCB acquisition function. However, for the objective acquisition function, UCB seems to perform best, then to EI, then to PI.

Figure 2 shows the comparison between 18 variants of the srMO-BO-3GP methods in DTLZ test suite, including regularized versus non-regularized augmented Tchebycheff scalarized single-objective functions. Again, regularization seems to slightly improve the performance in terms of hypervolume. For the Pareto acquisition function, all UCB, EI, and PI acquisition functions seem to perform on par with each other. For the objective acquisition function, UCB outperforms both EI and PI, and EI and PI are of the same performance. Figure 3 compares Pareto frontiers obtained by the proposed algorithm against the true Pareto frontiers, showing an acceptable agreement in ZDT and DTLZ benchmarks. Overall, the proposed algorithm performs relatively well on most benchmarks, except for those highly oscillatory with multiple modes, but the optimum” objectives are reasonably close with the global optimum” objectives.

## 4 Engineering Applications: Thermomechanical Finite Element for Flip-Chip Package Design Optimization

We demonstrate the applicability of our proposed framework to a thermomechanical FEM model for flip-chip package design, where five objectives are considered. Figure 4 shows the geometric model of the thermomechanical problem, where the mesh density varies for different levels of fidelity. Two design variables are associated with the die, three are associated with the substrate, three more are associated with the stiffener ring, two are with the underfill, and the last one is with the PCB board. Only two levels of fidelity are considered in this example. Table 1 show the design variables, the physical meaning of the design variables, as well as their lower and upper bounds in this case study.

Variable | Design part | Lower bound | Upper bound | Optimal value |
---|---|---|---|---|

x_{1} | Die | 20000 | 30000 | 20702 |

x_{2} | Die | 300 | 750 | 320 |

x_{3} | Substrate | 30000 | 40000 | 35539 |

x_{4} | Substrate | 100 | 1800 | 1614 |

x_{5} | Substrate | 10 · 10^{−6} | 17 · 10^{−6} | 17 · 10^{−6} |

x_{6} | Stiffener ring | 2000 | 6000 | 4126 |

x_{7} | Stiffener ring | 100 | 2500 | 1646 |

x_{8} | Stiffener ring | 8 · 10^{−6} | 25 · 10^{−6} | 8.94 · 10^{−6} |

x_{9} | Underfill | 1.0 | 3.0 | 1.52 |

x_{10} | Underfill | 0.5 | 1.0 | 0.804 |

x_{11} | PCB board | 12.0 · 10^{−6} | 16.7 · 10^{−6} | 16.7 · 10^{−6} |

Variable | Design part | Lower bound | Upper bound | Optimal value |
---|---|---|---|---|

x_{1} | Die | 20000 | 30000 | 20702 |

x_{2} | Die | 300 | 750 | 320 |

x_{3} | Substrate | 30000 | 40000 | 35539 |

x_{4} | Substrate | 100 | 1800 | 1614 |

x_{5} | Substrate | 10 · 10^{−6} | 17 · 10^{−6} | 17 · 10^{−6} |

x_{6} | Stiffener ring | 2000 | 6000 | 4126 |

x_{7} | Stiffener ring | 100 | 2500 | 1646 |

x_{8} | Stiffener ring | 8 · 10^{−6} | 25 · 10^{−6} | 8.94 · 10^{−6} |

x_{9} | Underfill | 1.0 | 3.0 | 1.52 |

x_{10} | Underfill | 0.5 | 1.0 | 0.804 |

x_{11} | PCB board | 12.0 · 10^{−6} | 16.7 · 10^{−6} | 16.7 · 10^{−6} |

After the numerical solution is obtained, the component warpage at 20°C, 200°C, and the strain energy density of the furthest solder joint are calculated. The five objectives are listed as follows,

Objective-1: warpage at 20°C,

Objective-2: warpage at 200°C,

Objective-3: damage metric for BGA lifetime prediction,

Objective-4: damage metric for C4 interconnection lifetime prediction,

Objective-5: first principal stress at C4 corner, where cracking and delamination frequently occurs,

Figure 5(a) shows the scatter plot matrix between the five objectives and their joint densities, whereas Fig. 5(b) presents the Pearson correlation, which is bounded between (−1) and (+1). 851 simulations are performed, in which only 256 simulations are feasible. Out of 256 feasible simulations, 84 of those represent the current Pareto frontier of this numerical study. The objective 1 is very positively correlated with the objective 2, as both of them corresponds to the warpages at different temperatures. This implies that minimizing the object 1 would also minimize the objective 2, thus no trade-off is found. The same argument applies for objectives 3 and 5, where positive correlation is found. The objective 4 is poorly correlated with other objectives. Trade-offs are found between members of the group of objectives 1 and 2 and those of the group of objective 3 and 5. Overall, it is challenging to plot the Pareto front on high-dimensional space for a practical visualization. However, we have demonstrated that our methodology is applicable to problems with many objective functions. In practice, the number of objectives are often reduced to minimum, before the optimization study is conducted.

## 5 Discussion

BO is a powerful and flexible framework which allows for many useful extensions. For example, the local GP approach [46,47] could be used to improve the scalability of GP. An extended version of local GP has also been developed [48] to solve the mixed-integer optimization problems, where the number of discrete/categorical variables is relatively small. Accelerated BO methods by exploiting computational resource on high-performance computing platform have been proposed [49,50] to reduce the amount of physical waiting time. Multi-fidelity BO approaches [51,52] have been developed to couple information between different levels of fidelity by exploiting the correlation between low- and high-fidelity models to reduce the computational efforts. Its success has been demonstrated, at least in the field of bioengineering [53] and computational fluid dynamics [54,55].

For multi-objective problems, hypervolume-based approaches remain more popular than the scalarization approaches. Furthermore, there are limitations in combining multiple objectives into single objective as this approach relies on the weights, which are randomly sampled and may lead to convergence and diversity issues. Perhaps, combining the Pareto classifier with a hypervolume-based approach would result in a better numerical performance. This topic remains as an interesting direction in the future.

An advantage of the proposed method is that the inclusion of the uncertain Pareto classifier makes the effort of locating the Pareto frontier easier in the input space $X$, regardless of whether the Pareto frontier is discontinuous, convex, or concave, because this uncertain Pareto classifier only considers the input space. As shown in this paper, the proposed method performs relatively well for multiple objectives in both ZDT and DTLZ test suites.

As usual, the predictive performance of GP is highly dependent on the dimensionality of the problem, i.e. the number of design variables, and the characteristics of the underlying function, which is typically assumed to be smooth. It is worthy to note that the proposed algorithm does not depend on the number of objective functions, as the Pareto classifier is constructed directly on the input space $X$ and the weighted Tchebycheff scheme combines multiple objectives to a single objective. Therefore, the proposed srMO-BO-3GP method has a competitive advantage for problems with high number of objective functions. Furthermore, it does not depend on the characteristics of the Pareto frontiers, i.e., whether they are convex, concave, or discontinuous. The known constraints do not have any effect on the optimization problem, as we assume they can be cheaply evaluated and the functional evaluation *f*(·) would not be invoked if one of the known constraints *g*_{k} is violated. The number of unknown constraints also does not have any effect on the optimization problems, as we are primarily concerned with the input space $X$, and the probability of satisfying all unknown constraints is modeled by another GP, which can be considered as a black-box function. In summary, the main factor that may have the most impact on the numerical performance of srMO-BO-3GP is the dimensionality of the optimization problem.

The drawback of Tchebycheff decomposition for MO problems is that it can only identify a unique Pareto solution point for a fixed weight vector at best (cf. Miettinen [56]). The diversity of the Pareto frontier solely depends on the weight vectors, which is uniformly distributed in this case. Furthermore, because the Tchebycheff scheme does not consider the current Pareto frontiers, it does not promote diversity the Pareto frontiers, even though one might argue that the convergence is preserved. The second term in the composite acquisition function (Eq. (10)), which corresponds to the Pareto classifier, is inserted to promote the diversity in Pareto frontiers and bypass the weight adjustment problem.

As theoretically formulated in the BO framework, the acquisition function strikes for the balance of exploration and exploitation, which would probabilistically converge to the global optimum, regardless of the number and locations of initial sampling points, as described in Bull [57]. Strong theoretical guarantees are often provided in the case of single objective function as BO is well known as a no-regret optimization algorithm. Along with the Tchebycheff argument in Miettinen [56], where optimizing with a fixed weight vector would lead to a unique Pareto point, the proposed srMO-BO-3GP method does not depend on the initial sampling points, at least from a theoretical point of view.

The proposed method does not cope well with high-dimensional problems, as they are the usual challenges for GP. While some solutions have been proposed, the scalability problem remains as an active research area and is posed as the future extension to this work.

## 6 Conclusion

In this paper, we propose a multi-objective Bayesian optimization framework, called srMO-BO-3GP in a sequential setting with applications to engineering-based simulations. In this framework, three distinct GPs are coupled together. The first GP is used to approximate a single-objective function, which is converted from the original multi-objective functions using a regularization-augmented Tchebycheff function. The second GP is used to learn the unknown constraints, which are evaluated simultaneously with the set of objective functions, supporting the general case where some performance metrics reflect design aspirations (objectives) while others must satisfy prescribed requirements (constraints). The third GP is used as an uncertain binary classifier to learn the Pareto frontier, where its own acquisition function is embedded in the main acquisition function. The srMO-BO-3GP framework is demonstrated using two numerical benchmarking test suites, ZDT and DTLZ, and an engineering thermomechanical FEM application.

## Acknowledgment

The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

## Conflict of Interest

There are no conflicts of interest.