Efficient sampling for decision making in materials discovery*

2021-05-24 02:24YuanTian田原TurabLookmanandDezhenXue薛德禎

Chinese Physics B 2021年5期

關鍵詞：田原

Yuan Tian(田原), Turab Lookman, and Dezhen Xue(薛德禎),?

1State Key Laboratory for Mechanical Behavior of Materials,Xi’an Jiaotong University,Xi’an 710049,China

2Los Alamos National Laboratory,Los Alamos,NM 87545,USA

Keywords: sampling methods,active learning,decision making,material design,Bayesian optimization

1. Introduction

Material research and development have so far largely been based on expert knowledge and “trial-and-error”. This has the drawback of relatively low efficiency and high costs.[1]With improvements in measurement capabilities and advances in scientific computing, materials data is being accumulated at unprecedented levels.[2]Over the last ten years, the information sciences have played a pivotal role in the analysis and prediction from data. Together with experiment, theory, and computer simulations,[3–5]we are at the threshold of a new age in which we will be able to obtain relationships between composition, structure, processing, and performance faster, more accurately, and more economically than ever before.[6–10]In addition,we will be able to employ the tools of decision making to guide the discovery of targeted materials with the fewest experiments to minimize the costs and labor.[11–21]

Materials science is characterized in the main with data that is often expensive to obtain as experiments are not easy,and it is accompanied by a materials space that is vast and complex. Hence,often a relatively small size of data is available for training and we have to make predictions in a high dimensional materials design space.[22–24]Machine learning models trained on such small data sets are then usually accompanied with large uncertainties,which makes the decision making process challenging.[12–21,25–29]Using methods to improve decision outcomes and effectiveness is essential in guiding and accelerating the discovery of materials with targeted properties.[15–21,26,28,30]

A supervised learning approach requires a substantial amount of labeled data to train the model adequately.[31]However,in materials science we usually have a large pool of unlabeled data. This is where a subfield of machine learning,known as active learning,[32–34]becomes important. Active learning depends on taking relatively few special labeled data points, those labeled by an expert, and training a model on those data. The claim then is that active learning can achieve a similar or better performance using fewer data than a supervised model does.[28,37–40]Thus,active learning endeavors to select the most“informative”data samples in order to successively improve the model in an iterative loop.[35,36,41,42,42–44]It involves a continuous learning process, in contrast to passive learning which relies on making predictions on the available training data. It is also called “query learning” as it depends on selecting or querying an oracle or experiment for the next label,and is related to“optimal experimental design”[12,45]or“sequential design”.[46,47]Bayesian optimization and adaptive design are methods following the similar strategy, which require goal-directed iterative feedbacks.[21,26,28]

The different ways in which a learner can make queries include (a) query synthesis, (b) stream-based selective sampling,and(c)pool-based sampling.[32]In query synthesis,the learner learns by making membership queries[48–50]from any unlabeled data in the input space, such as the feature space,and queries the oracle for the label. Though quite reasonable on certain problems,[51]query synthesis can give rise to problematic choices, for example,ambiguous images when learning to discriminate characters. In stream-based sampling,[52]typically each unlabeled data is drawn continuously from an underlying distribution and the learner decides whether to query it or to discard it. This latter decision can be made based on defining a utility function or information measure so that samples with higher utility are more likely to be queried.Whereas one data point is sampled sequentially from a streaming data source in stream-based learning, pool-based active learning[53,54]considers all of the unlabeled data pooled together before making the selection using a utility as a means to sample. It is the most common learning situation for active learning,especially in applications to materials problems.

Figure 1 shows an active learning loop with two aspects,the learner or model prediction and the query. Starting from a “small data” set, there is an inference to establish a model with uncertainties. The results of the prediction or probability estimates are applied to a pool of unlabeled instances or data. A number of query methods can then be employed for deciding on the next selection. We will review below several pool-based sampling methods including committee-based strategies,large margin-based strategies,posterior probabilitybased strategies,sampling strategies based on directed exploration, and mean objective cost of uncertainty. The sampling then guides the next measurement(oracle)to label the data.

Fig.1. Active learning loop emphasize on the query section.

In this review,we outline in Subsection 2.1 the motivation underlying commonly used sampling methods,which use uncertainties in model predictions to make the next queries. We also discuss other uncertainty based sampling strategies,such as methods based on directed exploration and mean objective cost of uncertainties in Subsections 2.2 and 2.3, to guide the choice of the next samples to test. We conclude in Section 3 with a number of examples from materials science related to classification and regression applications,which show how active learning can be applied in practice.

2. Sampling strategies

Selecting specific data samples for labeling from a pool of unlabeled data depends on the sampling strategy. Active learning uses these strategies in the form of utility or acquisition functions learned by constructing a model from data,that is the expected utilities. The sample with the largest utility is then chosen to be labeled. The sampling strategies includes committee-based strategies,large margin-based strategies,and posterior probability-based strategies, etc. We will elaborate these sampling strategies.

2.1. Commonly used sampling strategies

Committee-based strategies Here we use the predictions from several models to guide us on which data samples to label. The set of models is known as a committee[55,56]so that k different models will provide k labels, and the data sample chosen is the one giving maximum disagreement in predictions. The distinct models for the same dataset can be with different hyperparameters or different algorithms such as support vector machines (SVM),[57]decision trees, logistic regression, etc. The disagreement itself in predictions obtained from the committee can be measured using the concept of entropy,which is an aspect of uncertainty sampling.[58]Thus,if V(yi) indicates the number of votes for the label yi, then the chosen sample is x*that maximizes the entropy function given by[59]

where C is a constant.

The KL-divergence is a slight modification of the entropy definition and can also be utilized to represent the difference between two probability distributions p(y)and q(y). It allows us to evaluate how much information is lost when we approximate one distribution by another. The KL-divergence is defined by[60,61]

The data sample which produces most disagreement across the predictions is labelled accordingly using, say, the above methods,and for regression-based models the variance can be used as a measure of disagreement.[62]Thus,high variance in the predictions represents samples likely to be most informative. We can also use bootstrap or bagging[63]in conjunction with entropy or variance approaches. The k training sets are created from the original data set by drawing samples randomly with replacement to create k predictions from k models. Another approach to construct a committee of models is to use the feature space so that each model works on the same data set but with different sets of features. This can potentially provide a greater degree of divergence in the predictions of the models.

Large margin-based strategies Margin-based strategies are especially relevant for classification problems. The model uncertainty of classifiers such as the margin-based SVM can be represented by the distance to the separating hyperplane.[64]Evaluating the distance to the separating hyperplane is the straightforward way to measure the model uncertainty on unknown sample. The support vectors of the SVM define the decision boundary and could be considered as most informative samples or those with greatest uncertainty. While a sample away from the decision boundary considered with a high confidence of class assignment will be thus not interesting for future sampling. Thus,the margin sampling method considers the data sample for which the prediction on one of the classes is most uncertain, corresponding to those with minimum distance from the hyperplane. The informative data samples fall within the margin and the samples or support vectors chosen,which are given by

where f(xi,ω)represents the distance between the data sample and the hyperplane for class ω. One single data sample is chosen for querying the oracle per iteration, so there is computational overhead. In contrast,the multi-class level strategy considers a multi-class setting. For a data sample,the classes with the largest and second-largest distances from the margin are noted by identifying the distances from the hyperplane for each class of each data sample. The multi-class level uncertainty is then defined as the two largest distances corresponding to two different classes. Thus,[65]

The goal is to minimize this difference so that the most uncertain and informative unlabeled data sample can be chosen.

Posterior probability-based strategies The posterior probability distribution provides a model’s ability and uncertainty in assigning a particular class to a data sample and can also be queried to determine whether an unlabeled data sample should be labeled or not. Such a strategy can be used with any model capable of predicting output probabilities. A well known approach is the least confidence which selects the next sample with least certain in its prediction or class assignment.Thus,if Pθ(?y|x)is the highest posterior probability for label ?y predicted by model θ,and maximizing(1 ?Pθ(?y|x))gives us the data sample with the highest uncertainty. That is,[66]

The strategy can be extended to multiple classes in which we consider the two most probable classes. One has the highest probability of ?y1and the other has the second-highest probability of ?y2. Both probabilities are obtained using model θ. We can minimize the difference between the two most probable classes, as it is indicative of the uncertainty in predicting the unexplored data sample by the model. Therefore the smallest difference represents the most uncertain and informative sample,which is given by[67]

Such a sampling method sometimes is also referred as margin sampling, similar with the large margin-based strategy mentioned above. The difference is that this algorithm uses the posterior probabilities,while the other is based on the margin-based models. Going beyond one or two most probable classes for sampling,requires us to consider information lying in the remaining class probability distributions as it is unused. This is where the entropy measure is useful for sampling. The entropy is a measure of disorder or impurity in a system, which can be also used as a measure of model uncertainty. Hence, higher entropy values indicate high model uncertainty related to class membership. The data sample for which the model is least certain or most uncertain about the class assigned can be recommended by maximizing the following entropy function:[68]

We have seen the importance of several measures of uncertainty in the general sampling approaches outlined above.Entropy, margin, variance, and least confidence all provide definitions of uncertainties within the appropriate modeling context. Here we will consider a number of other sampling approaches recently employed in materials science that use uncertainties intimately to sample new data points for measurements.

2.2. Sampling strategies based on directed exploration

Sampling in an active learning loop often demands exploring the whole compound space adequately to find the optimal targets. Different degrees of exploration,i.e.,how much exploration needs to be performed,are determined by the costs of collecting new information and the value associated with that information as experiments in materials science are often costly. It is thus important to choose an appropriate degree of exploration so that the most important states can be efficiently sampled,especially when there is much unknown in the whole space.

Two types of exploration techniques are known as the undirected one and the directed one. The undirected exploration selects candidates randomly from some distribution,such as uniform distribution, semi-uniform distribution, and Boltzmann distribution. In the uniform-distribution case,random exploration is taken in which costs or rewards are not taken into account. If the distributions are other than uniform, the same treatment will be implemented as the actions are random. Whitehead[74]proved that random walk exploration leads to the learning time scaling exponentially with the size of the state space under certain conditions.

The other exploration technique is the directed exploration.[73]This kind of exploration uses some criteria or knowledge to guide the exploration. It helps to make the right decision to explore the environment best, which aims at maximize the improvement over time. However, it is impossible to know how much improvement of the candidate’s performance unless it has been measured. Thus, all directed exploration techniques are designed to optimize knowledge gain. The exploration may be achieved by choosing states based on frequency of occurrence(counter based exploration),high predicted uncertainty (maximum variance) or functions including various degrees of exploitation. Various trade-off methods,such as efficient global optimization(EGO),knowledge gradient(KG),based on Bayesian optimization also fall in this broad category.

Maximum variance If the purpose is to predict the continuous values such as hardness, strength, temperature, etc.,the variability at each point can be computed through bootstrap methods[75]and the Jackknife method, etc.[76–79]The uncertainty is thus given by

Maximum variance samples the point with the largest uncertainty in the whole space. If the deviation of the true value from the predicted value is large, the variance can be also large. The maximum variance chooses a point with the largest variance which can reduce the total deviation after measuring that point. It can be employed in many materials problems to optimize a material property curve/surface,such as the potential energy surface for Ar-SH[80]and the phase boundary.[27]

Pure exploitation method This chooses the point corresponding to the‘best’model prediction for the next selection

Similarly,pure exploration corresponds to the the choice with the maximum variance.

Exploration–exploitation trade-off methods Both the predicted value and the variance serve as input to these methods.

(a) Efficient global optimization (EGO). The purpose is to maximize the expected improvement over the best observed so far,ymax,[81,82]

where s is the uncertainty associated with the mean value ˉμ*of the model prediction,φ(·)and Φ(·)are the probability density function and cumulative distribution function of normal distribution,respectively.

(b) Knowledge gradient (KG) is designed to measure the alternative that produces the greatest value from a single observation[83–86]

2.3. Mean objective cost of uncertainty

The emphasis of mean objective cost of uncertainty(MOCU) is on minimizing the uncertainty in the objective function itself rather than on reducing the overall uncertainties or entropy to track information in a system.[69,70]A prior distribution is assumed,which is then updated to form a posterior after the outcome of an experiment is known. MOCU tracks how a system loses its performance because of the presence of uncertainty and the next experiment is chosen to reduce MOCU, i.e., that which reduces the variance in the posterior distribution the most.[71]

For a cost function f(x), let us define xrobustas the point with the expected value of f(x) over the unknown parameters,θ. That is,xrobust=argmaxxEθf(x)is the best or‘average’ result we can obtain given that θ is unknown. It is not the optimal because we do not know θ, but if θ was known,then the optimal would be given by x+=argmaxxfθ(x) for that value of θ. The difference between the optimal value f(x+) and the value associated with the robust selection,fθ(x+)?fθ(xrobust), is the loss we incur due to uncertainty for each value of θ. This difference is known as the objective cost of uncertainty (OCU) and its expectation over θ defines MOCU.That is,we define the expected MOCU as

The next experiment is then selected by considering all possible outcomes of a given experiment and choosing that which reduces MOCU the most. The application of MOCU within a modeling framework for shape memory alloys is discussed in the study of Dehghannasiri et al.[72]

3. Accelerating materials discovery

3.1. Model prediction

For classification problems, logistic regression, decision tree,random forest,gradient-boosted tree,support vector machine (SVM), and naive Bayes can be used to construct a model. The random forest and gradient-boosted tree are ensemble models making use of predictions on an ensemble of models to make robust predictions. Decision trees are usually the base learners and the ensemble is over many such trees.The SVM performs non-linear classification using a kernel function to map the problem to a high-dimensional feature space. It uses a hyper-plane to fit the data. Naive Bayes is a probabilistic classifier based on Bayes’theorem and provides the probability for the next sampling.

Regression models include linear, polynomial, support vector regression using a kernel function, and the Gaussian process based on a Gaussian kernel for the covariance. Polynomial regressor is usually fit by minimizing the variance of the unbiased estimators of the coefficients.

3.2. Case studies

3.2.1. Posterior probability-based strategies

Example 1: phase diagram construction We review an example where uncertainty sampling methods are employed to efficiently construct a phase diagram.[66]The active learning loop(Fig.2)shows how starting with a phase diagram with few labeled points, a new candidate may be sampled to optimize the initial phase diagram. The probability distributions P(p|x)of the phases for each point x can be obtained through semi-supervised learning models including label propagation(LP)and label spreading(LS),where p is the label of phases.In LP,the labels of the measured points are fixed. LP provides the probability of a phase p at x by calculating that of reaching the phase p first via random walk from x. The labels of measured points can be changed according to different environment in LS so that LS is effective when the label noise is large.Each unmeasured point x on the phase diagram is thus labeled by choosing the phase with the highest probability. The uncertainty score is calculated via different sampling approaches.Three uncertainty sampling (US) methods mentioned above are adopted in this work: the least confident(LC)method,the margin sampling(MS),and the entropy-based approach(EA).The unmeasured points located closest to the margin of multiple phases are of higher uncertainty than those further away from the margin. The next candidate is determined from the unmeasured points with the highest uncertainty score.

Uncertainty sampling is compared with random sampling(RS)to construct the phase diagrams of H2O under lower pressure (H2O-L), H2O under higher pressure (H2O-H), and the ternary phase diagram of glass-ceramic glazes of SiO2,Al2O3,and MgO(SiO2–Al2O3–MgO).LC(LP+LC)is used to evaluate the uncertainty score. Begin with the same nine data points,after several given iterations,the samplings follows different trajectories using different sampling methods,shown in Fig. 3. Figures 3(d)–3(f) are the result of sampled points by uncertainty sampling. These sampled points are apparently distributed near the phase boundaries, which represent well the phase boundary locations. However, it is found that random sampling points hardly capture the margin between the two classes if one checks the same number of points as US does. Moreover,if there are small areas of certain phases,it is difficult for RS to sample points in this area,and this area will be easily ignored. For example,no point is sampled in region III of the H2O-H phase diagram in the RS approach. Efficient estimation of this region is presented in Fig.3(e)using US.

Fig. 2. The schematic of phase diagram estimation and optimization process based on the uncertainty sampling methods. Reproduced with permission from Ref.[66].

Fig. 3. Phase diagrams for (a) H2O-L, (b) H2O-H, (c) SiO2–Al2O3–MgO and the corresponding results of these phase diagrams utilizing(d)–(f)uncertainty sampling and(g)–(i)RS approaches. Reproduced with permission from Ref.[66].

Fig.4. Performances of different uncertainty sampling approaches on the phase diagram construction. Reproduced with permission from Ref.[66].

The performances of various uncertainty sampling methods are quantitatively shown in Fig.4.The results indicate that the LP method is better suited than the LS method. The EA is found not to be useful. If the number of phases is small, MS is better suited, whereas LC is powerful when many phases co-exist in a phase diagram. Thus, efficient selection can be realized using LC to construct complicated phase diagrams.Uncertainty sampling can decrease the number of sampling points to 20% and still construct an accurate phase diagram.Furthermore, the US approach can find an undetected new phase rapidly and a smaller number of initial sampling points are sufficient to obtain an accurate phase diagram.

3.2.2. Sampling strategies based on directed exploration

3.2.2.1. Single property optimization

Example 2: search for very low thermal hysteresis(ΔT) NiTi-based shape memory alloys. To pursue a desired property, alloys with multi-component are usually considered. The increase in the number of elements and the variability of the elemental concentrations lead to an exponentially increased search space. It is thus a challenge to search for a new material in such a huge unexplored space.The work in Ref. [28] demonstrates an active learning loop consisting of 1) data collection, 2) inference model, 3) experimental design, and 4) feedback to realize the optimization of materials property, shown in Fig. 5. After training 22 measured data synthesized in the laboratory by a support vector regression/Gaussian process regression model,properties and uncertainties of the candidates in the large virtual space Ni50?x?y?zTi50CuxFeyPdzof 797482 potential alloys can be predicted. To guide the next measurement,different utility functions such as KG/EGO, which combine exploration and exploitation and the greedy pure exploitation method Min are used. We start from randomly selected samples and combine different models with different utility functions to find the optimal alloy. For robust results,we repeat this process 2000 times to investigate the performances of several regressors: utility function combinations as a function of the size of the data. SVRrbf: KG is found to outperform other regressor: utility function combination on the training set,shown in Fig.6.

Fig. 5. The adaptive design loop based on the model predictions and uncertainties is demonstrated to guide the experiments efficiently. Reproduced with permission from Ref.[28].

Fig. 6. The performances of various combinations of regressor and selector on the NiTi SMA training data set. Reproduced with permission from Ref.[28].

The predicted value and its associated uncertainty for thermal hysteresis (ΔT) of all the data in the search space are provided using 1000 bootstrap samples with the SVRrbfbase learner. This is followed by KG to explore the whole space and focus on a local area with the apparent global minima. Ti50.0Ni46.7Cu0.8Fe2.3Pd0.2possessing the smallest ΔT(1.84 K) was discovered in the sixth iterations of our loop.Our design framework thus provides the opportunity for efficient decision making to accelerates the process of finding materials with desired properties.

Example 3:finding high phase transition temperature NiTi-based shape memory alloys.We demonstrate a statistical learning approach encoded an active learning loop to predict martensitic transformation temperatures in this work.[29]Material descriptors related to the physical properties and the microstructure such as chemical bonding, atomic radii, etc.are employed in the model to capture the possible mechanisms underlying dependence of the transformation temperature, which enable us to expand the method to other alloys. Pauling electronegativity (en), valence electron numbers (ven), Waber–Cromer’s pseudopotential radii (dor) are chose as the final features through Pearson map followed by the best subset selection methods. We compare different machine learning models built by the three preselected features,including linear model, polynomial model, and support vector regression model (rbf, linear, poly kernel). The performance of the machine learning inference models on the training dataset is shown in Fig.7,and the best performer SVR with rbf kernel is decided as the model in the adaptive design framework. Based on the predicted mean and standard deviation provided by the bootstrap results based on the SVR model,decision making can be completed via utility functions. The designed loop chooses the best candidate for synthesis and characterization. We show that iteratively learning and improving the statistical model can accelerate the search for SMAs with targeted transformation temperatures. The Ti50Ni25Pd25with the highest transition temperature of 182.89°C was discovered in the Ti50(Ni50?x?y?zCuxFeyPdz)alloy system of totally 1652470 possibilities.

In addition, there are some domain knowledge in material science. Combine the domain knowledge with the active learning loop will make the sampling much more efficient.Examples include accelerated search for BaTiO3-based piezoelectrics with vertical morphotropic phase boundary using Bayesian learning.[26]With the knowledge of the vertical MPB providing temperature-independent d33piezoelectric property for the compounds here, the solid solution (Ba0.5Ca0.5)TiO3-Ba(Ti0.7Zr0.3)O3is predicted and validated with piezoelectric properties that show good temperature reliability. Another example is accelerated search for BaTiO3-based ceramics with large energy storage at low fields.[37]In this work, the size of virtual space is shrunk from ～9 million to seven hundred thousand based on the domain knowledge, combining with machine learning, cooperating with sampling method EGO,we find the compound (Ba0.86Ca0.14)(Ti0.79Zr0.11Hf0.10)O3with the largest energy storage density 73 mJ·cm?3at a field of 20 kV·cm?1in only two iterations.

Fig. 7. The performances of the machine learning inference models: (a) linear regression (LIN); (b) polynomial regression (POLY); (c) support vector regression with radial basis function kernel (SVR.rbf); (d) support vector regression with linear kernel (SVR.lin); and (e) support vector regression with polynomial kernel(SVR.poly). Reproduced with permission from Ref.[29].

3.2.2.2. Multi-objective optimization

Example 4: design of a new Mg alloy with high strength and ductility. In the process of developing new materials with optimized performance, we often desire that the material has more than one excellent property, and even expect these properties to improve simultaneously during the optimization process. However,the relationship of many properties requires a balance as not all can be equally improved. The most typical example is the strength and ductility of metals.Therefore, in order to gain the new materials with multiple excellent properties with as few experiments as possible, an efficient decision making method is needed as a guide.

In this work,[87]we applied the machine learning assisted multi-objective optimization for materials processing parameters aiming at enhancing the strength and ductility of an as-cast ZE62 (Mg 6 wt.% Zn–2 wt.% RE (Y, Gd, Ce, Nd)) magnesium alloy. All the steps including machine learning prediction of multiple objectives,efficient sampling,and experimental feedback are enclosed in an active learning strategy to perform multi-objective optimization of the as-cast ZE62 magnesium alloy. The samplings are completed through scalarizing a set of objectives into a single objective. Figure 8 shows the schematic of our design strategy,illustrating how to search for the best solution to meet the trade-off optimization requirements. The goal is to push the new discovery materials to the target(yti).

The famous method solving the multi-objective problem is to find the Pareto front. In the first approach,we decide the Pareto front firstly, then aim to minimize the included angle θpof two vectors(ωtand ωp)in Fig.8. Between them,vector ωtis defined from the origin towards the target. Vector ωpis from the origin towards the point in the Pareto front of the virtual space. The next candidate decided is calculated by the formula x=argminθp,where θpis given by

The option x=argmin(δj)will then be chosen as the next experiment to be performed. These methods both transform the two-objective optimization problem into a single-objective optimization problem.

Fig.8.The schematic of our design strategy illustrating how to search for the best solution to meet the trade-off optimization requirements. Reproduced with permission from Ref.[87].

We utilized the two strategies to sample the the heat treatment conditions of alloy ZE62 among the large parameter space. According to the recommendation, perform the corresponding heat treatments and characterize the strength and ductility via the stress–strain curves. The active learning loop runs till the requirements are met. As a result, the strength and ductility of the alloy are increased by 27%and 13.5%respectively with only four new experiments. This work offers two sampling recipes to solve the multi-objective optimization problems in designing processing parameters of materials.

3.2.2.3. Material property curve/surface optimization

Example 5: efficient estimation of material property curves and surfaces. Material properties are always affected by variables such as micro-structure parameters, compositions, the heat treatment conditions as well as the ambient environments. A material property curve or surface exhibits property changes and makes convenient the production process. However,with each new component or materials parameter,the space of candidate experiments grows exponentially.Especially when the target curve/surface is very complex with critical points, a large number of steps will be taken to determine such a curve/surface through experiments or calculations. Thus,learning from existing data complied by sampling the most informative options from the search space will realize the construction of the curve/surface with the least cost. In this work,[80]we employ 5 kinds of sampling methods based on exploration compared with random sampling in different feature dimensions, including 1) one-dimensional cases: the fatigue life curve for 304L stainless steel (SS304L) and the liquidus line of the Fe–C phase diagram; 2) multi-dimension cases: the intermolecular potential of Ar-SH and the Curie temperature of BaTiO3ferroelectrics modeled in 4D.

Figure 9 is the optimization result for the fatigue life curve of SS304L steel. We start from a random training data set of n=5 points and calculate the deviation from the prediction curve to the true one. Mean absolute error(MAE)and maximum absolute error(Max.AE)as a function of the number of new measurements for the different sampling methods are tracked. Figure 9 plots the average values and 95% confidence intervals for MAE and Max.AE for the different utility functions over 100 trials. Maximum variance(Max-v)and B.EGO aiming at searching for the option with the maximum of the variability in the function perform well. These two are able to converge in relatively few iterations, followed by random, which also converges but with more iterations. The trade-off methods EGO and KG decrease the error quickly in the first three iterations,but then relax slowly and nevertheless also converge after a large amount of experiments.The greedy,pure exploitation Min-u shows very little relaxation after a few iterations.

Fig. 9. The performance of five utility functions compared with random selection in optimizing the fatigue life curve for SS304L steel. (a) and (b)Mean absolute error and maximum absolute error during the optimization process. Reproduced with permission from Ref.[80].

Here, the 1D case of the fatigue life curve for SS304L steel is reviewed and the performances of different sampling strategies in other cases are shown in Ref.[80]. In general,directed exploration via maximum variance performs better than the others in mapping the property curve. Following Maxv, the variability utility function B.EGO based on bootstrap samples is sometimes also a good performer. However,under some circumstances,the trade-off methods employing various degrees of exploitation can perform at least as well as maximum variance(Max-v),even better than Max-v in a few cases.The choice of the sampling methods is sensitive to many factors such as the distribution of the training and subsequently acquired data,the model performance,the noise as well as the budget,which determines the number of iterations allowed.

4. Conclusion

We have provided an overview of sampling strategies,including uncertainty sampling as well as those based on directed exploration of the search space. Trade-off methods,which balance exploration with exploitation functions that use the best model outcomes, belong to the latter class and have been used widely in sequential decision making. Efficient sampling guides the selection of the optimal next experiment and is a key component in minimizing the number of experiments and hence the costs required. We have shown how several materials problems can be addressed by employing sampling strategies, and one of our conclusions is that maximum variance is a robust strategy across several examples we have investigated. There is a need for further numerical and analytical studies to distill guiding principles for what strategies are best suited for different types of materials data.