CHAPTER I INTRODUCTION I.1 Background Data-driven science and engineering by Machine Learning (ML) are currently emerging as a popular method that performs well in physics modeling tasks (Brunton et al., 2020). Attempts to apply ML methods in the engineering field have actually started a few decades ago (Quinlan, 1986), but are currently resurfacing due to recent advances in computing capabilities and the so- called big data era where an abundance of data exists in the engineering field that leads to exciting and complex applications such as object recognition (Bouarfa et al., 2020), predictive weather forecasting (Kim et al., 2021), medical diagnosis (Li et al., 2020), autonomous vehicles (You et al.), and also especially to perform predictive modeling in the aerospace engineering and fluid science field where the phenomena are mostly high-dimensional and non-linear (Brunton et al., 2021). Data-driven ML methods such as Polynomial Chaos Expansion (PCE), Gaussian Process Regression (GPR), Artificial Neural Network (ANN), etc. usually perform better as more data are given during the training process of the model and some require a large amount of training data to perform well (L’Heureux et al., 2017). However, labeled data that could be used to train ML models in the aerospace engineering field are usually obtained from computational simulations or experiments where for some cases such as multi-physics numerical simulations (Ojha et al., 2019) or wind tunnel testing (Topbas et al., 2020) could be very expensive to obtain. Thus, recent research in ML and engineering field suggests the development of models that could perform better in the small data regime (Qi and Luo, 2022). As ML methods are widely used, the interpretability (Carvalho et al., 2019) of the model becomes important, especially in applications that require justifiable reasons behind the prediction output of the model such as medical, military, and aerospace field (Kapteyn et al., 2020) where data-driven models must 1 be certifiable and verifiable due to safety reasons (Miller, 2019). Recently, ML models that are widely being used for engineering cases due to their outstanding predictive performance such as GPR and ANN are mostly classified as black-box ML models where the formula of the model is considered too complicated for humans or experts to understand or interpret for practical applications (Loyola-Gonz´alez, 2019). Black-box models often predict the correct answer for wrong reasons resulting in a good predictive performance for training data but fail to generalize on test data (Rudin, 2019). On the other hand, ML models that are interpretable or so-called white-box models are based on rules or mathematical expressions closer to the human language that could better be understood by experts in their domain (Rudin et al., 2021). Addressing the recent challenges to develop models that could perform better with less training data and be interpretable, there actually exists an ML approach that could perform better in little training data conditions and provides more interpretability than other commonly used ML methods such as PCE, GPR, and ANN which is by the Symbolic Regression (SR) approach (Vaddireddy et al., 2020). The SR approach optimizes the structure and parameter of an analytical model to obtain a mathematical expression that hopefully could provide interpretability as its behavior could be analyzed mathematically (periodicity, asymptotic behavior, etc.) (La Cava et al.). Previously, Cramer (Cramer, 1985) introduced the Genetic Programming (GP) method, further developed by Koza (Koza, 1990), that applied Genetic Algorithm (GA) (Holland, 1992) to evolve the population of candidate mathematical expression tree consisting of functions and operations over input variables and constants forming mathematical function solutions that fit training sample data. Nonetheless, the GP approach has the disadvantage of its nonlinear complex structure of parse trees where the genetic operators of GP operating at the tree level could generate structural impossibilities resulting in invalid candidate solutions that reduce the algorithm’s effectiveness to obtain solutions. To overcome the invalid solution issue, Ferreira (Ferreira, 2001) 2 developed Gene Expression Programming (GEP) that incorporate simpler and linear solution representations that could have various sizes and shapes similar to the parse trees of GP but would not produce invalid expressions. However, despite the model-specific high interpretability shown by white-box models, there is a limit to the modeling capabilities in its predictive accuracy of white-box models, especially when dealing with high-dimensional non-linear complex problems where it is commonly known that the more interpretable the model is, the lower its complexity thus affecting its predictive capability (Morocho-Cayamcela et al., 2019). The challenge nowadays with using white- box models for the sake of interpretability is that even though it’s interpretable, for more complex problems, their predictive accuracy is just too low to be accepted in most applications, thus making black-box models more popular. There have been various attempts to make black-box models’ prediction results interpretable by separating the explanations from the machine learning model itself by the model-agnostic interpretation methods. By model-agnostic methods, the interpretability does not depend on the structure of the model itself as in model-specific methods but depends only on what the predictions from the models are, thus any machine learning model whether white- or black- box models could be used (Ribeiro et al., 2016b). The model-agnostic methods used in this research are widely used methods which are the Partial Dependence Plot (PDP), Individual Conditional Expectation (ICE), Local Interpretable Model-agnostic Explanations (LIME), and Shapley Additive Explanations (SHAP) (Molnar, 2022). Addressing the aforementioned demands on model interpretability and better predictive performance in less training data environment for data-driven models, especially in the aerospace engineering field, this work aims to demonstrate the application of the SR approach, especially by GEP as an ML method alternative that performs better compared to black-box models in less training data conditions due to better extrapolation capabilities and even could be implemented to improve black-box models’ performance such as GPR. Also, 3 this work aims to demonstrate how white-box models could provide model- specific interpretability, but as the problem gets more complex having higher dimensionality, white-box models eventually come short in their predictive performance and black-box models have to be used. Moreover, as we are resorting to black-box models due to their better predictive performance, their predictions could be interpreted by model-agnostic interpretation methods. There are three case studies in this work that involve the flutter prediction case as a less complex problem, and also a wind turbine blade design and re-entry trajectory analysis as the more complex problems. I.2 Problem Formulation Based on the research background, the research problems that this work aims to resolve could be defined as follows. 1. Could gene expression programming provide better predictive performance than other machine learning alternatives in less training data environments or even used to improve other models’ performance. 2. How can white-box models provide model-specific interpretation and what kind of problems is it suitable for. 3.