Abstract
Deep generative models have shown significant promise in improving performance in design space exploration. But there is limited understanding of their interpretability, a necessity when model explanations are desired and problems are ill-defined. Interpretability involves learning design features behind design performance, called designer learning. This study explores human–machine collaboration’s effects on designer learning and design performance. We conduct an experiment (N = 42) designing mechanical metamaterials using a conditional variational autoencoder. The independent variables are: (i) the level of automation of design synthesis, e.g., manual (where the user manually manipulates design variables), manual feature-based (where the user manipulates the weights of the features learned by the encoder), and semi-automated feature-based (where the agent generates a local design based on a start design and user-selected step size); and (ii) feature semanticity, e.g., meaningful versus abstract features. We assess feature-specific learning using item response theory and design performance using utopia distance and hypervolume improvement. The results suggest that design performance depends on the subjects’ feature-specific knowledge, emphasizing the precursory role of learning. The semi-automated synthesis locally improves the utopia distance. Still, it does not result in higher global hypervolume improvement compared to manual design synthesis and reduced designer learning compared to manual feature-based synthesis. The subjects learn semantic features better than abstract features only when design performance is sensitive to them. Potential cognitive constructs influencing learning in human–machine collaborative settings are discussed, such as cognitive load and recognition heuristics.
1 Introduction
Deep learning methods have been applied to a variety of engineering design problems such as airfoil design [1], structural design [2–4], and metamaterial design [5]. Indeed, design space exploration (DSE) using deep learning, referred to as deep generative design, creates novel designs efficiently and shows improvements over traditional optimization methods [6,7]. Deep learning methods can effectively optimize well-defined design performance metrics and meet quantitative requirements under constraints [8]. Deep generative design helps by distilling high-dimensional input data into low-dimensional representations, which we call features. However, to be useful for deriving insights, the features need to be understood by designers, a key requirement for model interpretability. In this context, designer learning includes identifying successful designs, understanding the features behind successful designs and constraints, and knowing the analogical association between designs [9,10]. Knowing driving features can aid in directing the exploration of a large design space [11–13]. Understanding key features is also necessary when designers must explain design decisions to stakeholders, which requires rationales, especially in the early design phase. This learning process is a prerequisite when a design problem is ill-defined, and the knowledge from early design tasks needs to be transferred to subsequent design processes.
In this paper, we highlight that bringing a human designer into the DSE can potentially better balance a designer’s learning and design performance compared to a “black-box” only optimization [14,15]. Existing approaches to human-in-the-loop design space exploration take designer inputs on design- and feature selection [16–18] and present feedback on performance metrics and the diversity of generated methods [3,19]. The interaction between a human designer and the computer includes visual graphical user interfaces [16], natural language processing interfaces for question answering and textual explanations [20], and tangible physical interfaces [14]. Despite this progress, there is a limited understanding of whether generative design methods are effective for human learning. Specifically, the evaluation of how and whether different types of features (semantic versus abstract) and modes of interaction improve designer learning and design performance are limited.
Existing approaches apply automation to different functions such as design search, design analysis, and design evaluation [21]. Within each part, the level of automation can vary from low to high, i.e., from manual to fully automatic. This paper explicitly analyzes levels of automation of design exploration. Also, the type of feature is associated with engineering significance. Features are information sets that refer to the form, function, material, or precision attributes of a part [22]. The semantic nature of a feature can exploit an individual’s dense prior knowledge relative to abstract, data-driven features [23]. Therefore, the research objective of this paper is to quantify the effects of changing two interactivity-related factors in DSE: (i) the level of automation of the search function, e.g., whether a designer generates a design manually from its constituting parts (low automation), manually from predefined features, or automatically using a deep generative design method (high automation); and (ii) the semanticity of features, e.g., features can have a semantic meaning, or they can be abstract latent features that are an output of a generative design algorithm. The evaluative criteria of the automation and feature type level are designer learning and design performance.
The approach follows a human subject experiment and a quantitative measurement of designer learning and design performance. A conditional variation autoencoder (C-VAE) [24] enables the generative design of mechanical metamaterials with strength-based and density-based objectives. The human subject experiment instantiates variations of the C-VAE based on the independent variables under study. Overall, the experimental data include 42 subjects from a within-subject experiment. We measure design performance with established multi-disciplinary design optimization measures. Similarly, a questionnaire measures designer learning after each experimental condition separately [25,26]. An item response theory (IRT) model estimates the subjects’ feature-specific “abilities” based on the questionnaire responses [27].
This study contributes critical behavioral insights and an IRT model for assessing learning in interactive deep generative design.
The analysis is the intertwined nature of design performance with designer learning, a hypothesis promoted by Sim and Duffy [9,28,29]. For example, the semi-automated generated method with a high level of automation adversely impacts the subjects’ feature-specific learning and overall design performance. Barriers to feature learning likely diminish the designer’s ability to generate better designs.
The study identifies behavioral patterns in how individuals learn about feature importance in interactive deep generative design. The positive influence of semanticity on how much the subjects learn and their performance depends on the features’ performance sensitivity. The higher the performance sensitivity due to a feature, the higher the related learning. These insights can help design better interactive and learning-focused deep generative design tools.
The paper also contributes a unique approach combining experiments and IRT to evaluate component-level learning of features. There have been qualitative approaches to assess feature understanding [30,31]. But the presented method is the first in developing a quantitative IRT model for evaluating learning in a human–machine collaborative setting. Provided that a questionnaire is implemented and relevant design features are embedded in test questions, the IRT model can scale to other design problems for evaluating feature-specific abilities.
The rest of the paper is structured as follows. Section 2 reviews existing interactive methods for generative design and related research studies. Section 3 presents the mathematical details of the C-VAE-based interactive tool, the IRT model, and the experiment design. Section 4 provides results from the analysis of the experimental data. Section 5 explains the main findings and suggests future design support tools. Section 6 presents the conclusion. The developed tool is available.2
2 Related Work
2.1 Interactive Generative Design Methods.
Generative design refers to computational design methods that can automatically conduct DSE under constraints [4,8]. Generative design methods create multiple optimal designs by varying the weights of multi-objective and design parameters using gradient-based (e.g., stochastic gradient descent) or gradient-free (e.g., genetic algorithms) techniques. This paper focuses on deep generative design, which refers to algorithms that generate new designs using deep learning [32]. Deep neural networks (DNNs) and convolutional neural networks (CNNs) are frequently used to build surrogate models for engineering problems due to their high performance in learning patterns from images to recognize objects. The CNN consists of convolution and pooling layers, with fully connected layers at the end. Variational autoencoders use multiple CNNs. An encoder network transforms inputs into low-dimensional latent features. A decoder network is to reconstruct a design from features that maximize similarity to the original inputs [33].
There are many variations in the functional allocation between humans and algorithms in the optimization. This paper includes one option which is the user selecting the step size, but other options could be the user repairing designs to make them feasible, the user giving feedback to the agent about its designs, the user providing additional constraints to the optimizer in real-time, etc. More formally, existing interactive generative design approaches vary along the dimensions of input type, knowledge outcomes, and the type of human–machine interface. Table 1 presents a description of these dimensions. The input type pertains to how user feedback is incorporated. A user might steer design exploration by choosing a desired set of designs or design variable values [34], select design parameters or features [17,35], or set a range for desired objective values [15,19]. Using high-level rather than low-level features can reduce the number of input commands. In the context of detailed parametric tasks, generative design methods utilize form features, material features, precision features, or primitive features [22]. Furthermore, generative design typically attempts to solve multi-objective problems with multiple conflicting design criteria. Any desired DSE outcomes must be represented as a particular loss function. Existing generative design methods can optimize predefined performance metrics like structural compliance [36,5,4], maximize the diversity of generated designs [3,7,8], or learn driving features behind selected designs [17,35]. Finally, the mode of interaction between a user and the underlying tool can be a graphical user interface, natural language interface, or tangible interface. The human–machine interface also provides feedback to the user through visual representations of generated designs [37–40], explanations about evaluation models and driving features [19], and evaluations of physical prototypes [14]. This paper adopts a methodology where a user makes parametric changes in the feature space to optimize performance and learning objectives using visual design representations on a graphical user interface.
Dimension | Categories | Examples |
---|---|---|
Input type | Design space | Choose desired designs from generated designs to guide further exploration |
Feature space | Parametric change or selection made to latent embedding or feature values | |
Objective space | A user selects a desired range or values for a specific design objective | |
DSE outcomes | Performance-driven | Generated designs maximize fixed performance metrics or converge towards true Pareto front |
Diversity-driven | Designs are generated to increase diversity in decision/feature and objective spaces | |
Learning-driven | Designs are generated to learn the main aspects driving the problem, such as sensitivities or features common among Pareto designs | |
Human–machine interface | Graphical user interface | Generated designs and/or features are visualized as images or graphs and objective/feature space are visualized as scatter plots |
Natural language interface | User asks questions through voice or a chatbox, and human–machine provides answers such as explanations about the design’s performance | |
Tangible interface | A user creates designs by manipulating a physical representation of it (e.g., wooden blocks on a tabletop) while visualizing the tradespace information on a computer screen |
Dimension | Categories | Examples |
---|---|---|
Input type | Design space | Choose desired designs from generated designs to guide further exploration |
Feature space | Parametric change or selection made to latent embedding or feature values | |
Objective space | A user selects a desired range or values for a specific design objective | |
DSE outcomes | Performance-driven | Generated designs maximize fixed performance metrics or converge towards true Pareto front |
Diversity-driven | Designs are generated to increase diversity in decision/feature and objective spaces | |
Learning-driven | Designs are generated to learn the main aspects driving the problem, such as sensitivities or features common among Pareto designs | |
Human–machine interface | Graphical user interface | Generated designs and/or features are visualized as images or graphs and objective/feature space are visualized as scatter plots |
Natural language interface | User asks questions through voice or a chatbox, and human–machine provides answers such as explanations about the design’s performance | |
Tangible interface | A user creates designs by manipulating a physical representation of it (e.g., wooden blocks on a tabletop) while visualizing the tradespace information on a computer screen |
2.2 Interpretability Approaches for Deep Generative Models.
Interpretability broadly refers to the ability to present and explain understandably the cause-and-effect relationship between inputs and outputs of a machine learning model [41,42]. The need for interpretability arises when predictions and calculated metrics do not suffice for making informed decisions. For example, in the conceptual design, the real-world objectives are challenging to quantify, decision risks or stakes are high, and there is a tradeoff between objectives. The reasons behind analyzing interpretability are to explain complex deep learning models, enhance the fairness of model results, create white-box models, or test the robustness/sensitivity of predictions [43]. The scope of interpretability is broader than that of explainability, which refers to explanations of the internal logic and mechanisms of deep learning models. We focus on evaluating the interpretability of existing deep learning models, given their prevalence in the current design literature.
The type of interpretability evaluation depends on the machine learning task and whether real humans are involved in experiments. Application-grounded, human-grounded, and functionally grounded evaluation are three types of evaluation approaches [42]. Application-grounded evaluation conducts experiments with domain experts within a real-world application, e.g., public testing of self-driving cars. Such evaluations require substantial time and effort, but are sometimes necessary for real-world validation. Human-grounded evaluations involve experiments with lay humans in the real-world with simplified tasks. Human experiments test hypotheses by questioning human participants about the preference between different explanations or by identifying correct predictions from the presented input/explanation. Finally, functionally grounded evaluations use a formal definition of interpretability for automated interpretations. This approach requires defining quantitative metrics as proxies for interpretability. Functionally grounded metrics are applicable when working with models that have been validated, such as by human experiments. The survey by Linardatos et al. [43] summarizes various model-specific and model-agnostic methods for functionally grounded evaluation. Some model-specific methods analyze gradients of outputs with respect to inputs to find salient features, e.g., sensitivity analysis for DNNs [44], Deep learning important features (DeepLIFT) [45], and visual explanations for CNNs [46]. Some model-agnostic methods compute importance values for input features within predictions, e.g., local interpretable model-agnostic explanations (LIME) [47] and shapley additive explanations (SHAP) [48].
In this study, we use human-grounded evaluation of interpretability for interactive deep generative design because such methods have limited behavioral validation. On a related point, Sec. 2.3 points to conflicting findings on the effectiveness of the deep generative design.
2.3 Learning and Performance Outcomes of Deep Generative Design.
The effectiveness of interactive generative design tools may depend on task complexity, usability, and users’ expertise. Viros i Martin and Selva [49] compare two versions of an human–machine agent with a natural language interface varying in functionality and level of pro-activeness. An “assistant” version only answers technical questions from the designer (e.g., querying databases and doing data analysis), whereas the “peer” version provides recommendations to improve design solutions. They find that more interactions with the tool in both versions improve the design performance and learning. Recent research finds that the collaboration of a human and a computer agent significantly improves design performance compared to human-only or agent-only processes [14,19,50–52]. A body of research on hybrid human–machine teams [53,54] finds that low-performing players benefit from the decision support, but this support can be overly conservative for high-performing players. The cognitive agent boosts the performance of low-performing teams in a changing problem setting but hurts the performance of high-performing teams [40]. Also, the evidence in Ref. [15] highlights differences in learning between expert designers and novices.
Despite being a crucial part of design decision-making, designer learning has received little attention in evaluating deep generative models. Recent studies have proposed approaches to measuring knowledge. One general approach is to pose questions testing individuals’ specific learning about the problem at hand, as demonstrated by Bang and Selva [30] for tradespace exploration. Related to this, IRT offers a consistent approach to estimating concept-specific ability from observations of test responses. Mathematically, IRT defines the functional relationship between the individual’s ability/knowledge on a topic and the likelihood that they correctly answer questions on the same topic [25]. The simplest IRT model uses a scalar ability parameter and the binary response, related through a Sigmoid function. More complex IRT models have been applied for estimating multi-dimensional ability levels, which can be either independent of each other [26] or interconnected in a Bayesian network [27]. This work presents a new variation of IRT for quantifying learning from design space exploration.
3 Methodology
Interactive design space exploration implicitly supports learning and performance goals by allowing visualization, generation, and evaluation of alternative designs. An example of a learning goal is identifying the driving features that make up good designs such as Pareto-optimal designs. An example of a performance goal is maximizing one or more design objectives.
3.1 Implementation of the Interactive Deep Generative Design.
We use a conditional variational autoencoder (C-VAE) to represent the relationships between designs, features, and objectives. Figures 1(a) and 1(b) present the network structure of C-VAE. Suppose a vector or matrix x represents a design. Features z1 are predefined, deterministic functions of design x mainly representing mechanical and geometric features of designs such as shape and size. Grayscale image I denotes a visual representation of design x, with each pixel taking a value between [0,1].
Two sequential neural networks, encoder and decoder, operate on image I as part of the variational autoencoder. The encoder network E : {I, z1} → {μa, σa} converts image I and predefined features z1 into mean μa and standard deviation σa vectors that have the same length as latent dimensions. A latent feature vector z2 is a sample from a normal distribution with the same mean and deviation vectors N(μa, σa). Furthermore, the decoder network transforms the predefined features z1 and the latent features z2 into a reconstructed image . Furthermore, separate neural networks post-process the C-VAE outcomes. First, the adaptation network reformats the grayscale image into a binary array of the same size as x, resulting in a reconstructed design . Second, the regression network predicts the design objective values from features. The network structures in Fig. 1(b) include operators such as 2D convolution (Conv2D), 2D transposed convolution (Conv2DT), linear transformation (Dense), and activation functions such as rectified linear units and Sigmoid function (Sigm) [55]. The dense layer does not have an activation function.
We propose three modes of human–machine collaboration concerning the level of automation of search decisions a designer must take:
(Manual design synthesis) A user defines design x with all its constituent parts;
(Manual feature-based design synthesis); and
(Semi-automated feature-based design synthesis).
3.1.1 Manual Design Synthesis.
The first collaborative mode involves a user manually creating a design x with all its constituting components and the C-VAE evaluating the design objectives . A user only sees the design objective outputs and does not observe the intermediate latent features z2.
3.1.2 Manual Feature-Based Design Synthesis.
In the second mode, a user manually selects features z1, z2 individually to generate designs with the decoder network D. Selecting feature values is done one feature at a time, either from predefined features or latent features. For every feature value adjustment, the C-VAE automatically generates a new design. A user can further decide whether to evaluate design objectives at any generated design or not. Suppose initial design x and its features z1 and z2 are given. A user makes Δz2 change to the latent features and evaluates the new design corresponding to the updated latent features z′2 = z2 + Δz2. The output of this process is a newly reconstructed design x′ = A(D(z1, z′2)) and its design objective values y′ = R(z1, z′2).
3.1.3 Semi-Automated Feature-Based Design Synthesis.
3.2 Measures of Design Performance and Designer Learning
3.2.1 Multi-objective Performance Metrics.
We use three established performance measures calculated based on the values of the design objectives . First, hypervolume improvement is a measure commonly used in multi-objective optimization [56]. Given a set S of points (e.g., the output of a design search process), the hypervolume indicator of S is the area (for a 2D case) of the union of the region of the objective space dominated by each point in S and limited by a user-defined reference point. The reference point is at or near the anti-utopia point, i.e., the smallest objective value for each objective, assuming the problem requires objective maximization.
Second, a metric based on credit assignment strategies from multi-armed bandit theory evaluates designs more locally [57]. If an initial design x is modified to produce a new design x’, then the value of x’ is determined based on whether or not x’ dominates x. If the new design dominates the initial one, i.e., if it is better than the initial design in all objectives, it receives a value of 1. Conversely, if the initial design dominates the new design, the new one receives a score of 0. If no design dominates, the new design receives a score of 0.5.
Third, the distance to the utopia point is the closest distance between the generated designs set S and the utopia point. In maximization, the utopia point has coordinates equal to the largest possible objective values, whereas in minimization, the utopia point has the smallest possible objective values as its coordinates.
3.2.2 Designer Learning: Feature-Specific Abilities.
After exploring the design space, we implement a psychometric assessment approach to measure designer learning. This approach involves multiple-choice questions and an IRT model to estimate the feature-specific ability from an individual’s responses. The feature-specific ability measures the degree to which a designer understands the effect of that feature on design objectives. These features are the same as the predefined and latent features in the C-VAE in Sec. 3.1. We use two types of questions to assess a person’s knowledge: (i) design comparison and (ii) feature identification [30]. A design comparison question includes two given designs (say A and B) and requires a person to choose the design they think has a higher value for a given objective. For a given pair of designs (say A and B), a subject selects one of four choices: (i) “Option A,” (ii) “Option B,” (iii) “Minimal difference,” and (iv) “Not sure.” The “Not sure” option reduces the likelihood of a false-positive response. Furthermore, a feature identification question tests a person’s ability to correctly identify a particular feature’s effect on a design objective in the context of adding the feature to a specific design. The question assumes that only the given feature changes value while other features are kept constant. In response to how the given objective changes, the person chooses four options: “Increases,” “Decreases,” “Minimal change,” and “Not sure.”
3.3 Human Subject Experiment
3.3.1 Mechanical Metamaterial Design Problem.
The experimental task involves the design of 2D mechanical metamaterials. A mechanical metamaterial is an artificially engineered structure of lattice units replicated in all directions. The lattice topology exhibits unique and tunable properties. Surjadi et al. [58] discuss unique properties of metamaterials and their applications in structural design and additive manufacturing.
We consider designs consisting of a unit cell structure repeating horizontally and vertically. The unit cell structure consists of multiple links joining nodes in a 3 × 3 grid in the XY plane. This defines a design space of 236 possible designs, which is reduced to 228 unique designs if we account for duplicates due to the replication of the unit cell structure in 2D space. Such design problem presents the right level of complexity for student subjects to develop problem understanding.
A metamaterial is evaluated using two design objectives: maximize vertical stiffness (which relates to strength) and minimize volume fraction (which relates to weight). The default model of choice for computing stiffness is a Hooke’s law-based truss stiffness model (termed as the “truss model”), taken from Ref. [59, Ch.9]. The 2D metamaterial is a truss structure with each member assumed to only experience axial forces. The individual stiffness matrices for each member are determined by solving Hooke’s law relationship. In cases where the truss model fails due to isolated members, a lower fidelity fiber stiffness model (called “the fiber model”) is employed based on Cox [60]. The fiber model considers each member as a fiber and computes a length-normalized approximation of the effective stiffness.
The design problem also requires a feasibility constraint that metamaterials should satisfy: No two links in a unit cell should intersect, except at nodes; and a resulting metamaterial should be connected in the sense of a network graph, i.e., it should not have any disconnected subcomponents.
3.3.2 User Interface.
Figure 2 presents the interactive design exploration platform used in the experiment. On the top panel, the tradespace plot shows a collection of existing designs (denoted by circles) and user-generated designs (indicated by triangles). A user may click on any design in the tradespace plot to visualize its details, including the unit cell structure on the bottom left panel (design visualization panel). On the bottom right panel, a user generates a new design through one of the three modes of interaction described in Sec. 3.1: the manual design synthesis (labeled “Change Design” in the figure), the manual feature-based design synthesis (“Change Feature”), and the semi-automated feature-based design synthesis (“Auto Feature Changes”). The manual design synthesis allows users to create a metamaterial design by specifying a unit cell structure. Upon clicking on the “Test metamaterial” button, the tradespace plot displayed the objective values of the new design and the design visualization panel shows the newly tested design. In the manual feature-based design synthesis mode (see Fig. 3), a user selects a change in individual features and considers the effect of the feature change on the selected design and its objectives. A newly generated design is updated in real-time on the design visualization panel whenever there is a change in feature values. The user must click on the “Test metamaterial” button to evaluate the generated design. In the semi-automated feature-based design synthesis mode, a user selects the maximum amount of change desired with respect to the initial design, and the C-VAE predicts the best possible design within that neighborhood of the initial design, according to Sec. 3.1.3. A newly generated design and the differences in features for the set change are visualized in real-time. Here again, the user has to signal intentionally, clicking on the “Test Metamaterial” button, if they want to evaluate a newly generated design.
3.3.3 Experiment Design.
Table 2 presents the experiment protocol, including the order of the design synthesis tasks and learning tests. The experimental conditions vary in two independent variables: (i) the level of automation and (ii) the semanticity of features. The level of automation involves the interaction between a designer and the conditional variational autoencoder (C-VAE). That is, a subject completes one of the following tasks at any given time: the manual design synthesis (task 1), the manual feature-based design synthesis (task 2), and the semi-automated feature-based design synthesis (task 3), as described in Sec. 3.1. In each task, the user can only generate new designs using one functionality. Furthermore, the predefined features z1 in the C-VAE have semantic meanings (semantic features), whereas the latent features z2 are mathematical variables (abstract features). Table 3 provides a brief description for individual features. We select five semantic features for the mechanical metamaterial design problem: horizontal lines, vertical lines, diagonal lines, triangles, and three-star nodes. The other five abstract features are the outputs from the encoder network, with the probability distribution approximately equal to the standard normal distribution for each one.
I | Design pretest | 16 design comparison questions |
II | Task 1 | Manual design synthesis (8 min) |
Design posttest | 16 design comparison questions | |
III | Task 2 | Manual feature-based design synthesis (8 min) |
First feature test | 20 feature identification questions | |
IV | Task 3 | Semi-automated feature-based design synthesis (8 min) |
Second feature test | 20 feature identification questions |
I | Design pretest | 16 design comparison questions |
II | Task 1 | Manual design synthesis (8 min) |
Design posttest | 16 design comparison questions | |
III | Task 2 | Manual feature-based design synthesis (8 min) |
First feature test | 20 feature identification questions | |
IV | Task 3 | Semi-automated feature-based design synthesis (8 min) |
Second feature test | 20 feature identification questions |
Note: The protocol also implements the reverse order (Part I, IV, III, and II) for approximately 20 of 42 subjects.
Feature | Description | Volume fraction | Vertical stiffness |
---|---|---|---|
Feature 1 | Derived from C-VAE | None | None |
Feature 2 | Derived from C-VAE | None | High |
Feature 3 | Derived from C-VAE | High | None |
Feature 4 | Derived from C-VAE | None | None |
Feature 5 | Derived from C-VAE | None | None |
Horizontal lines | Number of horizontal links connecting two nodes | High | None |
Vertical lines | Number of vertical links connecting two nodes | High | High |
Diagonals | Number of inclined links connecting two nodes | High | High |
Triangles | Number of three links (any orientation) connecting three nodes to each other | High | None |
Three-stars | Number of three links connecting a single center node to three outer nodes separately | High | None |
Feature | Description | Volume fraction | Vertical stiffness |
---|---|---|---|
Feature 1 | Derived from C-VAE | None | None |
Feature 2 | Derived from C-VAE | None | High |
Feature 3 | Derived from C-VAE | High | None |
Feature 4 | Derived from C-VAE | None | None |
Feature 5 | Derived from C-VAE | None | None |
Horizontal lines | Number of horizontal links connecting two nodes | High | None |
Vertical lines | Number of vertical links connecting two nodes | High | High |
Diagonals | Number of inclined links connecting two nodes | High | High |
Triangles | Number of three links (any orientation) connecting three nodes to each other | High | None |
Three-stars | Number of three links connecting a single center node to three outer nodes separately | High | None |
The experiment involves 42 junior-, senior-, and graduate-level students from engineering disciplines at Texas A&M University. Each subject completes the three tasks with different levels of automation. The order of three tasks, given by Table 2, is reversed for about half of the subjects pool so that no design synthesis task always follows the same task in both orders. This setup helps to counterbalance order effects. Task 1 with a pretest is always conducted at the start of the experiment. For every subject, a total of ten semantic and abstract features are randomly divided into two groups of five each, one for task 2 and the other for task 3. Because the features are randomly assigned to different tasks for each subject, all 10 features still appear in every task over the entire subject population. This within-subject design ensures that each subject completes all three tasks and sees five abstract features and five semantic features at some point between tasks 2 and 3. We can still partition the collected data into different levels of automation and types of features. The design allows us to study relative differences between different experimental conditions. The total experiment lasts about 45 min, and each subject receives a fixed payment of 20 USD at the end. The subjects must spend a minimum of 5 min on the instructions, which include textual and graphical details of the metamaterial problem and the user interface. The eight minutes of task duration provided extra time to familiarize themselves with the interface and was selected after pilot testing.
We administer four learning tests throughout the experiment, as shown in Table 2. Before any design synthesis task, part I includes a design pretest with 16 design comparison questions to test the prior knowledge of the subject about the mechanical metamaterial design problem. With the task ordering shown in Table 2, part II includes a manual design synthesis task and a design posttest with 16 design comparison questions to measure resultant learning. We do not repeat questions between design pretest and posttest to prevent the subjects from remembering answers. The questions in both tests still have a similar distribution of question complexity, as measured by the feature difference Δz2 (see Fig. 4(a)). Parts III and IV, respectively, include the manual- and semi-automated feature-based design synthesis and first and second feature tests, which include 20 feature identification questions each. About half of all subjects complete the tasks in the reverse order of parts I, IV, III, and II to mitigate the impact of task order in the data.
3.4 Model Training.
We train the conditional variational autoencoder on a dataset of 21,444 designs, which were generated from a greedy search using genetic algorithms [61]. The training data include 3 × 3 metamaterial lattice structures represented as 28 × 28 pixel grayscale images and 28-bit binary vectors, and the objectives vector for each design, made of vertical stiffness, volume fraction, and feasibility constraint. An image is generated from the binary vector representation of a metamaterial design. A pixel has a value of 1 if it falls on an active link or 0 otherwise. An image of an example unit lattice is highlighted in the bottom left part of Fig. 2. The image acts as an input I to the encoder network.
The loss function of the C-VAE comprises four terms. The reconstruction loss measures the difference between input designs and reconstructed designs. The Kullback–Leibler divergence (KLD) loss measures the difference in the posterior feature distribution and the standard normal distribution to reduce correlation among different features [62]. The KLD loss was weighted ten times the actual KLD loss. The regression loss compares the predicted and observed values of the objectives. Finally, a correlation loss term maximizes the correlation of feature 2 and feature 3, respectively, with vertical stiffness and volume fraction. This loss artificially introduces strong sensitivity between the design objectives and select abstract features to help with the assessment. The hypothesis is that if those features are strongly correlated with the design objectives, the user should be able to learn those features more correctly. Table 3 differentiates high versus low sensitivity features in the trained model based on the total-effect Sobol index. The semantic features are converted from integer numbers to normalized float values by centering with sample mean and standardizing with sample deviation. These values feed into the C-VAE as vector z1. We ran the Adam-based stochastic optimization algorithm for 50 epochs with a batch size of 128.
4 Results
The results present descriptive statistics and the posterior estimates from the item response theory model. The results use the aggregated data of both experimental task orders, as described in Table 2. We highlight the order-specific differences whenever relevant. The rest of the section is divided into the designer learning and performance outcomes.
4.1 Designer Learning Outcomes.
The experimental task increases the subjects’ ability to differentiate designs based on design objectives in the design comparison questions asked. Figure 4(a) presents the average correctness of responses in the learning tests. We observe that the average correctness is higher in the design posttest than in the pretest (relative t-statistic = 4.12, two-sided p-value < 0.001, Cohen’s d = 0.78). This difference is statistically significant irrespective of the task order. The average correctness of design posttest is higher during the forward order parts I, II, III, IV (relative t-statistic = 2.47, two-sided p-value = 0.018, Cohen’s d = 0.73) and during the reverse task order (t-statistic = 2.58, two-sided p-value = 0.014, Cohen’s d = 0.84). Furthermore, in the design posttest, the average correctness of response increases in proportion to the feature distance between the designs being compared (slope = 0.13 (± 0.035), intercept = 0.05 (± 0.18), r-value = 0.16, and a one-sided p-value < 0.001). Here, the feature distance ‖Δz‖2 defines the MSE distance between the features of two designs in a test question. The more different the two designs are, the easier it is expected for the subjects to predict the influence of features on a given objective correctly. Note that the correlation between the feature distance and the distance in the objective space (|Δy|) is statistically insignificant, according to the results in Fig. 4(b). Thus, |Δy| is not expected to confound the effect of ‖Δz‖2 on the average correctness of responses.
Overall, the subjects most accurately learn the effect strength and direction for features with inherently significant and positive effects on the design objectives. The semanticity further improves the accuracy of responses. According to Fig. 4(c), high sensitivity semantic features such as “horizontal lines,” “vertical lines,” and “diagonals” collectively have higher mean correctness of response than the other semantic features (relative t-statistic = 4.5, two-sided p-value < 0.001, Cohen’s d = 0.69), especially for task 2. Furthermore, these three semantic features have better correctness of response than the high sensitivity abstract features, i.e., “feature 2” and “feature 3” (relative t-statistic = 3.40, two-sided p-value < 0.001, Cohen’s d = 0.54).
Task 3 produces a lower accuracy of responses compared to task 2. The “horizontal lines” feature has lower average correct responses in feature test 2 than feature test 1, according to Figs. 4(c) and 4(d) (t-statistic = 1.74, two-sided p-value = 0.08, Cohen’s d = 0.36). A similar effect is observed for the “vertical lines” feature (t-statistic = 2.24, two-sided p-value = 0.03, Cohen’s d = 0.40). The differences in other features are not statistically significant for the average correctness metric.
For a consistent comparison of feature-specific knowledge, Fig. 5 presents feature-specific abilities estimated using the item response theory model. A boxplot shows the first, second, and third quartiles as horizontal lines and the sample mean as a filled marker. The hollow circles outside a boxplot are sample outliers. Between design pretest and posttest, we observe that the subjects exhibit increased understanding of the effects of semantic features, except for the “diagonals” feature. In the design pretest, the subjects, on average, have a poor understanding of the influence of horizontal lines—horizontal lines do not influence the vertical stiffness. In the design posttest questionnaire, the most considerable estimated ability is for vertical lines.
Figure 5 further shows the differences in the feature-based abilities measured from the feature tests. We observe that the estimated abilities for semantic features, such as horizontal, vertical, and diagonal lines, are higher than feature 2 and feature 3 combined (t-statistic = 71.60, p-value = 0.001, Cohen’s d = 2.23) in feature test 1. And the subjects perform worse on the knowledge of horizontal lines and vertical lines in feature test 2 (task 3) than in feature test 1 (task 2). These results are consistent with the descriptive results presented in Fig. 4.
4.2 Design Performance Outcomes.
The degree of performance improvement compared to the initial Pareto front varies across conditions. When comparing the overall performance based on all generated designs, the manual design synthesis (task 1) provides better mean performance than the other two conditions. Figure 6 presents the distribution of 1000 bootstrapped means for various performance measures. Bootstrapping allows hypothesis testing by resampling multiple sample sets from the experimental data [63]. Hypervolume improvement measures the improvement in the final generated Pareto front relative to the initial Pareto front. The higher the hypervolume improvement, the better. Since all design objectives are normalized between [0, 1], the hypervolume of 1 represents the utopia point. In Fig. 6(a), the mean hypervolume improvement is larger in task 1 than task 3 (t-statistic = 2.78, p-value = 0.008, Cohen’s d = 0.43). At the same time, the smallest distance to utopia measures the distance between the final generated Pareto front and the utopia point ([1,0]). The smaller the distance, the better the performance. This metric is smaller in task 1 than in task 3 (t-statistic = 3.19, p-value = 0.003, Cohen’s d = 0.45), as given in Fig. 6(b).
Task 3 performs better at the level of individual-generated designs when compared to task 1. The local dominance metric measures the improvement in each generated design relative to the initial design that it modifies. From Fig. 6(c), about of generated designs in task 3 dominate their respective initial designs, compared to about 10 in task 1. The difference in the number of dominant generated designs compared to dominated generated designs in task 3 is large and statistically significant (t-statistic = 3.11, p-value = 0.004, Cohen’s d = 0.74). However, a generated design in task 1 is likely to be twice as close to the utopia point as a generated design in task 3 if we kept the initial design the same (t-statistic = 6.08, p-value < 0.001, Cohen’s d = 0.71), according to Fig. 6(d).
Among the semantic features, the changes made in the number of horizontal and vertical lines have a large, statistically significant correlation with the corresponding changes in overall hypervolume, as given in Table 4. Similarly, a significant correlation is observed for “Feature 4.” Since the objectives are negligibly sensitive to “feature 4,” this result could occur due to potential higher-order interaction effects. The subject population tested all features with similar frequency. Despite this effort, some features do not exhibit a high correlation with positive outcomes, possibly due to the relatively low influence of these features or the subjects’ low feature-specific abilities.
Feature | Effort (feature changes) | Pearson’s r |
---|---|---|
Feature 1 | 215 | 0.014 |
Feature 2 | 229 | 0.057 |
Feature 3 | 187 | 0.116 |
Feature 4 | 226 | −0.189a |
Feature 5 | 150 | −0.182 |
Horizontal lines | 154 | −0.295a |
Vertical lines | 156 | 0.625a |
Diagonals | 161 | −0.114 |
Triangles | 186 | 0.038 |
Three stars | 150 | −0.0366 |
Feature | Effort (feature changes) | Pearson’s r |
---|---|---|
Feature 1 | 215 | 0.014 |
Feature 2 | 229 | 0.057 |
Feature 3 | 187 | 0.116 |
Feature 4 | 226 | −0.189a |
Feature 5 | 150 | −0.182 |
Horizontal lines | 154 | −0.295a |
Vertical lines | 156 | 0.625a |
Diagonals | 161 | −0.114 |
Triangles | 186 | 0.038 |
Three stars | 150 | −0.0366 |
Correlation coefficients are statistically significant with two-sided p-value < 0.005.
5 Discussion
5.1 Positive Influence of Certain Semantic Features on Designer Learning.
The results in Figs. 4 and 5 show that the effects of certain semantic features (e.g., horizontal and vertical lines) are easier to learn for the subjects than other semantic and abstract features. These high-ability features are semantic and have considerable sensitivity due to the design objectives. As observed from Table 4, the subjects also learn to improve design performance by manipulating these features.
The constructs that may explain the above observations are the recognition of semantic features and intertwined effects of design performance and designer learning. First, the semantic nature can exploit an individual’s dense prior knowledge [23], which the data-driven features lack, to explain feature behavior. The recognition from memory, i.e., recognition heuristic [64], places a higher value on identifiable features. The recognition heuristic might even be one of the first simple cues humans use to make decisions [65]. The recognition and simplicity could be a differentiating factor between single-link features (such as horizontal lines, vertical lines, and diagonals) and multi-link features (such as triangles and three-stars). Furthermore, retrospective learning triggers involve the need to learn from successful and failed designs [9]. The high sensitivity of certain semantic features likely allows the subjects to observe significant variations in objective values, thus triggering feature-specific learning for certain features. On the other hand, the low feature sensitivity does not provide a clear association between successful or failed designs and the changes in respective features.
5.2 Mixed Influence of High Automation on Design Performance.
From Fig. 6, we observe that higher automation in task 3 improves the local dominance of a newly generated design compared to an initial design. However, the local improvement in such a generated design could be half of that of a manually synthesized design, as seen from Fig. 6. This local dominance also does not necessarily translate into more significant hypervolume improvement. Potential explanations for this result could be related to (i) the low diversity of user-selected initial designs, (ii) the low amount of user-selected change (step size γ in Eq. (1)) in the initial design, (iii) the fixed number of iterations (≈50) used in the gradient descent algorithm, (iv) non-convex objective function, or (iv) the cognitive load in understanding model output.
Higher autonomy offers users more freedom in testing custom features and creating new designs. The subjects develop metamaterial designs based on the limited number of features in the experiment. Allowing users to define and test their features could facilitate learning [17]. Also, the cognitive load involved in parsing the automated suggestions should be a concern. A large amount of information on the user interface may complicate the comprehension and trustworthiness of results and could likely reduce the design performance of hybrid human–machine teams [40]. In task 3, the subjects view suggestions for five features simultaneously. However, in parametric design activities, a designer commonly evaluates one design variable at a time [66]. Besides the interpretability of information, human decision-makers are likely only to consider a single automated suggestion from a machine learning model at a time [67].
Also, higher model accuracy and complexity of design representation may influence the results. For example, more advanced deep learning models could increase the design performance in task 3, but whether that improves designer learning is not guaranteed. The ease of use and complexity of design representation may influence how often the model gets used and, thus, designer learning. The semi-automated design synthesis may become more desirable for a complicated representation where manual design synthesis is not feasible.
5.3 Implications for Engineering Design Decision Support.
Human–machine collaborative design requires that the human designer comprehend and trust the model results. Design decisions depend on the accurate understanding of model inputs, outputs, and the causal effects of the latent features. We observe that semantic nature and their effect size on design objectives influence the causal understanding of features. Semanticity can form a basis for human–machine collaboration and would scale up to more complex problems as long as meaningful semantic design features can be defined. While the method remains applicable, model training may require a larger amount of data for more complex problems. Accordingly, future design assistants should explicitly describe the underlying features in a roughly linguistic sort and clarify their effects on design objectives. The knowledge representation hypothesis by Brian Smith [68] similarly states that any mechanically intelligent system should embody a semantic representation of knowledge. The emerging approaches to achieving interpretability, such as transparency and post hoc explanations [41], offer additional ways to improve the designer’s causal understanding.
The findings also highlight the role of designer learning in DSE and its effect on performance. Restricted learning due to a higher level of automation, cognitive overload from model outputs, or abstract features reduce the potential for higher design performance. Even though optimization using deep generative design can provide incremental improvements, global performance improvement is also a function of designer learning, especially in the human–machine collaborative setting.
5.4 Limitations and Future Directions.
More validation with different deep learning methods, subject populations, and design problems is necessary to generalize the findings. This paper uses deep learning and the conditional variational encoder to focus on generative design. Some alternatives are evolutionary computation, adaptive or component-specific step size algorithms, or more advanced neural network architectures. Future work can evaluate such options in a human–machine collaborative setting. Additionally, we do not compare or test interpretability tools such as saliency maps, feature importance graphs, partial dependence plots, or specificity versus coverage plots. On the upside, the graphical interface and item response theory model provide a unique way to evaluate different alternatives in the future.
In data collection, the subjects did not have detailed domain knowledge and learned about mechanical metamaterials during the experimental task. While their engineering education forms a basis for their decisions, the lack of domain knowledge can drive their focus on certain semantic features. A more complicated design problem will need subjects with significant expertise to have external validity. However, laboratory experiments are still scalable to more complex problems. Recent research suggests that the representativeness in lab experiments depends not on matching subjects, tasks, and context separately, but rather on the behavior that emerges from the interplay of these three dimensions [69]. Moreover, the prevalence of open-source tools makes it easier to design user interfaces (oTree and matlab) and recruit lay subjects (Amazon Mechanical Turks). Future work can still validate the findings by comparing the patterns of designers’ learning and performance outcomes between novices and experts. As with any experimental study, one needs to perform context-specific validation using more experiments when applying the insights to new settings.
6 Conclusion
The rise of deep learning applications in human–machine collaborative design necessitates the analysis of model interpretability, mainly to satisfy designer learning and performance goals. This paper facilitates such analysis by combining an interactive deep generative tool, human subject experiments, and a learning assessment based on item response theory. The findings provide essential mathematical tools and behavior insights for future design assistants. The subjects in our experiment appear to understand the sensitivity better for certain semantic features than abstract features. Cognitive factors such as cognitive load and semantic features are essential in mediating the overall design performance. If the findings hold, future interactive deep generative design platforms should emphasize discovering influential features and explaining them in the context of the problem definition. Interpretability measures would help maximize learning outcomes and performance while enlisting computational intelligence for design exploration.
Footnote
Acknowledgment
The authors gratefully acknowledge the financial support from the US National Science Foundation (NSF) CMMI through Grant # 1907541.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The datasets generated and supporting the findings of this article are obtainable from the corresponding author upon reasonable request.