| CARVIEW |
Motivation for Out-of-Distribution (OOD) Generalization for Materials
Developing new materials and molecules is a cornerstone of technological innovation. From semiconductors to bio-degradable polymers, the ability to tailor materials with specific properties underpins advances across industries. There is particular interest in property values that are outside the known property value distribution as these will most likely lead to discovering new materials that will, in turn, unlock new capabilities and technologies.
For instance, in solar energy, discovering solid-state materials with enhanced energy conversion efficiency could lead to substantial improvements in the performance of solar panels, contributing to more sustainable energy solutions. In healthcare, discovering drug molecules with exceptional binding affinities or novel biological activities could open new avenues for treating diseases that currently lack effective therapies, thus advancing the medical field. In environmental sustainability, materials with exceptional carbon dioxide adsorption capacities could greatly improve carbon capture technologies, helping to reduce greenhouse gas emissions and combat climate change.
The discovery of such materials, those with properties beyond the range of materials already known and characterized, holds the promise of addressing pressing challenges in modern technology.
OOD What?
When considering OOD generalization for materials, one can assess the uniqueness of generated materials in relation to the chemical or structural space of the training data, or evaluate the uniqueness of the property values of these materials. In our context, OOD materials are defined as those with property values that fall outside the distribution of the known data (i.e., training data). These materials may also correspond to novel chemical or structural spaces, though this is not always the case.
Case Study: EDM model
EDM is an E(3) equivariant diffusion model for generating molecules in 3D [Hoogeboom et al, 2022]. This is a seminal work that introduced molecular generation through diffusion. In this work, molecules are modeled using point clouds with coordinates \(x=(x_1, …,x_M)\in \mathbb{R}^{M\times 3}\) and corresponding features \(h=(h_1,…, h_M) \in\mathbb{R}^{M\times d}\) where \(M\) is the number of atoms in the molecule. While \(x\) are continuous coordiantes, \(h\) is the atomic type, which is discrete, and represented as a one-hot vector. EDM constructed a joint diffusion process for the two parameters, where at training \(x\) is diffusing towards a conventional Gaussian noise \(\mathcal{N}(0, I)\) and the distribution of \(h\) diffuse towards a uniform categorical distribution, as presented in the figure below. The model learns to denoise starting from standard normal noise \(Z_T\) by sampling from \(p(z_{t-1}|z_t)\) iteratively and learns the noise needed to be subtracted at each step, separately for\(x\) and \(h\).
For the model to generalize well across molecular data, it needs to be invariant with respect to the E(3) group, which includes all 3D Euclidean symmetries, i.e. rotations, translations, and inversions. However, to account for chiral molecules, SE(3) invariance should be considered. The premise of this work relies on the following theorem:
Theorem:
If \( x \sim p(x) \) is invariant to a group and the transition probabilities of a Markov Chain \( y \sim p(y \mid x)\) are equivariant, then the marginal distribution of \( y \) at any time step is invariant to group transformations as well.
This means that if the standard noise \(p(z_T)\) is invariant to E(3) and the neural network used to parameterize the diffusion process \(p(z_{t-1} \mid z_t)\) is equivariant to E(3), then the marginal distribution \(p(x)\) of the output of the denoising model will be invariant as desired!
The forward diffusion process is modelled as follows:
\[q(z_t \mid x,h) = \mathcal{N}_{xh}(z_t|\alpha_t [x,h],\sigma_t^2 I) = \mathcal{N}_x(z_t^x|\alpha_t x, \sigma_t^2 I) \mathcal{N}(z_t^h|\alpha_t h, \sigma_t^2 I)\]In order for the standrd noise to be invaraint, the forward diffusion process needs to produce an invariant noise. \(h\) is invariant w.r.t \(E(3)\) transformations, as the atomic types of the molecule are not affected by Euclidean operators, so the noise distribution for \(h\) can be the conventional normal distribution \(\mathcal{N}\). Generally, for a distribution to be invariant to translation – it needs to be constant. But if it’s constant, it can not integrate to 1. In order to construct a translation invariant noise distribution, the noising process of \(x\) can be restricted to a translation invariant linear subspace – by centering the nodes s.t. their gravity center is always zero [Satorras et al, 2021].
The forward diffusion process is modelled as follows:
\[q(z_{t-1}|z_t) = \mathcal{N}_{xh}(z_{t-1}|\mu_t([\hat{x},\hat{h}], z_t),\sigma_t^2 I)\]Where:
\[\hat{x},\hat{h} = \frac{1}{\alpha_t}[z_t^x, z_t^h] - \frac{\sigma_t}{\alpha_t}[\hat{\epsilon}_t^x,\hat{\epsilon}_t^h]\]And:
\[\hat{\epsilon}_t^x,\hat{\epsilon}_t^h = \text{EGNN}(z_t^x, z_t^h)\]In order for the diffusion process to be equivariant, the noise is approximated by an equivariant network (instead of the common non-equivariant U-net).
This work achieves state-of-the-art performance with 98.7% atom stability (the proportion of atoms with correct valency), 82% molecule stability (the proportion of molecules with all stable atoms), and 91.9% validity of the generated molecules.
OOD Generalization of the EDM model
In this fork, I’ve introduced additional features for conditional molecule generation. Users can use this to specify target property values for controlled molecule generation. Additionally, this implementation supports extrapolation, allowing users to explore molecules specifically generated to match property values outside the training range. The model gets at inference a range of desired property values, and generates 100 molecules along equal spacings of that range. To validate the property values of the generated molecules, Density Functional Theory (DFT) calculations — a quantum mechanics simulation method used to calculate the electronic structure and properties of materials by focusing on the electron density of the system — were employed. These calculated values are accurate enough to be considered ground truth and can be used to evaluate the model’s ability to generate stable and valid molecules with OOD property values.
OOD Generalization Results
The model was initially trained conditionally on isotropic polarizability \(\alpha\) values \(x, h \sim p(x, h | c)\) drawn from the training distribution. During inference, generation was conditioned on out-of-distribution (OOD) property values (represented by the red extension of the distribution). Among the 100 molecules generated, only 38% were stable enough to run quantum mechanics (QM) calculations. The remaining molecules had issues such as incorrect charge or valency, rendering them completely invalid for QM simulations. Notably, the model appears to have generated some molecules with OOD isotropic polarizability values, as the histogram of the generated molecules \(\alpha\) extends beyond the OOD threshold!
However, out of the 100 molecules generated, only 1% were valid molecules. Unfortunately, this single valid molecule had an isotropic polarizability of \(\alpha = 105.6\), placing it well within the distribution rather than in the OOD range. The remaining molecules exhibited incorrect bond lengths that would break the molecule in reality, as illustrated in the gif and figures below.
What about the top In-Distribution values?
It’s interesting to examine the model’s ability to generate molecules with extreme in-distribution (ID) values, as these samples are rare in the dataset. This highlights the model’s ability to generalize to cases that, although infrequent, were still seen during training.
In this case, 55% of the 100 molecules were stable enough for QM calculations, and 13% actually valid.
While the model was able to produce more valid molecules in the desired extreme \(\alpha\) range, the highest value was \(\alpha = 112.6\), far from the distribution edge.
Conclusion
When conditioning on values from the main distribution, the model’s validity reaches 90%, as reported in the paper. These results are not entirely surprising, as out-of-distribution (OOD) generalization is notoriously challenging for machine learning models. However, overcoming these challenges and improving OOD generalization for materials data remains a key area of research, as it holds the potential to drive significant advancements in materials design and the discovery of new, high-performance materials.
]]>Introduction
Materials embody a diverse array of chemical and physical properties, intricately shaping their suitability for various applications. The representation of materials as graphs, where atoms serve as nodes and chemical bonds as edges, facilitates a systematic analysis. Graph Neural Networks (GNNs) have emerged as promising tools for deciphering relationships and patterns within materials data. The utilization of GNNs holds the potential to develop computational tools that deepen our understanding and aid in designing structure-property relationships in atomic systems.
In recent years, there has been a heightened focus on employing machine learning for the accelerated discovery of molecules and materials with desired properties [Min and Cho, 2020; Pyzer-Knapp et al, 2022; Merchant et al, 2023]. Notably, these methods are exclusively applied to stable systems in physical equilibrium, where such systems correspond to local minima of the potential energy surface \(E(r_1, . . . , r_n)\), with \(r_i\) representing the position of atom \(i\) [Schüttet al, 2018].
The diverse arrangements of atoms in the system result in varying potential energy values, influencing chemical stability. In the GIF below, different trajectories can be seen of the molecule Ethane. The Ethane molecule spends 99% of its time in a specific conformation, in which the substituents are at the maximum distance from each other. This conformation is called the staggered conformation. Looking at the molecule from a position on the C-C (main) axis (as in the second half of the animation), The staggered conformation is reached when the H atoms of the front C atom are exactly between the H atoms of the other C atom. This animation also show the 3-fold symmetry of the molecule around the main axis. All three staggered conformations will have the same energy value, as they are completely equivalent. The intermediate conformations will result in a higher energy value, as they are energetically less favorable. Different conformations can also portray elongations of some bonds lengths and variations in angles value. Predicting stable arrangements of atomic systems is in itself an important challenge!
In the three-dimensional Euclidean space, materials and physical systems in general, inherently exhibit rotation, translation, and inversion symmetries. These operations form the E(3) symmetry group, a group of transformations that preserve the Euclidean distance between any two points in 3D space. When adopting a graph-based approach, a generic GNN may be sensitive to these operations, but an E(3) equivariant GNN excels in handling such complexities. Its inherent capability to grasp rotations, translations, and inversions allows for a more nuanced understanding, enabling the capture of underlying physical symmetries within the material structures [Batzner et al, 2022].
Data
The MD 17 dataset, an extensive repository of ab-initio molecular dynamics trajectories [Chmiela et al, 2019], was employed in this study.
Each trajectory within the dataset includes Cartesian positions of atoms (in Angstrom), their atomic numbers, along with total energy (in kcal/mol) and forces (kcal/mol/Angstrom) acting on each atom. The latter two parameters serve as regression targets in analyses.
Our focus narrowed down to the molecules Aspirin, Ethanol, and Toluene:
The distributions of energy values (kcal/mol) for various conformations of the three molecules, within the training and validation sets, are illustrated in the histograms below.
The training set for Aspirin comprises 1000 conformations, while its validation set consists of 500 conformations. Ethanol’s training and validation sets each consist of 1000 conformations. Toluene’s training set comprises 1000 conformations, and its validation set consists of 500 conformations.
Method
In this project, our objective is to conduct a comparative analysis of two Graph Neural Network (GNN) architectures: an E(3) equivariant network and a non-equivariant (specifically E(3) Invariant) one. The primary focus is on energy prediction tasks related to atomic systems, with a particular emphasis on exploring the distinctions within the latent representations of these architectures and their interpretability.
All GNNs are permutation invariant by design [Keriven and Peyr, 2019]. Our baseline GNN for comparison achieves rotation and translation invariance by simply operating only on interatomic distances instead of absolute position of the atoms. This design choice ensures that both the output and internal features of the network remain invariant to rotations. In contrast, our equivariant GNN for comparison utilizes relative position vectors rather than distances (scalars) together with features comprised of not only scalars, but also higher-order geometric tensors.
In our Invariant GNN, the node-wise formulation of the message passing is given by:
Where $ x_i, x_j $ are the feature vectors of the target and source nodes, respectively, defined as a one-hot representation of the atomic number of that node. The summation is performed over the neighborhood \(\mathcal{N}(i)\) of atom $i$, defined by a radial cutoff around each node, a tunable parameter typically set around 4-5 angstroms. Meaning, the concept of neighborhood is based on the distance between nodes, not their connectivity. Additionally, \( d_i = 1 + \sum_{j \in \mathcal{N}(i)} e_{j,i} \) where \( e_{j,i} \) represents the edge weight from the source node \(j\) to the target node $i$ , and is defined as the interatomic distance.
For constructing our equivariant GNN, E3nn was employed - a torch-based library designed for building o(3) equivariant networks. Following the method presented in [Batzner et al, 2022], a neural network that exhibits invariance to translation and equivariance to rotation and inversion was constructed. Two key aspects of E3nn facilitating the construction of O(3) equivariant neural networks are the use of irreducible representations (Irreps) for data structuring and encapsulating geometrical information in Spherical Harmonics. Irreps are data structures that describe how the data behaves under rotation. We can think of them as data types, in the sense that this structure includes the values of the data alongside instructions for interpretation. The Spherical Harmonics form an orthonormal basis set of functions that operate on a sphere, and they’re equivariant with respect to rotations, which makes them very useful (and popular!) in expanding expressions in physical settings with spherical symmetry.
For the equivariant GNN, the node-wise formulation of the message is:
where \() f_j, f_i \) are the target and source nodes feature vectors, defined similarly as a one-hot representation of the atomic number. \(z\) is the average degree (number of neighhbors) of the nodes, and the neighborhood \(\partial(i)\) is once again defined using a radial cutoff. \(x_{ij}\) is the relative distance vector, \(h\) is a multi layer perceptron and \(Y\) is the spherical harmonics. The expression \(x \; \otimes(w) \; y\) denotes a tensor product of \(x\) with \(y\) using weights \(w\). This signifies that the message passing formula involves a convolution over nodes’ feature vectors with filters constrained to be a multiplication of a learned radial function and the spherical harmonics.
Results
The performance of the two GNNs was compared for the task of predicting the total energy of the molecule’s conformation - a scalar property. By constraining the Equivariant GNN to predict a scalar output, it becomes overall invariant to the E(3) group. However, the use of higher order geometric tensors in the intermediate representations and operations in the E-GNN, makes internal features equivariant to rotation and inversion. This enables the passage of angular information through the network using rotationally equivariant filters (spherical harmonics) in the node feature convolution. This is the essential difference between the two architectures.
The learning curves of the two GNNs for each molecule data are presented in the figures below:
The models were trained for 50 epochs using mean absolute error (MAE) objective for predicting normalized energy (in kcal/mol units). Adam optimizer with a learning rate of 0.01 and learning rate scheduler were employed. The E-GNN achieves a superior MAE rate for all three molecules.
Next, let’s examine the latent representation of the two models! The last layer values of the validation data of both models were projected using t-SNE to a 2D representation and color-coded according to the target energy values:
A color gradient can be seen in all three projections of the Equivariant GNN; and it is the clearest for Ethanol. The Invariant GNN’s latent projections do not exhibit a similar structure, perhaps except for Ethanol’s conformations. Moreover, in Ethanol’s case, the GNN projection appears to be quite one-dimensional.
The apparent color gradient according to the target values in the E-GNN latent space is impressive, suggesting that the model leverages this information when embedding data conformations for predictions. Multiple “locations” in the latent space denote various high-energy conformations, indicating that the model considers not only the target energy value but also structural differences.
To assess whether there’s molecular structural ordering in the embeddings, we construct system-specific variables for each molecule and visualize the latent space accordingly. Ethanol, with its relatively simple structure, showcases three important variables: the distance between the two Carbons (C-C bond), the distance between Carbon and Oxygen (C-O bond), and the angle formed by the three atoms. The distributions of these variables in Ethanol’s train and validation sets are depicted in the figure below:
The distributions appear very similar for each variable in the train and validation sets. Now, let’s examine Ethanol’s validation conformations latent projection, color-coded with respect to the target and the three system-specific variables:
A clear gradient is observed for the main angle and C-C bond! The target gradient appears from the top left corner to the bottom right; the C-C bond gradient seems to go from bottom left to top right, and the main angle gradient isn’t as linear, appearing to spiral from the bottom to the top right corner clockwise. The C-O bond projection doesn’t seem to follow a discernible gradient, suggesting it’s not as influential on the target as the other two variables.
Cool huh? The Equivariant GNN appears to embed the data according to the target value but also according to the systems geometrical structure! This suggests that the model leverages its E(3) equivariant convolution layers to capture and encode information about both the target values and the intricate geometric features of the molecular systems.
Conclusion
In conclusion, our exploration has demonstrated the efficiency of the E(3) equivariant GNN, compared to an invariant GNN, in predicting the total energy of molecular conformations. Though both models were compared on predicting energy, a scalar propery, the E-GNN’s ability to leverage the inherent symmetries present in the system allowed it to effectively capture and encode the relationship between the arrangement of molecules and their respective energy. This was illustrated through the latent representation visualizations, and was particularly evident in the case of Ethanol. Here, discernible gradients in the latent space were observed, correlating with the target energy value and variations in C-C bond length and main angle. However, interpretability varies among the latent projections for the more complex molecules investigated in this project. Potential improvements could be achieved with additional data and a more expressive equivariant network.
]]>