b School of Electrical and Information Engineering, Xi'an Technological University, Xi'an 710021, China;
c School of Materials Science and Chemical Engineering, Xi'an Technological University, Xi'an 710021, China
Accurate prediction of structures in chemical compound space (CCS) is crucial for the design and synthesis of novel materials [1]. Although theoretical structure prediction methods have been widely adopted in material research [2,3], the computational cost associated with large-scale structural systems remains a challenge. A CALYPSO software based on swarm intelligence has been developed to accelerate structure prediction [4,5]. The existing first-principles-based (DFT) methods are time-consuming and economically expensive when applied to predict large-scale structures. Consequently, there is an urgent need to accelerate the process of material structure prediction [6].
Typically, structure prediction methods involve several steps [6]: (1) Randomly generating an initial population of structures under symmetry constraints; (2) Assessing the similarity among generated structures and eliminating duplicates; (3) Performing local optimization and energy calculations using the DFT on generated structures; (4) Constructing the next generation of structures using swarm intelligence algorithms, where low-energy structures evolve into new structures while high-energy structures are replaced by randomly generated ones. However, for large-scale structures containing hundreds or thousands of atoms, the computational time required for energy calculations is significant. Furthermore, performing energy calculations and optimizations for multiple randomly generated structures further compounds the time cost. This limitation significantly impedes the prediction of large-scale material structures in most structure prediction software. Deep learning (DL) methods have gained substantial success in computer vision, pattern recognition, and other fields [7,8]. However, their application in predicting and generating new structures in computational chemistry remains limited [9].
In conclusion, the acceleration of material structure prediction is a critical problem that needs to be addressed. While traditional methods face challenges in dealing with large-scale structures, DL methods offer a potential solution. By leveraging DL techniques, it may be possible to overcome the time constraints associated with energy calculations and optimize the prediction process. This application of DL in computational chemistry holds promise for advancing the field of structure prediction and generating new structures.
In recent years, the rapid development of artificial intelligence has made ML a new method that researchers are eager to explore. ML has become popular in various application scenarios that require prediction, classification, and decision-making [4,10]. With the availability of large-scale quantum mechanical data [11–13], researchers have established ML models and used them to predict material properties such as formation energies, defect energies, elasticity, and other mechanical properties [4]. For instance, Hansen et al. [14] employed a linear regression (LR) algorithm to learn the relationship between structural information and cluster energy and predicted the energy of new clusters based on atomic Cartesian coordinates and nuclear charge. Meanwhile, neural network (NN) methods [15,16] have been leveraged to accelerate the construction of potential energy surfaces. Rupp et al. [17] introduced a machine learning model that predicts the AEs of different organic molecules. Gupta et al. [18,19] utilized the DL method to establish the correlation between molecular topological features and accurate AE predictions, achieving an impressive MAE.
In addition, some DL methods were used to replace the DFT calculation process to reduce the computational complexity of the system and accelerate structure prediction in an end-to-end way [20]. Due to the strong feature extraction and feature learning abilities of the DL methods, we have introduced a new NPN [21] model that predicts the AE of different molecules in an end-to-end manner based solely on nuclear charge and atomic position. To address this task, an end-to-end energy prediction model was constructed, and its schematic diagram is shown in Fig. 1.
|
Download:
|
| Fig. 1. Schematic diagram of an end-to-end chemical energy prediction process. | |
Let Φ be an AE prediction model, and the model ΦNPN utilizes a neural network architecture for probabilistic modeling. The model leverages the properties of exponential family distributions to transform the neural network output into the natural parameters of the exponential family distribution. This allows NPN to conveniently parameterize the probabilistic model and maximize the log-likelihood function using gradient-based optimization algorithms during training. NPN transforms the conditional probability p(Y/X) between the input variable and the output variable into the form of an exponential family distribution. According to all the molecular structure information, we extract the coordinate information of all atoms of each molecule as the input of ΦNPN, which is represented by X. The AE of each molecule is represented by Y. Here we have:
|
(1) |
where X is the coordinate information of all atoms of each molecule, and θNPN is the parameters of the NPN model. NPN consists of both linear and nonlinear transformation layers. Here, we first introduce the linear form of NPN. For simplicity, let's assume a Gaussian distribution ξ = (q, d)T with two natural parameters. We decompose the distribution over a weight matrix p(Wl|Wql, Wdl) = Πi,jp(Wijl|Wq,ijl, Wd,ijl), where (Wq,ijl, Wd,ijl) is the corresponding natural parameter. We assume similar decomposed distributions for bl, zl, and xl in NPN, which are all exponential family distributions. For computational convenience, we wish to approximate zl using another exponential family distribution by matching its mean and variance. To compute (Wml, Wsl) = f(Wql, Wdl) and (bml, bsl) = f(bql, bdl), we can obtain zql and zdl from the mean zml and variance zsl of zl as follows:
|
(2) |
|
(3) |
|
(4) |
where, the symbol ⊙ represents the product of the elements in a distribution, while the bijective function f(., .) maps its natural parameters to the distribution's mean and variance. Similarly, symbol f−1(., .) denotes the inverse transformation. The means and variances of Wl and bl, which are obtained from the natural parameters, are represented by Wml, Wsl, bml and bsl, respectively. The values of zml and zsl that are computed can be used to recover zql and zdl, thereby streamlining the feed-forward calculation of the nonlinear transformation.
Once we have the distribution over zl, defined by natural parameters zql and zdl, that has been linearly transformed, we then apply an element-wise, nonlinear transformation v(·) to it. The resulting distribution of activations is denoted as px(xl) = pz(v−1(xl))|v−1′(xl)|, where pz is the factorized distribution over zl defined by (zql, zdl). Even if px(xl) is not an exponential-family distribution, we can still approximate it using p(xl|xql, xdl) by matching the first two moments. Once we obtain the mean xml and variance xsl of px(xl), we can compute the corresponding natural parameters using f−1(., .). The feed-forward computation can be expressed as:
|
(5) |
|
(6) |
To model distributions over weights and neurons, the natural parameters must be learned. The NPN is designed to take a vector random distribution as input, such as a multivariate Gaussian distribution. It then multiplies this input by a matrix random distribution and applies a nonlinear transformation before outputting another distribution. Since all three distributions in this process can be characterized by their natural parameters, learning and prediction of the network can be performed in the space of natural parameters. During back-propagation, for distributions characterized by two natural parameters, the gradient is composed of two terms. For instance, in the equation ∂E/∂zq = (∂E/∂xm)⊙(∂xm/∂zc) + (∂E/∂xs)⊙(∂xs/∂zq), where E represents the error term of the network. Naturally, nonlinear NPN layers can be stacked to form deep NPN, as shown in Algorithm 1. NPN does not need expensive reasoning algorithms, such as variational reasoning or Markov chain Monte Carlo (MCMC). Moreover, in terms of flexibility, it can choose different types of exponential family distributions for weights and neurons.
|
|
Table Algorithm 1 Deep NPN's algorithm. |
We know that χ can be converted using a previous study [22] with Coulomb matrices (CM). The CM are matrices that contain information about the atomic charges and atomic coordinates. The CM can be calculated using the following formula:
|
(7) |
among them, 1≤i, j≤23, Zi is the nuclear charge of atom i, and Ri is its Cartesian coordinate.
Specifically, QM7 is a dataset consisting of 7165 molecules taken from the larger GDB-13 database, which comprises nearly one billion synthetic organic molecules. QM7 has information about the molecule that consists of the atoms H, C, N, O, and S. The maximum number of atoms in a molecule is 23. The molecules in QM7 exhibit a diverse range of chemical structures, such as double and triple bonds, cycles, carboxylic acids, cyanides, amides, alcohols, and epoxides [23]. The CM representation is used to describe the molecular structures, and this approach is invariant to translations and rotations of the molecule. In addition to the CM, the dataset includes AEs ranging from −800 kcal/mol to −2000 kcal/mol [14].
The main focus of this article is on the χ and 
For NPN, GCN, CNN, and LSTM models, we use the rectified linear unit (ReLU) activation function in the hidden layer. We used the ADAM optimizer [24] with a learning rate of 0.0001 and a decay rate of 0.00001, and based on our experience and knowledge, we have selected different combinations of network hyper-parameters for model training and fine-tuning. We set the learning rate to [0.01 0.001 0.0001], the number of training rounds [100 200 300], the batch size [32 64 128], and the hidden layer nodes [100 200 300]. Then, we combined the values of different network hyper-parameters. The experimental results showed that when the learning rate was 0.0001, the batch size was 64, the hidden layer nodes were 200, and the number of training rounds was 200, the model performance was optimal. The proposed models have been implemented in the Pytorch library. As for the hardware system configuration, the processor is an Intel (R) Xeon (R) Silver 4110 CPU@2.10 GHz, the RAM has a capacity of 32.0 GB, and the graphics card is an NVIDIA RTX 2080 Ti [25]. We randomly divided the dataset into a training dataset and a test dataset in the ratio of 8:2.
We used the proposed method to train the AE prediction models on training data and validated the performance of the models using testing data, and the losses of the proposed models during the training are shown in Fig. 2.
|
Download:
|
| Fig. 2. The results of training losses of different DL models. | |
By training on a large set of different molecules in QM7, we use MAE and root mean square error (RMSE) as evaluation indicators to evaluate the prediction performance of various algorithms under different molecule presentation methods. The comparison results between our method and other methods [22] are shown in Table 1. Our results indicate that the proposed method exhibits a lower prediction error, indicating excellent prediction performance. Based on this, we trained and tested different DL models we designed on the QM7 dataset [26], and obtained some information that reflects the performance of the models, such as model parameters, testing time, and testing errors. Experiments on real-world datasets [26] show that NPN can achieve state-of-the-art performance on regression tasks. The performance of different methods are compared in Table 2.
|
|
Table 1 Comparison of prediction errors across multiple algorithms under different representation types using MAE and RMSE metrics. |
|
|
Table 2 Comparison results of performance evaluation of different networks. |
In previous experiments [14], researchers often merged feature values/vectors and feature centers with flattened CM to form a 7165-sample dataset with N-dimensional features. However, the integration of new molecular property information has brought about a new problem: the concatenation of old and new features may lead to unwanted "heterogeneity" in the feature vectors, and including more features may actually increase noise in the dataset [30]. It should be noted that some traditional ML techniques may not be able to identify meaningful patterns from these newly added features, so merging new features may result in poorer results [31].
We present the results analysis of AEs prediction for four different molecules and their isomers, it can be shown in Fig. 3. Based on the structural information, our proposed method achieved good AE prediction results by using the extended CM method to characterize molecules. However, we also found that the model has a significant prediction error for molecules [32] with fewer atoms, which may be due to the interference of additional information supplemented in the extended CM to the model's learning of structural information features. For some molecules with compact structures relative to sparse spatial structures, the prediction errors were relatively larger. We analyzed that this may be due to the high density distribution of a large number of identical atoms in the molecule, which led to the model's unsatisfactory extraction of the CM features.
|
Download:
|
| Fig. 3. Visualization of different conformations of four molecules and comparison of their AE prediction results. Corresponding predicted AEs by our model are shown in parentheses, all AEs are in kcal/mol. Our model has a prediction error of about 0.2–3 kcal/mol for AE. The data marked in blue indicates an error between 5 and 10 kcal/mol, while the data marked in red indicates an error over 10 kcal/mol. | |
To validate the robustness and effectiveness of our proposed method, we randomly divided the BC2P data from Fu et al. [33] into a train dataset and a test dataset with an 8:2 ratio. We use our proposed method to train a model on train data, and the trained model accurately predicts the corresponding energy [33]. The training loss curve can be seen in Fig. 4. Moreover, we tested the model on test data, and the results showed that our model had an average prediction time of 0.391 ms for the energy of each molecular conformation and an average test loss of 0.236 eV/atom. The results of performance evaluation of NPN model on BC2P is shown in Table 2.
|
Download:
|
| Fig. 4. The results of training loss curve of our model on BC2P data. | |
The study aims to propose and validate a method for predicting atomic energy. Initially, we introduce the extended CM representation of molecular structure along with its inherent characteristics. Subsequently, we compare the performance of various ML and DL models for predicting AEs. These models not only enable fast AE prediction but also expedite the structure prediction process. Furthermore, to overcome the limitations of our current method, we plan to incorporate the three-dimensional spatial pattern of material structure information in future research. We intend to employ the graph convolutional network (GCN) method to mathematically represent atoms in chemical molecules. In addition to model development, we informally discuss the results of numerical experiments conducted in this study. Through extensive comparative experiments and analyses, we gain valuable insights into the performance and potential applications of the proposed method. Some potential applications include catalyst design, materials discovery, optimization of energy storage and conversion, as well as material performance prediction. In the future, we envision conducting further in-depth research on material performance prediction and material discovery. By leveraging the power of machine learning and data mining techniques on large datasets, we can uncover hidden correlations within complex data structures. This will allow us to predict new material structures and their corresponding properties. Ultimately, this research contributes to the advancement of materials science and opens up new possibilities for designing innovative materials with tailored properties.
Declaration of competing interestThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
AcknowledgmentThis work was supported by the Nature Science Foundation of China (Nos. 61671362 and 62071366).
| [1] |
G. Montavon, M. Rupp, V. Gobre, et al., New J. Phys. 15 (2013) 095003. DOI:10.1088/1367-2630/15/9/095003 |
| [2] |
E. Tuca, G. DiLabio, A. Otero-de-la-Roza, J. Chem. Inf. Model. 62 (2022) 4107-4121. DOI:10.1021/acs.jcim.2c00656 |
| [3] |
D. Yang, L. Wang, P. Yuan, et al., Chin. Chem. Lett. 34 (2023) 107964. DOI:10.1016/j.cclet.2022.107964 |
| [4] |
Q. Tong, L. Xue, J. Lv, et al., Faraday Discuss 211 (2018) 31-43. DOI:10.1039/C8FD00055G |
| [5] |
P. Gao, S. Wang, J. Lv, et al., RSC Adv. 7 (2017) 39869-39876. DOI:10.1039/C7RA07461A |
| [6] |
X. Wei, C. Zhou, X. Shen, et al., J. Jilin Univ. 51 (2021) 667-676. |
| [7] |
L. Torres, J.P. Arrais, B. Ribeiro, Neural Comput. Appl. 35 (2023) 13167-13185. DOI:10.1007/s00521-023-08403-5 |
| [8] |
M. Sumanto, M.A. Martoprawiro, A.L. Ivansyah, J. Phys. Conf. Ser. 2072 (2021) 012005. DOI:10.1088/1742-6596/2072/1/012005 |
| [9] |
G.A. Pinheiro, J. Mucelini, M.D. Soares, et al., J. Phys. Chem. A 124 (2020) 9854-9866. DOI:10.1021/acs.jpca.0c05969 |
| [10] |
M. Tsubaki, T. Mizoguchi, Phys. Rev. Lett. 125 (2020) 206401. DOI:10.1103/PhysRevLett.125.206401 |
| [11] |
P. Reddy, A.B. Bhattacherjee, Mach. Learn. Sci. Technol. 2 (2021) 025019. DOI:10.1088/2632-2153/abd486 |
| [12] |
N. Dandu, L. Ward, R.S. Assary, et al., J. Phys. Chem. A 124 (2020) 5804-5811. DOI:10.1021/acs.jpca.0c01777 |
| [13] |
L. Ward, B. Blaiszik, I. Foster, et al., MRS Commun. 9 (2019) 891-899. DOI:10.1557/mrc.2019.107 |
| [14] |
K Hansen, G. Montavon, F. Biegler, et al., J. Chem. Theory Comput. 9 (2013) 3404-3419. DOI:10.1021/ct400195d |
| [15] |
A. Jussupow, V.R.I. Kaila, J. Chem. Theory Comput. 19 (2023) 1965-1975. DOI:10.1021/acs.jctc.2c01027 |
| [16] |
B. Selvaratnam, R.T. Koodali, P. Miró, J. Phys. Chem. 60 (2020) 1928-1935. |
| [17] |
M. Rupp, A. Tkatchenko, K.R. Müller, et al., J. Cheminf. 4 (2012) 1. DOI:10.1186/1758-2946-4-1 |
| [18] |
A.K. Gupta, K. Raghavachari, J. Chem. Theory Comput. 18 (2022) 2132-2143. DOI:10.1021/acs.jctc.1c00504 |
| [19] |
E.M. Collins, K. Raghavachari, J. Comput. Chem. 19 (2023) 2804-2810. |
| [20] |
M. Kilgour, J. Rogal, M. Tuckerman, J. Chem. Theory Comput. (2023). arXiv: 2303.10140.
|
| [21] |
H. Wang, X. Shi, D.Y. Yeung, Adv. Neural Inf. Process. Syst. (2016) 118-126. |
| [22] |
G. Montavon, K. Hansen, S. Fazli, et al., Adv. Neural Inf. Process. Syst. 1 (2012) 449-457. |
| [23] |
J. Hoja, L. Medrano Sandonas, B.G. Ernst, et al., Sci. Data 8 (2021) 43. |
| [24] |
D.P. Kingma, J. Ba, Learning (2014). arXiv: 1412.6980[cs. LG].
|
| [25] |
C. Chu, Q. Xiao, Y. Zhang, et al., Int. J. Pattern Recognit. Artif. Intell. 36 (2022) 2250036. DOI:10.1142/S0218001422500367 |
| [26] |
M. Rupp, A. Tkatchenko, K.R. Müller, et al., Phys. Rev. Lett. 108 (2012) 058301. DOI:10.1103/PhysRevLett.108.058301 |
| [27] |
M. Krykunov, T.K. Woo, J. Chem. Theory Comput. 14 (2018) 5229-5237. DOI:10.1021/acs.jctc.8b00788 |
| [28] |
S. Laghuvarapu, Y. Pathak, U.D. Priyakumar, J. Comput. Chem. 41 (2020) 790-799. DOI:10.1002/jcc.26128 |
| [29] |
A.B. Tchagang, J.J. Valdés, Prediction of the Atomization Energy of Molecules Using Coulomb Matrix and Atomic Composition in a Bayesian Regularized Neural Networks, ICANN, 2019, pp. 793–803.
|
| [30] |
S. Wengert, G. Csányi, K. Reuter, et al., J. Chem. Theory Comput. 18 (2022) 4586-4593. DOI:10.1021/acs.jctc.2c00343 |
| [31] |
Y.H. Tang, W.A. de Jong, J. Chem. Phys. 150 (2019) 044107. DOI:10.1063/1.5078640 |
| [32] |
M. Tsubaki, T. Mizoguchi, J. Chem. Theory Comput. 17 (2021) 7814-7821. DOI:10.1021/acs.jctc.1c00568 |
| [33] |
X. Fu, X. Cheng, C. He, et al., Phys. Chem. Chem. Phys. 25 (2023) 2430-2438. DOI:10.1039/D2CP04941D |
2024, Vol. 35 

