Critical Properties and Acentric Factors of Pure Compounds Modelling Based on QSPR-SVM with Dragonfly Algorithm

This work aimed to model the critical pressure, temperature, volume properties, and acentric factors of 6700 pure compounds based on five relevant descriptors and two thermodynamic properties. To that end, four methods were used, namely, multi-linear regression (MLR), artificial neural networks (ANNs), support vector machines (SVMs) using sequential minimal optimisation (SMO), and hybrid SVM with Dragonfly optimisation algorithm (SVM-DA) to model each property. The results suggested that hybrid SVM-DA had better prediction performance compared to the other models in terms of average absolute relative deviation (AARD%) of {0.7551, 1.962, 1.929, and 2.173} and R2 of {0.9699, 0.9673, 0.9856, and 0.9766} for critical temperature, critical pressure, critical volume, and acentric factor, respectively. The developed models can be used to estimate the property of newly designed compounds only from their molecular structure.


Introduction
Physical properties of chemical compounds, such as critical properties and acentric factor are important for scientists in designing and modelling chemical processes 1 such as supercritical extraction and their importance in the application of different equation of states. Experimental computing of these properties is costly, time-consuming, and sometimes even impossible, given that some larger compounds or higher molecular weight may chemically degrade before they reach critical conditions. 1 Therefore, different predicting methods have been published in the literature to compute these critical properties when experimental values are not available. To date, two approaches have been proposed to estimate critical properties, group contribution, and quantitative structure-property relationship (QSPR) models. Since group contribution based models have some limitations, 2 another approach based on QSPR was used to estimate the physical properties solely from the molecular descriptors, which are computed by applying certain mathematical algorithms on the molecular structure of components. [2][3][4] Several approaches have been employed to estimate these properties, namely, group contribution from molecular structures, 5,6 multi-linear regression based on QSPR approaches from descriptors. 1 In addition to different computational techniques like artificial neural networks (ANNs), 7,8 adaptive neuro-fuzzy inference system (ANFIS), and support vector machine (SVM) have been successfully used to estimate these properties of different components from descriptors. 1,[9][10][11][12] A well-known intelligent learning method, such as SVM, is used as a powerful predictive tool to fit the non-linear behaviour of any system. 13 This approach has great modelling capability when large datasets are available and the relationships of parameters are complicated. 14 In addition, SVM approach was originally developed by Cortes and Vapnik 15 and identified as a low parameter technique with a rapid calculation process to find the best tuning parameters and robust performance for overcoming the overfitting problem compared with traditional intelligent learning algorithms.
Therefore, the novelty of this work is in the use of a hybrid approach between SVMs and Dragonfly optimisation algorithm (DA) to model the critical properties and acentric factors of pure compounds based on a large and representative data set.

Materials and methods
In this work, a dataset of 6700 chemical compounds was collected from previously published papers and books in the literature. 1,16 Descriptors were computed using alvaDesc software based on the simplified molecular-input line-entry system (SMILES) of each compound, which computes 5305 types of descriptors, (i.e., 0D, 1D, 2D, and 3D descriptors). 17 The relevant descriptors were pre-selected in two steps. The first selection was performed using alvaDesc software by reducing all descriptors as follows: • Constant and near constant descriptor values were removed; • All missing values or at least one missing value were removed; • Any descriptor with a relative standard deviation < 0.001 was removed; • Descriptors with pair correlation larger or equal to ≥ 0.95 were removed.
The second step consisted of using stepwise MLR as final screening to select only the relevant descriptors. Five descriptors remained after the final screening, as follows: {S p : sum of atomic polarizabilities (scaled on carbon atom), M p : mean atomic polarizability (scaled on carbon atom), GD: graph density, H% percentage of H atoms, and C%: percentage of C atoms}. The number of inputs for each property was 5 descriptors, boiling temperatures (T b ), and molecular weights (M w ).
The performances of SVM method for regression depend mainly on several parameters and the kernel function.
Since the relationship among selected molecular descriptors and the selected properties is non-linear and for regression tasks, a commonly used kernel function is the Gaussian radial basis function (RBF) due to its good general performance -this method is well explained in the following paper. 18

Description of SVM and DA method
SVMs are a supervised learning method presented for the first time in 1992 by B. E. Boser et al. 19 A few years later, they were introduced by C. Cortes and V. Vapnik 15 for classification cases. In 1995 they were expanded and adjusted to regression problems by V. Vapnik 20 to map the non-linear relationship between inputs and outputs, and construct the regression estimation function in high-dimensional feature space Φ(x), map back to the original space and define the suitable kernel function K(x i ,y i ), where x i is the input value, y i is the output value, and {i = 1,…,m} m is the size of the learning dataset D = (x i ,y i ). 21 The regression function is designed by where W is the weight vector and b is the bias. 22 The major advantages of SVM model are the absence of local minima during the training stage, it deals with non-linear regression, sparse, employs kernel functions. 23 More details about the SVM model optimisation equation and the kernel functions can be found in ref. [24][25][26] A novel metaheuristic optimisation algorithm named Dragonfly (DA) was developed by S. Mirjalili 27 based on the behaviour of dragonflies. 27 This algorithm was coupled with SVM method to tune its hyper-parameters. 28 3 Results and discussion

Critical temperature
The predicted critical temperature values using the developed MLR model was compared to the experimental values using two metrics, average absolute relative deviation (AARD%) and determination coefficient (R). Fig. 1(a) shows the performance of the MLR model in terms of {AARD% = 3.376 % and R 2 = 0.9104}. In this figure, the red solid line indicates the exact fit between the experimental and predicted properties, whereas the black circles demonstrate the training points, and the blue squares show the test points correlated for all properties by the proposed model vs experimental data. The closer the points to the solid line, the more accurate correlated properties data.
The equation of this model is written as follows: Secondly, the parameters of the best ANN model were determined based on a trial-and-error method. Results showed that the best multi-layer perception feed-forward back-propagation neural network trained with Levenberg-Marquardt algorithm (MLP-ANN) was obtained with the architecture of {7, 15, 1} neurons in the first layer, hidden layer, and output layer, respectively.
Tangent hyperbolic (f) and linear (g) transfer functions were used in the hidden layer and the output layer, respectively. The regression curve is presented in Fig. 1(b), where the performance of the ANN model in terms of {AARD% = 2.639 % and R 2 = 0.9450} and its expression can be written as follows: where X i is the input vector, , 1, , I S j i j W W are input-hidden layer and hidden-output layer weights matrix, and b hj , b O1 are biases of hidden and output layers, respectively.
Another model was developed based on the SVM approach, the best model was based on the optimisation of its parameters, namely, capacity (Box Constraint) (C), the kernel width parameter (Kernel scale) (γ), the quantity of support vectors (QSV) and the kernel function. The accurate model was found with RBF kernel function and {C = 10, γ = 1, and QSV = 4008} for the critical temperature. This model was found with {AARD% = 1.224 %, and R 2 = 0.9375} as reported in Fig. 1(c).
The last model was based on the use of SVM hybrid with recent optimisation algorithm named Dragonfly, which was inserted in the MATLAB software. The graphical comparison between the predicted and the experimental critical temperatures was conducted using the postreg MATLAB function. Based on the obtained results presented in Fig. 1(d) For the best ANN model, results showed that the best structure of the best model was {7, 15, 1} neurons in the first layer, hidden layer, and output layer, respectively. The regression comparison is depicted in Fig. 2(b), where the performance of the ANN model was found as {AARD% = 5.897 % and R 2 = 0.911}. The best SVM model was obtained with the RBF kernel function and {C = 10, γ = 1, and QSV = 4990} for the critical pressure. The accuracy of this model was {AARD% = 2.575 % and R 2 = 0.9491}, and the regression plot is presented in Fig. 2(c).
The graphical comparison presented in Fig. 2(d) between the SVM-DA predicted and the experimental values of the critical pressure and the calculated statistical parameters shows the performance of this model in terms of {AARD% = 1.962 % and R 2 = 0.9673}. This model was found with a RBF kernel function and {C = 10, γ = 0.5, and QSV = 4785}, and showed high capability of predicting critical pressure compared to other tested models in this study.

Acentric factor
The acentric factor (w) was modelled using the best MLR model with the performance of {AARD% = 23.76 % and R 2 = 0.232}, as presented in Fig. 4(a). The equation of this model is written as follows: The acentric factor was modelled using the best ANN model with the performance of {AARD% = 9.733 % and R 2 = 0.890}, as presented in Fig. 4(b). This model was optimised with the structure of {7, 20, 1} neurons in the first layer, hidden layer, and output layer, respectively. Tangent hyperbolic (f) and linear (g) transfer functions were used in the hidden layer and the output layer, respectively.
The acentric factor was modelled using the best SVM model with the performance of {AARD% = 6.140 % and R 2 = 0.8725}, as presented in Fig. 4(c). The accurate model was found with RBF kernel function and {C = 10, γ = 1, and QSV = 4349}.
The acentric factor was modelled using the best SVM-DA model with the performance of {AARD% = 2.173 % and R 2 = 0.9766}, as presented in Fig. 4(a). The obtained SVM-DA was found with RBF kernel function and {C = 100, γ = 0.5, and QSV = 3032}. As may be seen in Fig. 5, the SVM-DA model was able to predict the selected parameters with high accuracy in comparison to the other models. Therefore, a sensitivity analysis, the applicability domain, and the statistical evaluation and validation section were performed based on the obtained SVM-DA parameters.

Applicability domain of the developed models
In this study, the applicability domain is presented using William's plot. This plot illustrates the standardised residu-als against the leverage values (h). The leverage of the i th (h i ) component is defined as follows: where X is the descriptor matrix of the training set, X i is the descriptor vector for the desired compound, the superscript means transpose, and n is the number of molecular structures in the training set. In this plot, the critical leverage (h * ) was computed according to the following equation: where P is the number of descriptors in the model. The standardised residuals δ were calculated by: where y i and ˆi y are the experimental value and calculated value for the i th compound, respectively.
The outliers are data points when the values of standardised residual are in the range of ± 3 standard deviations units and leverage value should be higher than h * . As may be observed in Fig. 6, the outliers represent only about 4.791, 4.382, 4.772, and 5.492 % of total compounds in the data set for T C , p C , V C , and w, respectively. The developed models were capable of predicting the majority of experimental data with values of standardised residual lower than ± 3. A few points are out of the application range with larger values of h or standardised residuals. However, outliers do not mean that the prediction capability of the model is not good, the AARD% and R are a judgement of the predictive ability for SVM-DA model.  7 illustrates the values of absolute relative deviation ARD% of the modelled properties. Results proved that these models made a good prediction by yielding a very low error of almost all the dataset, and confirmed that all the selected descriptors for each model were not chosen by chance.

Global sensitivity analysis
The objective of the global sensitivity analysis is to assess the effect of inputs on each output using the cosine amplitude method (CAM). [29][30][31] This method is based on the calculation of the relevance factor (r ij ) which measures the strength relationships between the output x j and input x i parameters. This factor can be represented by the following expression: 26 The used data pairs build a data array X, which are defined as X = {X 1 , X 2 , X 3 ,..., X m } where X i is a vector of lengths m, expressed as: X i = {x i1 ,x i2 ,x i3 ,...,x im }. i and k are the dimensions of the input matrix (i = 1 to 6700, k = 1 to m), j represents the dimensions of the output vector (j = 1 to 6700), and m is the number of input parameters (m = 7). Fig. 8 shows the importance of the selected inputs on each output. It is clearly shown that all parameters have approximately the same effect on each output. The best SVM-DA model was implemented in friendly graphical user interface designed by MATLAB software, and presented in Fig. 10. This interface can compute the desired outputs by knowing only the selected inputs for 6700 systems.

Conclusion
The goal of this study was to test the applicability of four intelligence computational methods in modelling the critical properties and acentric factors of 6700 pure compounds.
Four models were designed, namely, MLR, ANN, SVM, and SVM-DA. The experimental dataset of the properties were collected from the literature, while the input matrix consisted of a mixture of 05 descriptors and 02 thermodynamic properties.
All models were assessed statistically in terms of AARD% and R 2 . The obtained results showed that the SVM-DA was found with a very low deviation (AARD%) of {0.7551, 1.962, 1.929, and 2.173} and high determination coefficient R 2 of {0.9699, 0.9673, 0.9856, and 0.9766} for the four properties. It can be concluded that the SVM-DA model is capable of establishing a satisfactory nonlinear relationship between the molecular descriptors and the critical properties of pure compounds with high accuracy. Based on the SVM-DA model, a sensitivity analysis investigation was performed to assess the effects of inputs on each output. The results showed that all inputs had a strong effect on the output, and their choice was based on their contribution on the studied phenomenology. A separate section of this study is the applicability domain of the model against the data set used, and the results show that the SVM-DA model can be applied to fit almost 95 % of the used data, which suggests the capability of the model when modelling the studied properties in comparison to the other models. Finally, to facilitate the use of this model, a friendly graphical user interface was designed using MATLAB software without knowing the details about the phenomenology or the used software. anonymous reviewers for their constructive comments, which helped to improve the quality and presentation of this paper.