Artificial Neural Network and Support Vector Regression Applied in Quantitative Structure-property Relationship Modelling of Solubility of Solid Solutes in Supercritical CO2

In this study, the solubility of 145 solid solutes in supercritical CO2 (scCO2) was correlated using computational intelligence techniques based on Quantitative Structure-Property Relationship (QSPR) models. A database of 3637 solubility values has been collected from previously published papers. Dragon software was used to calculate molecular descriptors of 145 solid systems. The genetic algorithm (GA) was implemented to optimise the subset of the significantly contributed descriptors. The overall average absolute relative deviation MAARD of about 1.345 % between experimental and calculated values by support vector regress SVR-QSPR model was obtained to predict the solubility of 145 solid solutes in supercritical CO2, which is better than that obtained using ANN-QSPR model of 2.772 %. The results show that the developed SVR-QSPR model is more accurate and can be used as an alternative powerful modelling tool for QSAR studies of the solubility of solid solutes in supercritical carbon dioxide (scCO2). The accuracy of the proposed model was evaluated using statistical analysis by comparing the results with other models reported in the literature.


Introduction
Supercritical carbon dioxide (scCO 2 ) is generally used in separation processes applied in the food, chemistry, pharmaceutical, and other industries. The design and optimisation of extraction, fractionation, and purification processes are mainly based on the knowledge of the solubility of solid solutes in scCO 2 . 1 The experimental measurement of the solubility (thermo-physical property) of such compounds in scCO 2 is laborious and costly. 2 To avoid expensive and tedious experiments, and to fulfil the lack of solubility data and/or pure component property data required to estimate solubility, a need exists to develop flexible and robust predictive models to estimate the solubility of solid solutes in supercritical solvents using limited information. Literature reports three major modelling approaches have been used to model the solubility of a solid solute in scCO 2 using equations of state, 3 semi empirical equations, 2 and computational intelligence techniques such as artificial neural networks and support vector machine. 4 The solid solute solubility in scCO 2 is mostly modelled using density-based correlations, such as in 5 . Moreover, MLP-ANN also has been used successfully to model the solid solute solubility in scCO 2 based only on thermodynamic parameters. 6 The least square support vector machine LS-SVM may be considered an alternative for classical approaches, such as semi empirical correlations to model the solid solute solubility in scCO 2 . 7 The most common type of ANN model that is being used nowadays is multilayer perceptron (MLP) feedforward neural network (FNN) trained by back-propagation (BP) algorithm. 8 This ANN type has also been successfully used for mapping complex highly nonlinear input/ output relationships between one dependent variable and more independent variables of any systems. But there are numerous limitations, for instance, difficulty in deciding the criterion for the number of hidden neurons and layers, as well as over-fitting. 9 To avoid these problems, SVM is formulated as a quadratic optimisation problem and ensures a global optimal solution. Some studies have proven the superiority of SVM over ANN. 10 SVM method has gained a wide range of engineering applications in forecasting and regression analysis due to its attractive features and remarkable generalization performance. 11 Therefore the present study was carried out to evaluate the predictive performance of computational intelligence techniques including MPL-FNN as the most used neural network type and SVM to foretell the solubility of 145 different solid solutes in scCO 2 based on a mixture between thermodynamic properties and molecular descriptors. The novelty of the developed model in contrast to previous models is its ability to extrapolate or interpolate the solubility of several solid solutes in scCO 2 with only one combination of the fitted parameters obtained during the training stage. In addition, the independent parameters were a mixture between thermodynamic properties and structural descriptors. Moreover, a simple and convivial graphical user interface was designed to compute the solubility without learning MATLAB software. Finally, a sensitivity analysis was computed by calculating the relative importance of each input parameter on the output.

Methodology
In order to develop an ANN-QSPR or SVR-QSPR model, several steps are required, as specified in the following section (Fig. 1). 12 Before sample data were given to one of the machine learning models, relevant descriptors of each system were calculated and optimised, and added to the thermodynamic parameters of each system. The data were then scaled so as to solve the problems of noise and correlated data. Next, splitting of data into two samples training and test, respectively. After settling each model architecture, its parameter optimisation was processed during the training phase. The next phase was to evaluate the performance of the optimised models using other data not used during the training stage. Lastly, statistical parameters had to be calculated to assess the accuracy of each model and validate it to be used for further prediction of solid solute solubility in scCO 2 . The modelling procedure was run in MATLAB R2018a on the Windows 7 system.

Data preparation
Reliability and effectiveness of any model highly depends on the availability of accurate experimental data. For that, the required data for training and testing of the computational intelligence techniques were extracted from previous experimental investigations in literature. Table 1 displays several properties of the 145 selected binary solute-scCO 2 systems counting 3637 data points of experimental solubility in scCO 2 , among these parameters, the name of the corresponding drugs, molecular weight, range of working temperature and pressure, experimental solubility, and references of the 145 selected systems.

Determination of descriptors
Quantitative Structure-Property Relationship (QSPR) models can also be used to perform property estimation since they establish quantitative correlations between diverse molecular properties and the chemical structure. 3 A QSPR model consists of a mathematical relationship between the property or activity and several molecular descriptors, such as structural/topological indices or electronic/quanto-chemical properties. One of the most important steps in the construction of QSPR models is the quantification of structure information of the studied molecules, 104 which are called molecular descriptors. Actually, there exist more than 11145 molecular descriptors that can be used to solve several problems in different fields, such as chemistry, biology, and other related sciences. 105 In this study, 1666 molecular descriptors were calculated online by E-Dragon 1.0 software. 106 The computation of these descriptors was preceded by a very interesting step, i.e., the generating of simplified molecular input-line entry system (SMILES) strings with the use of the "ChemDraw" software. After the calculation of the molecular descriptors, constant and near-constant descriptor values were removed, all missing values or at least one missing value were removed, any descriptor with a relative standard deviation < 0.001 was removed, also descriptors with pair correlation larger or equal to ≥ 0.95 were removed. Finally, a total set of 641 remaining descriptors was reduced as possible using stepwise regression method to select only the relevant subset of descriptors that have significant contributions to the solubility of solid solutes. After this pre-treatment, three pertinent descriptors were obtained for each system to build the ANN and SVR models 107 (Table 2).

Statistical performance evaluation criteria
The internal predictive capability of the ANN and SVR models was evaluated by root mean square error (RMSE), average absolute relative deviation (AARD), leave-oneout (LOO) cross validation Q 2 LOO on the training set, and the coefficient of determination (COD). The definitions of these measures are given in Table 3, where N represents The value of RMSE and AARD ranges from 0 to ∞, where the R 2 is used to determine the degree of similarity between the predict value and its observed value. In general, the best value for R 2 and Q 2 LOO is 1, and the best value of RMSE and AARD is zero which indicates high performance of the model.

ANN-QSPR model results
Multi-layer perceptron networks are the most widely used feedforward artificial neural networks, and are considered a powerful nonlinear black-box model for learning complex nonlinear relationships between input and output variables. The multi-layer perceptron neural networks consist of 03 connected layers: input, hidden, and output layers. The first layer receives input data and sends them to the hidden layer. After adjusting weights and biases by interconnected neurons using several non-linear transfer func-tions and different variables, the results from the hidden layer are sent to the output layer for the outputs. 110 The weights of the ANN model are modified using the training algorithm until the error criterion is satisfied. In this study, the Bayesian regularisation backpropagation algorithm was used in training the ANN models. An objective function, which is the mean squared error between the outputs of the network and the target (experimental) values is used to calculate the bias and weights of the feedforward neural network. To avoid fitting problems, the number of hidden neurons is usually determined by a trial-error methodology.
In this study, this number was changed from 10 to 30 neurons using one and two hidden layers. Results have shown that ANN-QSPR with two hidden layers performs better in comparison to single layer. This procedure is repeated until the best precision is achieved for each architecture. The input parameters for the proposed model were temperature, pressure, density of scCO 2 , critical temperature, critical pressure, acentric factor, and the most accurate three selected 2D descriptors {MPC2, WTPT-1, and AATS0i}, while the output was the logarithm of the molar fraction solubility.
In the ANN model, as the inputs have different magnitudes and in order to decrease the error, a normalization must be performed on the inputs and outputs of the artificial neural network (usually between −1 and +1) to achieve fast convergence. The data set was randomly split into three subsets (66 %) for the training phase, (17 %) for the testing phase, and (17 %) for the validation phase of the ANN model.
The best developed three-layer FFNN for the solubility has the structure of 9-25-20-1. This means that the ANN model has nine independent variables (thermodynamic properties and descriptors), 25 numbers of neurons in the first hidden layer, 20 numbers of neurons in the second hidden layer, and one output which is the solubility for each system. Table 4 shows the structure of the optimised ANN model. Hyperbolic tangent sigmoid (tansig) and line-   In Fig. 2, the first bisector shows the exact fit between the correlated solubilities and experimental data, whereas the cross points demonstrate the real correlated solubility data by the proposed correlation vs. experimental data. In detail, the closer the points are to the solid line, the more accurate are the correlated solubility data.
The logarithmic scale of the experimental against calculated solubility is plotted in Fig. 2. On the basis of the obtained results, it can be concluded that the ANN-QSPR model was able to correlate the solubility of solids in scCO 2 with an acceptable deviation (Table 5).  Table 5.

SVR modelling
Support vector regress (SVR) has found many applications in different regression problems. 111 This approach was firstly proposed by Vapnik. 112 The quality of SVR models depends on the proper setting of SVR parameters for a given data set. The selection of these parameters (kernel width parameter σ, penalizing parameter C, and error-accuracy parameter ϵ) is investigated in this section.
The same ANN inputs have been adopted to build the SVR model. The data set was randomly split into two subsets (66 %) for the training phase, (33 %) for the testing phase of the SVR model. The optimisation procedure was repeated several times in order to find the most probable global optimum of the fitness function.
In this study, radial basis function (RBF) was selected as the most common kernel function used in literature; details of the other selected parameters are summarized in Table 6.

Comparison
The comparative study between ANN-QSPR and SVR-QSPR models in terms of regression and different errors using all data set was conducted (Fig. 4). The results show that the solubility data of solid solutes in scCO 2 are better correlated by SVR-QSPR model than with the ANN-QSPR model. Statistical analyses show that support vector regress (SVR) predictions have an excellent agreement with the experimental data set. In addition to the statistical validation of the developed models, further evaluation of the accuracy and the regression was obtained from the most relevant papers developed by other researchers (Table 8 and Fig 5). Despite numerous previously published papers in literature, the developed models in this study showed better performance in cor- relating the solubility of solid solutes in scCO 2 based on some relevant descriptors. The obtained models were able to correlate 145 systems (3637 data points) with lowest errors. The QSPR-SVM model was obtained with very acceptable statistical parameters, such as R 2 of 0.9857, RMSE of 0.3393, and MAARD of 0.9859, as presented in Table 8. However, the global comparison in terms of the correlated systems indicated that the proposed model performed 145 systems with low errors and high collection coefficient in contrast to the other correlations, as depicted in Fig. 5. The performance of our model (SVR-QSPR) was compared again to six density-based models previously published ( Table 9).
The parameters of the selected literature models were adjusted using genetic algorithms (ga MATLAB function).
The global comparison for 145 compounds was calculated for each of the 07 studied correlations. The correlation results of the experimental data from this study are presented in terms of a comparative table of (AARD) and R 2 grouped in Table 10 113 Mèndez-Santiago and Teja (1999) 114 Hezave and Lashkarbolooki (2013) 5 Keshmiri (2014) 115 Bian (2016) 116 Si Moussa Sharing the obtained results by SVR model with users might be interesting using a simple interface. In Fig. 7, a convivial graphical user interface based on MATLAB software and using the trained SVR model was developed to evaluate the solubility of 145 solid solutes in scCO 2 . Fig. 7 -A MATLAB calculator to evaluate the solubility of 145 solid solutes in scCO 2 using support vector machine model

Global Sensitivity analysis
The objective of the global sensitivity analysis is a mathematical technique applied to find which input variable is considered more important for prediction of solubility values. 124,125 In this work, the cosine amplitude method (CAM) was employed to assess the effect of each input on the output. 126 The used data pairs build a data array X which are defined as X = {X 1 , X 2 , X 3 ,...,X m }, where X i is a vector of lengths m, expressed as: X i = {x i1 ,x i2 ,x i3 ,...,x im }. The r ij values are defined as the strength relationships between the output x j and input x i parameters, and can be calculated using Eq. (1).
where i and k are the dimensions of the input matrix (i = 1 to 3637, k = 1 to m), j is the dimension of the output vector (j = 1 to 3637), and m is the number of input parameters (m = 9).
The importance of input parameters in QSPR-SVM for solid solute solubility in scCO 2 is shown in Fig. 8. As observed, all parameters have approximately the same effects as well.
The results prove that all of the designated input parameters in this study have crucial effects on the solid solute solubility in scCO 2 , hence, they have been appropriately selected. Fig. 8 -Sensitivity analysis between the input variables and the calculated -log(y 2 ) in scCO 2

Extrapolation capability
In this section, the extrapolation capability of the proposed model was discussed using the experimental measurements reported in the following papers. [127][128][129][130][131][132][133][134] The predicted solubility values obtained by the SVR-QSPR model were correlated to the experimental ones of eight systems and also depicted clearly in terms of scCO 2 density. The results of the comparison, shown in Figs. (9)(10)(11)(12)(13)(14)(15)(16), indicated that the model performed well with a high goodness of fit R during the prediction of the solubility of eight systems, and almost all predicted solubility values of the eight systems fall inside the experimental ones. These results demonstrated clearly that the performance of SVR-QSPR based forecasting model is quite satisfactory.

Conclusion
In this study, a comparison was conducted between support vector machine (SVM) and artificial neural network (ANN) to predict non-linear variation of the solubility of 145 different compounds in supercritical carbon dioxide. Results show clearly that coupling the SVR approach with QSPR is more accurate with R 2 and AARD% of 0.9857 and 1.345 %, respectively, in comparison with ANN-QSPR for estimating the drug solubility in scCO 2 . The predictive ability index of the optimal model was estimated with a leaveone-out cross-validated coefficient (Q 2 LOO ) of 0.9859.
Based on the calculated statistical parameters, the QSPR-SVM has better performance than QSPR-ANN. Therefore, advantageously, the proposed QSPR-SVM as a simple approach to forecasting solid solute solubility in scCO 2 can be used in commercial software in the absence of experimental data. Also, QSPR-SVM modelling for the prediction of solubility of solid solute in scCO 2 provided in this study shows better performance in comparison with the other previous correlations. A sensitivity analysis has been conducted to show the influence of each input parameter on the output variable. All parameters have approximately the same effects as well.