A Comparison of “Neural Networks and Multiple Linear Regressions” Models to Describe the Rejection of Micropollutants by Membranes

A rejection process of organic compounds by nanofiltration and reverse osmosis membranes was modelled using the artificial neural networks. Three feed-forward neural networks based on quantitative structure-activity relationship (QSAR-NN models) characterised by a similar structure (twelve neurons for QSAR-NN1, QSAR-NN2, and QSAR-NN3 in the input layer, one hidden layer and one neuron in the output layer), were constructed with the aim of predicting the rejection of organic compounds. A set of 1394 data points for QSAR-NN1, 980 data points for QSAR-NN2, and 436 data points for QSAR-NN3 were used to construct the neural networks. Good agreements between the predicted and experimental rejections were obtained by QSARNN models (the correlation coefficient for the total dataset were 0.9191 for QSAR-NN1, 0.9338 for QSAR-NN2, and 0.9709 for QSAR-NN3). Comparison between the feed-forward neural networks and multiple linear regressions based on quantitative structure-activity relationship “QSAR-MLR” revealed the superiority of the QSAR-NN models (the root mean squared errors for the total dataset for the QSAR-NN models were 10.6517 % for QSAR-NN1, 9.1991 % for QSAR-NN2, and 5.8869 % for QSAR-NN3, and for QSAR-MLR models they were 20.1865 % for QSAR-MLR1, 19.3815 % for QSAR-MLR2, and 16.2062 % for QSAR-MLR3).


Introduction
Nanofiltration (NF) and reverse osmosis (RO) are key processes widely used for the removal of ions and organics from contaminated feedwater in process water, drinking water, and wastewater treatment. 1,2 The occurrence of trace organic contaminants in treated and untreated wastewater appears to be a significant environmental health concern. 3,4 Hence, the long-term effects of consumption of water containing low concentrations of micropollutants will remain an unanswered question for the foreseeable future. 5 Meanwhile, water treatment facilities are implementing monitoring programs, research organizations dealing with water reuse have published reports, and studies have addressed the topic. 6 Fewer studies have examined micropollutant rejection as a function of the interactions between compounds and membrane surfaces. [6][7][8] In general, three major interactions primarily affecting solute-membrane rejection can be pointed out as follows: steric hindrance (sieving effect), electrostatic repulsion (charge effect), and hydrophobic/ adsorptive interactions. [9][10][11] An important aspect to deal with the problem has been the identification of compound physicochemical properties and membrane characteristics to explain transport and removal of micropollutants by different mechanisms, explicitly size/steric exclusion, hy-drophobic adsorption and partitioning, and electrostatic repulsion. 1,12,[13][14][15][16][17][18] Nanofiltration can be characterised by the molecular weight cut-off (MWCO) or ionic retention of salts, such as the magnesium sulphate salt rejection (SR(MgSO 4 )) or the sodium chloride salt rejection (SR(NaCl)); reverse osmosis membranes being dense are characterised also by salt retention. 19 The use of MWCO was acceptable for modelling purposes; however, NF membranes with a broad range of MWCO (pore size and distribution) may make it difficult to estimate the rejection of contaminants, thus magnesium sulphate salt rejection SR(MgSO 4 ) may be more appropriate. 5 Moreover, MWCO alone is frequently unable to predict the rejection of charged and uncharged organic compounds by NF/RO membranes. 5,19,20 Hence, in this work, sodium chloride salt rejection "SR (NaCl)" is considered as a factor of input because it is the first parameter giving an indication of the "porosity" of the membrane, measured by the manufacturer under standard conditions. 21 Agenson et al. showed, in general, that the strongest retention rates are obtained for membranes having the highest retention rate in NaCl. 22 A number of articles have proposed a mechanistic understanding of the interaction between membranes and organic compounds; others have tried to apply fitting parameter models to model rejection. 8,23,24 However, there have been few models to "predict" the rejection of compounds. 5 In recent years, there have been various attempts to advance the use of QSAR as a viable approach to develop data-driven models to describe the performance of membrane processes. 5,[25][26][27][28] In 2009, Yangali Quintanilla et al. evaluated a MLR based on QSAR to model the 161 rejections of 50 neutral organic compounds by six NF and nine RO membranes. They demonstrated that the magnesium sulphate salt rejection (SR(MgSO 4 )) may be a possible lump parameter that defines size exclusion capability of neutral organic compounds by NF/RO membrane. This model (Eq. 1) correlates the retention with the properties of neutral organic molecules (size and hydrophobicity) and properties of the NF/RO membrane (porosity): 28 Rejection  In this study, we developed feed-forward neural networks (FFNN) models and multiple linear regressions "QSAR-MLR", to predict the rejection of organic compounds by NF/RO membranes. To the best of our knowledge, no studies have been reported in the literature that have used feed-forward neural networks for modelling the rejection of organic compounds by NF/RO membranes, taking into account the molecular weight cut-off and salt rejection "magnesium sulphate and sodium chloride". The feed-forward neural networks "FFNN" also compared with multiple linear regressions "QSAR-MLR".
The aim of our study was the superiority of the QSAR-NN or QSAR-MLR models, and therefrom to confirm that the rejection of organic compounds by NF/RO membrane is a linear or nonlinear phenomenon, and determine the best porosity indicator (MWCO, SR(MgSO 4 ), and SR(NaCl)) for retention of neutral and ionic organic compounds by NF/ RO membrane. The paper is organised as follows. Section 2 presents a database collection; modelling the rejection of charged and uncharged organic compounds by NF/RO membranes using QSAR-NN and QSAR-MLR. Section 3 presents results and discussions. Section 4 presents comparisons between QSAR-NN, QSAR-MLR, and other models. Section 5 presents selected organic compounds rejection at varying feedwater pH values.
The selection of the input and output variables was based on interactions between organic compound properties, membrane characteristics, and filtration operating conditions for the removal of organic compounds by NF/RO membranes. These solute-membrane interactions are determined by organic compound solute properties, membrane properties, and operating conditions. 11,28,46 The inputs considered in this work were: compound descriptors (molecular weight, compound hydrophobicity "logD", dipole moment, molecular length, and equivalent molecular width), membrane characteristics (molecular weight cut-off (MWCO), magnesium sulphate salt rejection "SR (MgSO 4 )", sodium chloride salt rejection "SR (NaCl)", surface membrane charge "zeta potential", and membrane hydrophobicity "contact angle"), and operating conditions (pH, pressure, recovery, and temperature).
Molecular weight (MW) and logD were obtained from Chemspider, 47 the Dipole moment, another fundamental descriptor, was calculated by Chem3D Ultra 12.0, and ChemAxon 48 used for computing molecular descriptors of size, such as molecular length, molecular width, and molecular depth. The equivalent molecular width is defined as the geometric mean of width and depth. 5,49 The values of the minimum (min), mean (mean), maximum (max), and standard deviations (STD) for the inputs and output data are presented in Table 1.
The samples were randomly split into three subsets: 80 % for training set, 10 % for validation set, and 10 % for test set. Fig. 1 shows the total data as function of the molecular weight for each database. The rejections according to the molecular weight are indicated by pink circles for DB1 "1394 points", blue circles for DB2 "980 points", and green circles for BD3 "436 points". When the molecular weight varied from 150 to 300 mg mol −1 , the experimental rejections, which had percentages greater than 60.5 %, agglomerated with each other.

Modelling with neural networks
A neural network (NN) is a massively parallel-distributed processor made up of simple processing units known as nodes that perform certain mathematical functions, usually nonlinear. This kind of non-algorithmic computation is characterised by a system resembling the human brain structure. One of the great advantages of these models is their ability to learn (store experimental knowledge), generalise (make the knowledge available) or automatically extract rules from complex data. 50 The role of the QSAR-NN is to transform the inputs information into output information. During the training process, the weights are corrected to produce output values as close as possible to the target values. 51 The procedure based on the design and optimisation of the architecture of neural network "QSAR-NN" is described in Fig. 2.

Modelling with multiple linear regressions
Multiple linear regressions "QSAR-MLR" are the most widely used and known modelling methods, and are used as the basis for a number of multivariate methods. 52 Multiple linear regressions consist of a quantitative relationship between a group of predictor variables (X) and a response y i , as shown in Eq. (4): where y i is the output of QSAR-MLR, X k represents the inputs, A k represents the coefficients, and A 0 is the intercept of the equation.
3 Results and discussion

Neural networks
The database was divided into three different sets. Training phase (80 %) is the major part of the data, because from this data set, the connection weights of the neurons are adjusted during the training to acquire the knowledge of the network. On the other hand, the validation phase (10 %) is used for verifying the generalisation ability of the network. The test phase (10 %) is used to evaluate the NN  performance in real situations. The training algorithm used in this work was the BFGS quasi-Newton (trainbfg). Each QSAR-NN had three layers of neurons or nodes: one input layer with twelve neurons for QSAR-NN 1 , QSAR-NN 2 , and QSAR-NN 3 in the input layer, one hidden layer with a number of active neurons optimised during training, and one output layer with one unit that generated the estimated value of rejection. The number of hidden neurons varied from 3 to 25 neurons. The tangent sigmoid (tansig), the log sigmoid (logsig), and the exponential transfer functions were used in the hidden layer. The pure-linear (purelin) transfer function was used in the output layer.   The application of QSAR-NN modelling of the rejection of charged and uncharged organic compounds by NF/RO membranes was performed using STATISTICA 8 software. Table 2 shows the structure of the optimised QSAR-NN models. From the optimised neural networks (QSAR-NN) shown in Fig. 3, assimilation of rejection of organic compounds by NF/RO membranes can be expressed by a mathematical model incorporating all inputs E i (M w , logD, dipole moment, length, Eqwidth, MWCO, SR(MgSO 4 ), SR(NaCl), zeta potential, contact angle, pH, pressure, recovery, and temperature) within it, as follows: The instance outputs Z j of the hidden layer: The output "Rejection": The combination of Eqs. (2) and (3) leads to the following mathematical formula, which describes the retention assimilation by taking into account all inputs E i (M w , logD, dipole moment, length, Eqwidth, MWCO, SR(MgSO 4 ), SR(NaCl), zeta potential, contact angle, pH, pressure, recovery, and temperature): Similarly, rejection "SR(MgSO 4 )" and rejection "SR(NaCl)" can be expressed by mathematical equations extracted from QSAR-NN 2 and QSAR-NN 3 , optimised as follows (with: for QSAR-NN 2 and QSAR-NN 3 ):   Table 3 shows the vectors of linear regression for the three neural models (QSAR-NN 1 , QSAR-NN 2 , and QSAR-NN 3 ). This table shows the parameters of the least favourable regression in QSAR-NN 1 . It is obvious that the proposed approach gives satisfactory results with regression vector values approaching the ideal [i.e. α = 1 (slope), β = 0 (intercept), R = 1 (correlation coefficient)] in the adjustment of the profiles of QSAR-NN 1 , QSAR-NN 2 , and QSAR-NN 3 .

Multiple linear regressions
Multiple linear regressions model relates a set of "predictor" variables to the potency of the response variable (Rejection). It was used to develop the models using the organic compounds properties, membrane characteristics, and filtration operating conditions to predict the rejection of charged and uncharged organic compounds by NF/RO membranes. The same datasets used for developing QSAR-NN were used to develop QSAR-MLR. They were implemented using MATLAB function "regress". The general QSAR-MLR linear equations for rejection were as follows: The prediction of the rejection processes of charged and uncharged organic compounds by NF/RO membranes according to the QSAR-MLR models are presented in Fig. 5. The predicted values for the rejection were in moderate agreement with the observed values "total datasets" (R = 0.6642 for QSAR-MLR 1 , R = 0.6560for QSAR-MLR 2 , and R = 0.7522 for QSAR-MLR 3 ). These correlation coefficients are generally considered satisfactory (0.50 ≤ R < 0.90) for QSAR-MLR 1 , QSAR-MLR 2 , and QSAR-MLR 3 models, which implies acceptable robustness of the multiple linear regressions models, and the possibility of predicting the different parameters that characterise the rejection of organic compounds by NF/RO.

Comparison between QSAR-NN and QSAR-MLR models
In order to establish the developed QSAR-NN models as a plausible alternative to the QSAR models, a comparison between the two approaches was made in terms of the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Standard Error of Prediction (SEP), Residual Predictive Deviation (RPD), and Range Error Ratio (RER).
Equations of those parameters are given below: where n is the total number of data; Y i,exp is the experimental value, Y i,cal represents the calculated value from the neural network model, Y e is the mean value of experimental data. SD is the standard deviation of experimental data, Min is the minimum of experimental data, and Max is the maximum of experimental data. Fig. 6 shows the comparison between the QSAR-MLR and the QSAR-NN models. The performance of each model was evaluated in terms of root mean squared error, mean absolute error, standard error of prediction, residual predictive deviation, and range error ratio. As expected, QSAR-NN 1 model had the lowest accuracy. Another conclusion, which may be extracted from Fig. 6, is that QSAR-NN 2 model and QSAR-NN 3 model, which use experimental rejection, could be more robust and more accurate than the QSAR-NN 1 model. It shows that QSAR-NN 3 model is more effective and more accurate than other developed QSAR-NN models in this study, while the QSAR-NN models gave lower errors than QSAR-MLR models. The errors of QSAR-NN 3 model were lower than the errors of QSAR-NN 1 and QSAR-NN 2 models, and the errors of QSAR- Since the RPD of QSAR-NN 1 , QSAR-NN 2 , and QSAR-NN 2 were 2.54, 2.79, and 4.18 respectively, and they were higher than 2.5 for all neural networks models. Therefore, we can say that QSAR-NN models had excellent levels of prediction for the rejection of micropollutants by NF/RO membranes.
On the other hand, the RPD of QSAR-MLR 1 and QSAR-MLR 2 , were 1.34 and 1.33 respectively, being lower than 1.40, thus these models were unsuccessful in predicting the rejection of charged and uncharged organic compounds by NF/RO membranes. However, the RPD of QSAR-MLR 3 was 1.52, i.e., between 1.40 and 1.8, and as a result, these models showed a possibility to distinguish high and low values. These results prove that QSAR-NN has a higher capacity of prediction than QSAR-MLR.
Furthermore, the values of RER of QSAR-NN 2 and QSAR-NN3 were 10.78 and 16.99 respectively, being higher than the recommended (RER > 10), which implies that these models showed a good predicting ability of the rejection. Additionally, the RER values of QSAR-NN 1 , QSAR-MLR 1 , QSAR-MLR2, and QSAR-MLR 3 (ranged from 4.95 to 9.39) in this study, did not exceed the recommended threshold values. The obtained results show that these models have an acceptable predicting power.
Thus, the good predicting power and the high accuracy of QSAR-NN models to estimate the rejection of micropollutants by NF/RO membranes is quite obvious, which makes these models highly recommendable over the other models (QSAR-MLR). 28 Table 4 shows a comparison between our QSAR-NN models and other models. In our current study, we used the three QSAR-NN for the prediction of the rejection of micropollutants by NF/RO membranes, taking into account three porosity indicators (MWCO, SR(MgSO 4 ), and SR(Na-Cl)). However, in the work of Yangali-Quintanilla et al., 28 they used only one porosity indicator SR(MgSO 4 ) for the neutral organic compounds, and in our earlier study, we were using two porosity indicators (MWCO and SR(Mg-SO 4 )) for the neutral and ionic organic compounds. Our QSAR-NN 3 models with porosity indicators SR(NaCl) give a more accurate and more robust prediction of the rejection of organic compounds (neutral and ionic) by NF/RO membranes than other models. The calculated rejection values are in accordance with the experimental rejection values for the organic compounds (2,4-dihydroxybenzoic acid) (Fig. 7). These results suggested an excellent validation of the QSAR-NN 3 model. The increase in experimental rejection observed at a pH between 3 and 5 for both nanofiltration membranes can likely be attributed to an increase in the negative effective membrane surface charge (as quantified by both zeta potential measurements), resulting in an increased degree of electrostatic repulsion. At pH values of 5, 7, and 9, the rejection of 2,4-dihydroxybenzoic acid remained at about 90 % for both nanofiltration membranes tested, although the negative surface charge continued to increase for both nanofiltration membranes in this pH range. 17 The QSAR-NN 3 model can be used to predict rejection of organic compounds in a wide range of pH.

Conclusions
The present paper illustrates the use of three neural network models that were developed with the aim of predicting the rejection processes of charged and uncharged organic compounds by NF/RO membranes. The purpose of the current study was to develop three feed-forward neural network models able to summarize interactions between membrane characteristics, filtration operating conditions, and physicochemical properties of charged and uncharged organic compounds. Chloride sodium salt rejection may be a possible lump parameter for modelling rejection of charged and uncharged organic compounds by NF/RO membranes.   Supplementary data ANNEX A. List of the 116 organic compounds used for developing neural networks based on the quantitative structure-activity relationship (QSAR-NN) and multiple linear regressions based on the quantitative structure-activity relationship (QSAR-MLR) models.