Application of gene expression programming, artificial neural network and multilinear regression in predicting hydrochar physicochemical properties

Globally, the provision of energy is becoming an absolute necessity. Biomass resources are abundant and have been described as a potential alternative source of energy. However, it is important to assess the fuel characteristics of the various available biomass sources. Soft computing techniques are presented in this study to predict the mass yield (MY), energy yield (EY), and higher heating value (HHV) of hydrothermally carbonized biomass using Gene Expression Programming (GEP), multiple-input single output-artificial neural network (MISO-ANN), and Multilinear regression (MLR). The three techniques were compared using statistical performance metrics. The coefficient of determination (R2), mean absolute error (MAE) and mean bias error (MBE) were used to evaluate the performance of the models. The MISO-ANN with 5-10 to 10-1 and 5-15-15-1 network architectures provided the most satisfactory performance of the three proposed models (R2 = 0.976, 0.955, 0.996; MAE = 2.24, 2.11, 0.93; MBE = 0.16, 0.37, 0.12) for MY, EY and HHV, respectively. The GEP technique’s ability to predict hydrochar properties based on the input parameters was found to be satisfactory, while MLR provided an unsatisfactory predictive model. Sensitivity analysis was conducted, and the analysis revealed that volatile matter (VM) and temperature (Temp) have more influence on the MY, EY, and HHV.


Introduction
The increasing energy demand has led to the need to find alternative energy sources that are affordable, widely available, and environmentally friendly. Biomass is a biological and sustainable material originated from plants and animals, along with their waste and residues (Krylova and Zaitchenko 2018). Biomass is the most available renewable energy source, with a contribution of about 50% of the total global renewable energy as of 2018, and providing energy to billions of people and stimulating economic growth (Pradhan et al. 2018). The studies by (Tekin et al. 2014) and (Rousset et al. 2012) reported that biomass is a potential alternative renewable energy source for power generation as a result of its low emissions, low ash, and total sulphur content. (Saba et al. 2017) and (Perlack et al. 2011) reported that biomass greenhouse gas emission status is zero to net negative as carbon dioxide is absorbed by plants during photosynthesis.
The most generally used thermochemical pre-treatment techniques include pyrolysis, gasification, torrefaction, and hydrothermal carbonization (HTC) (Wang et al. 2018; Kambo and Dutta 2015). The studies by (Kubacki et al. 2012) and (Makwarela et al. 2017) stated that the co-combustion of biomass and coal allowed coal to ignite and burnout at lower temperatures because of the interactions with the early combustion of biomass volatile matter. The study concluded that the emission reductions reported were due to an improved reaction between coal and biomass volatiles in a hot oxidizing atmosphere. A number of studies have been carried out on the pre-treatment of biomass by various researchers (Safarian et al. 2019; Zhang and Pang 2019; Kambo and Dutta 2015). The type of feedstock and the preferred end product determines the type of pre-treatment method to be used (Kambo and Dutta 2015).
The hydrochar utilized in the study was produced using the HTC method, which is generally considered to be a more effective technique (Danso-Boateng 2015). Biomass for HTC treatment does not require drying before treatment and thus uses less energy. In fact, unlike the conventional biological treatment technique, the presence of toxic compounds in the biomass does not affect HTC. HTC treatment typically takes place at relatively low temperatures (180-260 °C) and under internally generated pressure from an enclosed reactor, which decreases the oxygen and hydrogen content of the starting material by dehydration and decarboxylation (Libra et al. 2011). The HTC treatment converts the wet biomass into a hydrochar, a solid substance with improved carbon content. Hydrochar has a heating value higher than the feedstock and a chemical structure similar to that of coal (Mumme et al. 2011). The process is controlled by process parameters such as temperature and residence time, which define the intensity of biomass treatment (Wiedner et al. 2013;Xu et al. 2013).
The temperature has a significant influence on the HTC process. It is the key determinant of the water properties, which leads to ionic reactions in the subcritical region. A rise in temperature alters the viscosity of the water, making it easier to penetrate the pores of the material and thus further degrade the biomass (Funke and Ziegler 2010). With an increase in temperature, the disintegration of solid residues increases, and this further leads to an increase in the yield of solids to gas products. In most of the studies reviewed (Wang et al. 2018;Kim et al. 2014;Parshetti et al. 2013;Hwang et al. 2012;Sevilla and Fuertes 2009), an increase in temperature has been reported to result in lower mass yields, with an increase in HHV of hydrochars, suitable for power generation. A study conducted by (Wang et al. 2018) also reported on the significance of residence time on the severity of the HTC process. The residence time at a given temperature influences the degree of decomposition of the feedstock, but the minimum impact on hydrochar mass yield compared to temperature.
The investigation conducted by (Zhu et al. 2018) at different temperatures and residence time has shown that the above-mentioned parameters do influence the properties of cornstalk hydrochar. The author reported that the fuel mass yield decreased from 70.57 to 33.40% with an increase in temperature. While the residence time tends to have a lower influence on the mass yield relative to the temperature. The energy content of the raw cornstalk was increased from 16.35 to 26.31 MJ/kg. A similar result was also observed for the HTC treatment of other biomass such as biogas sludge, barley, and maize silage, starch, municipal solid waste, and sewage sludge (Seyedsadr et al. 2018;Kim et al. 2014;Parshetti et al. 2013;Hwang et al. 2012;Sevilla and Fuertes 2009). Therefore, it can be concluded that temperature and residence time do influence the properties of the hydrochar as well as the raw biomass type. With the understanding of the relationship between temperature and residence time on the mass yield of hydrochar, the physiochemical properties of the hydrochar under the set conditions were used to predict its yield theoretically.
Empirical and semi-empirical correlations have been reported in the literature for the estimation of biomass fuel HHV based on their proximate, ultimate, and chemical analyses (Saldarriaga et al. 2015;Saidur et al. 2011;Sheng and Azevedo 2005). (Vargas-Moreno et al. 2012) reported on mathematical models used to predict biomass HHV and assessed the performances of the prediction models. The study reported that the R 2 remained as high as 0.748 for biomass in the 15 univariate and multivariate prediction equations reviewed. Artificial neural networks and, in particular, feed-forward artificial neural networks (FANNs) have been widely used to develop process models over the last 10 years, and their use in industry has evolved rapidly (Onifade et al. 2019;Aladejare et al. 2020;Majumder et al. 2008;Hansen and Meservy 1996;Wasserman 1993).  and (Estiati et al. 2016) used neural networks and regression analysis to predict the HHV of coal and biomassbased fuels from their proximate and ultimate analyses using both experimental and existing data from the literature. The results obtained by these authors show that the adaptive neuro-fuzzy inference system (ANFIS) and artificial neural network-particle swarm optimization (ANN-PSO) models perform better than the MLR models as reflected in the statistical analysis conducted to assess the performance of the models.
There is limited study in the literature that assesses the influence of temperature, residence time, and the composition of biomass sources on hydrochar properties. The aim of this study is, therefore, to predict the mass yield, energy yield, and higher heating value of hydrochars using the HTC process conditions (temperature and residence time), and the biomass proximate analysis results. The experimental data from this study and the data obtained from the literature were utilized in the linear and non-linear empirical models proposed to predict these properties for different biomass sources (as provided in Table 1). The performance of the proposed models was compared using the R 2 , mean absolute error, and mean bias errors.

Sample characterization and data generation
To develop the proposed models, data from the proximate analysis, HTC process conditions (temperature and residence time), hydrochar properties (MY, EY, and HHV) relating to a number of biomass species were used. The woody biomass (Searsia lancea) used in this study were harvested from a phytoremediation trial site polluted with groundwater from gold and the uranium tailings dam at AngloGold Ashanti Limited' West Wits and Vaal River mining operations in South Africa. The different biomass components were milled in a Retsch SM 200 mill to − 1 mm and − 212 µm size fractions. The − 1 mm fraction was used for the hydrothermal carbonization and the − 212 µm fraction for the physicochemical characterization. The proximate analysis for these samples was performed based on the ASTM D5142, with approximately 1 g used to calculate the fixed carbon, moisture, ash, and volatile matter contents. The fixed carbon is expressed as the subtraction of the sum of moisture, volatile matter, and ash contents from 100%. The bomb calorimeter (Leco AC500), in accordance with the ASTM D5865-04 standard, was used to estimate the HHV of the samples.
One hundred and fifteen (115) data points; 9 from the experimental investigation and 106 from published articles on several biomass species were used to obtain predictive models. The summary of the statistics of the dataset obtained from the experimental tests and the literature is presented in Table 1 and the details of the data used in the model development are presented in Table 2. Summary of the statistics in Table 1 shows that volatile matter, Ash content, fixed carbon, and residence time do not follow normal distributions based on their respective skewness. To enable the general application of the proposed models, the data set was trained, tested, and validated using GEP, MISO-ANN, and MLR and compared with one another.

Hydrothermal carbonization
The woody Searsia lancea tree species was carbonized in a laboratory-scale high-pressure Berghof BR-1500 reactor. For each experiment, the reactor was loaded with 100 g of air-dried sample and 800 ml of deionized water, with the reactor pressurized at 20 bar using nitrogen. The hydrothermal test was conducted at different reaction temperatures of 200, 250, and 280 °C and residence time of 30, 60, and 90 min. The mixture was stirred with the reactor agitated at 200 rpm and was sustained for the entire experiment. After the holding time, the reactor was allowed to cool to room temperature. The solid hydrochar was collected via filtration and allowed to dry in an oven at 105 °C for 24 h. The results from the nine (9) hydrochar samples produced under the set conditions are depicted in Table 2. The mass yield and energy yield of hydrochars were calculated using the following equations: where MY is mass yield, M HC is the mass of hydrochar and M R is the mass of the raw sample: where EY is energy yield, HHV HC and HHV R is the higher heating value of the hydrochars and raw samples, respectively.

Gene expression programming (GEP)
GEP is an evolutionary-based algorithm that explores the genotype from the genetic algorithm (GA) and phenotype from genetic programming (GP). Like a living organism, the GEP utilizes a simple chromosome with a fixed length for keeping and transmitting genetic information and complex tree structures for learning and adaptation by changing size, shapes, and composition.     The key advantage of the GEP model is the ability to present its output in the form of an expression tree and a simple relationship between the model parameters and the targeted output. Unlike many optimization algorithms that require prior suggestion of the relationship between the model parameters and the output parameter. Hence, the rigors required in many optimization algorithms in establishing the model parameter combination that will give optimum results have been solved.
In GA and GP, mutation and crossover operators are the common means of reproduction between them, which operate based on their respective algorithms which could increase the computational resources (Guven and Aytek 2009;Teodorescu and Sherwood 2008). The GEP proposed by (Ferreira 2001) explores the merit of GA and GP, however, overcomes the demerits of both the GA and GP. It utilizes two entities which are the chromosomes and the expression trees. Instead of applying its operators on the expression tree directly, it operates on the chromosome which reduces the computational resources (Guven and Aytek 2009;Teodorescu and Sherwood 2008). The flowsheet in Fig. 1 shows the steps involve in GEP.

Artificial neural networks (ANN)
ANN belongs to the family of artificial intelligence which imitates the functionality of the human brains. It explores how the human brain receives, process, and transform information. There are different types of ANNs, but the multilayer neural network is the most used. Mainly, in a supervised ANN, the input parameters are supplied with the targeted output (Jain et al. 1996). The input parameters will be multiplied with the connecting weights and their summation together with the bias will be fed into the transfer function at the hidden layer. The output of the hidden layer will be multiplied by another weight connecting the hidden layer to the output layer and its summation will be added to the bias and then fed into the transfer function at the output layer to obtain the final predicted output. The transfer function at the hidden layer is usually non-linear, while that at the output layer could be linear or non-linear. The flow chart explaining the steps involved in the ANN training is illustrated in Fig. 2. Table 2 shows the proximate analysis of the biomass feedstock (Searsia lancea). The results of the proximate analysis test were presented on a moisture-free basis (dried-basis). Volatile matter, ash content, and fixed carbon are influential constituents of fuel materials used to ascertain its quality. The content of volatile matter significantly influences the process of combustion (Mierzwa-Hersztek et al. 2019; Sadiku et al. 2016). In addition, (Brewer et al. 2014) and (Holtmeyer et al. 2013) reported that material with higher volatile matter could be advantageous for combustion processes, because it is easier to ignite, lower temperature of complete burnout, and a stable flame. A high volatile matter of 75.67% was obtained for the Searsia lancea, making the material a potential feedstock for combustion. Ash content of 4.26% was obtained for the feedstock. (He et al. 2018) reported that, with lower ash content, there might be a decrease in fouling and slagging. The fixed carbon content of any material indicates the fuel's heating value (Sadiku et al. 2016). For our biomass feedstock, the fixed carbon content of 20.07% was obtained. The mass yields of the hydrochars from Searsia lancea calculated using Eq. (1) decreases as the temperature increases at each residence time, reaching yields as low as 34.89% at 250 °C. The reduction in mass yield is a result

GEP model
In developing the GEP model, the dataset used in the training and testing of the ANN model was also used. However, instead of normalizing the dataset within the range of − 1 and 1, the dataset was normalized within 0 and 1 in the GEP model. The purpose is to ensure dimensional linearity and forestall overfitting. The GEP model was implemented in GeneXproTools 5.0. After loading the data into the software, the number of chromosomes, the head size, the number of genes, and

Artificial neural network
A MISO-ANN is proposed in this study for the prediction of EY, MY, and HHV. To achieve this, single hiddenlayer and double hidden-layers were tried for each of the EY, MY, and HHV as presented in Tables 3, 4. The optimum networks obtained for each of the EY, MY, and HHV are presented in Figs. 7 and 8. In developing the MISO-ANN models, the number of neurons in the input, hidden, and output layers are to be defined and the respective transfer functions at the hidden and output layers are to be defined. Therefore, in this study, there are five neurons in the input layers comprising VM, Ash, FC, Temp, and RT. For the hidden layer, several neurons ranging from 3 to 15 were tried for the MISO-ANN with single hidden layer architecture, while for the MISO-ANN with double hidden layer architecture, the neurons combinations tried ranged 5-3 to 15-15 for each of the targeted variables. The transfer function adopted for the network with a single hidden layer is a hyperbolic tangent for the hidden and output layers, respectively. For the double hidden layer, hyperbolic tangent was used in the first and second hidden layers, while purlin was used for the output layer. Feedforward Backpropagation training algorithm with Levenberg-Marquardt training function was used for the training of the network. One hundred and fifteen (115) datasets were used for model development, divided into 70% for training, 15% each for testing, and validation, respectively (Fig. 2). The datasets were normalized to within the range of − 1 and 1 to forestall overfitting and ensure dimensional uniformity. The performance of each of the trained networks using the normalized datasets was evaluated using R 2 , RMSE, ME, and standard deviation (std). The obtained outputs for various combinations of neurons are presented in Tables 3, 4. The best network for the MY prediction is 5-10 to 10-, while 5-15 to 15-1 is the best network for EY and 5-15 to 15-1 for the HHV as bolded in Tables 3, 4 and presented in Figs. 6 and 7.

Multiple linear regression analysis
Regression analysis is commonly used in establishing the relationship between the regressor and the targeted variable. When it involves a relationship between the targeted variable and a regressor, it is known as linear regression analysis. However, for more than one regressor, it is known as multiple linear regression analysis. MLR has been used by researchers (Said et al. 2020a, b;Onifade et al. 2019) for prediction purposes. MLR is also adopted in this study, to enable the comparison between GEP and ANN models. MLR model was developed for each of the three predicted parameters: MY, EY, and HHV. The MLR analysis was performed in the Microsoft Excel software Add-ins using the same datasets used in GEP and ANN models. The obtained MLR models are as presented in Eqs. (9) to (11):

Models comparison
The accuracy of the proposed models using GEP, MISO-ANN, and MLR methods are compared with the laboratory-measured values using the testing and validation datasets. For the MY, the outcome of the comparison is presented in Fig. 8. For the training datasets, the points predicted with MISO-ANN fall largely within the 3% error line, while many of the points predicted by the GEP and MLR fall outside the error line. This hitherto gave rise to R 2 of 0.981 obtained for the MISO-ANN, while the R 2 recorded for both the GEP and MLR models are 0.691 and 0.463, respectively, for the testing datasets. For the validation data points, however, the R 2 values recorded for the MISO-ANN are 0.976, while those of GEP and MLR are 0.548 and 0.154, respectively. The MISO-ANN predictions are generally closer to the experimentally measured values among the three proposed models. The performance of MISO-ANN can be attributed to its ability to handle complex non-linearity between the model parameters (Gevrey et al. 2003). The outcome of the MISO-ANN is consistent with most of the previous studies that compared the performance of ANN with the regression-based models in predicting the HHV of solid fuels (Onifade et al. 2019;Ghugare et al. 2017;Uzun et al. 2017;Patel et al. 2007). Aside from the HHV of solid fuel, many authors have found that the ANN provides a more reliable predictive model than the regression-based model (Lawal 2020;Lawal et al. 2020;Said et al. 2020a;Saadat et al. 2014;Khandelwal and Singh 2010).
The outcome of the comparison of the predictive ability of the three proposed models GEP, MISO-ANN, and MLR are also tested for the hydrochar property EY as presented in Fig. 9. The majority of the data points predicted with MISO-ANN fall within the 3% error lines, while many of the predicted data points using GEP and MLR fall outside the error line. As a result of this, the resulting performance indicator R 2 of MISO-ANN for the testing and validation datasets are 0.965 and 0.955, respectively, while that of the GEP are 0.622 and 0.419. For the MLR model, the R 2 values for the respective testing and validation datasets are 0.219 and 0.205. Again, the MISO-ANN model outperforms the GEP and MLR models.
Similarly, the accuracy of the proposed models (GEP, MISO-ANN, and MLR) for predicting HHV is also evaluated as presented in Fig. 10. All the predicted data points using the MISO-ANN fall within the 3% error lines for both the testing and validation data points. The majority of the predicted data points by GEP and MLR models fall outside the error lines in Fig. 10. As a result of the presence of the data points predicted using the MISO-ANN model within the error band, the performance of the MISO-ANN is excellent as the R 2 of 0.999 is obtained for testing data points (Fig. 10a), while 0.996 was recorded for the validation data points (Fig. 10b). The R 2 values obtained for the GEP models are 0.810 and 0.717, respectively, while that of the MLR models are 0.788 and 0.643 for the respective testing and validation data points. The low R 2 values observed in GEP and MLR models for the MY, EY, and HHV predictions can be attributed to their respective predicted data points that fall outside the error bands. Hence, the MISO-ANN can give reliable predictions of the MY, EY, and HHV follow by the GEP model, while MLR may not be reliable.

Error analysis
To enable the selection of the best performing model for predicting the MY, EY, and HHV, mean absolute error and mean bias errors were evaluated for each of the three techniques used in developing the proposed models as presented in Eqs. (12) and (13): The obtained results from the conducted analyses using Eqs. (12) and (13) are presented in Table 5. From Table 5, the MAE of 2.24, 2.11, and 0.93% were obtained for MY, EY, and HHV, respectively, using the MISO-ANN model, while the MBE obtained are 0.16 0.37, and 0.12% for MY, EY, and HHV, respectively. Hence, the best model for the prediction of MY, EY, and HHV is the MISO-ANN model follow by the GEP model, while the MLR will overestimate the values of the MY, EY, and HHV based on the MBE values in Table 5.

Sensitivity analysis
The sensitivity analysis helps in providing useful information on the contributions of each of the input parameters on the output predicted by the model. Various techniques have been proposed to perform this task but the Cosine Amplitude method (CAM) (Yang and Zhang 1997) as presented in Eq. (14) is adopted in this study:  where R ij stands for the strength of the input parameter, r m represents the model regressors, P is the predicted output, n is the data points number.
The MISO-ANN model which is adjudged the best out of the three proposed models based on the previous analysis conducted in this study is used to perform the sensitivity analysis. The output obtained is presented in the Pareto chart (PC) shown in Figs. 11, 12, 13. The VM and Temp have the highest influence on the MY, EY, and HHV as presented in Figs. 11,12,13. The order of the influence in all the figures is VM > Temp > FC > RT > Ash. In addition, based on the Pareto chart analysis, the Ash and RT in that order should not be ignored when predicting MY, EY, and HHV.

Conclusion
Mass yield, energy yield, and higher heating value are important hydrochar properties required for the analysis and design of any bioenergy systems. In the present study, Gene Expression Programming, multiple-input single output-artificial neural network, and Multilinear regression were applied to predict MY, EY, and HHV of hydrochars using the composition of biomass source from proximate analysis and HTC process conditions (temperature and residence time). Based on R 2 values and error analysis, MISO-ANN with 5-10 to 10-1 and 5-15 to 15-1 network architectures presented the best performance among the proposed models with R 2 = 0.976, 0.955, 0.996; MAE = 2.24, 2.11, 0.93; MBE = 0.16, 0.37, 0.12 for the respective MY, EY, and HHV. GEP has been shown to provide satisfactory predictive alternative to MISO-ANN with R 2 = 0.691, 0.622, 0.810; MAE = 12.38, 10.31, 8.58; MBE = 2.95, 0.64, 0.78 for MY, EY and HHV, respectively. From the sensitivity analysis, volatile matter and temperature were found to be the most influencing input variables. This study demonstrated the ability of GEP to satisfactorily model hydrochar properties based on biomass composition and HTC process conditions. Although the accuracy of the GEP models was slightly lower than that of the MISO-ANN models, the GEP models provided much more accurate predictions than the MLR models, which proved unsatisfactory.