scholarly journals Reverse engineering model structures for soil and ecosystem respiration: the potential of gene expression programming

2017 ◽  
Vol 10 (9) ◽  
pp. 3519-3545 ◽  
Author(s):  
Iulia Ilie ◽  
Peter Dittrich ◽  
Nuno Carvalhais ◽  
Martin Jung ◽  
Andreas Heinemeyer ◽  
...  

Abstract. Accurate model representation of land–atmosphere carbon fluxes is essential for climate projections. However, the exact responses of carbon cycle processes to climatic drivers often remain uncertain. Presently, knowledge derived from experiments, complemented by a steadily evolving body of mechanistic theory, provides the main basis for developing such models. The strongly increasing availability of measurements may facilitate new ways of identifying suitable model structures using machine learning. Here, we explore the potential of gene expression programming (GEP) to derive relevant model formulations based solely on the signals present in data by automatically applying various mathematical transformations to potential predictors and repeatedly evolving the resulting model structures. In contrast to most other machine learning regression techniques, the GEP approach generates readable models that allow for prediction and possibly for interpretation. Our study is based on two cases: artificially generated data and real observations. Simulations based on artificial data show that GEP is successful in identifying prescribed functions, with the prediction capacity of the models comparable to four state-of-the-art machine learning methods (random forests, support vector machines, artificial neural networks, and kernel ridge regressions). Based on real observations we explore the responses of the different components of terrestrial respiration at an oak forest in south-eastern England. We find that the GEP-retrieved models are often better in prediction than some established respiration models. Based on their structures, we find previously unconsidered exponential dependencies of respiration on seasonal ecosystem carbon assimilation and water dynamics. We noticed that the GEP models are only partly portable across respiration components, the identification of a general terrestrial respiration model possibly prevented by equifinality issues. Overall, GEP is a promising tool for uncovering new model structures for terrestrial ecology in the data-rich era, complementing more traditional modelling approaches.

2016 ◽  
Author(s):  
Iulia Ilie ◽  
Peter Dittrich ◽  
Nuno Carvalhais ◽  
Martin Jung ◽  
Andreas Heinemeyer ◽  
...  

Abstract. Accurate modelling of land-atmosphere carbon fluxes is essential for future climate projections. However, the exact responses of carbon cycle processes to climatic drivers often remain uncertain. Presently, knowledge derived from experiments complemented with a steadily evolving body of mechanistic theory provides the main basis for developing the respective models. The strongly increasing availability of measurements may complicate the traditional hypothesis driven path to developing mechanistic models, but it may facilitate new ways of identifying suitable model structures using machine learning as well. Here we explore the potential to derive model formulations automatically from data based on gene expression programming (GEP). GEP automatically (re)combines various mathematical operators to model formulations that are further evolved, eventually identifying the most suitable structures. In contrast to most other machine learning regression techniques, the GEP approach generates models that allow for prediction and possibly for interpretation. Our study is based on two cases: artificially generated data and real observations. Simulations based on artificial data show that GEP is successful in identifying prescribed functions with the prediction capacity of the models comparable to four state-of-the-art machine learning methods (Random Forests, Support Vector Machines, Artificial Neural Networks, and Kernel Ridge Regressions). The case of real observations explores different components of terrestrial respiration at an oak forest in south-east England. We find that GEP retrieved models are often better in prediction than established respiration models. Furthermore, the structure of the GEP models offers new insights to driver selection and interactions. We find previously unconsidered exponential dependencies of respiration on seasonal ecosystem carbon assimilation and water dynamics. However, we also noticed that the GEP models are only partly portable across respiration components; equifinality issues possibly preventing the identification of a "general" terrestrial respiration model. Overall, GEP is a promising tool to uncover new model structures for terrestrial ecology in the data rich era, complementing the traditional approach of model building.


Mathematics ◽  
2020 ◽  
Vol 8 (6) ◽  
pp. 913
Author(s):  
Zulfiqar Ahmad ◽  
Hua Zhong ◽  
Amir Mosavi ◽  
Mehreen Sadiq ◽  
Hira Saleem ◽  
...  

The present study emphasizes the efficacy of a biosurfactant-producing bacterial strain Klebsiella sp. KOD36 in biodegradation of azo dyes and hexavalent chromium individually and in a simultaneous system. The bacterial strain has exhibited a considerable potential for biodegradation of chromium and azo dyes in single and combination systems (maximum 97%, 94% in an individual and combined system, respectively). Simultaneous aerobic biodegradation of azo dyes and hexavalent chromium (SBAHC) was modeled using machine learning programming, which includes gene expression programming, random forest, support vector regression, and support vector regression-fruit fly optimization algorithm. The correlation coefficient includes the dispersion index, and the Willmott agreement index was employed as statistical metrics to assess the performance of each model separately. In addition, the Taylor diagram was used to further investigate the methods used. The findings of the present study were that the support vector regression-fruitfly optimization algorithm (SVR-FOA) with correlation coefficient (CC) of 0.644, (scattered index) SI of 0.374, and (Willmott’s index of agreement) WI of 0.607 performed better than the autonomous support vector regression (SVR), gene expression programming (GEP), and random forest (RF) methods. In addition, the standalone SVR model with CC of 0.146, SI of 0.473, and WI of 0.408 ranked the second best. In summary, the SBAHC can be accurately estimated using the hybrid SVR-FOA method. In other words, FOA has proven to be a powerful optimization algorithm for increasing the accuracy of the SVR method.


2021 ◽  
Vol 11 (2) ◽  
pp. 61
Author(s):  
Jiande Wu ◽  
Chindo Hicks

Background: Breast cancer is a heterogeneous disease defined by molecular types and subtypes. Advances in genomic research have enabled use of precision medicine in clinical management of breast cancer. A critical unmet medical need is distinguishing triple negative breast cancer, the most aggressive and lethal form of breast cancer, from non-triple negative breast cancer. Here we propose use of a machine learning (ML) approach for classification of triple negative breast cancer and non-triple negative breast cancer patients using gene expression data. Methods: We performed analysis of RNA-Sequence data from 110 triple negative and 992 non-triple negative breast cancer tumor samples from The Cancer Genome Atlas to select the features (genes) used in the development and validation of the classification models. We evaluated four different classification models including Support Vector Machines, K-nearest neighbor, Naïve Bayes and Decision tree using features selected at different threshold levels to train the models for classifying the two types of breast cancer. For performance evaluation and validation, the proposed methods were applied to independent gene expression datasets. Results: Among the four ML algorithms evaluated, the Support Vector Machine algorithm was able to classify breast cancer more accurately into triple negative and non-triple negative breast cancer and had less misclassification errors than the other three algorithms evaluated. Conclusions: The prediction results show that ML algorithms are efficient and can be used for classification of breast cancer into triple negative and non-triple negative breast cancer types.


2016 ◽  
Vol 24 (1) ◽  
pp. 54-65 ◽  
Author(s):  
Stefano Parodi ◽  
Chiara Manneschi ◽  
Damiano Verda ◽  
Enrico Ferrari ◽  
Marco Muselli

This study evaluates the performance of a set of machine learning techniques in predicting the prognosis of Hodgkin’s lymphoma using clinical factors and gene expression data. Analysed samples from 130 Hodgkin’s lymphoma patients included a small set of clinical variables and more than 54,000 gene features. Machine learning classifiers included three black-box algorithms ( k-nearest neighbour, Artificial Neural Network, and Support Vector Machine) and two methods based on intelligible rules (Decision Tree and the innovative Logic Learning Machine method). Support Vector Machine clearly outperformed any of the other methods. Among the two rule-based algorithms, Logic Learning Machine performed better and identified a set of simple intelligible rules based on a combination of clinical variables and gene expressions. Decision Tree identified a non-coding gene ( XIST) involved in the early phases of X chromosome inactivation that was overexpressed in females and in non-relapsed patients. XIST expression might be responsible for the better prognosis of female Hodgkin’s lymphoma patients.


2020 ◽  
pp. annrheumdis-2020-217840 ◽  
Author(s):  
Kimberly Showalter ◽  
Robert Spiera ◽  
Cynthia Magro ◽  
Phaedra Agius ◽  
Viktor Martyanov ◽  
...  

ObjectiveWe sought to determine histologic and gene expression features of clinical improvement in early diffuse cutaneous systemic sclerosis (dcSSc; scleroderma).MethodsFifty-eight forearm biopsies were evaluated from 26 individuals with dcSSc in two clinical trials. Histologic/immunophenotypic assessments of global severity, alpha-smooth muscle actin (aSMA), CD34, collagen, inflammatory infiltrate, follicles and thickness were compared with gene expression and clinical data. Support vector machine learning was performed using scleroderma gene expression subset (normal-like, fibroproliferative, inflammatory) as classifiers and histology scores as inputs. Comparison of w-vector mean absolute weights was used to identify histologic features most predictive of gene expression subset. We then tested for differential gene expression according to histologic severity and compared those with clinical improvement (according to the Combined Response Index in Systemic Sclerosis).ResultsaSMA was highest and CD34 lowest in samples with highest local Modified Rodnan Skin Score. CD34 and aSMA changed significantly from baseline to 52 weeks in clinical improvers. CD34 and aSMA were the strongest predictors of gene expression subset, with highest CD34 staining in the normal-like subset (p<0.001) and highest aSMA staining in the inflammatory subset (p=0.016). Analysis of gene expression according to CD34 and aSMA binarised scores identified a 47-gene fibroblast polarisation signature that decreases over time only in improvers (vs non-improvers). Pathway analysis of these genes identified gene expression signatures of inflammatory fibroblasts.ConclusionCD34 and aSMA stains describe distinct fibroblast polarisation states, are associated with gene expression subsets and clinical assessments, and may be useful biomarkers of clinical severity and improvement in dcSSc.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0257343
Author(s):  
Shaoshuo Li ◽  
Baixing Chen ◽  
Hao Chen ◽  
Zhen Hua ◽  
Yang Shao ◽  
...  

Objectives Smoking is a significant independent risk factor for postmenopausal osteoporosis, leading to genome variations in postmenopausal smokers. This study investigates potential biomarkers and molecular mechanisms of smoking-related postmenopausal osteoporosis (SRPO). Materials and methods The GSE13850 microarray dataset was downloaded from Gene Expression Omnibus (GEO). Gene modules associated with SRPO were identified using weighted gene co-expression network analysis (WGCNA), protein-protein interaction (PPI) analysis, and pathway and functional enrichment analyses. Feature genes were selected using two machine learning methods: support vector machine-recursive feature elimination (SVM-RFE) and random forest (RF). The diagnostic efficiency of the selected genes was assessed by gene expression analysis and receiver operating characteristic curve. Results Eight highly conserved modules were detected in the WGCNA network, and the genes in the module that was strongly correlated with SRPO were used for constructing the PPI network. A total of 113 hub genes were identified in the core network using topological network analysis. Enrichment analysis results showed that hub genes were closely associated with the regulation of RNA transcription and translation, ATPase activity, and immune-related signaling. Six genes (HNRNPC, PFDN2, PSMC5, RPS16, TCEB2, and UBE2V2) were selected as genetic biomarkers for SRPO by integrating the feature selection of SVM-RFE and RF. Conclusion The present study identified potential genetic biomarkers and provided a novel insight into the underlying molecular mechanism of SRPO.


Author(s):  
Ali Rashid Niaghi ◽  
Oveis Hassanijalilian ◽  
Jalal Shiri

The ASCE-EWRI reference evapotranspiration (ETo) equation is recommended as a standardized method for reference crop ETo estimation. However, various climate data as input variables to the standardized ETo method are considered limiting factors in most cases and restrict the ETo estimation. This paper assessed the potential of different machine learning (ML) models for ETo estimation using limited meteorological data. The ML models used to estimate daily ETo included Gene Expression Programming (GEP), Support Vector Machine (SVM), Multiple Linear Regression (LR), and Random Forest (RF). Three input combinations of daily maximum and minimum temperature (Tmax and Tmin), wind speed (W) with Tmax and Tmin, and solar radiation (Rs) with Tmax and Tmin were considered using meteorological data during 2003–2016 from six weather stations in the Red River Valley. To understand the performance of the applied models with the various combinations, station, and yearly based tests were assessed with local and spatial approaches. Considering the local and spatial approaches analysis, the LR and RF models illustrated the lowest rate of improvement compared to GEP and SVM. The spatial RF and SVM approaches showed the lowest and highest values of the scatter index as 0.333 and 0.457, respectively. As a result, the radiation-based combination and the RF model showed the best performance with higher accuracy for all stations either locally or spatially, and the spatial SVM and GEP illustrated the lowest performance among models and approaches.


Sign in / Sign up

Export Citation Format

Share Document