Error assessment and optimal cross-validation approaches in machine learning applied to impurity diffusion

2019 ◽  
Vol 169 ◽  
pp. 109075 ◽  
Author(s):  
Hai-Jin Lu ◽  
Nan Zou ◽  
Ryan Jacobs ◽  
Ben Afflerbach ◽  
Xiao-Gang Lu ◽  
...  
2020 ◽  
Vol 25 (40) ◽  
pp. 4296-4302 ◽  
Author(s):  
Yuan Zhang ◽  
Zhenyan Han ◽  
Qian Gao ◽  
Xiaoyi Bai ◽  
Chi Zhang ◽  
...  

Background: β thalassemia is a common monogenic genetic disease that is very harmful to human health. The disease arises is due to the deletion of or defects in β-globin, which reduces synthesis of the β-globin chain, resulting in a relatively excess number of α-chains. The formation of inclusion bodies deposited on the cell membrane causes a decrease in the ability of red blood cells to deform and a group of hereditary haemolytic diseases caused by massive destruction in the spleen. Methods: In this work, machine learning algorithms were employed to build a prediction model for inhibitors against K562 based on 117 inhibitors and 190 non-inhibitors. Results: The overall accuracy (ACC) of a 10-fold cross-validation test and an independent set test using Adaboost were 83.1% and 78.0%, respectively, surpassing Bayes Net, Random Forest, Random Tree, C4.5, SVM, KNN and Bagging. Conclusion: This study indicated that Adaboost could be applied to build a learning model in the prediction of inhibitors against K526 cells.


2021 ◽  
Vol 13 (3) ◽  
pp. 408
Author(s):  
Charles Nickmilder ◽  
Anthony Tedde ◽  
Isabelle Dufrasne ◽  
Françoise Lessire ◽  
Bernard Tychon ◽  
...  

Accurate information about the available standing biomass on pastures is critical for the adequate management of grazing and its promotion to farmers. In this paper, machine learning models are developed to predict available biomass expressed as compressed sward height (CSH) from readily accessible meteorological, optical (Sentinel-2) and radar satellite data (Sentinel-1). This study assumed that combining heterogeneous data sources, data transformations and machine learning methods would improve the robustness and the accuracy of the developed models. A total of 72,795 records of CSH with a spatial positioning, collected in 2018 and 2019, were used and aggregated according to a pixel-like pattern. The resulting dataset was split into a training one with 11,625 pixellated records and an independent validation one with 4952 pixellated records. The models were trained with a 19-fold cross-validation. A wide range of performances was observed (with mean root mean square error (RMSE) of cross-validation ranging from 22.84 mm of CSH to infinite-like values), and the four best-performing models were a cubist, a glmnet, a neural network and a random forest. These models had an RMSE of independent validation lower than 20 mm of CSH at the pixel-level. To simulate the behavior of the model in a decision support system, performances at the paddock level were also studied. These were computed according to two scenarios: either the predictions were made at a sub-parcel level and then aggregated, or the data were aggregated at the parcel level and the predictions were made for these aggregated data. The results obtained in this study were more accurate than those found in the literature concerning pasture budgeting and grassland biomass evaluation. The training of the 124 models resulting from the described framework was part of the realization of a decision support system to help farmers in their daily decision making.


2021 ◽  
pp. 1-15
Author(s):  
Sung Hoon Kang ◽  
Bo Kyoung Cheon ◽  
Ji-Sun Kim ◽  
Hyemin Jang ◽  
Hee Jin Kim ◽  
...  

Background: Amyloid (Aβ) evaluation in amnestic mild cognitive impairment (aMCI) patients is important for predicting conversion to Alzheimer’s disease. However, Aβ evaluation through amyloid positron emission tomography (PET) is limited due to high cost and safety issues. Objective: We therefore aimed to develop and validate prediction models of Aβ positivity for aMCI using optimal interpretable machine learning (ML) approaches utilizing multimodal markers. Methods: We recruited 529 aMCI patients from multiple centers who underwent Aβ PET. We trained ML algorithms using a training cohort (324 aMCI from Samsung medical center) with two-phase modelling: model 1 included age, gender, education, diabetes, hypertension, apolipoprotein E genotype, and neuropsychological test scores; model 2 included the same variables as model 1 with additional MRI features. We used four-fold cross-validation during the modelling and evaluated the models on an external validation cohort (187 aMCI from the other centers). Results: Model 1 showed good accuracy (area under the receiver operating characteristic curve [AUROC] 0.837) in cross-validation, and fair accuracy (AUROC 0.765) in external validation. Model 2 led to improvement in the prediction performance with good accuracy (AUROC 0.892) in cross validation compared to model 1. Apolipoprotein E genotype, delayed recall task scores, and interaction between cortical thickness in the temporal region and hippocampal volume were the most important predictors of Aβ positivity. Conclusion: Our results suggest that ML models are effective in predicting Aβ positivity at the individual level and could help the biomarker-guided diagnosis of prodromal AD.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Carly A. Bobak ◽  
Lili Kang ◽  
Lesley Workman ◽  
Lindy Bateman ◽  
Mohammad S. Khan ◽  
...  

AbstractPediatric tuberculosis (TB) remains a global health crisis. Despite progress, pediatric patients remain difficult to diagnose, with approximately half of all childhood TB patients lacking bacterial confirmation. In this pilot study (n = 31), we identify a 4-compound breathprint and subsequent machine learning model that accurately classifies children with confirmed TB (n = 10) from children with another lower respiratory tract infection (LRTI) (n = 10) with a sensitivity of 80% and specificity of 100% observed across cross validation folds. Importantly, we demonstrate that the breathprint identified an additional nine of eleven patients who had unconfirmed clinical TB and whose symptoms improved while treated for TB. While more work is necessary to validate the utility of using patient breath to diagnose pediatric TB, it shows promise as a triage instrument or paired as part of an aggregate diagnostic scheme.


10.2196/13476 ◽  
2019 ◽  
Vol 7 (3) ◽  
pp. e13476 ◽  
Author(s):  
Jiangpeng Wu ◽  
Xiangyi Zan ◽  
Liping Gao ◽  
Jianhong Zhao ◽  
Jing Fan ◽  
...  

Background Liquid biopsies based on blood samples have been widely accepted as a diagnostic and monitoring tool for cancers, but extremely high sensitivity is frequently needed due to the very low levels of the specially selected DNA, RNA, or protein biomarkers that are released into blood. However, routine blood indices tests are frequently ordered by physicians, as they are easy to perform and are cost effective. In addition, machine learning is broadly accepted for its ability to decipher complicated connections between multiple sets of test data and diseases. Objective The aim of this study is to discover the potential association between lung cancer and routine blood indices and thereby help clinicians and patients to identify lung cancer based on these routine tests. Methods The machine learning method known as Random Forest was adopted to build an identification model between routine blood indices and lung cancer that would determine if they were potentially linked. Ten-fold cross-validation and further tests were utilized to evaluate the reliability of the identification model. Results In total, 277 patients with 49 types of routine blood indices were included in this study, including 183 patients with lung cancer and 94 patients without lung cancer. Throughout the course of the study, there was correlation found between the combination of 19 types of routine blood indices and lung cancer. Lung cancer patients could be identified from other patients, especially those with tuberculosis (which usually has similar clinical symptoms to lung cancer), with a sensitivity, specificity and total accuracy of 96.3%, 94.97% and 95.7% for the cross-validation results, respectively. This identification method is called the routine blood indices model for lung cancer, and it promises to be of help as a tool for both clinicians and patients for the identification of lung cancer based on routine blood indices. Conclusions Lung cancer can be identified based on the combination of 19 types of routine blood indices, which implies that artificial intelligence can find the connections between a disease and the fundamental indices of blood, which could reduce the necessity of costly, elaborate blood test techniques for this purpose. It may also be possible that the combination of multiple indices obtained from routine blood tests may be connected to other diseases as well.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yao Huimin

With the development of cloud computing and distributed cluster technology, the concept of big data has been expanded and extended in terms of capacity and value, and machine learning technology has also received unprecedented attention in recent years. Traditional machine learning algorithms cannot solve the problem of effective parallelization, so a parallelization support vector machine based on Spark big data platform is proposed. Firstly, the big data platform is designed with Lambda architecture, which is divided into three layers: Batch Layer, Serving Layer, and Speed Layer. Secondly, in order to improve the training efficiency of support vector machines on large-scale data, when merging two support vector machines, the “special points” other than support vectors are considered, that is, the points where the nonsupport vectors in one subset violate the training results of the other subset, and a cross-validation merging algorithm is proposed. Then, a parallelized support vector machine based on cross-validation is proposed, and the parallelization process of the support vector machine is realized on the Spark platform. Finally, experiments on different datasets verify the effectiveness and stability of the proposed method. Experimental results show that the proposed parallelized support vector machine has outstanding performance in speed-up ratio, training time, and prediction accuracy.


2021 ◽  
Vol 15 ◽  
Author(s):  
Jinyu Zang ◽  
Yuanyuan Huang ◽  
Lingyin Kong ◽  
Bingye Lei ◽  
Pengfei Ke ◽  
...  

Recently, machine learning techniques have been widely applied in discriminative studies of schizophrenia (SZ) patients with multimodal magnetic resonance imaging (MRI); however, the effects of brain atlases and machine learning methods remain largely unknown. In this study, we collected MRI data for 61 first-episode SZ patients (FESZ), 79 chronic SZ patients (CSZ) and 205 normal controls (NC) and calculated 4 MRI measurements, including regional gray matter volume (GMV), regional homogeneity (ReHo), amplitude of low-frequency fluctuation and degree centrality. We systematically analyzed the performance of two classifications (SZ vs NC; FESZ vs CSZ) based on the combinations of three brain atlases, five classifiers, two cross validation methods and 3 dimensionality reduction algorithms. Our results showed that the groupwise whole-brain atlas with 268 ROIs outperformed the other two brain atlases. In addition, the leave-one-out cross validation was the best cross validation method to select the best hyperparameter set, but the classification performances by different classifiers and dimensionality reduction algorithms were quite similar. Importantly, the contributions of input features to both classifications were higher with the GMV and ReHo features of brain regions in the prefrontal and temporal gyri. Furthermore, an ensemble learning method was performed to establish an integrated model, in which classification performance was improved. Taken together, these findings indicated the effects of these factors in constructing effective classifiers for psychiatric diseases and showed that the integrated model has the potential to improve the clinical diagnosis and treatment evaluation of SZ.


2022 ◽  
Author(s):  
Leon Faure ◽  
Bastien Mollet ◽  
Wolfram Liebermeister ◽  
Jean-Loup Faulon

Metabolic networks have largely been exploited as mechanistic tools to predict the behavior of microorganisms with a defined genotype in different environments. However, flux predictions by constraint-based modeling approaches are limited in quality unless labor-intensive experiments including the measurement of media intake fluxes, are performed. Using machine learning instead of an optimization of biomass flux - on which most existing constraint-based methods are based - provides ways to improve flux and growth rate predictions. In this paper, we show how Recurrent Neural Networks can surrogate constraint-based modeling and make metabolic networks suitable for backpropagation and consequently be used as an architecture for machine learning. We refer to our hybrid - mechanistic and neural network - models as Artificial Metabolic Networks (AMN). We showcase AMN and illustrate its performance with an experimental dataset of Escherichia coli growth rates in 73 different media compositions. We reach a regression coefficient of R2=0.78 on cross-validation sets. We expect AMNs to provide easier discovery of metabolic insights and prompt new biotechnological applications.


2021 ◽  
Author(s):  
Lianteng Song ◽  
◽  
Zhonghua Liu ◽  
Chaoliu Li ◽  
Congqian Ning ◽  
...  

Geomechanical properties are essential for safe drilling, successful completion, and exploration of both conven-tional and unconventional reservoirs, e.g. deep shale gas and shale oil. Typically, these properties could be calcu-lated from sonic logs. However, in shale reservoirs, it is time-consuming and challenging to obtain reliable log-ging data due to borehole complexity and lacking of in-formation, which often results in log deficiency and high recovery cost of incomplete datasets. In this work, we propose the bidirectional long short-term memory (BiL-STM) which is a supervised neural network algorithm that has been widely used in sequential data-based pre-diction to estimate geomechanical parameters. The pre-diction from log data can be conducted from two differ-ent aspects. 1) Single-Well prediction, the log data from a single well is divided into training data and testing data for cross validation; 2) Cross-Well prediction, a group of wells from the same geographical region are divided into training set and testing set for cross validation, as well. The logs used in this work were collected from 11 wells from Jimusaer Shale, which includes gamma ray, bulk density, resistivity, and etc. We employed 5 vari-ous machine learning algorithms for comparison, among which BiLSTM showed the best performance with an R-squared of more than 90% and an RMSE of less than 10. The predicted results can be directly used to calcu-late geomechanical properties, of which accuracy is also improved in contrast to conventional methods.


2021 ◽  
Author(s):  
Surya Gupta ◽  
Peter Lehmann ◽  
Andreas Papritz ◽  
Tomislav Hengl ◽  
Sara Bonetti ◽  
...  

<p>Saturated soil hydraulic conductivity (Ksat) is a key parameter in many hydrological and climatic modeling applications, as it controls the partitioning between precipitation, infiltration and runoff. Values of Ksat are often deduced from Pedotransfer Functions (PTFs) using maps of soil attributes. To circumvent inherent limitations of present PTFs (heavy reliance of arable land measurements, ignoring soil structure, and geographic bias to temperate regions), we propose a new global Ksat map at 1–km resolution by harnessing technological advances in machine learning and availability of remotely sensed surrogate information (terrain, climate and vegetation). We compiled a comprehensive Ksat data set with 13,258 data geo-referenced points from literature and other sources. The data were standardized and quality-checked in order to provide a global database of soil saturated hydraulic conductivity (SoilKsatDB). The SoilKsatDB was then applied to develop a Covariate-based GeoTransfer Function (CoGTF) model for predicting spatially distributed Ksat values using remotely sensed information on various environmental covariates. The model accuracy assessment based on spatial cross-validation shows a concordance correlation coefficient (CCC) of 0.16 and a root meansquare error (RMSE) of 1.18 for log10 Ksat values in cm/day (CCC=0.79 and RMSE=0.72 for non spatial cross-validation). The generated maps of Ksat represent spatial patterns of soil formation processes more distinctly than previous global maps of Ksat based on soil texture information and bulk density. The validation indicates that Ksat could be modeled without bias using CoGTFs that harness spatially distributed surface and climate attributes, compared to soil information based PTFs. The relatively poor performance of all models in the validation (low CCC and high RMSE) highlights the need for the collection of additional Ksat values to train the model for regions with sparse data.</p>


Sign in / Sign up

Export Citation Format

Share Document