boruta algorithm
Recently Published Documents


TOTAL DOCUMENTS

20
(FIVE YEARS 18)

H-INDEX

4
(FIVE YEARS 2)

PLoS ONE ◽  
2021 ◽  
Vol 16 (12) ◽  
pp. e0261401
Author(s):  
Christian Blüthgen ◽  
Miriam Patella ◽  
André Euler ◽  
Bettina Baessler ◽  
Katharina Martini ◽  
...  

Objectives To evaluate CT-derived radiomics for machine learning-based classification of thymic epithelial tumor (TET) stage (TNM classification), histology (WHO classification) and the presence of myasthenia gravis (MG). Methods Patients with histologically confirmed TET in the years 2000–2018 were retrospectively included, excluding patients with incompatible imaging or other tumors. CT scans were reformatted uniformly, gray values were normalized and discretized. Tumors were segmented manually; 15 scans were re-segmented after 2 weeks by two readers. 1316 radiomic features were calculated (pyRadiomics). Features with low intra-/inter-reader agreement (ICC<0.75) were excluded. Repeated nested cross-validation was used for feature selection (Boruta algorithm), model training, and evaluation (out-of-fold predictions). Shapley additive explanation (SHAP) values were calculated to assess feature importance. Results 105 patients undergoing surgery for TET were identified. After applying exclusion criteria, 62 patients (28 female; mean age, 57±14 years; range, 22–82 years) with 34 low-risk TET (LRT; WHO types A/AB/B1), 28 high-risk TET (HRT; WHO B2/B3/C) in early stage (49, TNM stage I-II) or advanced stage (13, TNM III-IV) were included. 14(23%) of the patients had MG. 334(25%) features were excluded after intra-/inter-reader analysis. Discriminatory performance of the random forest classifiers was good for histology(AUC, 87.6%; 95% confidence interval, 76.3–94.3) and TNM stage(AUC, 83.8%; 95%CI, 66.9–93.4) but poor for the prediction of MG (AUC, 63.9%; 95%CI, 44.8–79.5). Conclusions CT-derived radiomic features may be a useful imaging biomarker for TET histology and TNM stage.


2021 ◽  
Author(s):  
Tyler L. Weiglein ◽  
Brian D. Strahm ◽  
Maggie M. Bowman ◽  
Adrian C. Gallo ◽  
Jeff A. Hatten ◽  
...  

AbstractSoil organic matter (SOM) is the largest terrestrial pool of organic carbon, and potential carbon-climate feedbacks involving SOM decomposition could exacerbate anthropogenic climate change. However, our understanding of the controls on SOM mineralization is still incomplete, and as such, our ability to predict carbon-climate feedbacks is limited. To improve our understanding of controls on SOM decomposition, A and upper B horizon soil samples from 26 National Ecological Observatory Network (NEON) sites spanning the conterminous U.S. were incubated for 52 weeks under conditions representing site-specific mean summer temperature and sample-specific field capacity (−33 kPa) water potential. Cumulative carbon dioxide respired was periodically measured and normalized by soil organic C content to calculate cumulative specific respiration (CSR), a metric of SOM vulnerability to mineralization. The Boruta algorithm, a feature selection algorithm, was used to select important predictors of CSR from 159 variables. A diverse suite of predictors was selected (12 for A horizons, 7 for B horizons) with predictors falling into three categories corresponding to SOM chemistry, reactive Fe and Al phases, and site moisture availability. The relationship between SOM chemistry predictors and CSR was complex, while sites that had greater concentrations of reactive Fe and Al phases or were wetter had lower CSR. Only three predictors were selected for both horizon types, suggesting dominant controls on SOM decomposition differ by horizon. Our findings contribute to the emerging consensus that a broad array of controls regulates SOM decomposition at large scales and highlight the need to consider changing controls with depth.


Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 3389-3389
Author(s):  
Ibrahim Didi ◽  
David Simoncini ◽  
Francois Vergez ◽  
Pierre-Yves Dumas ◽  
Suzanne Tavitian ◽  
...  

Abstract Introduction In the acute myeloid leukemia (AML) setting, artificial intelligence has mainly been used to facilitate diagnosis or to identify biological subcategories. In this work, we trained and compared machine learning and deep learning predictive models of outcome on the data of 3687 consecutive adult AML patients included in the DATAML registry between 2000 and 2019. We also trained a model to predict the best treatment for newly diagnosed AML over 70 years. Methods Feature engineering and selection were done to keep the most relevant variables among clinical and biological characteristics at diagnosis. We worked with 54 features per patient, as well as information about the treatment received (intensive chemotherapy (IC) or azacitidine (AZA)), response and survival. We compared the performance of a gradient boosting algorithm (XGBoost) and three neural networks architectures: a multilayer perceptron (MLP), a neural oblivious decision ensemble model (NODE) and a recurrent relational network (RRN). We calibrated XGBoost with a grid search algorithm, and used 5-fold cross-validation on the dataset to evaluate all the models. The Shapley Additive Explanations method (SHAP) was used to showcase the importance and influence of variables on the predictions. The Boruta algorithm was then used to extract the most important features for prediction. Results In our cohort, 3030 patients (82.2%) received IC and 657 (17.8%) AZA as first line treatment. Median overall survival (OS) was 18 and 9 months, respectively. We first designed models for OS prediction. In the IC cohort, we achieved an accuracy of 68.5% on predicting OS at the 18-month mark, an improvement of 17.5% over a naïve predictor. The Boruta algorithm selected 13 variables as the most important, with decreasing order of importance: age, cytogenetic risk, WBC, LDH, platelets count, albumin, MPO, mean corpuscular volume, CD117, NPM1 mutation, AML status, multilineage dysmyelopoiesis, ASXL1 mutation (Figure 1). When training with only these 13 variables, we achieved an accuracy of 67.8%. In the AZA cohort, we achieved an accuracy of 62.1% on predicting OS at the 9-month mark, an improvement of 11.1% over a naïve predictor. Here the Boruta algorithm selected only 7 variables: blood blasts, serum ferritin, CD56, LDH, hemoglobin, CD13 and the presence of a disseminated intravascular coagulation. When training with only these 7 variables, we achieved a 61.9% accuracy. We then designed models to predict the best treatment between IC and AZA for the 1032 patients older than 70 years. We achieved a 88.5% accuracy, which is 37.5% more than a naïve predictor given the distribution of the cohort: 51% having received IC and 49% having received AZA. For this model, 12 features out of 54 were selected by the Boruta algorithm as the most important: age, TP53 mutation, bone marrow blasts, AML status, disseminated intravascular coagulation, blood blasts, cytogenetic risk, IDH2 mutation, IDH1 mutation, presence of an infection at diagnosis, ASXL1 mutation and presence of leukostasis. Conclusion We show that predictive models can be trained on our database to predict with characteristics at diagnosis the treatment that would be chosen by an expert hematologist between IC and AZA in newly diagnosed AML, give an indication of OS with each treatment, and outperform classical statistical analysis or naïve predictors. For the task of predicting OS, the improvement over naïve predictors is maximal at the median time of OS. We show with the Boruta algorithm that a small number of variables can recapitulate the accuracy of neural networks, which renders this type of model of high interest for routine practice, especially with the advent of targeted therapies. Figure 1 Figure 1. Disclosures Vergez: Pierre Fabre Laboratory: Research Funding; Roche: Research Funding. Dumas: BMS Celgene: Consultancy; Astellas: Consultancy; Daiichi-Sankyo: Consultancy. Tavitian: Novartis: Consultancy. Delabesse: Astellas: Consultancy; Novartis: Consultancy. Pigneux: Amgen: Consultancy; Sunesis: Consultancy, Research Funding; BMS Celgene: Consultancy, Research Funding; Roche: Consultancy, Research Funding; Novartis: Consultancy, Research Funding. Recher: Pfizer: Honoraria, Membership on an entity's Board of Directors or advisory committees; Novartis: Consultancy, Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Macrogenics: Honoraria, Membership on an entity's Board of Directors or advisory committees; MaatPharma: Research Funding; Jazz: Consultancy, Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Janssen: Honoraria; Incyte: Honoraria; Daiichi Sankyo: Consultancy, Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; BMS/Celgene: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Astellas: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Amgen: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Roche: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; Takeda: Honoraria, Membership on an entity's Board of Directors or advisory committees; Agios: Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding; AbbVie: Consultancy, Honoraria, Membership on an entity's Board of Directors or advisory committees, Research Funding. Bertoli: Astellas: Consultancy; BMS Celgene: Consultancy; Abbvie: Consultancy; Jazz Pharmaceuticals: Consultancy.


2021 ◽  
Vol 13 (18) ◽  
pp. 3643
Author(s):  
Yuan Liu ◽  
Qimeng Yue ◽  
Qianyang Wang ◽  
Jingshan Yu ◽  
Yuexin Zheng ◽  
...  

As the most direct indicator of drought, the dynamic assessment and prediction of actual evapotranspiration (AET) is crucial to regional water resources management. This research aims to develop a framework for the regional AET evaluation and prediction based on multiple machine learning methods and multi-source remote sensing data, which combines Boruta algorithm, Random Forest (RF), and Support Vector Regression (SVR) models, employing datasets from CRU, GLDAS, MODIS, GRACE (-FO), and CMIP6, covering meteorological, vegetation, and hydrological variables. To verify the framework, it is applied to grids of South America (SA) as a case. The results meticulously demonstrate the tendency of AET and identify the decisive role of T, P, and NDVI on AET in SA. Regarding the projection, RF has better performance in different input strategies in SA. According to the accuracy of RF and SVR on the pixel scale, the AET prediction dataset is generated by integrating the optimal results of the two models. By using multiple parameter inputs and two models to jointly obtain the optimal output, the results become more reasonable and accurate. The framework can systematically and comprehensively evaluate and forecast AET; although prediction products generated in SA cannot calibrate relevant parameters, it provides a quite valuable reference for regional drought warning and water allocating.


2021 ◽  
Vol 9 ◽  
Author(s):  
Jintao Lei ◽  
Tiankai Sun ◽  
Yongjiang Jiang ◽  
Ping Wu ◽  
Jinjian Fu ◽  
...  

Bronchopulmonary dysplasia (BPD) is one of the most common complications in premature infants. This disease is caused by long-time use of supplemental oxygen, which seriously affects the lung function of the child and imposes a heavy burden on the family and society. This research aims to adopt the method of ensemble learning in machine learning, combining the Boruta algorithm and the random forest algorithm to determine the predictors of premature infants with BPD and establish a predictive model to help clinicians to conduct an optimal treatment plan. Data were collected from clinical records of 996 premature infants treated in the neonatology department of Liuzhou Maternal and Child Health Hospital in Western China. In this study, premature infants with congenital anomaly, premature infants who died, and premature infants with incomplete data before the diagnosis of BPD were excluded from the data set. After exclusion, we included 648 premature infants in the study. The Boruta algorithm and 10-fold cross-validation were used for feature selection in this study. Six variables were finally selected from the 26 variables, and the random forest model was established. The area under the curve (AUC) of the model was as high as 0.929 with excellent predictive performance. The use of machine learning methods can help clinicians predict the disease so as to formulate the best treatment plan.


Energies ◽  
2021 ◽  
Vol 14 (10) ◽  
pp. 2779
Author(s):  
Tomasz Szul ◽  
Sylwester Tabor ◽  
Krzysztof Pancerz

Energy prediction used for building heating has attracted particular attention because it is often required in the development of various strategies to improve the energy efficiency of buildings, especially those undergoing thermal improvements. The complexity, dynamics, uncertainty, and nonlinearity of existing building energy systems create a great need for modeling techniques. One of them is machine learning models, which are based on input data consisting of features that describe the objects under study. The data describing actual buildings used to build the model may be characterized by missing values, duplicate or inconsistent features, noise, and outliers. Therefore, an extremely important aspect of the prediction model development effort is the proper selection of features to simplify the prediction of energy consumption for heating. In this connection, the goal was to evaluate the usefulness of a model describing the final energy demand rate for building heating using groups of features describing actual residential buildings undergoing thermal retrofit. The model was created by combining two algorithms: the BORUTA feature selection algorithm, which prepares conditional variables corresponding to features for a prediction model based on rough set theory (RST). The research was conducted on a group of 109 multi-family buildings from the end of the last century (made in large-panel technology), thermomodernized at the beginning of the 21st century. Evaluation metrics such as MAPE, MBE, CV RMSE, and R2, which are adopted as statistical calibration standards by ASHRAE, were used to assess the quality of the developed prediction model. The analysis of the obtained results indicated that the model based on RST, based on the features selected by the BORUTA algorithm, gives a satisfactory prediction quality with a limited number of input variables, and thus allows to predict energy consumption (after thermal improvement) for this type of buildings with high accuracy.


2021 ◽  
Author(s):  
Tung Dang ◽  
Hirohisa Kishino

Abstract Background: Random forest (RF) captures complex feature patterns that differentiate groups of samples and is rapidly being adopted in microbiome studies. However, a major challenge is the high dimensionality of microbiome datasets. They include thousands of species or molecular functions of particular biological interest. This high dimensionality significantly reduces the power of random forest approaches for identifying true differences. The widely used Boruta algorithm iteratively removes features that are proved by a statistical test to be less relevant than random probes. Result: We developed a massively parallel forward variable selection algorithm and coupled it with the RF classifier to maximize the predictive performance. The forward variable selection algorithm adds new variable to a set of selected variables as far as the prespecified criterion of predictive power is improved. At each step, the parameters of random forest are optimized. We demonstrated the performance of the proposed approach, which we named RF-FVS, by analyzing two published datasets from large-scale case-control studies: (i) 16S rRNA gene amplicon data for Clostridioides difficile infection (CDI) and (ii) shotgun metagenomics data for human colorectal cancer (CRC). The RF-FVS approach further screened the variables that the Boruta algorithm left and improved the accuracy of the random forest classifier from 81% to 99.01% for CDI and from 75.14% to 90.17% for CRC. Conclusion: Valid variable selection is essential for the analysis of high-dimensional microbiota data. By adopting the Boruta algorithm for pre-screening of the variables, our proposed RF-FVS approach improves the accuracy of random forest significantly with minimum increase of computational burden. The procedure can be used to identify the functional profiles that differentiate samples between different conditions.


2021 ◽  
Vol 2021 ◽  
pp. 1-22
Author(s):  
Jieqi Jin ◽  
Mengkai Guang ◽  
Anthony Chukwunonso Ogbuehi ◽  
Simin Li ◽  
Kai Zhang ◽  
...  

Objective. To investigate the genetic crosstalk mechanisms that link periodontitis and Alzheimer’s disease (AD). Background. Periodontitis, a common oral infectious disease, is associated with Alzheimer’s disease (AD) and considered a putative contributory factor to its progression. However, a comprehensive investigation of potential shared genetic mechanisms between these diseases has not yet been reported. Methods. Gene expression datasets related to periodontitis were downloaded from the Gene Expression Omnibus (GEO) database, and differential expression analysis was performed to identify differentially expressed genes (DEGs). Genes associated with AD were downloaded from the DisGeNET database. Overlapping genes among the DEGs in periodontitis and the AD-related genes were defined as crosstalk genes between periodontitis and AD. The Boruta algorithm was applied to perform feature selection from these crosstalk genes, and representative crosstalk genes were thus obtained. In addition, a support vector machine (SVM) model was constructed by using the scikit-learn algorithm in Python. Next, the crosstalk gene-TF network and crosstalk gene-DEP (differentially expressed pathway) network were each constructed. As a final step, shared genes among the crosstalk genes and periodontitis-related genes in DisGeNET were identified and denoted as the core crosstalk genes. Results. Four datasets (GSE23586, GSE16134, GSE10334, and GSE79705) pertaining to periodontitis were included in the analysis. A total of 48 representative crosstalk genes were identified by using the Boruta algorithm. Three TFs (FOS, MEF2C, and USF2) and several pathways (i.e., JAK-STAT, MAPK, NF-kappa B, and natural killer cell-mediated cytotoxicity) were identified as regulators of these crosstalk genes. Among these 48 crosstalk genes and the chronic periodontitis-related genes in DisGeNET, C4A, C4B, CXCL12, FCGR3A, IL1B, and MMP3 were shared and identified as the most pivotal candidate links between periodontitis and AD. Conclusions. Exploration of available transcriptomic datasets revealed C4A, C4B, CXCL12, FCGR3A, IL1B, and MMP3 as the top candidate molecular linkage genes between periodontitis and AD.


2021 ◽  
Vol 50 ◽  
pp. 100682
Author(s):  
Hamid Gholami ◽  
Aliakbar Mohammadifar ◽  
Shahram Golzari ◽  
Dimitris G. Kaskaoutis ◽  
Adrian L. Collins

Sign in / Sign up

Export Citation Format

Share Document