scholarly journals Accuracy in the prediction of disease epidemics when ensembling simple but highly correlated models

2021 ◽  
Vol 17 (3) ◽  
pp. e1008831
Author(s):  
Denis A. Shah ◽  
Erick D. De Wolf ◽  
Pierce A. Paul ◽  
Laurence V. Madden

Ensembling combines the predictions made by individual component base models with the goal of achieving a predictive accuracy that is better than that of any one of the constituent member models. Diversity among the base models in terms of predictions is a crucial criterion in ensembling. However, there are practical instances when the available base models produce highly correlated predictions, because they may have been developed within the same research group or may have been built from the same underlying algorithm. We investigated, via a case study on Fusarium head blight (FHB) on wheat in the U.S., whether ensembles of simple yet highly correlated models for predicting the risk of FHB epidemics, all generated from logistic regression, provided any benefit to predictive performance, despite relatively low levels of base model diversity. Three ensembling methods were explored: soft voting, weighted averaging of smaller subsets of the base models, and penalized regression as a stacking algorithm. Soft voting and weighted model averages were generally better at classification than the base models, though not universally so. The performances of stacked regressions were superior to those of the other two ensembling methods we analyzed in this study. Ensembling simple yet correlated models is computationally feasible and is therefore worth pursuing for models of epidemic risk.

2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Nicoletta Dessì ◽  
Emanuele Pascariello ◽  
Barbara Pes

Feature selection has become the essential step in biomarker discovery from high-dimensional genomics data. It is recognized that different feature selection techniques may result in different set of biomarkers, that is, different groups of genes highly correlated to a given pathological condition, but few direct comparisons exist which quantify these differences in a systematic way. In this paper, we propose a general methodology for comparing the outcomes of different selection techniques in the context of biomarker discovery. The comparison is carried out along two dimensions: (i) measuring the similarity/dissimilarity of selected gene sets; (ii) evaluating the implications of these differences in terms of both predictive performance and stability of selected gene sets. As a case study, we considered three benchmarks deriving from DNA microarray experiments and conducted a comparative analysis among eight selection methods, representatives of different classes of feature selection techniques. Our results show that the proposed approach can provide useful insight about the pattern of agreement of biomarker discovery techniques.


2021 ◽  
Author(s):  
Ilkin Bayramli ◽  
Victor Castro ◽  
Yuval Barak-Corren ◽  
Emily M Madsen ◽  
Matthew K Nock ◽  
...  

Clinical risk prediction models powered by electronic health records (EHRs) are becoming increasingly widespread in clinical practice. With suicide-related mortality rates rising in recent years, it is becoming increasingly urgent to understand, predict, and prevent suicidal behavior. Here, we compare the predictive value of structured and unstructured EHR data for predicting suicide risk. We find that Naive Bayes Classifier (NBC) and Random Forest (RF) models trained on structured EHR data perform better than those based on unstructured EHR data. An NBC model trained on both structured and unstructured data yields similar performance (AUC = 0.743) to an NBC model trained on structured data alone (0.742, p = 0.668), while an RF model trained on both data types yields significantly better results (AUC = 0.903) than an RF model trained on structured data alone (0.887, p<0.001), likely due to the RF model's ability to capture interactions between the two data types. To investigate these interactions, we propose and implement a general framework for identifying specific structured-unstructured feature pairs whose interactions differ between case and non-case cohorts, and thus have the potential to improve predictive performance and increase understanding of clinical risk. We find that such feature pairs tend to capture heterogeneous pairs of general concepts, rather than homogeneous pairs of specific concepts. These findings and this framework can be used to improve current and future EHR-based clinical modeling efforts.


Processes ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 65
Author(s):  
Zihao Wang ◽  
Zhen Song ◽  
Teng Zhou

In addition to proper physicochemical properties, low toxicity is also desirable when seeking suitable ionic liquids (ILs) for specific applications. In this context, machine learning (ML) models were developed to predict the IL toxicity in leukemia rat cell line (IPC-81) based on an extended experimental dataset. Following a systematic procedure including framework construction, hyper-parameter optimization, model training, and evaluation, the feedforward neural network (FNN) and support vector machine (SVM) algorithms were adopted to predict the toxicity of ILs directly from their molecular structures. Based on the ML structures optimized by the five-fold cross validation, two ML models were established and evaluated using IL structural descriptors as inputs. It was observed that both models exhibited high predictive accuracy, with the SVM model observed to be slightly better than the FNN model. For the SVM model, the determination coefficients were 0.9289 and 0.9202 for the training and test sets, respectively. The satisfactory predictive performance and generalization ability make our models useful for the computer-aided molecular design (CAMD) of environmentally friendly ILs.


2021 ◽  
Author(s):  
Adrian Soto-Mota ◽  
Braulio A. Marfil-Garza ◽  
Santiago Castiello de Obeso ◽  
Erick Martínez ◽  
Daniel Alberto Carrillo-Vázquez ◽  
...  

ABSTRACTBackgroundMost COVID-19 mortality scores were developed in the early months of the pandemic and now available evidence-based interventions have helped reduce its lethality. It has not been evaluated if the original predictive performance of these scores holds true nor compared it against Clinical Gestalt predictions. We tested the current predictive accuracy of six COVID-19 scores and compared it with Clinical Gestalt predictions.Methods200 COVID-19 patients were enrolled in a tertiary hospital in Mexico City between September and December 2020. Clinical Gestalt predictions of death (as a percentage) and LOW-HARM, qSOFA, MSL-COVID-19, NUTRI-CoV and NEWS2 were obtained at admission. We calculated the AUC of each score and compared it against Clinical Gestalt predictions and against their respective originally reported value.Results106 men and 60 women aged 56+/-9 and with confirmed COVID-19 were included in the analysis. The observed AUC of all scores was significantly lower than originally reported; LOW-HARM 0.96 (0.94-0.98) vs 0.76 (0.69-0.84), qSOFA 0.74 (0.65-0.81) vs 0.61 (0.53-0.69), MSL-COVID-19 0.72 (0.69-0.75) vs 0.64 (0.55-0.73) NUTRI-CoV 0.79 (0.76-0.82) vs 0.60 (0.51-0.69), NEWS2 0.84 (0.79-0.90) vs 0.65 (0.56-0.75), Neutrophil-Lymphocyte ratio 0.74 (0.62-0.85) vs 0.65 (0.57-0.73). Clinical Gestalt predictions were non-inferior to mortality scores (AUC=0.68 (0.59-0.77)). Adjusting the LOW-HARM score with locally derived likelihood ratios did not improve its performance. However, some scores performed better than Clinical Gestalt predictions when clinician’s confidence of prediction was <80%.ConclusionNo score was significantly better than Clinical Gestalt predictions. Despite its subjective nature, Clinical Gestalt has relevant advantages for predicting COVID-19 clinical outcomes.


2005 ◽  
Vol 51 (12) ◽  
pp. 325-329 ◽  
Author(s):  
X. Wang ◽  
X. Bai ◽  
J. Qiu ◽  
B. Wang

The performance of a pond–constructed wetland system in the treatment of municipal wastewater in Kiaochow city was studied; and comparison with oxidation ponds system was conducted. In the post-constructed wetland, the removal of COD, TN and TP is 24%, 58.5% and 24.8% respectively. The treated effluent from the constructed wetland can meet the Chinese National Agricultural and Irrigation Standard. The comparison between pond–constructed wetland system and oxidation pond system shows that total nitrogen removal in a constructed wetland is better than that in an oxidation pond and the TP removal is inferior. A possible reason is the low dissolved oxygen concentration in the wetland. Constructed wetlands can restrain the growth of algae effectively, and can produce obvious ecological and economical benefits.


2021 ◽  
pp. 1-22
Author(s):  
Lei Jinyu ◽  
Liu Lei ◽  
Chu Xiumin ◽  
He Wei ◽  
Liu Xinglong ◽  
...  

Abstract The ship safety domain plays a significant role in collision risk assessment. However, few studies take the practical considerations of implementing this method in the vicinity of bridge-waters into account. Therefore, historical automatic identification system data is utilised to construct and analyse ship domains considering ship–ship and ship–bridge collisions. A method for determining the closest boundary is proposed, and the boundary of the ship domain is fitted by the least squares method. The ship domains near bridge-waters are constructed as ellipse models, the characteristics of which are discussed. Novel fuzzy quaternion ship domain models are established respectively for inland ships and bridge piers, which would assist in the construction of a risk quantification model and the calculation of a grid ship collision index. A case study is carried out on the multi-bridge waterway of the Yangtze River in Wuhan, China. The results show that the size of the ship domain is highly correlated with the ship's speed and length, and analysis of collision risk can reflect the real situation near bridge-waters, which is helpful to demonstrate the application of the ship domain in quantifying the collision risk and to characterise the collision risk distribution near bridge-waters.


2021 ◽  
Vol 18 (2) ◽  
pp. 172988142199958
Author(s):  
Larkin Folsom ◽  
Masahiro Ono ◽  
Kyohei Otsu ◽  
Hyoshin Park

Mission-critical exploration of uncertain environments requires reliable and robust mechanisms for achieving information gain. Typical measures of information gain such as Shannon entropy and KL divergence are unable to distinguish between different bimodal probability distributions or introduce bias toward one mode of a bimodal probability distribution. The use of a standard deviation (SD) metric reduces bias while retaining the ability to distinguish between higher and lower risk distributions. Areas of high SD can be safely explored through observation with an autonomous Mars Helicopter allowing safer and faster path plans for ground-based rovers. First, this study presents a single-agent information-theoretic utility-based path planning method for a highly correlated uncertain environment. Then, an information-theoretic two-stage multiagent rapidly exploring random tree framework is presented, which guides Mars helicopter through regions of high SD to reduce uncertainty for the rover. In a Monte Carlo simulation, we compare our information-theoretic framework with a rover-only approach and a naive approach, in which the helicopter scouts ahead of the rover along its planned path. Finally, the model is demonstrated in a case study on the Jezero region of Mars. Results show that the information-theoretic helicopter improves the travel time for the rover on average when compared with the rover alone or with the helicopter scouting ahead along the rover’s initially planned route.


Author(s):  
Jacqueline Peng ◽  
Mengge Zhao ◽  
James Havrilla ◽  
Cong Liu ◽  
Chunhua Weng ◽  
...  

Abstract Background Natural language processing (NLP) tools can facilitate the extraction of biomedical concepts from unstructured free texts, such as research articles or clinical notes. The NLP software tools CLAMP, cTAKES, and MetaMap are among the most widely used tools to extract biomedical concept entities. However, their performance in extracting disease-specific terminology from literature has not been compared extensively, especially for complex neuropsychiatric disorders with a diverse set of phenotypic and clinical manifestations. Methods We comparatively evaluated these NLP tools using autism spectrum disorder (ASD) as a case study. We collected 827 ASD-related terms based on previous literature as the benchmark list for performance evaluation. Then, we applied CLAMP, cTAKES, and MetaMap on 544 full-text articles and 20,408 abstracts from PubMed to extract ASD-related terms. We evaluated the predictive performance using precision, recall, and F1 score. Results We found that CLAMP has the best performance in terms of F1 score followed by cTAKES and then MetaMap. Our results show that CLAMP has much higher precision than cTAKES and MetaMap, while cTAKES and MetaMap have higher recall than CLAMP. Conclusion The analysis protocols used in this study can be applied to other neuropsychiatric or neurodevelopmental disorders that lack well-defined terminology sets to describe their phenotypic presentations.


Sign in / Sign up

Export Citation Format

Share Document