Accurate Heart Disease Prediction via Improved Stacking Integration Algorithm

Author(s):  
Hua-ping Jia ◽  
Jun-long Zhao ◽  
Jun-Liu ◽  
Min-Zhang ◽  
Wei-Xi Sun

The stacking algorithm has better generalization ability than other learning algorithms, and can flexibly handle different tasks. The basic model of this algorithm uses heterogeneous learning devices (different types of learning devices), but for each data set in K-fold cross validation, the learners used are homogeneous (the same type of learner). Considering the neglect of the precision difference by a homogeneous heterotopic learner, the accuracy difference weighting method is proposed to improve the traditional stacking algorithm. In the first layer of the traditional stacking algorithm, the algorithm is weighted according to the prediction accuracy, that is, the output of the test set of the first layer is weighted by the weight calculated with the obtained precision, and the weighted result input into the element learner is taken as the feature. As one of the diseases with the highest incidence and mortality, the effective prediction of heart disease can provide an important basis for assisting diagnosis and enhancing the survival rate of patients. In this article, the improved stacking integration algorithm was used to construct a two-layer classifier model to predict heart disease. The experimental results show that the algorithm can effectively improve the prediction accuracy of heart disease through the verification of other heart disease data sets, and it is found that the stacking algorithm has better generalization performance.

Polymers ◽  
2021 ◽  
Vol 13 (21) ◽  
pp. 3811
Author(s):  
Iosif Sorin Fazakas-Anca ◽  
Arina Modrea ◽  
Sorin Vlase

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.


2020 ◽  
Vol 10 (8) ◽  
pp. 2725-2739 ◽  
Author(s):  
Diego Jarquin ◽  
Reka Howard ◽  
Jose Crossa ◽  
Yoseph Beyene ◽  
Manje Gowda ◽  
...  

“Sparse testing” refers to reduced multi-environment breeding trials in which not all genotypes of interest are grown in each environment. Using genomic-enabled prediction and a model embracing genotype × environment interaction (GE), the non-observed genotype-in-environment combinations can be predicted. Consequently, the overall costs can be reduced and the testing capacities can be increased. The accuracy of predicting the unobserved data depends on different factors including (1) how many genotypes overlap between environments, (2) in how many environments each genotype is grown, and (3) which prediction method is used. In this research, we studied the predictive ability obtained when using a fixed number of plots and different sparse testing designs. The considered designs included the extreme cases of (1) no overlap of genotypes between environments, and (2) complete overlap of the genotypes between environments. In the latter case, the prediction set fully consists of genotypes that have not been tested at all. Moreover, we gradually go from one extreme to the other considering (3) intermediates between the two previous cases with varying numbers of different or non-overlapping (NO)/overlapping (O) genotypes. The empirical study is built upon two different maize hybrid data sets consisting of different genotypes crossed to two different testers (T1 and T2) and each data set was analyzed separately. For each set, phenotypic records on yield from three different environments are available. Three different prediction models were implemented, two main effects models (M1 and M2), and a model (M3) including GE. The results showed that the genome-based model including GE (M3) captured more phenotypic variation than the models that did not include this component. Also, M3 provided higher prediction accuracy than models M1 and M2 for the different allocation scenarios. Reducing the size of the calibration sets decreased the prediction accuracy under all allocation designs with M3 being the less affected model; however, using the genome-enabled models (i.e., M2 and M3) the predictive ability is recovered when more genotypes are tested across environments. Our results indicate that a substantial part of the testing resources can be saved when using genome-based models including GE for optimizing sparse testing designs.


Geophysics ◽  
2013 ◽  
Vol 78 (2) ◽  
pp. E79-E94 ◽  
Author(s):  
John Deceuster ◽  
Olivier Kaufmann ◽  
Michel Van Camp

Electrical resistivity tomography (ERT) monitoring experiments are being conducted more often to image spatiotemporal changes in soil properties. When conducting long-term ERT monitoring, the identification of suspicious electrodes in a permanent spread is of major importance because changes in electrode contact properties of a single electrode may affect the quality of many measurements on each time-slice. An automated methodology was developed to detect these temporal changes in electrode contact properties, based on a Bayesian approach called “weights of evidence.” Contrasts [Formula: see text] and studentized contrasts [Formula: see text] are estimators of the influence of each electrode in the global data quality. A consolidated studentized contrast [Formula: see text] is introduced to consider the proportion of rejected quadripoles which contain a single electrode. These estimators are computed for each time-slice using [Formula: see text]-factor (coefficient of variation of repeated measurements) threshold values, from 0 to 10%, to discriminate between selected and rejected quadripoles. An automated detection strategy is proposed to identify suspicious electrodes by comparing the [Formula: see text] to the [Formula: see text] (maximum expected [Formula: see text] values when every electrode is good for the given data set). These [Formula: see text] are computed using Monte-Carlo simulations of a hundred random draws where the distribution of [Formula: see text]-factor values follows a Weibull cumulative distribution, with [Formula: see text] and [Formula: see text], fitted on a background data set filtered using a 5% threshold on absolute reciprocal errors. The efficiency of the methodology and its sensitivity to the selected reciprocal error threshold are assessed on synthetic and field data. Our approach is suitable to detect suspicious electrodes and slowly changing conditions affecting the galvanic contact resistances where classical approaches are shown to be inadequate except when the faulty electrode is disconnected. A data-weighting method is finally proposed to ensure that only good data will be used in the inversion of ERT monitoring data sets.


2018 ◽  
Vol 154 (2) ◽  
pp. 149-155
Author(s):  
Michael Archer

1. Yearly records of worker Vespula germanica (Fabricius) taken in suction traps at Silwood Park (28 years) and at Rothamsted Research (39 years) are examined. 2. Using the autocorrelation function (ACF), a significant negative 1-year lag followed by a lesser non-significant positive 2-year lag was found in all, or parts of, each data set, indicating an underlying population dynamic of a 2-year cycle with a damped waveform. 3. The minimum number of years before the 2-year cycle with damped waveform was shown varied between 17 and 26, or was not found in some data sets. 4. Ecological factors delaying or preventing the occurrence of the 2-year cycle are considered.


2018 ◽  
Vol 21 (2) ◽  
pp. 117-124 ◽  
Author(s):  
Bakhtyar Sepehri ◽  
Nematollah Omidikia ◽  
Mohsen Kompany-Zareh ◽  
Raouf Ghavami

Aims & Scope: In this research, 8 variable selection approaches were used to investigate the effect of variable selection on the predictive power and stability of CoMFA models. Materials & Methods: Three data sets including 36 EPAC antagonists, 79 CD38 inhibitors and 57 ATAD2 bromodomain inhibitors were modelled by CoMFA. First of all, for all three data sets, CoMFA models with all CoMFA descriptors were created then by applying each variable selection method a new CoMFA model was developed so for each data set, 9 CoMFA models were built. Obtained results show noisy and uninformative variables affect CoMFA results. Based on created models, applying 5 variable selection approaches including FFD, SRD-FFD, IVE-PLS, SRD-UVEPLS and SPA-jackknife increases the predictive power and stability of CoMFA models significantly. Result & Conclusion: Among them, SPA-jackknife removes most of the variables while FFD retains most of them. FFD and IVE-PLS are time consuming process while SRD-FFD and SRD-UVE-PLS run need to few seconds. Also applying FFD, SRD-FFD, IVE-PLS, SRD-UVE-PLS protect CoMFA countor maps information for both fields.


Author(s):  
Kyungkoo Jun

Background & Objective: This paper proposes a Fourier transform inspired method to classify human activities from time series sensor data. Methods: Our method begins by decomposing 1D input signal into 2D patterns, which is motivated by the Fourier conversion. The decomposition is helped by Long Short-Term Memory (LSTM) which captures the temporal dependency from the signal and then produces encoded sequences. The sequences, once arranged into the 2D array, can represent the fingerprints of the signals. The benefit of such transformation is that we can exploit the recent advances of the deep learning models for the image classification such as Convolutional Neural Network (CNN). Results: The proposed model, as a result, is the combination of LSTM and CNN. We evaluate the model over two data sets. For the first data set, which is more standardized than the other, our model outperforms previous works or at least equal. In the case of the second data set, we devise the schemes to generate training and testing data by changing the parameters of the window size, the sliding size, and the labeling scheme. Conclusion: The evaluation results show that the accuracy is over 95% for some cases. We also analyze the effect of the parameters on the performance.


2019 ◽  
Vol 73 (8) ◽  
pp. 893-901
Author(s):  
Sinead J. Barton ◽  
Bryan M. Hennelly

Cosmic ray artifacts may be present in all photo-electric readout systems. In spectroscopy, they present as random unidirectional sharp spikes that distort spectra and may have an affect on post-processing, possibly affecting the results of multivariate statistical classification. A number of methods have previously been proposed to remove cosmic ray artifacts from spectra but the goal of removing the artifacts while making no other change to the underlying spectrum is challenging. One of the most successful and commonly applied methods for the removal of comic ray artifacts involves the capture of two sequential spectra that are compared in order to identify spikes. The disadvantage of this approach is that at least two recordings are necessary, which may be problematic for dynamically changing spectra, and which can reduce the signal-to-noise (S/N) ratio when compared with a single recording of equivalent duration due to the inclusion of two instances of read noise. In this paper, a cosmic ray artefact removal algorithm is proposed that works in a similar way to the double acquisition method but requires only a single capture, so long as a data set of similar spectra is available. The method employs normalized covariance in order to identify a similar spectrum in the data set, from which a direct comparison reveals the presence of cosmic ray artifacts, which are then replaced with the corresponding values from the matching spectrum. The advantage of the proposed method over the double acquisition method is investigated in the context of the S/N ratio and is applied to various data sets of Raman spectra recorded from biological cells.


2013 ◽  
Vol 756-759 ◽  
pp. 3652-3658
Author(s):  
You Li Lu ◽  
Jun Luo

Under the study of Kernel Methods, this paper put forward two improved algorithm which called R-SVM & I-SVDD in order to cope with the imbalanced data sets in closed systems. R-SVM used K-means algorithm clustering space samples while I-SVDD improved the performance of original SVDD by imbalanced sample training. Experiment of two sets of system call data set shows that these two algorithms are more effectively and R-SVM has a lower complexity.


2019 ◽  
Vol 8 (8) ◽  
pp. 1167 ◽  
Author(s):  
Maria Fe Muñoz-Moreno ◽  
Pablo Ryan ◽  
Alejandro Alvaro-Meca ◽  
Jorge Valencia ◽  
Eduardo Tamayo ◽  
...  

Background: People living with human immunodeficiency virus (HIV) (PLWH) form a vulnerable population for the onset of infective endocarditis (IE). We aimed to analyze the epidemiological trend of IE, as well as its microbiological characteristics, in PLWH during the combined antiretroviral therapy era in Spain. Methods: We performed a retrospective study (1997–2014) in PLWH with data obtained from the Spanish Minimum Basic Data Set. We selected 1800 hospital admissions with an IE diagnosis, which corresponded to 1439 patients. Results: We found significant downward trends in the periods 1997–1999 and 2008–2014 in the rate of hospital admissions with an IE diagnosis (from 21.8 to 3.8 events per 10,000 patients/year; p < 0.001), IE incidence (from 18.2 to 2.9 events per 10,000 patients/year; p < 0.001), and IE mortality (from 23.9 to 5.5 deaths per 100,000 patient-years; p < 0.001). The most frequent microorganisms involved were staphylococci (50%; 42.7% Staphylococcus aureus and 7.3% coagulase-negative staphylococci (CoNS)), followed by streptococci (9.3%), Gram-negative bacilli (8.3%), enterococci (3%), and fungus (1.4%). During the study period, we found a downward trend in the rates of CoNS (p < 0.001) and an upward trends in streptococci (p = 0.001), Gram-negative bacilli (p < 0.001), enterococci (p = 0.003), and fungus (p < 0.001) related to IE, mainly in 2008–2014. The rate of community-acquired IE showed a significant upward trend (p = 0.001), while the rate of health care-associated IE showed a significant downward trend (p < 0.001). Conclusions: The rates of hospital admissions, incidence, and mortality related to IE diagnosis in PLWH in Spain decreased from 1997 to 2014, while other changes in clinical characteristics, mode of acquisition, and pathogens occurred over this time.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Yahya Albalawi ◽  
Jim Buckley ◽  
Nikola S. Nikolov

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.


Sign in / Sign up

Export Citation Format

Share Document