An Ensemble Learning Based Framework for Traditional Chinese Medicine Data Analysis with ICD-10 Labels

Objective.This study aims to establish a model to analyze clinical experience of TCM veteran doctors. We propose an ensemble learning based framework to analyze clinical records with ICD-10 labels information for effective diagnosis and acupoints recommendation.Methods.We propose an ensemble learning framework for the analysis task. A set of base learners composed of decision tree (DT) and support vector machine (SVM) are trained by bootstrapping the training dataset. The base learners are sorted by accuracy and diversity through nondominated sort (NDS) algorithm and combined through a deep ensemble learning strategy.Results.We evaluate the proposed method with comparison to two currently successful methods on a clinical diagnosis dataset with manually labeled ICD-10 information. ICD-10 label annotation and acupoints recommendation are evaluated for three methods. The proposed method achieves an accuracy rate of 88.2% ± 2.8% measured by zero-one loss for the first evaluation session and 79.6% ± 3.6% measured by Hamming loss, which are superior to the other two methods.Conclusion.The proposed ensemble model can effectively model the implied knowledge and experience in historic clinical data records. The computational cost of training a set of base learners is relatively low.

Download Full-text

Ensemble Learning Approach with LASSO for Predicting Catalytic Reaction Rates

Synlett ◽

10.1055/a-1304-4878 ◽

2020 ◽

Author(s):

Akira Yada ◽

Kazuhiko Sato ◽

Tarojiro Matsumura ◽

Yasunobu Ando ◽

Kenji Nagata ◽

...

Keyword(s):

Ensemble Learning ◽

Reaction Rates ◽

Initial Reaction Rate ◽

Training Dataset ◽

Initial Reaction ◽

Learning Approach ◽

Learning Framework ◽

Machine Learning Approach ◽

Reasonable Prediction ◽

Epoxidation Of Alkenes

AbstractThe prediction of the initial reaction rate in the tungsten-catalyzed epoxidation of alkenes by using a machine learning approach is demonstrated. The ensemble learning framework used in this study consists of random sampling with replacement from the training dataset, the construction of several predictive models (weak learners), and the combination of their outputs. This approach enables us to obtain a reasonable prediction model that avoids the problem of overfitting, even when analyzing a small dataset.

Download Full-text

An Efficient Ensemble Learning Method for Gene Microarray Classification

BioMed Research International ◽

10.1155/2013/478410 ◽

2013 ◽

Vol 2013 ◽

pp. 1-10 ◽

Cited By ~ 9

Author(s):

Alireza Osareh ◽

Bita Shadgar

Keyword(s):

Feature Selection ◽

Ensemble Learning ◽

Feature Selection Method ◽

Support Vector ◽

Gene Microarray ◽

Ensemble Classifiers ◽

Classifier Ensembles ◽

Rotation Forest ◽

Ensemble Techniques ◽

Effective Diagnosis

The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. However, it has been also revealed that the basic classification techniques have intrinsic drawbacks in achieving accurate gene classification and cancer diagnosis. On the other hand, classifier ensembles have received increasing attention in various applications. Here, we address the gene classification issue using RotBoost ensemble methodology. This method is a combination of Rotation Forest and AdaBoost techniques which in turn preserve both desirable features of an ensemble architecture, that is, accuracy and diversity. To select a concise subset of informative genes, 5 different feature selection algorithms are considered. To assess the efficiency of the RotBoost, other nonensemble/ensemble techniques including Decision Trees, Support Vector Machines, Rotation Forest, AdaBoost, and Bagging are also deployed. Experimental results have revealed that the combination of the fast correlation-based feature selection method with ICA-based RotBoost ensemble is highly effective for gene classification. In fact, the proposed method can create ensemble classifiers which outperform not only the classifiers produced by the conventional machine learning but also the classifiers generated by two widely used conventional ensemble learning methods, that is, Bagging and AdaBoost.

Download Full-text

Infinite AdaBoost and its Application on Fault Diagnosis for Analog Circuits

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.201-203.2070 ◽

2011 ◽

Vol 201-203 ◽

pp. 2070-2074

Author(s):

He Xun Wang ◽

Chong Liu ◽

Yu Qing Sun

Keyword(s):

Support Vector Machine ◽

Fault Diagnosis ◽

Ensemble Learning ◽

Analog Circuits ◽

Classification Accuracy ◽

Experimental Results ◽

Support Vector ◽

Adaboost Algorithm ◽

Learning Framework

AdaBoost algorithm can achieve better performance by averaging over the predictions of some weak hypotheses. To improve the power of classification ability of AdaBoost, an infinite ensemble learning framework based on the Support Vector Machine was formulated. The framework can output an infinite AdaBoost through embedding infinite hypotheses into a new kernel of Support Vector Machine. The stump kernel embodies infinite decision stumps. At last, the algorithm was used in fault diagnosis for analog circuits. Experimental results show that infinite AdaBoost with Support Vector Machine is superior than finite AdaBoost with the same base hypothesis set. The purpose of enhancing classification accuracy of AdaBoost algorithm is achieved.

Download Full-text

MoRFPred_en: Sequence-based prediction of MoRFs using an ensemble learning strategy

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720019400158 ◽

2019 ◽

Vol 17 (06) ◽

pp. 1940015

Author(s):

Chun Fang ◽

Yoshitaka Moriwaki ◽

Caihong Li ◽

Kentaro Shimizu

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Ensemble Learning ◽

Intrinsically Disordered Proteins ◽

Learning Strategy ◽

Disordered Proteins ◽

Support Vector ◽

One Dimensional ◽

Intrinsically Disordered ◽

Molecular Recognition Features

Molecular recognition features (MoRFs) usually act as “hub” sites in the interaction networks of intrinsically disordered proteins (IDPs). Because an increasing number of serious diseases have been found to be associated with disordered proteins, identifying MoRFs has become increasingly important. In this study, we propose an ensemble learning strategy, named MoRFPred_en, to predict MoRFs from protein sequences. This approach combines four submodels that utilize different sequence-derived features for the prediction, including a multichannel one-dimensional convolutional neural network (CNN_1D multichannel) based model, two deep two-dimensional convolutional neural network (DCNN_2D) based models, and a support vector machine (SVM) based model. When compared with other methods on the same datasets, the MoRFPred_en approach produced better results than existing state-of-the-art MoRF prediction methods, achieving an AUC of 0.762 on the VALIDATION419 dataset, 0.795 on the TEST45 dataset, and 0.776 on the TEST49 dataset. Availability: http://vivace.bi.a.u-tokyo.ac.jp:8008/fang/MoRFPred_en.php .

Download Full-text

A Learning Framework of Nonparallel Hyperplanes Classifier

The Scientific World JOURNAL ◽

10.1155/2015/497617 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12

Author(s):

Zhi-Xia Yang ◽

Yuan-Hai Shao ◽

Yao-Lin Jiang

Keyword(s):

Binary Classification ◽

Computational Cost ◽

Classification Problem ◽

Multiclass Classification ◽

Decision Function ◽

Support Vector ◽

Learning Framework ◽

Hinge Loss ◽

Benchmark Datasets ◽

Nonparallel Hyperplanes

A novel learning framework of nonparallel hyperplanes support vector machines (NPSVMs) is proposed for binary classification and multiclass classification. This framework not only includes twin SVM (TWSVM) and its many deformation versions but also extends them into multiclass classification problem when different parameters or loss functions are chosen. Concretely, we discuss the linear and nonlinear cases of the framework, in which we select the hinge loss function as example. Moreover, we also give the primal problems of several extension versions of TWSVM’s deformation versions. It is worth mentioning that, in the decision function, the Euclidean distance is replaced by the absolute value|wTx+b|, which keeps the consistency between the decision function and the optimization problem and reduces the computational cost particularly when the kernel function is introduced. The numerical experiments on several artificial and benchmark datasets indicate that our framework is not only fast but also shows good generalization.

Download Full-text

A stacking ensemble learning framework for genomic prediction

10.21203/rs.3.rs-52592/v1 ◽

2020 ◽

Author(s):

Mang Liang ◽

Tianpeng Chang ◽

Bingxing An ◽

Xinghai Duan ◽

Lili Du ◽

...

Keyword(s):

Machine Learning ◽

Ensemble Learning ◽

Genomic Prediction ◽

Milk Fat ◽

Prediction Accuracy ◽

Support Vector ◽

Fat Percentage ◽

Prediction Ability ◽

Learning Framework ◽

Better Than

Abstract Background: Machine learning (ML) is perhaps the most useful for the interpretation of large genomic datasets. However, the performance of a single machine learning method in genomic selection (GS) was unsatisfactory in existing research. To improve the genomic predictions, we constructed a stacking ensemble learning framework (SELF) integrated three machine learning methods to predict genomic estimated breeding values (GEBVs). Results: We evaluated the prediction ability of SELF by three real datasets and compared the prediction accuracy of SELF, base learners, GBLUP and BayesB. For each trait, SELF performed better than base learners, which included support vector regression (SVR), kernel ridge regression (KRR) and elastic net (ENET). The prediction accuracy of SELF had an average 7.70% improvement compared with GBLUP in three datasets. Except for the milk fat percentage (MFP) traits of the German Holstein dairy cattle dataset, SELF more robust than BayesB in the remaining traits.Conclusions: In this study, we utilized a stacking ensemble learning framework (SELF) to genomic prediction and it performed much better than GBLUP and BayesB in three real datasets with different genetic architecture. Therefore, we believed SEFL had the potential to be promoted to estimate GEBVs in other animals and plants.

Download Full-text

A Stacking Ensemble Learning Framework for Genomic Prediction

Frontiers in Genetics ◽

10.3389/fgene.2021.600040 ◽

2021 ◽

Vol 12 ◽

Author(s):

Mang Liang ◽

Tianpeng Chang ◽

Bingxing An ◽

Xinghai Duan ◽

Lili Du ◽

...

Keyword(s):

Machine Learning ◽

Ensemble Learning ◽

Milk Fat ◽

Prediction Accuracy ◽

Support Vector ◽

Fat Percentage ◽

Prediction Ability ◽

Learning Framework ◽

Best Linear Unbiased ◽

Estimated Breeding Values

Machine learning (ML) is perhaps the most useful tool for the interpretation of large genomic datasets. However, the performance of a single machine learning method in genomic selection (GS) is currently unsatisfactory. To improve the genomic predictions, we constructed a stacking ensemble learning framework (SELF), integrating three machine learning methods, to predict genomic estimated breeding values (GEBVs). The present study evaluated the prediction ability of SELF by analyzing three real datasets, with different genetic architecture; comparing the prediction accuracy of SELF, base learners, genomic best linear unbiased prediction (GBLUP) and BayesB. For each trait, SELF performed better than base learners, which included support vector regression (SVR), kernel ridge regression (KRR) and elastic net (ENET). The prediction accuracy of SELF was, on average, 7.70% higher than GBLUP in three datasets. Except for the milk fat percentage (MFP) traits, of the German Holstein dairy cattle dataset, SELF was more robust than BayesB in all remaining traits. Therefore, we believed that SEFL has the potential to be promoted to estimate GEBVs in other animals and plants.

Download Full-text

ABC-Gly: identifying protein lysine glycation sites with artificial bee colony algorithm

Current Proteomics ◽

10.2174/1570164617666191227120136 ◽

2019 ◽

Vol 17 ◽

Author(s):

Yanqiu Yao ◽

Xiaosa Zhao ◽

Qiao Ning ◽

Junping Zhou

Keyword(s):

Support Vector Machine ◽

Amino Acid ◽

Artificial Bee Colony Algorithm ◽

Artificial Bee Colony ◽

Training Dataset ◽

Support Vector ◽

Supplementary File ◽

Feature Subset ◽

Lipid Molecule ◽

Bee Colony

Background: Glycation is a nonenzymatic post-translational modification process by attaching a sugar molecule to a protein or lipid molecule. It may impair the function and change the characteristic of the proteins which may lead to some metabolic diseases. In order to understand the underlying molecular mechanisms of glycation, computational prediction methods have been developed because of their convenience and high speed. However, a more effective computational tool is still a challenging task in computational biology. Methods: In this study, we showed an accurate identification tool named ABC-Gly for predicting lysine glycation sites. At first, we utilized three informative features, including position-specific amino acid propensity, secondary structure and the composition of k-spaced amino acid pairs to encode the peptides. Moreover, to sufficiently exploit discriminative features thus can improve the prediction and generalization ability of the model, we developed a two-step feature selection, which combined the Fisher score and an improved binary artificial bee colony algorithm based on support vector machine. Finally, based on the optimal feature subset, we constructed the effective model by using Support Vector Machine on the training dataset. Results: The performance of the proposed predictor ABC-Gly was measured with the sensitivity of 76.43%, the specificity of 91.10%, the balanced accuracy of 83.76%, the area under the receiver-operating characteristic curve (AUC) of 0.9313, a Matthew’s Correlation Coefficient (MCC) of 0.6861 by 10-fold cross-validation on training dataset, and a balanced accuracy of 59.05% on independent dataset. Compared to the state-of-the-art predictors on the training dataset, the proposed predictor achieved significant improvement in the AUC of 0.156 and MCC of 0.336. Conclusion: The detailed analysis results indicated that our predictor may serve as a powerful complementary tool to other existing methods for predicting protein lysine glycation. The source code and datasets of the ABC-Gly were provided in the Supplementary File 1.

Download Full-text

A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

BMC Medical Research Methodology ◽

10.1186/s12874-021-01299-6 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Yue Jiao ◽

Fabienne Lesueur ◽

Chloé-Agathe Azencott ◽

Maïté Laurent ◽

Noura Mebirouk ◽

...

Keyword(s):

Record Linkage ◽

Gold Standard ◽

Brca2 Mutation ◽

Epidemiological Studies ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Genetic Modifiers ◽

Brca1 And Brca2 ◽

Mutation Carriers

Abstract Background Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.

Download Full-text

Cohort study of the overall survival of patients with pancreatic cancer in a hospital of specialties of Quito-Ecuador in the period 2007–2017

Innovative Surgical Sciences ◽

10.1515/iss-2020-0030 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Andrés Moreno Roca ◽

Luciana Armijos Acurio ◽

Ruth Jimbo Sotomayor ◽

Carlos Céspedes Rivadeneira ◽

Carlos Rosero Reyes ◽

...

Keyword(s):

Pancreatic Cancer ◽

Cohort Study ◽

Survival Time ◽

Univariate Analysis ◽

Rank Test ◽

Kaplan Meier ◽

Significant Difference ◽

Icd 10 ◽

The Difference ◽

Clinical Records

Abstract Objectives Pancreatic cancers in most patients in Ecuador are diagnosed at an advanced stage of the disease, which is associated with lower survival. To determine the characteristics and global survival of pancreatic cancer patients in a social security hospital in Ecuador between 2007 and 2017. Methods A retrospective cohort study and a survival analysis were performed using all the available data in the electronic clinical records of patients with a diagnosis of pancreatic cancer in a Hospital of Specialties of Quito-Ecuador between 2007 and 2017. The included patients were those coded according to the ICD 10 between C25.0 and C25.9. Our univariate analysis calculated frequencies, measures of central tendency and dispersion. Through the Kaplan-Meier method we estimated the median time of survival and analyzed the difference in survival time among the different categories of our included variables. These differences were shown through the log rank test. Results A total of 357 patients diagnosed with pancreatic cancer between 2007 and 2017 were included in the study. More than two-thirds (69.9%) of the patients were diagnosed in late stages of the disease. The median survival time for all patients was of 4 months (P25: 2, P75: 8). Conclusions The statistically significant difference of survival time between types of treatment is the most relevant finding in this study, when comparing to all other types of treatments.

Download Full-text