An Extensive Approach Towards Heart Stroke Prediction Using Machine Learning with Ensemble Classifier

Color Doppler is used in the clinic for visually assessing the vascularity of breast masses on ultrasound, to aid in determining the likelihood of malignancy. In this study, quantitative color Doppler radiomics features were algorithmically extracted from breast sonograms for machine learning, producing a diagnostic model for breast cancer with higher performance than models based on grayscale and clinical category from the Breast Imaging Reporting and Data System for ultrasound (BI-RADSUS). Ultrasound images of 159 solid masses were analyzed. Algorithms extracted nine grayscale features and two color Doppler features. These features, along with patient age and BI-RADSUS category, were used to train an AdaBoost ensemble classifier. Though training on computer-extracted grayscale features and color Doppler features each significantly increased performance over that of models trained on clinical features, as measured by the area under the receiver operating characteristic (ROC) curve, training on both color Doppler and grayscale further increased the ROC area, from 0.925 ± 0.022 to 0.958 ± 0.013. Pruning low-confidence cases at 20% improved this to 0.986 ± 0.007 with 100% sensitivity, whereas 64% of the cases had to be pruned to reach this performance without color Doppler. Fewer borderline diagnoses and higher ROC performance were both achieved for diagnostic models of breast cancer on ultrasound by machine learning on color Doppler features.

Download Full-text

Prediction of novel mouse TLR9 agonists using a random forest approach

BMC Molecular and Cell Biology ◽

10.1186/s12860-019-0241-0 ◽

2019 ◽

Vol 20 (S2) ◽

Author(s):

Varun Khanna ◽

Lei Li ◽

Johnson Fung ◽

Shoba Ranganathan ◽

Nikolai Petrovsky

Keyword(s):

Machine Learning ◽

Random Forest ◽

Correlation Coefficient ◽

Matthews Correlation Coefficient ◽

Learning Algorithms ◽

Ensemble Classifier ◽

Innate Immune ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Algorithm

Abstract Background Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

Download Full-text

Predicting protein-membrane interfaces of peripheral membrane proteins using ensemble machine learning

10.1101/2021.06.28.450157 ◽

2021 ◽

Author(s):

Alexios Chatzigoulas ◽

Zoe Cournia

Keyword(s):

Machine Learning ◽

Membrane Proteins ◽

Ensemble Classifier ◽

Binding Domains ◽

Peripheral Membrane Proteins ◽

Ensemble Machine Learning ◽

Protein Interfaces ◽

Protein Membrane ◽

Membrane Interfaces ◽

Membrane Attachment

Motivation: Abnormal protein-membrane attachment is involved in deregulated cellular pathways and in disease. Therefore, the possibility to modulate protein-membrane interactions represents a new promising therapeutic strategy for peripheral membrane proteins that have been considered so far undruggable. A major obstacle in this drug design strategy is that the membrane binding domains of peripheral membrane proteins are usually not known. The development of fast and efficient algorithms predicting the protein-membrane interface would shed light into the accessibility of membrane-protein interfaces by drug-like molecules. Results: Herein, we describe an ensemble machine learning methodology and algorithm for predicting membrane-penetrating residues. We utilize available experimental data in the literature for training 21 machine learning classifiers and a voting classifier. Evaluation of the ensemble classifier accuracy produced a macro-averaged F1 score = 0.92 and an MCC = 0.84 for predicting correctly membrane-penetrating residues on unknown proteins of an independent test set. Availability and implementation: The python code for predicting protein-membrane interfaces of peripheral membrane proteins is available at https://github.com/zoecournia/DREAMM.

Download Full-text

ANALISIS KESEHATAN TERUMBU KARANG BERDASARKAN KARAKTERISTIK SUNGAI, LAUT, DAN POPULASI AREA PEMUKIMAN MENGGUNAKAN MACHINE LEARNING

IJIS - Indonesian Journal On Information System ◽

10.36549/ijis.v5i2.119 ◽

2020 ◽

Vol 5 (2) ◽

Author(s):

Adinda miftahul Ilmi Habiba ◽

Agi Prasetiadi ◽

Cepi Ramdani

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Nearest Neighbor ◽

Ensemble Classifier ◽

Support Vector ◽

Learning Support ◽

K Nearest Neighbor

Penelitian ini untuk mengetahui kualitas kesehatan terumbu karang disuatu wilayah di Indonesia dengan mengambil beberapa faktor seperti wisatawan yang datang, latitude, longtitude, suhu, tahun, populasi warga, jumlah pemuda, dan jumlah industri, dan metode yang digunakan adalah machine learning dengan algoritma K-Nearest Neighbor, Support Vector Machine, dan Ensemble Classifier, untuk ensemble menggunkan randomforest untuk mengambil cabang-cabang pohon atau fitur keputusan yang paling relevan dengan output, penelitian ini diharapkan bisa menjadi acuan bagi wilayah yang kondisi terumbu karangnya masih kurang baik dapat mencontoh wilayah yang kondisi terumbu karangnya sudah baik dengan melihat faktor apa saja yang mempengaruhi terumbu karang disuatu wilayah itu masuk kategori baik. Hasil akhir dari penelitian ini pada algoritma K-Nearest Neighbor faktor yang berpengaruh bagi kesehatan terumbu karang yaitu wisatawan yang datang, latitude, longtitude, suhu, tahum dan pupulasi warga, sementara pada algoritma Support Vector Machine faktor yang berpengaruh wisatawan yang datang, Latitude, suhu dan tahun untuk algoritma Ensemble Classifier faktor yang berpengaruh wisatawan yang datang, latitude, longtitude, suhu dan jumlah industry, Pada kasus ini algoritma Support Vector Machine memiliki kinerja lebih baik dibandingkan K-Nearest Neighbor dan Ensemble Classifier.Kata Kunci: Ekosistem, Ensemble Classifier, K-Nearest Neighbor, Machine Learning, Support Vector Machine

Download Full-text

Automatic catalog of RR Lyrae from ∼14 million VVV light curves: How far can we go with traditional machine-learning?

Astronomy and Astrophysics ◽

10.1051/0004-6361/202038314 ◽

2020 ◽

Vol 642 ◽

pp. A58

Author(s):

J. B. Cabral ◽

F. Ramos ◽

S. Gurovich ◽

P. M. Granitto

Keyword(s):

Machine Learning ◽

Model Selection ◽

Broad Band ◽

Ensemble Classifier ◽

Light Curves ◽

Ensemble Classifiers ◽

Data Set ◽

Rr Lyrae ◽

Selection Step ◽

Sampling Procedures

Context. The creation of a 3D map of the bulge using RR Lyrae (RRL) is one of the main goals of the VISTA Variables in the Via Lactea Survey (VVV) and VVV(X) surveys. The overwhelming number of sources undergoing analysis undoubtedly requires the use of automatic procedures. In this context, previous studies have introduced the use of machine learning (ML) methods for the task of variable star classification. Aims. Our goal is to develop and test an entirely automatic ML-based procedure for the identification of RRLs in the VVV Survey. This automatic procedure is meant to be used to generate reliable catalogs integrated over several tiles in the survey. Methods. Following the reconstruction of light curves, we extracted a set of period- and intensity-based features, which were already defined in previous works. Also, for the first time, we put a new subset of useful color features to use. We discuss in considerable detail all the appropriate steps needed to define our fully automatic pipeline, namely: the selection of quality measurements; sampling procedures; classifier setup, and model selection. Results. As a result, we were able to construct an ensemble classifier with an average recall of 0.48 and average precision of 0.86 over 15 tiles. We also made all our processed datasets available and we published a catalog of candidate RRLs. Conclusions. Perhaps most interestingly, from a classification perspective based on photometric broad-band data, our results indicate that color is an informative feature type of the RRL objective class that should always be considered in automatic classification methods via ML. We also argue that recall and precision in both tables and curves are high-quality metrics with regard to this highly imbalanced problem. Furthermore, we show for our VVV data-set that to have good estimates, it is important to use the original distribution more abundantly than reduced samples with an artificial balance. Finally, we show that the use of ensemble classifiers helps resolve the crucial model selection step and that most errors in the identification of RRLs are related to low-quality observations of some sources or to the increased difficulty in resolving the RRL-C type given the data.

Download Full-text

Forecasting autism gene discovery with machine learning and genome-scale data

10.1101/370601 ◽

2018 ◽

Cited By ~ 3

Author(s):

Leo Brueggeman ◽

Tanner Koomar ◽

Jacob J Michaelson

Keyword(s):

Gene Expression ◽

Machine Learning ◽

De Novo ◽

Gene Discovery ◽

Feature Space ◽

Ensemble Classifier ◽

Gene Level ◽

Increased Sensitivity ◽

Genome Scale ◽

Scale Data

AbstractBackgroundGenes are one of the most powerful windows into the biology of autism, and it has been estimated that perhaps a thousand or more genes may confer risk. However, less than 100 genes are currently viewed as having robust enough evidence to be considered true "autism genes". Massive genetic studies are underway to produce data to implicate additional genes, but this approach, although necessary, is costly and slow-moving.MethodsWe approach autism gene discovery as a machine learning problem, rather than a genetic association problem, and use genome-scale data as predictors for identifying further genes that have similar properties in the feature space compared to established autism risk genes. This approach, which we call forecASD, integrates spatiotemporal gene expression, heterogeneous network data, and previous gene-level predictors of autism association into an ensemble classifier that yields a single score that indexes each gene’s evidence for being involved in the etiology of autism.ResultsWe demonstrate that forecASD has substantially increased sensitivity and specificity compared to previous gene-level predictors of autism association, including genetic measures such as TADA. On an independent test set, consisting of newly-released pilot data from the SPARK Genomics Consortium, we show that forecASD best predicts which genes will have an excess of likely gene disrupting (LGD) de novo mutations. We further use independent data from a recent post mortem study of case/control gene expression to show that forecASD is also a significant predictor of genes implicated in ASD through differential expression. Using forecASD results, we show which molecular pathways are currently under-represented in the autism literature and likely represent under-appreciated biological mechanisms of autism. Finally, forecASD correctly predicted 12 of 16 genes implicated at FDR=0.2 by the latest ASD gene discovery study, while also identifying the most likely false positives among the candidate genes.ConclusionsThese results demonstrate that forecASD bridges the gap between genetic- and expression-based ASD gene discovery, and provides a data-driven replacement to much of the manual filtering and curation that is a critical step in ensuring the robustness of gene discovery studies.

Download Full-text

Evaluation of Ensemble Classifier (EC) Machine Learning Methods for Introduction of Breast Cancer Genomic Biomarkers

Multidisciplinary Cancer Investigation ◽

10.21859/mci-supp-36 ◽

2017 ◽

Vol 1 (Supplementary 1) ◽

pp. 0-0

Author(s):

L Mirsadeghi ◽

K Kavousi ◽

R Hajihosseini ◽

A Banaei-Moghaddam

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Ensemble Classifier ◽

Learning Methods ◽

Genomic Biomarkers ◽

Machine Learning Methods

Download Full-text

A practical approach for applying Machine Learning in the detection and classification of network devices used in building management

10.22541/au.160689781.19054555/v1 ◽

2020 ◽

Author(s):

Maroun Touma ◽

Shalisha Witherspoon ◽

Shonda Witherspoon ◽

Isabelle Crawford-Eng

Keyword(s):

Machine Learning ◽

Critical Infrastructure ◽

Ensemble Classifier ◽

Essential Elements ◽

Small Sample ◽

Training Data ◽

Feature Engineering ◽

Ensemble Classifiers ◽

Commercial Building ◽

Automation And Control

With the increasing deployment of smart buildings and infrastructure, Supervisory Control and Data Acquisition (SCADA) devices and the underlying IT network have become essential elements for the proper operations of these highly complex systems. Of course, with the increase in automation and the proliferation of SCADA devices, a corresponding increase in surface area of attack on critical infrastructure has increased. Understanding device behaviors in terms of known and understood or potentially qualified activities versus unknown and potentially nefarious activities in near-real time is a key component of any security solution. In this paper, we investigate the challenges with building robust machine learning models to identify unknowns purely from network traffic both inside and outside firewalls, starting with missing or inconsistent labels across sites, feature engineering and learning, temporal dependencies and analysis, and training data quality (including small sample sizes) for both shallow and deep learning methods. To demonstrate these challenges and the capabilities we have developed, we focus on Building Automation and Control networks (BACnet) from a private commercial building system. Our results show that ”Model Zoo” built from binary classifiers based on each device or behavior combined with an ensemble classifier integrating information from all classifiers provides a reliable methodology to identify unknown devices as well as determining specific known devices when the device type is in the training set. The capability of the Model Zoo framework is shown to be directly linked to feature engineering and learning, and the dependency of the feature selection varies depending on both the binary and ensemble classifiers as well.

Download Full-text

What’s in a Trauma? Using Machine Learning to Unpack What Makes an Event Traumatic

10.31234/osf.io/yh6wd ◽

2021 ◽

Author(s):

Payton J. Jones

Keyword(s):

Machine Learning ◽

Empirical Work ◽

Ensemble Classifier ◽

Political Orientation ◽

Support Vector ◽

Physical Injury ◽

Lasso Regression ◽

Interaction Terms ◽

Out Of Sample ◽

And Gender

What differentiates a trauma from an event that is merely upsetting? Wildly different definitions of trauma have been used across various settings. Yet there is a dearth of empirical work examining the features of events that individuals use to define an event as a ‘trauma’. First, a group of qualitative coders classified features (e.g., actual physical injury, loss of possessions) of 600 event descriptions (e.g., “was verbally harassed by a boss”, “watched a video of an adult being shot and killed”). Next, across two studies, machine learning was used to predict whether individuals rated event descriptions as ‘trauma’ or ‘traumatic’ in over 100,000 judgment tasks. In Study 1, examining continuous ratings, a cross-validated LASSO regression with interaction terms provided the best out-of-sample predictions (r2 = 0.76), outperforming ridge regression, support vector regression, and linear regression. In Study 2, using binary judgments, a random forest model accurately predicted out-of-sample individual responses (AUC = 0.96), outperform-ing a neural network and an AdaBoost ensemble classifier. The most important event features across the two studies were actual death, threat of death, and the presence of a human perpetrator. The most important human features in predicting judgments were political orientation and gender.

Download Full-text