Disease Classification and Prediction using Ensemble Machine Learning Classification Algorithm

In today’s scenario, disease prediction plays an important role in medical field. Early detection of diseases is essential because of the fast food habits and life. In my previous study for predicting diseases using radiology test report , and to classify the disease as positive or negative three classifiers Naïve Bayes (NB), Support Vector Machine (SVM) and Modified Extreme Learning Machine (MELM was used to increase the accuracy of results. To increase the efficiency of predicting the disease and to find which disease pricks the society, ensemble machine learning algorithm is used. The huge data from the healthcare industry were preprocessed., categorized and analyzed to find out and predict which patient to be treated and given priority and which hits the society the most. Ensemble machine learning's popularity in the medical industry is due to a variety of factors the Classifiers used are K Nearest Neighbors, Nearest Mean Classifier, Mean Feature Voting Classifier, KDtree KNN, Random Forest. To reduce the manual processes in medical field automating these processes has become important. Electronic medical records and significant advances in health care have given an opportunity to make find out which patients need to be given more importance. Several methodologies and techniques were used to preprocess the data in order to meet the study' requirements. To improve the performance of machine learning algorithms, feature selections were made using Tabu search. When ensemble prediction is combined with the Random Forest algorithm as the combiner, the results are more reliable. The aim of this study is to create a system to classify Medical records whether it is diseased or not and find out which disease rate has increased. This research will help the society to an individual to get treated easily and take preventive measures to avoid diseases.

Download Full-text

Comparison of Ensemble Machine Learning Methods for Soil Erosion Pin Measurements

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10010042 ◽

2021 ◽

Vol 10 (1) ◽

pp. 42

Author(s):

Kieu Anh Nguyen ◽

Walter Chen ◽

Bor-Shiun Lin ◽

Uma Seeboonruang

Keyword(s):

Machine Learning ◽

Soil Erosion ◽

Ensemble Methods ◽

Machine Learning Algorithms ◽

Multivariate Adaptive Regression Splines ◽

Gradient Boosting ◽

Support Vector ◽

Ensemble Machine Learning ◽

Boosting Method ◽

Bagging Method

Although machine learning has been extensively used in various fields, it has only recently been applied to soil erosion pin modeling. To improve upon previous methods of quantifying soil erosion based on erosion pin measurements, this study explored the possible application of ensemble machine learning algorithms to the Shihmen Reservoir watershed in northern Taiwan. Three categories of ensemble methods were considered in this study: (a) Bagging, (b) boosting, and (c) stacking. The bagging method in this study refers to bagged multivariate adaptive regression splines (bagged MARS) and random forest (RF), and the boosting method includes Cubist and gradient boosting machine (GBM). Finally, the stacking method is an ensemble method that uses a meta-model to combine the predictions of base models. This study used RF and GBM as the meta-models, decision tree, linear regression, artificial neural network, and support vector machine as the base models. The dataset used in this study was sampled using stratified random sampling to achieve a 70/30 split for the training and test data, and the process was repeated three times. The performance of six ensemble methods in three categories was analyzed based on the average of three attempts. It was found that GBM performed the best among the ensemble models with the lowest root-mean-square error (RMSE = 1.72 mm/year), the highest Nash-Sutcliffe efficiency (NSE = 0.54), and the highest index of agreement (d = 0.81). This result was confirmed by the spatial comparison of the absolute differences (errors) between model predictions and observations using GBM and RF in the study area. In summary, the results show that as a group, the bagging method and the boosting method performed equally well, and the stacking method was third for the erosion pin dataset considered in this study.

Download Full-text

Prediction and Analysis of Gold Prices using Ensemble Machine Learning Algorithms

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.36028 ◽

2021 ◽

Vol 9 (VI) ◽

pp. 4367-4374

Author(s):

Gudipally Chandrashakar

Keyword(s):

Machine Learning ◽

Time Series ◽

Time Series Data ◽

Gold Price ◽

Machine Learning Algorithms ◽

Series Data ◽

Gradient Boosting ◽

Support Vector ◽

Average Value ◽

Ensemble Machine Learning

In this article, we used historical time series data up to the current day gold price. In this study of predicting gold price, we consider few correlating factors like silver price, copper price, standard, and poor’s 500 value, dollar-rupee exchange rate, Dow Jones Industrial Average Value. Considering the prices of every correlating factor and gold price data where dates ranging from 2008 January to 2021 February. Few algorithms of machine learning are used to analyze the time-series data are Random Forest Regression, Support Vector Regressor, Linear Regressor, ExtraTrees Regressor and Gradient boosting Regression. While seeing the results the Extra Tree Regressor algorithm gives the predicted value of gold prices more accurately.

Download Full-text

Feature Selection and Comparison of Machine Learning Algorithms in Classification of Grazing and Rumination Behaviour in Sheep

Sensors ◽

10.3390/s18103532 ◽

2018 ◽

Vol 18 (10) ◽

pp. 3532 ◽

Cited By ~ 16

Author(s):

Nicola Mansbridge ◽

Jurgen Mitsch ◽

Nicola Bollard ◽

Keith Ellis ◽

Giuliana Miguel-Pacheco ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Time Budget ◽

Learning Algorithms ◽

Eating Behaviour ◽

Machine Learning Algorithms ◽

Support Vector ◽

Optimum Number ◽

Eating Behaviours ◽

Adaptive Boosting

Grazing and ruminating are the most important behaviours for ruminants, as they spend most of their daily time budget performing these. Continuous surveillance of eating behaviour is an important means for monitoring ruminant health, productivity and welfare. However, surveillance performed by human operators is prone to human variance, time-consuming and costly, especially on animals kept at pasture or free-ranging. The use of sensors to automatically acquire data, and software to classify and identify behaviours, offers significant potential in addressing such issues. In this work, data collected from sheep by means of an accelerometer/gyroscope sensor attached to the ear and collar, sampled at 16 Hz, were used to develop classifiers for grazing and ruminating behaviour using various machine learning algorithms: random forest (RF), support vector machine (SVM), k nearest neighbour (kNN) and adaptive boosting (Adaboost). Multiple features extracted from the signals were ranked on their importance for classification. Several performance indicators were considered when comparing classifiers as a function of algorithm used, sensor localisation and number of used features. Random forest yielded the highest overall accuracies: 92% for collar and 91% for ear. Gyroscope-based features were shown to have the greatest relative importance for eating behaviours. The optimum number of feature characteristics to be incorporated into the model was 39, from both ear and collar data. The findings suggest that one can successfully classify eating behaviours in sheep with very high accuracy; this could be used to develop a device for automatic monitoring of feed intake in the sheep sector to monitor health and welfare.

Download Full-text

Classifications of Breast Cancer Diagnosis using Machine Learning

International Journal of Computers ◽

10.46300/9108.2020.14.13 ◽

2020 ◽

Vol 14 ◽

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Random Forest ◽

Breast Cancer Diagnosis ◽

Performance Comparison ◽

Support Vector ◽

Breast Cancer Dataset ◽

K Nearest Neighbors ◽

Cancer Dataset ◽

Machine Learning Classification

Breast Cancer (BC) is amongst the most common and leading causes of deaths in women throughout the world. Recently, classification and data analysis tools are being widely used in the medical field for diagnosis, prognosis and decision making to help lower down the risks of people dying or suffering from diseases. Advanced machine learning methods have proven to give hope for patients as this has helped the doctors in early detection of diseases like Breast Cancer that can be fatal, in support with providing accurate outcomes. However, the results highly depend on the techniques used for feature selection and classification which will produce a strong machine learning model. In this paper, a performance comparison is conducted using four classifiers which are Multilayer Perceptron (MLP), Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest on the Wisconsin Breast Cancer dataset to spot the most effective predictors. The main goal is to apply best machine learning classification methods to predict the Breast Cancer as benign or malignant using terms such as accuracy, f-measure, precision and recall. Experimental results show that Random forest is proven to achieve the highest accuracy of 99.26% on this dataset and features, while SVM and KNN show 97.78% and 97.04% accuracy respectively. MLP shows the least accuracy of 94.07%. All the experiments are conducted using RStudio as the data mining tool platform.

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

PSIX-15 Assessment of machine learning algorithms for prediction of Aleutian disease in American mink

Journal of Animal Science ◽

10.1093/jas/skab235.484 ◽

2021 ◽

Vol 99 (Supplement_3) ◽

pp. 264-265

Author(s):

Duy Ngoc Do ◽

Guoyu Hu ◽

Younes Miar

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Models ◽

American Mink ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data ◽

Enzyme Linked Immunosorbent Assay ◽

Linear Discriminant ◽

Machine Learning Classification

Abstract American mink (Neovison vison) is the major source of fur for the fur industries worldwide and Aleutian disease (AD) is causing severe financial losses to the mink industry. Different methods have been used to diagnose the AD in mink, but the combination of several methods can be the most appropriate approach for the selection of AD resilient mink. Iodine agglutination test (IAT) and counterimmunoelectrophoresis (CIEP) methods are commonly employed in test-and-remove strategy; meanwhile, enzyme-linked immunosorbent assay (ELISA) and packed-cell volume (PCV) methods are complementary. However, using multiple methods are expensive; and therefore, hindering the corrected use of AD tests in selection. This research presented the assessments of the AD classification based on machine learning algorithms. The Aleutian disease was tested on 1,830 individuals using these tests in an AD positive mink farm (Canadian Centre for Fur Animal Research, NS, Canada). The accuracy of classification for CIEP was evaluated based on the sex information, and IAT, ELISA and PCV test results implemented in seven machine learning classification algorithms (Random Forest, Artificial Neural Networks, C50Tree, Naive Bayes, Generalized Linear Models, Boost, and Linear Discriminant Analysis) using the Caret package in R. The accuracy of prediction varied among the methods. Overall, the Random Forest was the best-performing algorithm for the current dataset with an accuracy of 0.89 in the training data and 0.94 in the testing data. Our work demonstrated the utility and relative ease of using machine learning algorithms to assess the CIEP information, and consequently reducing the cost of AD tests. However, further works require the inclusion of production and reproduction information in the models and extension of phenotypic collection to increase the accuracy of current methods.

Download Full-text

Abstract 15895: Machine Learning Algorithms to Predict Major Adverse Cardiovascular Events in Patients Undergoing Orthotopic Liver Transplantation: A Retrospective Cohort Study

Circulation ◽

10.1161/circ.142.suppl_3.15895 ◽

2020 ◽

Vol 142 (Suppl_3) ◽

Author(s):

vardhmaan jain ◽

Vikram Sharma ◽

Agam Bansal ◽

Cerise Kleb ◽

Chirag Sheth ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Cardiovascular Events ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Major Adverse Cardiovascular Events ◽

Support Vector ◽

Post Transplant ◽

Extreme Gradient Boosting ◽

All Cause Mortality

Background: Post-transplant major adverse cardiovascular events (MACE) are amongst the leading cause of death amongst orthotopic liver transplant(OLT) recipients. Despite years of guideline directed therapy, there are limited data on predictors of post-OLT MACE. We assessed if machine learning algorithms (MLA) can predict MACE and all-cause mortality in patients undergoing OLT. Methods: We tested three MLA: support vector machine, extreme gradient boosting(XG-Boost) and random forest with traditional logistic regression for prediction of MACE and all-cause mortality on a cohort of consecutive patients undergoing OLT at our center between 2008-2019. The cohort was randomly split into a training (80%) and testing (20%) cohort. Model performance was assessed using c-statistic or AUC. Results: We included 1,459 consecutive patients with mean ± SD age 54.2 ± 13.8 years, 32% female who underwent OLT. There were 199 (13.6%) MACE and 289 (20%) deaths at a mean follow up of 4.56 ± 3.3 years. The random forest MLA was the best performing model for predicting MACE [AUC:0.78, 95% CI: 0.70-0.85] as well as mortality [AUC:0.69, 95% CI: 0.61-0.76], with all models performing better when predicting MACE vs mortality. See Table and Figure. Conclusion: Random forest machine learning algorithms were more predictive and discriminative than traditional regression models for predicting major adverse cardiovascular events and all-cause mortality in patients undergoing OLT. Validation and subsequent incorporation of MLA in clinical decision making for OLT candidacy could help risk stratify patients for post-transplant adverse cardiovascular events.

Download Full-text

Damage Classification of Composites Using Machine Learning

Volume 13: Safety Engineering, Risk, and Reliability Analysis ◽

10.1115/imece2019-11851 ◽

2019 ◽

Author(s):

Shweta Dabetwar ◽

Stephen Ekwaro-Osire ◽

João Paulo Dias

Keyword(s):

Machine Learning ◽

Composite Materials ◽

Random Forest ◽

Condition Monitoring ◽

Machine Learning Algorithms ◽

Support Vector ◽

Damage Classification ◽

Combining Data ◽

Ultrasonic Measurements

Abstract Composite materials have tremendous and ever-increasing applications in complex engineering systems; thus, it is important to develop non-destructive and efficient condition monitoring methods to improve damage prediction, thereby avoiding catastrophic failures and reducing standby time. Nondestructive condition monitoring techniques when combined with machine learning applications can contribute towards the stated improvements. Thus, the research question taken into consideration for this paper is “Can machine learning techniques provide efficient damage classification of composite materials to improve condition monitoring using features extracted from acousto-ultrasonic measurements?” In order to answer this question, acoustic-ultrasonic signals in Carbon Fiber Reinforced Polymer (CFRP) composites for distinct damage levels were taken from NASA Ames prognostics data repository. Statistical condition indicators of the signals were used as features to train and test four traditional machine learning algorithms such as K-nearest neighbors, support vector machine, Decision Tree and Random Forest, and their performance was compared and discussed. Results showed higher accuracy for Random Forest with a strong dependency on the feature extraction/selection techniques employed. By combining data analysis from acoustic-ultrasonic measurements in composite materials with machine learning tools, this work contributes to the development of intelligent damage classification algorithms that can be applied to advanced online diagnostics and health management strategies of composite materials, operating under more complex working conditions.

Download Full-text

Prediction of novel mouse TLR9 agonists using a random forest approach

BMC Molecular and Cell Biology ◽

10.1186/s12860-019-0241-0 ◽

2019 ◽

Vol 20 (S2) ◽

Author(s):

Varun Khanna ◽

Lei Li ◽

Johnson Fung ◽

Shoba Ranganathan ◽

Nikolai Petrovsky

Keyword(s):

Machine Learning ◽

Random Forest ◽

Correlation Coefficient ◽

Matthews Correlation Coefficient ◽

Learning Algorithms ◽

Ensemble Classifier ◽

Innate Immune ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Algorithm

Abstract Background Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

Download Full-text

Can Gut Microbiota Be a Good Predictor for Parkinson’s Disease? A Machine Learning Approach

Brain Sciences ◽

10.3390/brainsci10040242 ◽

2020 ◽

Vol 10 (4) ◽

pp. 242 ◽

Cited By ~ 3

Author(s):

Daniele Pietrucci ◽

Adelaide Teofani ◽

Valeria Unida ◽

Rocco Cerroni ◽

Silvia Biocca ◽

...

Keyword(s):

Machine Learning ◽

Parkinson’S Disease ◽

Parkinson's Disease ◽

Random Forest ◽

Gut Microbiota ◽

Biological Data ◽

Machine Learning Algorithms ◽

Support Vector ◽

Published Data ◽

Promising Tool

The involvement of the gut microbiota in Parkinson’s disease (PD), investigated in several studies, identified some common alterations of the microbial community, such as a decrease in Lachnospiraceae and an increase in Verrucomicrobiaceae families in PD patients. However, the results of other bacterial families are often contradictory. Machine learning is a promising tool for building predictive models for the classification of biological data, such as those produced in metagenomic studies. We tested three different machine learning algorithms (random forest, neural networks and support vector machines), analyzing 846 metagenomic samples (472 from PD patients and 374 from healthy controls), including our published data and those downloaded from public databases. Prediction performance was evaluated by the area under curve, accuracy, precision, recall and F-score metrics. The random forest algorithm provided the best results. Bacterial families were sorted according to their importance in the classification, and a subset of 22 families has been identified for the prediction of patient status. Although the results are promising, it is necessary to train the algorithm with a larger number of samples in order to increase the accuracy of the procedure.

Download Full-text