Simulation-assisted machine learning

Timo M Deist; Andrew Patti; Zhaoqi Wang; David Krane; Taylor Sorenson; David Craft

doi:10.1093/bioinformatics/btz199

Simulation-assisted machine learning

Bioinformatics ◽

10.1093/bioinformatics/btz199 ◽

2019 ◽

Vol 35 (20) ◽

pp. 4072-4080 ◽

Cited By ~ 3

Author(s):

Timo M Deist ◽

Andrew Patti ◽

Zhaoqi Wang ◽

David Krane ◽

Taylor Sorenson ◽

...

Keyword(s):

Machine Learning ◽

Network Flow ◽

Similarity Measures ◽

Similarity Score ◽

Supplementary Information ◽

Support Vector ◽

Training Samples ◽

Wide Range ◽

Input Sample ◽

Network Flow Optimization

Abstract Motivation In a predictive modeling setting, if sufficient details of the system behavior are known, one can build and use a simulation for making predictions. When sufficient system details are not known, one typically turns to machine learning, which builds a black-box model of the system using a large dataset of input sample features and outputs. We consider a setting which is between these two extremes: some details of the system mechanics are known but not enough for creating simulations that can be used to make high quality predictions. In this context we propose using approximate simulations to build a kernel for use in kernelized machine learning methods, such as support vector machines. The results of multiple simulations (under various uncertainty scenarios) are used to compute similarity measures between every pair of samples: sample pairs are given a high similarity score if they behave similarly under a wide range of simulation parameters. These similarity values, rather than the original high dimensional feature data, are used to build the kernel. Results We demonstrate and explore the simulation-based kernel (SimKern) concept using four synthetic complex systems—three biologically inspired models and one network flow optimization model. We show that, when the number of training samples is small compared to the number of features, the SimKern approach dominates over no-prior-knowledge methods. This approach should be applicable in all disciplines where predictive models are sought and informative yet approximate simulations are available. Availability and implementation The Python SimKern software, the demonstration models (in MATLAB, R), and the datasets are available at https://github.com/davidcraft/SimKern. Supplementary information Supplementary data are available at Bioinformatics online.

Get full-text (via PubEx)

Triplet-based similarity score for fully multilabeled trees with poly-occurring labels

Bioinformatics ◽

10.1093/bioinformatics/btaa676 ◽

2020 ◽

Author(s):

Simone Ciccolella ◽

Giulia Bernardini ◽

Luca Denti ◽

Paola Bonizzoni ◽

Marco Previtali ◽

...

Keyword(s):

Open Source ◽

Evolutionary History ◽

Similarity Measures ◽

Real Data ◽

Similarity Score ◽

Supplementary Information ◽

Supplementary Data ◽

Wide Range ◽

Golden Standard ◽

History Of

Abstract Motivation The latest advances in cancer sequencing, and the availability of a wide range of methods to infer the evolutionary history of tumors, have made it important to evaluate, reconcile and cluster different tumor phylogenies. Recently, several notions of distance or similarities have been proposed in the literature, but none of them has emerged as the golden standard. Moreover, none of the known similarity measures is able to manage mutations occurring multiple times in the tree, a circumstance often occurring in real cases. Results To overcome these limitations, in this article, we propose MP3, the first similarity measure for tumor phylogenies able to effectively manage cases where multiple mutations can occur at the same time and mutations can occur multiple times. Moreover, a comparison of MP3 with other measures shows that it is able to classify correctly similar and dissimilar trees, both on simulated and on real data. Availability and implementation An open source implementation of MP3 is publicly available at https://github.com/AlgoLab/mp3treesim. Supplementary information Supplementary data are available at Bioinformatics online.

Get full-text (via PubEx)

Support Vector Machines in Big Data Classification: A Systematic Literature Review

10.21203/rs.3.rs-663359/v1 ◽

2021 ◽

Author(s):

Mohammad Hassan Almaspoor ◽

Ali Safaei ◽

Afshin Salajegheh ◽

Behrouz Minaei-Bidgoli

Keyword(s):

Machine Learning ◽

Big Data ◽

Large Scale ◽

Support Vector ◽

Research Areas ◽

Large Scale Data ◽

Training Samples ◽

Big Data Classification ◽

Scale Data

Abstract Classification is one of the most important and widely used issues in machine learning, the purpose of which is to create a rule for grouping data to sets of pre-existing categories is based on a set of training sets. Employed successfully in many scientific and engineering areas, the Support Vector Machine (SVM) is among the most promising methods of classification in machine learning. With the advent of big data, many of the machine learning methods have been challenged by big data characteristics. The standard SVM has been proposed for batch learning in which all data are available at the same time. The SVM has a high time complexity, i.e., increasing the number of training samples will intensify the need for computational resources and memory. Hence, many attempts have been made at SVM compatibility with online learning conditions and use of large-scale data. This paper focuses on the analysis, identification, and classification of existing methods for SVM compatibility with online conditions and large-scale data. These methods might be employed to classify big data and propose research areas for future studies. Considering its advantages, the SVM can be among the first options for compatibility with big data and classification of big data. For this purpose, appropriate techniques should be developed for data preprocessing in order to covert data into an appropriate form for learning. The existing frameworks should also be employed for parallel and distributed processes so that SVMs can be made scalable and properly online to be able to handle big data.

Get full-text (via PubEx)

Deep learning on chaos game representation for proteins

Bioinformatics ◽

10.1093/bioinformatics/btz493 ◽

2019 ◽

Vol 36 (1) ◽

pp. 272-279 ◽

Cited By ~ 5

Author(s):

Hannah F Löchel ◽

Dominic Eger ◽

Theodor Sperlea ◽

Dominik Heider

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Chemical Properties ◽

Protein Sequences ◽

Machine Learning Techniques ◽

Supplementary Information ◽

Support Vector ◽

Chaos Game Representation ◽

Chaos Game ◽

Game Representation

AbstractMotivationClassification of protein sequences is one big task in bioinformatics and has many applications. Different machine learning methods exist and are applied on these problems, such as support vector machines (SVM), random forests (RF) and neural networks (NN). All of these methods have in common that protein sequences have to be made machine-readable and comparable in the first step, for which different encodings exist. These encodings are typically based on physical or chemical properties of the sequence. However, due to the outstanding performance of deep neural networks (DNN) on image recognition, we used frequency matrix chaos game representation (FCGR) for encoding of protein sequences into images. In this study, we compare the performance of SVMs, RFs and DNNs, trained on FCGR encoded protein sequences. While the original chaos game representation (CGR) has been used mainly for genome sequence encoding and classification, we modified it to work also for protein sequences, resulting in n-flakes representation, an image with several icosagons.ResultsWe could show that all applied machine learning techniques (RF, SVM and DNN) show promising results compared to the state-of-the-art methods on our benchmark datasets, with DNNs outperforming the other methods and that FCGR is a promising new encoding method for protein sequences.Availability and implementationhttps://cran.r-project.org/.Supplementary informationSupplementary data are available at Bioinformatics online.

Get full-text (via PubEx)

Machine-Learning-Based Muscle Control of a 3D-Printed Bionic Arm

Sensors ◽

10.3390/s20113144 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3144 ◽

Cited By ~ 1

Author(s):

Sherif Said ◽

Ilyes Boulkaibet ◽

Murtaza Sheikh ◽

Abdullah S. Karar ◽

Samer Alkork ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Support Vector Machine ◽

Experimental Test ◽

Support Vector Machine Classifier ◽

Low Cost ◽

Support Vector ◽

Wide Range ◽

3D Printed ◽

Semg Signals

In this paper, a customizable wearable 3D-printed bionic arm is designed, fabricated, and optimized for a right arm amputee. An experimental test has been conducted for the user, where control of the artificial bionic hand is accomplished successfully using surface electromyography (sEMG) signals acquired by a multi-channel wearable armband. The 3D-printed bionic arm was designed for the low cost of 295 USD, and was lightweight at 428 g. To facilitate a generic control of the bionic arm, sEMG data were collected for a set of gestures (fist, spread fingers, wave-in, wave-out) from a wide range of participants. The collected data were processed and features related to the gestures were extracted for the purpose of training a classifier. In this study, several classifiers based on neural networks, support vector machine, and decision trees were constructed, trained, and statistically compared. The support vector machine classifier was found to exhibit an 89.93% success rate. Real-time testing of the bionic arm with the optimum classifier is demonstrated.

Get full-text (via PubEx)

A review of Machine Learning (ML) algorithms used for modeling travel mode choice

DYNA ◽

10.15446/dyna.v86n211.79743 ◽

2019 ◽

Vol 86 (211) ◽

pp. 32-41 ◽

Cited By ~ 2

Author(s):

Juan D. Pineda-Jaramillo

Keyword(s):

Machine Learning ◽

Choice Model ◽

Mode Choice ◽

Multinomial Logit Model ◽

Support Vector ◽

Travel Mode ◽

Wide Range ◽

Travel Mode Choice ◽

And Cluster Analysis ◽

Transportation Research

In recent decades, transportation planning researchers have used diverse types of machine learning (ML) algorithms to research a wide range of topics. This review paper starts with a brief explanation of some ML algorithms commonly used for transportation research, specifically Artificial Neural Networks (ANN), Decision Trees (DT), Support Vector Machines (SVM) and Cluster Analysis (CA). Then, these different methodologies used by researchers for modeling travel mode choice are collected and compared with the Multinomial Logit Model (MNL) which is the most commonly-used discrete choice model. Finally, the characterization of ML algorithms is discussed and Random Forest (RF), a variant of Decision Tree algorithms, is presented as the best methodology for modeling travel mode choice.

Get full-text (via PubEx)

Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches (Preprint)

10.2196/preprints.16678 ◽

2019 ◽

Author(s):

Adane Tarekegn ◽

Fulvio Ricceri ◽

Giuseppe Costa ◽

Elisa Ferracin ◽

Mario Giacobini

Keyword(s):

Machine Learning ◽

Older Adults ◽

Elderly People ◽

Emergency Admission ◽

Evaluation Metrics ◽

Support Vector ◽

Learning Models ◽

Increased Risk ◽

Wide Range ◽

Machine Learning Models

BACKGROUND Frailty is one of the most critical age-related conditions in older adults. It is often recognized as a syndrome of physiological decline in late life, characterized by a marked vulnerability to adverse health outcomes. A clear operational definition of frailty, however, has not been agreed so far. There is a wide range of studies on the detection of frailty and their association with mortality. Several of these studies have focused on the possible risk factors associated with frailty in the elderly population while predicting who will be at increased risk of frailty is still overlooked in clinical settings. OBJECTIVE The objective of our study was to develop predictive models for frailty conditions in older people using different machine learning methods based on a database of clinical characteristics and socioeconomic factors. METHODS An administrative health database containing 1,095,612 elderly people aged 65 or older with 58 input variables and 6 output variables was used. We first identify and define six problems/outputs as surrogates of frailty. We then resolve the imbalanced nature of the data through resampling process and a comparative study between the different machine learning (ML) algorithms – Artificial neural network (ANN), Genetic programming (GP), Support vector machines (SVM), Random Forest (RF), Logistic regression (LR) and Decision tree (DT) – was carried out. The performance of each model was evaluated using a separate unseen dataset. RESULTS Predicting mortality outcome has shown higher performance with ANN (TPR 0.81, TNR 0.76, accuracy 0.78, F1-score 0.79) and SVM (TPR 0.77, TNR 0.80, accuracy 0.79, F1-score 0.78) than predicting the other outcomes. On average, over the six problems, the DT classifier has shown the lowest accuracy, while other models (GP, LR, RF, ANN, and SVM) performed better. All models have shown lower accuracy in predicting an event of an emergency admission with red code than predicting fracture and disability. In predicting urgent hospitalization, only SVM achieved better performance (TPR 0.75, TNR 0.77, accuracy 0.73, F1-score 0.76) with the 10-fold cross validation compared with other models in all evaluation metrics. CONCLUSIONS We developed machine learning models for predicting frailty conditions (mortality, urgent hospitalization, disability, fracture, and emergency admission). The results show that the prediction performance of machine learning models significantly varies from problem to problem in terms of different evaluation metrics. Through further improvement, the model that performs better can be used as a base for developing decision-support tools to improve early identification and prediction of frail older adults.

Get full-text (via PubEx)

Extraction of Sea Ice Cover by Sentinel-1 SAR Based on SVM with Unsupervised Generation of Training Data

10.20944/preprints202005.0336.v1 ◽

2020 ◽

Author(s):

Xiaoming Li ◽

Yan Sun ◽

Qiang Zhang

Keyword(s):

Machine Learning ◽

Sea Ice ◽

Learning Algorithm ◽

Texture Features ◽

Open Water ◽

Ice Cover ◽

Training Data ◽

Support Vector ◽

Training Samples

In this paper, we focus on developing a novel method to extract sea ice cover (i.e., discrimination/classification of sea ice and open water) using Sentinel-1 (S1) cross-polarization (vertical-horizontal, VH or horizontal-vertical, HV) data in extra wide (EW) swath mode based on the machine learning algorithm support vector machine (SVM). The classification basis includes the S1 radar backscatter coefficients and texture features that are calculated from S1 data using the gray level co-occurrence matrix (GLCM). Different from previous methods where appropriate samples are manually selected to train the SVM to classify sea ice and open water, we proposed a method of unsupervised generation of the training samples based on two GLCM texture features, i.e. entropy and homogeneity, that have contrasting characteristics on sea ice and open water. We eliminate the most uncertainty of selecting training samples in machine learning and achieve automatic classification of sea ice and open water by using S1 EW data. The comparison shows good agreement between the SAR-derived sea ice cover using the proposed method and a visual inspection, of which the accuracy reaches approximately 90% - 95% based on a few cases. Besides this, compared with the analyzed sea ice cover data Ice Mapping System (IMS) based on 728 S1 EW images, the accuracy of extracted sea ice cover by using S1 data is more than 80%.

Get full-text (via PubEx)

FEASIBILITY OF MACHINE LEARNING METHODS FOR SEPARATING WOOD AND LEAF POINTS FROM TERRESTRIAL LASER SCANNING DATA

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-iv-2-w4-157-2017 ◽

2017 ◽

Vol IV-2/W4 ◽

pp. 157-164 ◽

Cited By ~ 6

Author(s):

D. Wang ◽

M. Hollaus ◽

N. Pfeifer

Keyword(s):

Machine Learning ◽

Laser Scanning ◽

Learning Algorithms ◽

Terrestrial Laser Scanning ◽

Gaussian Mixture ◽

Machine Learning Algorithms ◽

Support Vector ◽

Area Index ◽

Training Samples

Classification of wood and leaf components of trees is an essential prerequisite for deriving vital tree attributes, such as wood mass, leaf area index (LAI) and woody-to-total area. Laser scanning emerges to be a promising solution for such a request. Intensity based approaches are widely proposed, as different components of a tree can feature discriminatory optical properties at the operating wavelengths of a sensor system. For geometry based methods, machine learning algorithms are often used to separate wood and leaf points, by providing proper training samples. However, it remains unclear how the chosen machine learning classifier and features used would influence classification results. To this purpose, we compare four popular machine learning classifiers, namely Support Vector Machine (SVM), Na¨ıve Bayes (NB), Random Forest (RF), and Gaussian Mixture Model (GMM), for separating wood and leaf points from terrestrial laser scanning (TLS) data. Two trees, an <i>Erytrophleum fordii</i> and a <i>Betula pendula</i> (silver birch) are used to test the impacts from classifier, feature set, and training samples. Our results showed that RF is the best model in terms of accuracy, and local density related features are important. Experimental results confirmed the feasibility of machine learning algorithms for the reliable classification of wood and leaf points. It is also noted that our studies are based on isolated trees. Further tests should be performed on more tree species and data from more complex environments.

Get full-text (via PubEx)

Evaluation of Machine Learning Approaches for Automated Diagnosis of COVID-19 using X-Ray images (Preprint)

10.2196/preprints.18947 ◽

2020 ◽

Author(s):

Mazin Mohammed ◽

Karrar Hameed Abdulkareem ◽

Mashael S. Maashi ◽

Salama A. Mostafa A. Mostafa ◽

Abdullah Baz ◽

...

Keyword(s):

Machine Learning ◽

Computational Method ◽

Learning Performance ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Approaches ◽

Data Set ◽

X Ray ◽

Wide Range ◽

Artificial Neural Network Ann

BACKGROUND In most recent times, global concern has been caused by a coronavirus (COVID19), which is considered a global health threat due to its rapid spread across the globe. Machine learning (ML) is a computational method that can be used to automatically learn from experience and improve the accuracy of predictions. OBJECTIVE In this study, the use of machine learning has been applied to Coronavirus dataset of 50 X-ray images to enable the development of directions and detection modalities with risk causes.The dataset contains a wide range of samples of COVID-19 cases alongside SARS, MERS, and ARDS. The experiment was carried out using a total of 50 X-ray images, out of which 25 images were that of positive COVIDE-19 cases, while the other 25 were normal cases. METHODS An orange tool has been used for data manipulation. To be able to classify patients as carriers of Coronavirus and non-Coronavirus carriers, this tool has been employed in developing and analysing seven types of predictive models. Models such as , artificial neural network (ANN), support vector machine (SVM), linear kernel and radial basis function (RBF), k-nearest neighbour (k-NN), Decision Tree (DT), and CN2 rule inducer were used in this study.Furthermore, the standard InceptionV3 model has been used for feature extraction target. RESULTS The various machine learning techniques that have been trained on coronavirus disease 2019 (COVID-19) dataset with improved ML techniques parameters. The data set was divided into two parts, which are training and testing. The model was trained using 70% of the dataset, while the remaining 30% was used to test the model. The results show that the improved SVM achieved a F1 of 97% and an accuracy of 98%. CONCLUSIONS :. In this study, seven models have been developed to aid the detection of coronavirus. In such cases, the learning performance can be improved through knowledge transfer, whereby time-consuming data labelling efforts are not required.the evaluations of all the models are done in terms of different parameters. it can be concluded that all the models performed well, but the SVM demonstrated the best result for accuracy metric. Future work will compare classical approaches with deep learning ones and try to obtain better results. CLINICALTRIAL None

Get full-text (via PubEx)

Machine Learning-Based Improved Pressure–Volume–Temperature Correlations for Black Oil Reservoirs

Journal of Energy Resources Technology ◽

10.1115/1.4050579 ◽

2021 ◽

Vol 143 (11) ◽

Author(s):

Zeeshan Tariq ◽

Mohamed Mahmoud ◽

Abdulazeez Abdulraheem

Keyword(s):

Machine Learning ◽

Mathematical Models ◽

Petroleum Engineering ◽

Support Vector ◽

Pvt Properties ◽

Empirical Correlations ◽

Oil Viscosity ◽

Engineering Applications ◽

Wide Range ◽

Black Oil

Abstract Pressure–volume–temperature (PVT) properties of crude oil are considered the most important properties in petroleum engineering applications as they are virtually used in every reservoir and production engineering calculation. Determination of these properties in the laboratory is the most accurate way to obtain a representative value, at the same time, it is very expensive. However, in the absence of such facilities, other approaches such as analytical solutions and empirical correlations are used to estimate the PVT properties. This study demonstrates the combined use of two machine learning (ML) technique, viz., functional network (FN) coupled with particle swarm optimization (PSO) in predicting the black oil PVT properties such as bubble point pressure (Pb), oil formation volume factor at Pb, and oil viscosity at Pb. This study also proposes new mathematical models derived from the coupled FN-PSO model to estimate these properties. The use of proposed mathematical models does not need any ML engine for the execution. A total of 760 data points collected from the different sources were preprocessed and utilized to build and train the machine learning models. The data utilized covered a wide range of values that are quite reasonable in petroleum engineering applications. The performances of the developed models were tested against the most used empirical correlations. The results showed that the proposed PVT models outperformed previous models by demonstrating an error of up to 2%. The proposed FN-PSO models were also compared with other ML techniques such as an artificial neural network, support vector regression, and adaptive neuro-fuzzy inference system, and the results showed that proposed FN-PSO models outperformed other ML techniques.

Get full-text (via PubEx)