Supervised machine learning using encrypted training data

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

Artificially Generated Training Data-sets for Supervised Machine Learning Techniques in Magnetic Resonance Imaging: An Example in Myocardial Segmentation

2019 Computing in Cardiology Conference (CinC) ◽

10.22489/cinc.2019.220 ◽

2019 ◽

Author(s):

Christos Xanthis ◽

Kostas Haris ◽

Dimitrios Filos ◽

Anthony Aletras

Keyword(s):

Magnetic Resonance Imaging ◽

Machine Learning ◽

Magnetic Resonance ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Sets ◽

Resonance Imaging ◽

Learning Techniques ◽

Myocardial Segmentation

Download Full-text

Design-Oriented Multifidelity Fluid Simulation Using Machine Learned Fidelity Mapping

ASME 2019 Conference on Smart Materials, Adaptive Structures and Intelligent Systems ◽

10.1115/smasis2019-5515 ◽

2019 ◽

Cited By ~ 1

Author(s):

Kazuko Fuchi ◽

Eric M. Wolf ◽

David S. Makhija ◽

Nathan A. Wukie ◽

Christopher R. Schrock ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Fluid Simulation ◽

Machine Learning Algorithms ◽

Training Data ◽

Supervised Machine Learning ◽

High Fidelity ◽

Computational Domain ◽

Symmetry Properties ◽

High Fidelity Simulations

Abstract A machine learning algorithm that performs multifidelity domain decomposition is introduced. While the design of complex systems can be facilitated by numerical simulations, the determination of appropriate physics couplings and levels of model fidelity can be challenging. The proposed method automatically divides the computational domain into subregions and assigns required fidelity level, using a small number of high fidelity simulations to generate training data and low fidelity solutions as input data. Unsupervised and supervised machine learning algorithms are used to correlate features from low fidelity solutions to fidelity assignment. The effectiveness of the method is demonstrated in a problem of viscous fluid flow around a cylinder at Re ≈ 20. Ling et al. built physics-informed invariance and symmetry properties into machine learning models and demonstrated improved model generalizability. Along these lines, we avoid using problem dependent features such as coordinates of sample points, object geometry or flow conditions as explicit inputs to the machine learning model. Use of pointwise flow features generates large data sets from only one or two high fidelity simulations, and the fidelity predictor model achieved 99.5% accuracy at training points. The trained model was shown to be capable of predicting a fidelity map for a problem with an altered cylinder radius. A significant improvement in the prediction performance was seen when inputs are expanded to include multiscale features that incorporate neighborhood information.

Download Full-text

DeepEthogram, a machine learning pipeline for supervised behavior classification from raw pixels

eLife ◽

10.7554/elife.63377 ◽

2021 ◽

Vol 10 ◽

Author(s):

James P Bohnslav ◽

Nivanthika K Wimalasena ◽

Kelsey J Clausing ◽

Yu Y Dai ◽

David A Yarmolinsky ◽

...

Keyword(s):

Machine Learning ◽

Human Performance ◽

Video Recording ◽

Gene Mutations ◽

General Purpose ◽

Training Data ◽

Supervised Machine Learning ◽

Video Frame ◽

Neural Function ◽

End User Programming

Videos of animal behavior are used to quantify researcher-defined behaviors-of-interest to study neural function, gene mutations, and pharmacological therapies. Behaviors-of-interest are often scored manually, which is time-consuming, limited to few behaviors, and variable across researchers. We created DeepEthogram: software that uses supervised machine learning to convert raw video pixels into an ethogram, the behaviors-of-interest present in each video frame. DeepEthogram is designed to be general-purpose and applicable across species, behaviors, and video-recording hardware. It uses convolutional neural networks to compute motion, extract features from motion and images, and classify features into behaviors. Behaviors are classified with above 90% accuracy on single frames in videos of mice and flies, matching expert-level human performance. DeepEthogram accurately predicts rare behaviors, requires little training data, and generalizes across subjects. A graphical interface allows beginning-to-end analysis without end-user programming. DeepEthogram's rapid, automatic, and reproducible labeling of researcher-defined behaviors-of-interest may accelerate and enhance supervised behavior analysis.

Download Full-text

A Comparative Evaluation of Supervised Machine Learning Classification Techniques for Engineering Design Applications

Journal of Mechanical Design ◽

10.1115/1.4044524 ◽

2019 ◽

Vol 141 (12) ◽

Author(s):

Conner Sharpe ◽

Tyler Wiest ◽

Pingfeng Wang ◽

Carolyn Conner Seepersad

Keyword(s):

Machine Learning ◽

Engineering Design ◽

Design Space ◽

Optimization Problems ◽

Machine Learning Algorithms ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Design Exploration ◽

Machine Learning Classification

Abstract Supervised machine learning techniques have proven to be effective tools for engineering design exploration and optimization applications, in which they are especially useful for mapping promising or feasible regions of the design space. The design space mappings can be used to inform early-stage design exploration, provide reliability assessments, and aid convergence in multiobjective or multilevel problems that require collaborative design teams. However, the accuracy of the mappings can vary based on problem factors such as the number of design variables, presence of discrete variables, multimodality of the underlying response function, and amount of training data available. Additionally, there are several useful machine learning algorithms available, and each has its own set of algorithmic hyperparameters that significantly affect accuracy and computational expense. This work elucidates the use of machine learning for engineering design exploration and optimization problems by investigating the performance of popular classification algorithms on a variety of example engineering optimization problems. The results are synthesized into a set of observations to provide engineers with intuition for applying these techniques to their own problems in the future, as well as recommendations based on problem type to aid engineers in algorithm selection and utilization.

Download Full-text

Learning from the 2018 Western Japan Heavy Rains to Detect Floods during the 2019 Hagibis Typhoon

Remote Sensing ◽

10.3390/rs12142244 ◽

2020 ◽

Vol 12 (14) ◽

pp. 2244

Author(s):

Luis Moya ◽

Erick Mas ◽

Shunichi Koshimura

Keyword(s):

Machine Learning ◽

Real Time ◽

Local Governments ◽

Large Scale ◽

Damage Identification ◽

Remote Sensing Data ◽

Early Response ◽

Training Data ◽

Supervised Machine Learning ◽

A Current

Applications of machine learning on remote sensing data appear to be endless. Its use in damage identification for early response in the aftermath of a large-scale disaster has a specific issue. The collection of training data right after a disaster is costly, time-consuming, and many times impossible. This study analyzes a possible solution to the referred issue: the collection of training data from past disaster events to calibrate a discriminant function. Then the identification of affected areas in a current disaster can be performed in near real-time. The performance of a supervised machine learning classifier to learn from training data collected from the 2018 heavy rainfall at Okayama Prefecture, Japan, and to identify floods due to the typhoon Hagibis on 12 October 2019 at eastern Japan is reported in this paper. The results show a moderate agreement with flood maps provided by local governments and public institutions, and support the assumption that previous disaster information can be used to identify a current disaster in near-real time.

Download Full-text

Structure label prediction using similarity-based retrieval and weakly supervised label mapping

Geophysics ◽

10.1190/geo2018-0028.1 ◽

2019 ◽

Vol 84 (1) ◽

pp. V67-V79 ◽

Cited By ~ 9

Author(s):

Yazeed Alaudah ◽

Motaz Alfarraj ◽

Ghassan AlRegib

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Seismic Interpretation ◽

Machine Learning Algorithms ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Subsurface Structures ◽

Weakly Supervised ◽

Seismic Volumes

Recently, there has been significant interest in various supervised machine learning techniques that can help reduce the time and effort consumed by manual interpretation workflows. However, most successful supervised machine learning algorithms require huge amounts of annotated training data. Obtaining these labels for large seismic volumes is a very time-consuming and laborious task. We have addressed this problem by presenting a weakly supervised approach for predicting the labels of various seismic structures. By having an interpreter select a very small number of exemplar images for every class of subsurface structures, we use a novel similarity-based retrieval technique to extract thousands of images that contain similar subsurface structures from the seismic volume. By assuming that similar images belong to the same class, we obtain thousands of image-level labels for these images; we validate this assumption. We have evaluated a novel weakly supervised algorithm for mapping these rough image-level labels into more accurate pixel-level labels that localize the different subsurface structures within the image. This approach dramatically simplifies the process of obtaining labeled data for training supervised machine learning algorithms on seismic interpretation tasks. Using our method, we generate thousands of automatically labeled images from the Netherlands Offshore F3 block with reasonably accurate pixel-level labels. We believe that this work will allow for more advances in machine learning-enabled seismic interpretation.

Download Full-text

RTID-08. MACHINE LEARNING TO UNCOVER SIGNATURES OF VULNERABILITY IN GLIOBLASTOMA UMBRELLA SIGNATURE TRIAL (GUST)

Neuro-Oncology ◽

10.1093/neuonc/noab196.770 ◽

2021 ◽

Vol 23 (Supplement_6) ◽

pp. vi194-vi194

Author(s):

Sen Peng ◽

Matthew Lee ◽

Nanyun Tang ◽

Manmeet Ahluwalia ◽

Ekokobe Fonkem ◽

...

Keyword(s):

Machine Learning ◽

Specific Treatment ◽

Training Data ◽

Supervised Machine Learning ◽

Class Model ◽

Mutation Status ◽

Optimal Thresholds ◽

Investigational Treatment ◽

Whole Transcriptome ◽

Larger Sample

Abstract Glioblastoma is characterized by intra- and inter-tumoral heterogeneity. A glioblastoma umbrella signature trial (GUST) posits multiple investigational treatment arms based on corresponding biomarker signatures. A contingency of an efficient umbrella trial is a suite of orthogonal signatures to classify patients into the likely-most-beneficial arm. Assigning optimal thresholds of vulnerability signatures to classify patients as “most-likely responders” for each specific treatment arm is a crucial task. We utilized semi-supervised machine learning, Entropy-Regularized Logistic Regression, to predict vulnerability classification. By applying semi-supervised algorithms to the TCGA GBM cohort, we were able to transform the samples with the highest certainty of predicted response into a self-labeled dataset and thus augment the training data. In this case, we developed a predictive model with a larger sample size and potential better performance. Our GUST design currently includes four treatment arms for GBM patients: Arsenic Trioxide, Methoxyamine, Selinexor and Pevonedistat. Each treatment arm manifests its own signature developed by the customized machine learning pipelines based on selected gene mutation status and whole transcriptome data. In order to increase the robustness and scalability, we also developed a multi-class/label classification ensemble model that’s capable of predicting a probability of “fitness” of each novel therapeutic agent for each patient. Such a multi-class model would also enable us to rank each arm and provide sequential treatment planning. By expansion to four independent treatment arms within a single umbrella trial, a “mock” stratification of TCGA GBM patients labeled 56% of all cases into at least one “high likelihood of response” arm. Predicted vulnerability using genomic data from preclinical PDX models correctly placed 4 out of 6 models into the “responder” group. Our utilization of multiple vulnerability signatures in a GUST trial demonstrates how a precision medicine model can support an efficient clinical trial for heterogeneous diseases such as GBM. Surgical Therapies

Download Full-text

A machine learning approach to define antimalarial drug action from heterogeneous cell-based screens

Science Advances ◽

10.1126/sciadv.aba9338 ◽

2020 ◽

Vol 6 (39) ◽

pp. eaba9338 ◽

Cited By ~ 1

Author(s):

George W. Ashdown ◽

Michelle Dimon ◽

Minjie Fan ◽

Fernando Sánchez-Román Terán ◽

Kathrin Witmer ◽

...

Keyword(s):

Machine Learning ◽

Mechanism Of Action ◽

Training Data ◽

Supervised Machine Learning ◽

Cross Resistance ◽

Learning Approach ◽

Imaging Data ◽

Drug Induced ◽

Effective Prevention ◽

Machine Learning Approach

Drug resistance threatens the effective prevention and treatment of an ever-increasing range of human infections. This highlights an urgent need for new and improved drugs with novel mechanisms of action to avoid cross-resistance. Current cell-based drug screens are, however, restricted to binary live/dead readouts with no provision for mechanism of action prediction. Machine learning methods are increasingly being used to improve information extraction from imaging data. These methods, however, work poorly with heterogeneous cellular phenotypes and generally require time-consuming human-led training. We have developed a semi-supervised machine learning approach, combining human- and machine-labeled training data from mixed human malaria parasite cultures. Designed for high-throughput and high-resolution screening, our semi-supervised approach is robust to natural parasite morphological heterogeneity and correctly orders parasite developmental stages. Our approach also reproducibly detects and clusters drug-induced morphological outliers by mechanism of action, demonstrating the potential power of machine learning for accelerating cell-based drug discovery.

Download Full-text

Automatic microseismic event picking via unsupervised machine learning

Geophysical Journal International ◽

10.1093/gji/ggaa186 ◽

2020 ◽

Vol 222 (3) ◽

pp. 1750-1764 ◽

Cited By ~ 1

Author(s):

Yangkang Chen

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

Learning Algorithm ◽

State Of The Art ◽

The State ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Algorithm ◽

Unsupervised Machine Learning ◽

Earthquake Data

SUMMARY Effective and efficient arrival picking plays an important role in microseismic and earthquake data processing and imaging. Widely used short-term-average long-term-average ratio (STA/LTA) based arrival picking algorithms suffer from the sensitivity to moderate-to-strong random ambient noise. To make the state-of-the-art arrival picking approaches effective, microseismic data need to be first pre-processed, for example, removing sufficient amount of noise, and second analysed by arrival pickers. To conquer the noise issue in arrival picking for weak microseismic or earthquake event, I leverage the machine learning techniques to help recognizing seismic waveforms in microseismic or earthquake data. Because of the dependency of supervised machine learning algorithm on large volume of well-designed training data, I utilize an unsupervised machine learning algorithm to help cluster the time samples into two groups, that is, waveform points and non-waveform points. The fuzzy clustering algorithm has been demonstrated to be effective for such purpose. A group of synthetic, real microseismic and earthquake data sets with different levels of complexity show that the proposed method is much more robust than the state-of-the-art STA/LTA method in picking microseismic events, even in the case of moderately strong background noise.

Download Full-text