Boosted Supervised Intensional Learning Supported by Unsupervised Learning

Traditionally, supervised machine learning (ML) algorithms rely heavily on large sets of annotated data. This is especially true for deep learning (DL) neural networks, which need huge annotated data sets for good performance. However, large volumes of annotated data are not always readily available. In addition, some of the best performing ML and DL algorithms lack explainability – it is often difficult even for domain experts to interpret the results. This is an important consideration especially in safety-critical applications, such as AI-assisted medical endeavors, in which a DL’s failure mode is not well understood. This lack of explainability also increases the risk of malicious attacks by adversarial actors because these actions can become obscured in the decision-making process that lacks transparency. This paper describes an intensional learning approach which uses boosting to enhance prediction performance while minimizing reliance on availability of annotated data. The intensional information is derived from an unsupervised learning preprocessing step involving clustering. Preliminary evaluation on the MNIST data set has shown encouraging results. Specifically, using the proposed approach, it is now possible to achieve similar accuracy result as extensional learning alone while using only a small fraction of the original training data set.

Download Full-text

Use of artificial intelligence to estimate population health indicators in France

European Journal of Public Health ◽

10.1093/eurpub/ckaa165.267 ◽

2020 ◽

Vol 30 (Supplement_5) ◽

Author(s):

R Haneef ◽

S Fuentes ◽

R Hrzic ◽

S Fosse-Edorh ◽

S Kab ◽

...

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Training Data ◽

Supervised Machine Learning ◽

Data Sets ◽

Data Set ◽

Selection Of

Abstract Background The use of artificial intelligence is increasing to estimate and predict health outcomes from large data sets. The main objectives were to develop two algorithms using machine learning techniques to identify new cases of diabetes (case study I) and to classify type 1 and type 2 (case study II) in France. Methods We selected the training data set from a cohort study linked with French national Health database (i.e., SNDS). Two final datasets were used to achieve each objective. A supervised machine learning method including eight following steps was developed: the selection of the data set, case definition, coding and standardization of variables, split data into training and test data sets, variable selection, training, validation and selection of the model. We planned to apply the trained models on the SNDS to estimate the incidence of diabetes and the prevalence of type 1/2 diabetes. Results For the case study I, 23/3468 and for case study II, 14/3481 SNDS variables were selected based on an optimal balance between variance explained and using the ReliefExp algorithm. We trained four models using different classification algorithms on the training data set. The Linear Discriminant Analysis model performed best in both case studies. The models were assessed on the test datasets and achieved a specificity of 67% and a sensitivity of 62% in case study I, and a specificity of 97 % and sensitivity of 100% in case study II. The case study II model was applied to the SNDS and estimated the prevalence of type 1 diabetes in 2016 in France of 0.3% and for type 2, 4.4%. The case study model I was not applied to the SNDS. Conclusions The case study II model to estimate the prevalence of type 1/2 diabetes has good performance and will be used in routine surveillance. The case study I model to identify new cases of diabetes showed a poor performance due to missing necessary information on determinants of diabetes and will need to be improved for further research.

Download Full-text

Exploiting Class Learnability in Noisy Data

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014082 ◽

2019 ◽

Vol 33 ◽

pp. 4082-4089

Author(s):

Matthew Klawonn ◽

Eric Heim ◽

James Hendler

Keyword(s):

Training Data ◽

Supervised Machine Learning ◽

Data Sets ◽

Generalization Error ◽

Data Set ◽

Class Testing ◽

Model Generalization ◽

Improve Model ◽

Real World Applications ◽

And Training

In many domains, collecting sufficient labeled training data for supervised machine learning requires easily accessible but noisy sources, such as crowdsourcing services or tagged Web data. Noisy labels occur frequently in data sets harvested via these means, sometimes resulting in entire classes of data on which learned classifiers generalize poorly. For real world applications, we argue that it can be beneficial to avoid training on such classes entirely. In this work, we aim to explore the classes in a given data set, and guide supervised training to spend time on a class proportional to its learnability. By focusing the training process, we aim to improve model generalization on classes with a strong signal. To that end, we develop an online algorithm that works in conjunction with classifier and training algorithm, iteratively selecting training data for the classifier based on how well it appears to generalize on each class. Testing our approach on a variety of data sets, we show our algorithm learns to focus on classes for which the model has low generalization error relative to strong baselines, yielding a classifier with good performance on learnable classes.

Download Full-text

Gaze Amplifies Value in Decision Making

Psychological Science ◽

10.1177/0956797618810521 ◽

2018 ◽

Vol 30 (1) ◽

pp. 116-128 ◽

Cited By ~ 20

Author(s):

Stephanie M. Smith ◽

Ian Krajbich

Keyword(s):

Response Times ◽

Data Sets ◽

Choice Data ◽

Data Set ◽

Choice Process ◽

The Gaze ◽

Large Sets ◽

Subjective Value ◽

Amplifying Effect ◽

Familiar Stimuli

When making decisions, people tend to choose the option they have looked at more. An unanswered question is how attention influences the choice process: whether it amplifies the subjective value of the looked-at option or instead adds a constant, value-independent bias. To address this, we examined choice data from six eye-tracking studies ( Ns = 39, 44, 44, 36, 20, and 45, respectively) to characterize the interaction between value and gaze in the choice process. We found that the summed values of the options influenced response times in every data set and the gaze-choice correlation in most data sets, in line with an amplifying role of attention in the choice process. Our results suggest that this amplifying effect is more pronounced in tasks using large sets of familiar stimuli, compared with tasks using small sets of learned stimuli.

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

clusTCR: a Python interface for rapid clustering of large sets of CDR3 sequences

10.1101/2021.02.22.432291 ◽

2021 ◽

Author(s):

Sebastiaan Valkiers ◽

Max Van Houcke ◽

Kris Laukens ◽

Pieter Meysman

Keyword(s):

T Cell ◽

Large Data ◽

Cell Receptor ◽

Amino Acid Sequences ◽

Large Data Sets ◽

Data Sets ◽

Clustering Methods ◽

Link Type ◽

Large Sets ◽

Similar Accuracy

The T-cell receptor (TCR) determines the specificity of a T-cell towards an epitope. As of yet, the rules for antigen recognition remain largely undetermined. Current methods for grouping TCRs according to their epitope specificity remain limited in performance and scalability. Multiple methodologies have been developed, but all of them fail to efficiently cluster large data sets exceeding 1 million sequences. To account for this limitation, we developed clusTCR, a rapid TCR clustering alternative that efficiently scales up to millions of CDR3 amino acid sequences. Benchmarking comparisons revealed similar accuracy of clusTCR with other TCR clustering methods. clusTCR offers a drastic improvement in clustering speed, which allows clustering of millions of TCR sequences in just a few minutes through efficient similarity searching and sequence hashing.clusTCR was written in Python 3. It is available as an anaconda package (https://anaconda.org/svalkiers/clustcr) and on github (https://github.com/svalkiers/clusTCR).

Download Full-text

Artificially Generated Training Data-sets for Supervised Machine Learning Techniques in Magnetic Resonance Imaging: An Example in Myocardial Segmentation

2019 Computing in Cardiology Conference (CinC) ◽

10.22489/cinc.2019.220 ◽

2019 ◽

Author(s):

Christos Xanthis ◽

Kostas Haris ◽

Dimitrios Filos ◽

Anthony Aletras

Keyword(s):

Magnetic Resonance Imaging ◽

Machine Learning ◽

Magnetic Resonance ◽

Training Data ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Sets ◽

Resonance Imaging ◽

Learning Techniques ◽

Myocardial Segmentation

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Simple Convolutional-Based Models: Are They Learning the Task or the Data?

Neural Computation ◽

10.1162/neco_a_01446 ◽

2021 ◽

pp. 1-17

Author(s):

Luis Sa-Couto ◽

Andreas Wichert

Keyword(s):

Neural Networks ◽

Pattern Recognition ◽

Training Data ◽

Model Complexity ◽

Data Sets ◽

Simple Task ◽

Data Set ◽

Knowing That ◽

Handwritten Digit ◽

End To End

Abstract Convolutional neural networks (CNNs) evolved from Fukushima's neocognitron model, which is based on the ideas of Hubel and Wiesel about the early stages of the visual cortex. Unlike other branches of neocognitron-based models, the typical CNN is based on end-to-end supervised learning by backpropagation and removes the focus from built-in invariance mechanisms, using pooling not as a way to tolerate small shifts but as a regularization tool that decreases model complexity. These properties of end-to-end supervision and flexibility of structure allow the typical CNN to become highly tuned to the training data, leading to extremely high accuracies on typical visual pattern recognition data sets. However, in this work, we hypothesize that there is a flip side to this capability, a hidden overfitting. More concretely, a supervised, backpropagation based CNN will outperform a neocognitron/map transformation cascade (MTCCXC) when trained and tested inside the same data set. Yet if we take both models trained and test them on the same task but on another data set (without retraining), the overfitting appears. Other neocognitron descendants like the What-Where model go in a different direction. In these models, learning remains unsupervised, but more structure is added to capture invariance to typical changes. Knowing that, we further hypothesize that if we repeat the same experiments with this model, the lack of supervision may make it worse than the typical CNN inside the same data set, but the added structure will make it generalize even better to another one. To put our hypothesis to the test, we choose the simple task of handwritten digit classification and take two well-known data sets of it: MNIST and ETL-1. To try to make the two data sets as similar as possible, we experiment with several types of preprocessing. However, regardless of the type in question, the results align exactly with expectation.

Download Full-text

Lost in Space: Geolocation in Event Data

Political Science Research and Methods ◽

10.1017/psrm.2018.23 ◽

2018 ◽

Vol 7 (04) ◽

pp. 871-888 ◽

Cited By ~ 6

Author(s):

Sophie J. Lee ◽

Howard Liu ◽

Michael D. Ward

Keyword(s):

Learning Algorithm ◽

Text Processing ◽

Contextual Information ◽

Training Data ◽

Supervised Machine Learning ◽

Model Parameters ◽

Event Data ◽

Data Set ◽

N Gram ◽

Automated Text Processing

Improving geolocation accuracy in text data has long been a goal of automated text processing. We depart from the conventional method and introduce a two-stage supervised machine-learning algorithm that evaluates each location mention to be either correct or incorrect. We extract contextual information from texts, i.e., N-gram patterns for location words, mention frequency, and the context of sentences containing location words. We then estimate model parameters using a training data set and use this model to predict whether a location word in the test data set accurately represents the location of an event. We demonstrate these steps by constructing customized geolocation event data at the subnational level using news articles collected from around the world. The results show that the proposed algorithm outperforms existing geocoders even in a case added post hoc to test the generality of the developed algorithm.

Download Full-text

An assessment of Bayesian bias estimator for numerical weather prediction

Nonlinear Processes in Geophysics ◽

10.5194/npg-15-1013-2008 ◽

2008 ◽

Vol 15 (6) ◽

pp. 1013-1022 ◽

Cited By ~ 2

Author(s):

J. Son ◽

D. Hou ◽

Z. Toth

Keyword(s):

Time Series ◽

Numerical Weather Prediction ◽

Sampling Error ◽

Weather Prediction ◽

Training Data ◽

Statistical Characteristics ◽

Forecast Errors ◽

Data Sets ◽

Data Set ◽

Numerical Weather

Abstract. Various statistical methods are used to process operational Numerical Weather Prediction (NWP) products with the aim of reducing forecast errors and they often require sufficiently large training data sets. Generating such a hindcast data set for this purpose can be costly and a well designed algorithm should be able to reduce the required size of these data sets. This issue is investigated with the relatively simple case of bias correction, by comparing a Bayesian algorithm of bias estimation with the conventionally used empirical method. As available forecast data sets are not large enough for a comprehensive test, synthetically generated time series representing the analysis (truth) and forecast are used to increase the sample size. Since these synthetic time series retained the statistical characteristics of the observations and operational NWP model output, the results of this study can be extended to real observation and forecasts and this is confirmed by a preliminary test with real data. By using the climatological mean and standard deviation of the meteorological variable in consideration and the statistical relationship between the forecast and the analysis, the Bayesian bias estimator outperforms the empirical approach in terms of the accuracy of the estimated bias, and it can reduce the required size of the training sample by a factor of 3. This advantage of the Bayesian approach is due to the fact that it is less liable to the sampling error in consecutive sampling. These results suggest that a carefully designed statistical procedure may reduce the need for the costly generation of large hindcast datasets.

Download Full-text