Increasing trust in complex machine learning systems

Machine learning (ML) has become a core technology for many real-world applications. Modern ML models are applied to unprecedentedly complex and difficult challenges, including very large and subjective problems. For instance, applications towards multimedia understanding have been advanced substantially. Here, it is already prevalent that cultural/artistic objects such as music and videos are analyzed and served to users according to their preference, enabled through ML techniques. One of the most recent breakthroughs in ML is Deep Learning (DL), which has been immensely adopted to tackle such complex problems. DL allows for higher learning capacity, making end-to-end learning possible, which reduces the need for substantial engineering effort, while achieving high effectiveness. At the same time, this also makes DL models more complex than conventional ML models. Reports in several domains indicate that such more complex ML models may have potentially critical hidden problems: various biases embedded in the training data can emerge in the prediction, extremely sensitive models can make unaccountable mistakes. Furthermore, the black-box nature of the DL models hinders the interpretation of the mechanisms behind them. Such unexpected drawbacks result in a significant impact on the trustworthiness of the systems in which the ML models are equipped as the core apparatus. In this thesis, a series of studies investigates aspects of trustworthiness for complex ML applications, namely the reliability and explainability. Specifically, we focus on music as the primary domain of interest, considering its complexity and subjectivity. Due to this nature of music, ML models for music are necessarily complex for achieving meaningful effectiveness. As such, the reliability and explainability of music ML models are crucial in the field. The first main chapter of the thesis investigates the transferability of the neural network in the Music Information Retrieval (MIR) context. Transfer learning, where the pre-trained ML models are used as off-the-shelf modules for the task at hand, has become one of the major ML practices. It is helpful since a substantial amount of the information is already encoded in the pre-trained models, which allows the model to achieve high effectiveness even when the amount of the dataset for the current task is scarce. However, this may not always be true if the "source" task which pre-trained the model shares little commonality with the "target" task at hand. An experiment including multiple "source" tasks and "target" tasks was conducted to examine the conditions which have a positive effect on the transferability. The result of the experiment suggests that the number of source tasks is a major factor of transferability. Simultaneously, it is less evident that there is a single source task that is universally effective on multiple target tasks. Overall, we conclude that considering multiple pre-trained models or pre-training a model employing heterogeneous source tasks can increase the chance for successful transfer learning. The second major work investigates the robustness of the DL models in the transfer learning context. The hypothesis is that the DL models can be susceptible to imperceptible noise on the input. This may drastically shift the analysis of similarity among inputs, which is undesirable for tasks such as information retrieval. Several DL models pre-trained in MIR tasks are examined for a set of plausible perturbations in a real-world setup. Based on a proposed sensitivity measure, the experimental results indicate that all the DL models were substantially vulnerable to perturbations, compared to a traditional feature encoder. They also suggest that the experimental framework can be used to test the pre-trained DL models for measuring robustness. In the final main chapter, the explainability of black-box ML models is discussed. In particular, the chapter focuses on the evaluation of the explanation derived from model-agnostic explanation methods. With black-box ML models having become common practice, model-agnostic explanation methods have been developed to explain a prediction. However, the evaluation of such explanations is still an open problem. The work introduces an evaluation framework that measures the quality of the explanations employing fidelity and complexity. Fidelity refers to the explained mechanism's coherence to the black-box model, while complexity is the length of the explanation. Throughout the thesis, we gave special attention to the experimental design, such that robust conclusions can be reached. Furthermore, we focused on delivering machine learning framework and evaluation frameworks. This is crucial, as we intend that the experimental design and results will be reusable in general ML practice. As it implies, we also aim our findings to be applicable beyond the music applications such as computer vision or natural language processing. Trustworthiness in ML is not a domain-specific problem. Thus, it is vital for both researchers and practitioners from diverse problem spaces to increase awareness of complex ML systems' trustworthiness. We believe the research reported in this thesis provides meaningful stepping stones towards the trustworthiness of ML.

Download Full-text

Virtual to Real-World Transfer Learning: A Systematic Review

Electronics ◽

10.3390/electronics10121491 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1491

Author(s):

Mahesh Ranaweera ◽

Qusay H. Mahmoud

Keyword(s):

Machine Learning ◽

Systematic Review ◽

Transfer Learning ◽

Real World ◽

High Performance ◽

Research Area ◽

Training Data ◽

Machine Learning Techniques ◽

Current Status ◽

The Real

Machine learning has become an important research area in many domains and real-world applications. The prevailing assumption in traditional machine learning techniques, that training and testing data should be of the same domain, is a challenge. In the real world, gathering enough training data to create high-performance learning models is not easy. Sometimes data are not available, very expensive, or dangerous to collect. In this scenario, the concept of machine learning does not hold up to its potential. Transfer learning has recently gained much acclaim in the field of research as it has the capability to create high performance learners through virtual environments or by using data gathered from other domains. This systematic review defines (a) transfer learning; (b) discusses the recent research conducted; (c) the current status of transfer learning and finally, (d) discusses how transfer learning can bridge the gap between the virtual and the real.

Download Full-text

Zero-Shot Feature Selection via Transferring Supervised Knowledge

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-20

Author(s):

Zheng Wang ◽

Qiao Wang ◽

Tingzhang Zhao ◽

Chaokun Wang ◽

Xiaojun Ye

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Real World ◽

Rapid Growth ◽

Learning Systems ◽

Training Data ◽

Effective Technique ◽

Supervised Methods ◽

Real World Datasets

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

Glean

Proceedings of the VLDB Endowment ◽

10.14778/3447689.3447703 ◽

2021 ◽

Vol 14 (6) ◽

pp. 997-1005

Author(s):

Sandeep Tata ◽

Navneet Potti ◽

James B. Wendt ◽

Lauro Beltrão Costa ◽

Marc Najork ◽

...

Keyword(s):

Machine Learning ◽

Data Management ◽

Real World ◽

Empirical Studies ◽

Ground Truth ◽

Training Data ◽

Ground Truth Data ◽

Document Type ◽

Machine Learning Model ◽

Structured Information

Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture.

Download Full-text

On the utility of dreaming: A general model for how learning in artificial agents can benefit from data hallucination

Adaptive Behavior ◽

10.1177/1059712319896489 ◽

2020 ◽

pp. 105971231989648 ◽

Cited By ~ 2

Author(s):

David Windridge ◽

Henrik Svensson ◽

Serge Thill

Keyword(s):

Machine Learning ◽

Simulated Data ◽

Training Data ◽

Successful Implementation ◽

Artificial Agents ◽

Learning Context ◽

Training Set ◽

Convergence Point ◽

And Training ◽

General Method

We consider the benefits of dream mechanisms – that is, the ability to simulate new experiences based on past ones – in a machine learning context. Specifically, we are interested in learning for artificial agents that act in the world, and operationalize “dreaming” as a mechanism by which such an agent can use its own model of the learning environment to generate new hypotheses and training data. We first show that it is not necessarily a given that such a data-hallucination process is useful, since it can easily lead to a training set dominated by spurious imagined data until an ill-defined convergence point is reached. We then analyse a notably successful implementation of a machine learning-based dreaming mechanism by Ha and Schmidhuber (Ha, D., & Schmidhuber, J. (2018). World models. arXiv e-prints, arXiv:1803.10122). On that basis, we then develop a general framework by which an agent can generate simulated data to learn from in a manner that is beneficial to the agent. This, we argue, then forms a general method for an operationalized dream-like mechanism. We finish by demonstrating the general conditions under which such mechanisms can be useful in machine learning, wherein the implicit simulator inference and extrapolation involved in dreaming act without reinforcing inference error even when inference is incomplete.

Download Full-text

Estimating real-world performance of a predictive model: a case-study in predicting mortality

JAMIA Open ◽

10.1093/jamiaopen/ooaa008 ◽

2020 ◽

Vol 3 (2) ◽

pp. 243-251

Author(s):

Vincent J Major ◽

Neil Jethani ◽

Yindalon Aphinyanaphongs

Keyword(s):

Experimental Design ◽

Real World ◽

Model Performance ◽

Assistive Technologies ◽

Training Data ◽

Electronic Health Record Data ◽

Test Set ◽

Model Composite ◽

Temporal Validation ◽

Cohort Selection

Abstract Objective One primary consideration when developing predictive models is downstream effects on future model performance. We conduct experiments to quantify the effects of experimental design choices, namely cohort selection and internal validation methods, on (estimated) real-world model performance. Materials and Methods Four years of hospitalizations are used to develop a 1-year mortality prediction model (composite of death or initiation of hospice care). Two common methods to select appropriate patient visits from their encounter history (backwards-from-outcome and forwards-from-admission) are combined with 2 testing cohorts (random and temporal validation). Two models are trained under otherwise identical conditions, and their performances compared. Operating thresholds are selected in each test set and applied to a “real-world” cohort of labeled admissions from another, unused year. Results Backwards-from-outcome cohort selection retains 25% of candidate admissions (n = 23 579), whereas forwards-from-admission selection includes many more (n = 92 148). Both selection methods produce similar performances when applied to a random test set. However, when applied to the temporally defined “real-world” set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3% and 56.5% vs. 83.2% and 41.6%). Discussion A backwards-from-outcome experiment manipulates raw training data, simplifying the experiment. This manipulated data no longer resembles real-world data, resulting in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance. Conclusion Experimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance. LAY SUMMARY The routine care of patients stands to benefit greatly from assistive technologies, including data-driven risk assessment. Already, many different machine learning and artificial intelligence applications are being developed from complex electronic health record data. To overcome challenges that arise from such data, researchers often start with simple experimental approaches to test their work. One key component is how patients (and their healthcare visits) are selected for the study from the pool of all patients seen. Another is how the group of patients used to create the risk estimator differs from the group used to evaluate how well it works. These choices complicate how the experimental setting compares to the real-world application to patients. For example, different selection approaches that depend on each patient’s future outcome can simplify the experiment but are impractical upon implementation as these data are unavailable. We show that this kind of “backwards” experiment optimistically estimates how well the model performs. Instead, our results advocate for experiments that select patients in a “forwards” manner and “temporal” validation that approximates training on past data and implementing on future data. More robust results help gauge the clinical utility of recent works and aid decision-making before implementation into practice.

Download Full-text

Transfer Learning and Deep Domain Adaptation

Advances and Applications in Deep Learning ◽

10.5772/intechopen.94072 ◽

2020 ◽

Author(s):

Wen Xu ◽

Jing He ◽

Yanfeng Shu

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Transfer Learning ◽

Real World ◽

Deep Neural Networks ◽

Domain Adaptation ◽

Fine Tuning ◽

Real World Applications ◽

Comprehensive Survey ◽

Sample Reconstruction

Transfer learning is an emerging technique in machine learning, by which we can solve a new task with the knowledge obtained from an old task in order to address the lack of labeled data. In particular deep domain adaptation (a branch of transfer learning) gets the most attention in recently published articles. The intuition behind this is that deep neural networks usually have a large capacity to learn representation from one dataset and part of the information can be further used for a new task. In this research, we firstly present the complete scenarios of transfer learning according to the domains and tasks. Secondly, we conduct a comprehensive survey related to deep domain adaptation and categorize the recent advances into three types based on implementing approaches: fine-tuning networks, adversarial domain adaptation, and sample-reconstruction approaches. Thirdly, we discuss the details of these methods and introduce some typical real-world applications. Finally, we conclude our work and explore some potential issues to be further addressed.

Download Full-text

Facial Expression Recognition Based on Weighted-Cluster Loss and Deep Transfer Learning Using a Highly Imbalanced Dataset

Sensors ◽

10.3390/s20092639 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2639

Author(s):

Quan T. Ngo ◽

Seokhoon Yoon

Keyword(s):

Facial Expression ◽

Transfer Learning ◽

Loss Function ◽

Real World ◽

Facial Expression Recognition ◽

Training Data ◽

Fine Tuning ◽

Expression Recognition ◽

Recent Success ◽

Deep Cnn

Facial expression recognition (FER) is a challenging problem in the fields of pattern recognition and computer vision. The recent success of convolutional neural networks (CNNs) in object detection and object segmentation tasks has shown promise in building an automatic deep CNN-based FER model. However, in real-world scenarios, performance degrades dramatically owing to the great diversity of factors unrelated to facial expressions, and due to a lack of training data and an intrinsic imbalance in the existing facial emotion datasets. To tackle these problems, this paper not only applies deep transfer learning techniques, but also proposes a novel loss function called weighted-cluster loss, which is used during the fine-tuning phase. Specifically, the weighted-cluster loss function simultaneously improves the intra-class compactness and the inter-class separability by learning a class center for each emotion class. It also takes the imbalance in a facial expression dataset into account by giving each emotion class a weight based on its proportion of the total number of images. In addition, a recent, successful deep CNN architecture, pre-trained in the task of face identification with the VGGFace2 database from the Visual Geometry Group at Oxford University, is employed and fine-tuned using the proposed loss function to recognize eight basic facial emotions from the AffectNet database of facial expression, valence, and arousal computing in the wild. Experiments on an AffectNet real-world facial dataset demonstrate that our method outperforms the baseline CNN models that use either weighted-softmax loss or center loss.

Download Full-text

Predicting Fault Slip via Transfer Learning

10.21203/rs.3.rs-700852/v1 ◽

2021 ◽

Author(s):

Kun Wang ◽

Christopher Johnson ◽

Kane Bennett ◽

Paul Johnson

Keyword(s):

Machine Learning ◽

Numerical Simulations ◽

Transfer Learning ◽

Laboratory Experiments ◽

Laboratory Data ◽

Fault Slip ◽

Geophysical Data ◽

Training Data ◽

Data Sets ◽

Earthquake Cycle

Abstract Data-driven machine-learning for predicting instantaneous and future fault-slip in laboratory experiments has recently progressed markedly due to large training data sets. In Earth however, earthquake interevent times range from 10's-100's of years and geophysical data typically exist for only a portion of an earthquake cycle. Sparse data presents a serious challenge to training machine learning models. Here we describe a transfer learning approach using numerical simulations to train a convolutional encoder-decoder that predicts fault-slip behavior in laboratory experiments. The model learns a mapping between acoustic emission histories and fault-slip from numerical simulations, and generalizes to produce accurate results using laboratory data. Notably slip-predictions markedly improve using the simulation-data trained-model and training the latent space using a portion of a single laboratory earthquake-cycle. The transfer learning results elucidate the potential of using models trained on numerical simulations and fine-tuned with small geophysical data sets for potential applications to faults in Earth.

Download Full-text

Estimating Real World Performance of a Predictive Model: A Case-Study in Predicting End-of-Life

10.1101/19008821 ◽

2019 ◽

Author(s):

Vincent J Major ◽

Neil Jethani ◽

Yindalon Aphinyanaphongs

Keyword(s):

Experimental Design ◽

End Of Life ◽

Real World ◽

Hospital Admissions ◽

Model Performance ◽

Training Data ◽

Real World Data ◽

Test Set ◽

Subsequent Effect ◽

One Year

AbstractObjectiveThe main criteria for choosing how models are built is the subsequent effect on future (estimated) model performance. In this work, we evaluate the effects of experimental design choices on both estimated and actual model performance.Materials and MethodsFour years of hospital admissions are used to develop a 1 year end-of-life prediction model. Two common methods to select appropriate prediction timepoints (backwards-from-outcome and forwards-from-admission) are introduced and combined with two ways of separating cohorts for training and testing (internal and temporal). Two models are trained in identical conditions, and their performances are compared. Finally, operating thresholds are selected in each test set and applied in a final, ‘real-world’ cohort consisting of one year of admissions.ResultsBackwards-from-outcome cohort selection discards 75% of candidate admissions (n=23,579), whereas forwards-from-admission selection includes many more (n=92,148). Both selection methods produce similar global performances when applied to an internal test set. However, when applied to the temporally defined ‘real-world’ set, forwards-from-admission yields higher areas under the ROC and precision recall curves (88.3 and 56.5% vs. 83.2 and 41.6%).DiscussionA backwards-from-outcome experiment effectively transforms the training data such that it no longer resembles real-world data. This results in optimistic estimates of test set performance, especially at high precision. In contrast, a forwards-from-admission experiment with a temporally separated test set consistently and conservatively estimates real-world performance.ConclusionExperimental design choices impose bias upon selected cohorts. A forwards-from-admission experiment, validated temporally, can conservatively estimate real-world performance.

Download Full-text

Materials Representation and Transfer Learning for Multi-Property Prediction

10.26434/chemrxiv.14612307.v1 ◽

2021 ◽

Author(s):

Shufeng Kong ◽

Dan Guevarra ◽

Carla P. Gomes ◽

John Gregoire

Keyword(s):

Machine Learning ◽

Optical Absorption ◽

Transfer Learning ◽

Materials Science ◽

Training Data ◽

Target Domain ◽

Generative Adversarial Network ◽

Property Prediction ◽

Adversarial Network ◽

Correlation Learning

The adoption of machine learning in materials science has rapidly transformed materials property prediction. Hurdles limiting full capitalization of recent advancements in machine learning include the limited development of methods to learn the underlying interactions of multiple elements, as well as the relationships among multiple properties, to facilitate property prediction in new composition spaces. To address these issues, we introduce the Hierarchical Correlation Learning for Multi-property Prediction (H-CLMP) framework that seamlessly integrates (i) prediction using only a material’s composition, (ii) learning and exploitation of correlations among target properties in multitarget regression, and (iii) leveraging training data from tangential domains via generative transfer learning. The model is demonstrated for prediction of spectral optical absorption of complex metal oxides spanning 69 3-cation metal oxide composition spaces. H-CLMP accurately predicts non-linear composition-property relationships in composition spaces for which no training data is available, which broadens the purview of machine learning to the discovery of materials with exceptional properties. This achievement results from the principled integration of latent embedding learning, property correlation learning, generative transfer learning, and attention models. The best performance is obtained using H-CLMP with Transfer learning (H-CLMP(T)) wherein a generative adversarial network is trained on computational density of states data and deployed in the target domain to augment prediction of optical absorption from composition. H-CLMP(T) aggregates multiple knowledge sources with a framework that is well-suited for multi-target regression across the physical sciences.

Download Full-text