Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

Rationale Discovery and Explainable AI

10.3233/faia210341 ◽

2021 ◽

Author(s):

Cor Steging ◽

Silja Renooij ◽

Bart Verheij

Keyword(s):

Machine Learning ◽

Feature Detection ◽

State Of The Art ◽

High Accuracy ◽

Learning Systems ◽

Training Data ◽

Relevant Feature ◽

Explainable Ai ◽

The Right ◽

The Impact

The justification of an algorithm’s outcomes is important in many domains, and in particular in the law. However, previous research has shown that machine learning systems can make the right decisions for the wrong reasons: despite high accuracies, not all of the conditions that define the domain of the training data are learned. In this study, we investigate what the system does learn, using state-of-the-art explainable AI techniques. With the use of SHAP and LIME, we are able to show which features impact the decision making process and how the impact changes with different distributions of the training data. However, our results also show that even high accuracy and good relevant feature detection are no guarantee for a sound rationale. Hence these state-of-the-art explainable AI techniques cannot be used to fully expose unsound rationales, further advocating the need for a separate method for rationale evaluation.

Download Full-text

On the relationship between research parasites and fairness in machine learning: challenges and opportunities

GigaScience ◽

10.1093/gigascience/giab086 ◽

2021 ◽

Vol 10 (12) ◽

Author(s):

Nicolás Nieto ◽

Agostina Larrazabal ◽

Victoria Peterson ◽

Diego H Milone ◽

Enzo Ferrante

Keyword(s):

Machine Learning ◽

Data Model ◽

Learning Systems ◽

Training Data ◽

Model Construction ◽

Daily Lives ◽

The Past ◽

Learning Challenges ◽

Challenges And Opportunities ◽

The Relationship

Abstract Machine learning systems influence our daily lives in many different ways. Hence, it is crucial to ensure that the decisions and recommendations made by these systems are fair, equitable, and free of unintended biases. Over the past few years, the field of fairness in machine learning has grown rapidly, investigating how, when, and why these models capture, and even potentiate, biases that are deeply rooted not only in the training data but also in our society. In this Commentary, we discuss challenges and opportunities for rigorous posterior analyses of publicly available data to build fair and equitable machine learning systems, focusing on the importance of training data, model construction, and diversity in the team of developers. The thoughts presented here have grown out of the work we did, which resulted in our winning the annual Research Parasite Award that GigaSciencesponsors.

Download Full-text

Generating Video From Images using GAN and CVAE

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6425.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 1401-1404

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Generative Models ◽

Generative Model ◽

Learning Systems ◽

Training Data ◽

Variational Autoencoder ◽

Discriminative Model ◽

To Come

In a given scene, people can often easily predict a lot of quick future occasions that may occur. However generalized pixel-level expectation in Machine Learning systems is difficult in light of the fact that it struggles with the ambiguity inherent in predicting what's to come. However, the objective of the paper is to concentrate on predicting the dense direction of pixels in a scene — what will move in the scene, where it will travel, and how it will deform through the span of one second for which we propose a conditional variational autoencoder as a solution for this issue. We likewise propose another structure for assessing generative models through an adversarial procedure, wherein we simultaneously train two models, a generative model G that catches the information appropriation, and a discriminative model D that gauges the likelihood that an example originated from the training data instead of G. We focus on two uses of GANs semi-supervised learning, and the age of pictures that human's find visually realistic. We present the Moments in Time Dataset, an enormous scale human-clarified assortment of one million short recordings relating to dynamic situations unfolding within three seconds.

Download Full-text

The Conditional Entropy Bottleneck

Entropy ◽

10.3390/e22090999 ◽

2020 ◽

Vol 22 (9) ◽

pp. 999 ◽

Cited By ~ 3

Author(s):

Ian Fischer

Keyword(s):

Machine Learning ◽

Objective Function ◽

Failure Modes ◽

Conditional Entropy ◽

Learning Systems ◽

Training Data ◽

Deterministic Models ◽

Information Bottleneck ◽

Adversarial Examples

Much of the field of Machine Learning exhibits a prominent set of failure modes, including vulnerability to adversarial examples, poor out-of-distribution (OoD) detection, miscalibration, and willingness to memorize random labelings of datasets. We characterize these as failures of robust generalization, which extends the traditional measure of generalization as accuracy or related metrics on a held-out set. We hypothesize that these failures to robustly generalize are due to the learning systems retaining too much information about the training data. To test this hypothesis, we propose the Minimum Necessary Information (MNI) criterion for evaluating the quality of a model. In order to train models that perform well with respect to the MNI criterion, we present a new objective function, the Conditional Entropy Bottleneck (CEB), which is closely related to the Information Bottleneck (IB). We experimentally test our hypothesis by comparing the performance of CEB models with deterministic models and Variational Information Bottleneck (VIB) models on a variety of different datasets and robustness challenges. We find strong empirical evidence supporting our hypothesis that MNI models improve on these problems of robust generalization.

Download Full-text

Algorithms for Detecting and Preventing Attacks on Machine Learning Models in Cyber-Security Problems

Journal of Physics Conference Series ◽

10.1088/1742-6596/2096/1/012099 ◽

2021 ◽

Vol 2096 (1) ◽

pp. 012099

Author(s):

A P Chukhnov ◽

Y S Ivanov

Keyword(s):

Machine Learning ◽

Operating Systems ◽

Cyber Security ◽

Learning Systems ◽

Machine Learning Algorithms ◽

Training Data ◽

Poisoning Effect ◽

The Stability ◽

And Training ◽

Machine Learning Models

Abstract Machine learning algorithms can be vulnerable to many forms of attacks aimed at leading the machine learning systems to make deliberate errors. The article provides an overview of attack technologies on the models and training datasets for the purpose of destructive (poisoning) effect. Experiments have been carried out to implement the existing attacks on various models. A comparative analysis of cyber-resistance of various models, most frequently used in operating systems, to destructive information actions has been prepared. The stability of various models most often used in applied problems to destructive information influences is investigated. The stability of the models is shown in case of poisoning up to 50% of the training data.

Download Full-text

Platform for Analysing and Encouraging Student Activity on Contest and E-learning Systems

OLYMPIADS IN INFORMATICS ◽

10.15388/ioi.2018.07 ◽

2018 ◽

Vol 12 ◽

pp. 85-98

Author(s):

Bojan Kostadinov ◽

Mile Jovanov ◽

Emil STANKOV

Keyword(s):

Machine Learning ◽

Data Collection ◽

Educational Policy ◽

Learning Systems ◽

Data Sources ◽

Or Education ◽

Student Activity ◽

The World ◽

E Learning ◽

Analyse Data

Data collection and machine learning are changing the world. Whether it is medicine, sports or education, companies and institutions are investing a lot of time and money in systems that gather, process and analyse data. Likewise, to improve competitiveness, a lot of countries are making changes to their educational policy by supporting STEM disciplines. Therefore, it’s important to put effort into using various data sources to help students succeed in STEM. In this paper, we present a platform that can analyse student’s activity on various contest and e-learning systems, combine and process the data, and then present it in various ways that are easy to understand. This in turn enables teachers and organizers to recognize talented and hardworking students, identify issues, and/or motivate students to practice and work on areas where they’re weaker.

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text