Certifai: A Toolkit for Building Trust in AI Systems

As more companies and governments build and use machine learning models to automate decisions, there is an ever-growing need to monitor and evaluate these models' behavior once they are deployed. Our team at CognitiveScale has developed a toolkit called Cortex Certifai to answer this need. Cortex Certifai is a framework that assesses aspects of robustness, fairness, and interpretability of any classification or regression model trained on tabular data, without requiring access to its internal workings. Additionally, Cortex Certifai allows users to compare models along these different axes and only requires 1) query access to the model and 2) an “evaluation” dataset. At its foundation, Cortex Certifai generates counterfactual explanations, which are synthetic data points close to input data points but differing in terms of model prediction. The tool then harnesses characteristics of these counterfactual explanations to analyze different aspects of the supplied model and delivers evaluations relevant to a variety of different stakeholders (e.g., model developers, risk analysts, compliance officers). Cortex Certifai can be configured and executed using a command-line interface (CLI), within jupyter notebooks, or on the cloud, and the results are recorded in JSON files and can be visualized in an interactive console. Using these reports, stakeholders can understand, monitor, and build trust in their AI systems. In this paper, we provide a brief overview of a demonstration of Cortex Certifai's capabilities.

Download Full-text

Visualizing population structure with variational autoencoders

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkaa036 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

C J Battey ◽

Gabrielle C Coffing ◽

Andrew D Kern

Keyword(s):

Population Structure ◽

Population Genetic ◽

Input Data ◽

Command Line ◽

Anopheles Mosquitoes ◽

Global Geometry ◽

Latent Space ◽

Population Genetic Variation ◽

Machine Learning Models ◽

Better Than

Abstract Dimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs)—generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data—for visualizing population genetic variation. VAEs incorporate nonlinear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.

Download Full-text

Hypergraph Optimization for Multi-Structural Geometric Model Fitting

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018730 ◽

2019 ◽

Vol 33 ◽

pp. 8730-8737

Author(s):

Shuyuan Lin ◽

Guobao Xiao ◽

Yan Yan ◽

David Suter ◽

Hanzi Wang

Keyword(s):

Spectral Clustering ◽

Input Data ◽

State Of The Art ◽

Geometric Model ◽

Model Fitting ◽

Synthetic Data ◽

Estimation Algorithm ◽

Sampling Efficiency ◽

Data Points ◽

Fitting In

Recently, some hypergraph-based methods have been proposed to deal with the problem of model fitting in computer vision, mainly due to the superior capability of hypergraph to represent the complex relationship between data points. However, a hypergraph becomes extremely complicated when the input data include a large number of data points (usually contaminated with noises and outliers), which will significantly increase the computational burden. In order to overcome the above problem, we propose a novel hypergraph optimization based model fitting (HOMF) method to construct a simple but effective hypergraph. Specifically, HOMF includes two main parts: an adaptive inlier estimation algorithm for vertex optimization and an iterative hyperedge optimization algorithm for hyperedge optimization. The proposed method is highly efficient, and it can obtain accurate model fitting results within a few iterations. Moreover, HOMF can then directly apply spectral clustering, to achieve good fitting performance. Extensive experimental results show that HOMF outperforms several state-of-the-art model fitting methods on both synthetic data and real images, especially in sampling efficiency and in handling data with severe outliers.

Download Full-text

Visualizing Population Structure with Variational Autoencoders

10.1101/2020.08.12.248278 ◽

2020 ◽

Cited By ~ 3

Author(s):

C. J. Battey ◽

Gabrielle C. Coffing ◽

Andrew D. Kern

Keyword(s):

Population Structure ◽

Population Genetic ◽

Input Data ◽

Command Line ◽

Global Geometry ◽

Linear Relationships ◽

Latent Space ◽

Population Genetic Variation ◽

Machine Learning Models ◽

Better Than

AbstractDimensionality reduction is a common tool for visualization and inference of population structure from genotypes, but popular methods either return too many dimensions for easy plotting (PCA) or fail to preserve global geometry (t-SNE and UMAP). Here we explore the utility of variational autoencoders (VAEs) – generative machine learning models in which a pair of neural networks seek to first compress and then recreate the input data – for visualizing population genetic variation. VAEs incorporate non-linear relationships, allow users to define the dimensionality of the latent space, and in our tests preserve global geometry better than t-SNE and UMAP. Our implementation, which we call popvae, is available as a command-line python program at github.com/kr-colab/popvae. The approach yields latent embeddings that capture subtle aspects of population structure in humans and Anopheles mosquitoes, and can generate artificial genotypes characteristic of a given sample or population.

Download Full-text

ESIS ON THE WORLD WIDE WEB

International Journal of Modern Physics C ◽

10.1142/s0129183194000921 ◽

1994 ◽

Vol 05 (05) ◽

pp. 805-809 ◽

Cited By ~ 1

Author(s):

SALIM G. ANSARI ◽

PAOLO GIOMMI ◽

ALBERTO MICOL

Keyword(s):

World Wide Web ◽

Web Application ◽

World Wide ◽

Markup Language ◽

Command Line ◽

Command Line Interface ◽

The World ◽

Hypertext Transfer Protocol ◽

Hypertext Markup Language ◽

Near Future

On 3rd November, 1993, ESIS announced its Homepage on the World Wide Web (WWW) to the user community. Ever since then, ESIS has steadily increased its Web support to the astronomical community to include a bibliographic service, the ESIS catalogue documentation and the ESIS Data Browser. More functionality will be added in the near future. All these services share a common ESIS structure that is used by other ESIS user paradigms such as the ESIS Graphical User Interface (Giommi and Ansari, 1993), and the ESIS Command Line Interface. A forms-based paradigm, each ESIS-Web application interfaces to the hypertext transfer protocol (http) translating queries from/to the hypertext markup language (html) format understood by the NCSA Mosaic interface. In this paper, we discuss the ESIS system and show how each ESIS service works on the World Wide Web client.

Download Full-text

Fuzzy One-Class Classification Model Using Contamination Neighborhoods

Advances in Fuzzy Systems ◽

10.1155/2012/984325 ◽

2012 ◽

Vol 2012 ◽

pp. 1-10 ◽

Cited By ~ 5

Author(s):

Lev V. Utkin

Keyword(s):

Risk Measures ◽

Synthetic Data ◽

Fuzzy Model ◽

Fuzzy Classification ◽

Classification Model ◽

Support Vector ◽

Standard Support Vector Machine ◽

Robust Model ◽

Data Points ◽

One Class Classification

A fuzzy classification model is studied in the paper. It is based on the contaminated (robust) model which produces fuzzy expected risk measures characterizing classification errors. Optimal classification parameters of the models are derived by minimizing the fuzzy expected risk. It is shown that an algorithm for computing the classification parameters is reduced to a set of standard support vector machine tasks with weighted data points. Experimental results with synthetic data illustrate the proposed fuzzy model.

Download Full-text

Estimating Efforts and Success of Symmetry-Seeing Machines by Use of Synthetic Data

Symmetry ◽

10.3390/sym11020227 ◽

2019 ◽

Vol 11 (2) ◽

pp. 227

Author(s):

Eckart Michaelsen ◽

Stéphane Vujasinovic

Keyword(s):

Extraction Method ◽

Input Data ◽

State Of The Art ◽

Recognition Performance ◽

Human Subjects ◽

Synthetic Data ◽

Ground Truth ◽

Comparative Test ◽

Real Imagery ◽

Dominant Part

Representative input data are a necessary requirement for the assessment of machine-vision systems. For symmetry-seeing machines in particular, such imagery should provide symmetries as well as asymmetric clutter. Moreover, there must be reliable ground truth with the data. It should be possible to estimate the recognition performance and the computational efforts by providing different grades of difficulty and complexity. Recent competitions used real imagery labeled by human subjects with appropriate ground truth. The paper at hand proposes to use synthetic data instead. Such data contain symmetry, clutter, and nothing else. This is preferable because interference with other perceptive capabilities, such as object recognition, or prior knowledge, can be avoided. The data are given sparsely, i.e., as sets of primitive objects. However, images can be generated from them, so that the same data can also be fed into machines requiring dense input, such as multilayered perceptrons. Sparse representations are preferred, because the author’s own system requires such data, and in this way, any influence of the primitive extraction method is excluded. The presented format allows hierarchies of symmetries. This is important because hierarchy constitutes a natural and dominant part in symmetry-seeing. The paper reports some experiments using the author’s Gestalt algebra system as symmetry-seeing machine. Additionally included is a comparative test run with the state-of-the-art symmetry-seeing deep learning convolutional perceptron of the PSU. The computational efforts and recognition performance are assessed.

Download Full-text

Application of Various Machine Learning Models for Process Stability of Bio-Electrochemical Anaerobic Digestion

Processes ◽

10.3390/pr10010158 ◽

2022 ◽

Vol 10 (1) ◽

pp. 158

Author(s):

Ain Cheon ◽

Jwakyung Sung ◽

Hangbae Jun ◽

Heewon Jang ◽

Minji Kim ◽

...

Keyword(s):

Machine Learning ◽

Anaerobic Digestion ◽

Prediction Accuracy ◽

Input Data ◽

Process Stability ◽

Operational Parameters ◽

Monitoring Parameter ◽

Efficient Prediction ◽

Single Input ◽

Machine Learning Models

The application of a machine learning (ML) model to bio-electrochemical anaerobic digestion (BEAD) is a future-oriented approach for improving process stability by predicting performances that have nonlinear relationships with various operational parameters. Five ML models, which included tree-, regression-, and neural network-based algorithms, were applied to predict the methane yield in BEAD reactor. The results showed that various 1-step ahead ML models, which utilized prior data of BEAD performances, could enhance prediction accuracy. In addition, 1-step ahead with retraining algorithm could improve prediction accuracy by 37.3% compared with the conventional multi-step ahead algorithm. The improvement was particularly noteworthy in tree- and regression-based ML models. Moreover, 1-step ahead with retraining algorithm showed high potential of achieving efficient prediction using pH as a single input data, which is plausibly an easier monitoring parameter compared with the other parameters required in bioprocess models.

Download Full-text

UCEasy: A software package for automating and simplifying the analysis of ultraconserved elements (UCEs)

Biodiversity Data Journal ◽

10.3897/bdj.9.e78132 ◽

2021 ◽

Vol 9 ◽

Author(s):

Caio Ribeiro ◽

Lucas Oliveira ◽

Romina Batista ◽

Marcos De Sousa

Keyword(s):

Best Practices ◽

Software Package ◽

Phylogenetic Trees ◽

Computational Analysis ◽

Data Matrix ◽

Command Line ◽

Command Line Interface ◽

Ultraconserved Elements ◽

Research Software ◽

Different Levels

The use of Ultraconserved Elements (UCEs) as genetic markers in phylogenomics has become popular and has provided promising results. Although UCE data can be easily obtained from targeted enriched sequencing, the protocol for in silico analysis of UCEs consist of the execution of heterogeneous and complex tools, a challenge for scientists without training in bioinformatics. Developing tools with the adoption of best practices in research software can lessen this problem by improving the execution of computational experiments, thus promoting better reproducibility. We present UCEasy, an easy-to-install and easy-to-use software package with a simple command line interface that facilitates the computational analysis of UCEs from sequencing samples, following the best practices of research software. UCEasy is a wrapper that standardises, automates and simplifies the quality control of raw reads, assembly and extraction and alignment of UCEs, generating at the end a data matrix with different levels of completeness that can be used to infer phylogenetic trees. We demonstrate the functionalities of UCEasy by reproducing the published results of phylogenomic studies of the bird genus Turdus (Aves) and of Adephaga families (Coleoptera) containing genomic datasets to efficiently extract UCEs.

Download Full-text

Efficient Transfer of data from RDBMS to HDFS and conversion to JSON format

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.38710 ◽

2021 ◽

Vol 9 (10) ◽

pp. 1869-1871

Author(s):

Dr. C. K. Gomathy

Keyword(s):

Data Warehouse ◽

Relational Databases ◽

Command Line ◽

Command Line Interface ◽

Apache Hadoop ◽

Enterprise Data Warehouse ◽

Efficient Transfer ◽

Efficient Execution

Abstract: Apache Sqoop is mainly used to efficiently transfer large volumes of data between Apache Hadoop and relational databases. It helps to certain tasks, such as ETL (Extract transform load) processing, from an enterprise data warehouse to Hadoop, for efficient execution at a much less cost. Here first we import the table which presents in MYSQL Database with the help of command-line interface application called Sqoop and there is a chance of addition of new rows and updating new rows then we have to execute the query again. So, with the help of our project there is no need of executing queries again for that we are using Sqoop job, which consists of total commands for import and next after import we retrieve the data from hive using Java JDBC and we convert the data to JSON Format, which consists of data in an organized way and easy to access manner by using GSON Library. Keywords: Sqoop, Json, Gson, Maven and JDBC

Download Full-text

Reviewing Autoencoders for Missing Data Imputation: Technical Trends, Applications and Outcomes

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12312 ◽

2020 ◽

Vol 69 ◽

pp. 1255-1285

Author(s):

Ricardo Cardoso Pereira ◽

Miriam Seoane Santos ◽

Pedro Pereira Rodrigues ◽

Pedro Henriques Abreu

Keyword(s):

Missing Data ◽

Missing Values ◽

State Of The Art ◽

Data Imputation ◽

Tabular Data ◽

Missing Data Imputation ◽

Learning Techniques ◽

Real World Datasets ◽

And Training ◽

Machine Learning Models

Missing data is a problem often found in real-world datasets and it can degrade the performance of most machine learning models. Several deep learning techniques have been used to address this issue, and one of them is the Autoencoder and its Denoising and Variational variants. These models are able to learn a representation of the data with missing values and generate plausible new ones to replace them. This study surveys the use of Autoencoders for the imputation of tabular data and considers 26 works published between 2014 and 2020. The analysis is mainly focused on discussing patterns and recommendations for the architecture, hyperparameters and training settings of the network, while providing a detailed discussion of the results obtained by Autoencoders when compared to other state-of-the-art methods, and of the data contexts where they have been applied. The conclusions include a set of recommendations for the technical settings of the network, and show that Denoising Autoencoders outperform their competitors, particularly the often used statistical methods.

Download Full-text