A Probabilistic Bag-to-Class Approach to Multiple-Instance Learning

Kajsa Møllersen; Jon Yngve Hardeberg; Fred Godtliebsen

doi:10.3390/data5020056

A Probabilistic Bag-to-Class Approach to Multiple-Instance Learning

Data ◽

10.3390/data5020056 ◽

2020 ◽

Vol 5 (2) ◽

pp. 56

Author(s):

Kajsa Møllersen ◽

Jon Yngve Hardeberg ◽

Fred Godtliebsen

Keyword(s):

Probability Distribution ◽

Dimensional Space ◽

Probability Distributions ◽

Real Data ◽

Multiple Instance Learning ◽

Classification Problems ◽

Distribution Space ◽

Random Vectors ◽

Feature Vectors ◽

Training Sets

Multi-instance (MI) learning is a branch of machine learning, where each object (bag) consists of multiple feature vectors (instances)—for example, an image consisting of multiple patches and their corresponding feature vectors. In MI classification, each bag in the training set has a class label, but the instances are unlabeled. The instances are most commonly regarded as a set of points in a multi-dimensional space. Alternatively, instances are viewed as realizations of random vectors with corresponding probability distribution, where the bag is the distribution, not the realizations. By introducing the probability distribution space to bag-level classification problems, dissimilarities between probability distributions (divergences) can be applied. The bag-to-bag Kullback–Leibler information is asymptotically the best classifier, but the typical sparseness of MI training sets is an obstacle. We introduce bag-to-class divergence to MI learning, emphasizing the hierarchical nature of the random vectors that makes bags from the same class different. We propose two properties for bag-to-class divergences, and an additional property for sparse training sets, and propose a dissimilarity measure that fulfils them. Its performance is demonstrated on synthetic and real data. The probability distribution space is valid for MI learning, both for the theoretical analysis and applications.

Conditional Mixture Model for Correlated Neuronal Spikes

Neural Computation ◽

10.1162/neco.2010.04-08-766 ◽

2010 ◽

Vol 22 (7) ◽

pp. 1718-1736 ◽

Cited By ~ 2

Author(s):

Shun-ichi Amari

Keyword(s):

Probability Distribution ◽

Mixture Models ◽

Mixture Model ◽

Computational Neuroscience ◽

General Model ◽

Probability Distributions ◽

Real Data ◽

Spike Trains ◽

Cell Assemblies ◽

Competitive Model

Analysis of correlated spike trains is a hot topic of research in computational neuroscience. A general model of probability distributions for spikes includes too many parameters to be of use in analyzing real data. Instead, we need a simple but powerful generative model for correlated spikes. We developed a class of conditional mixture models that includes a number of existing models and analyzed its capabilities and limitations. We apply the model to dynamical aspects of neuron pools. When Hebbian cell assemblies coexist in a pool of neurons, the condition is specified by these assemblies such that the probability distribution of spikes is a mixture of those of the component assemblies. The probabilities of activation of the Hebbian assemblies change dynamically. We used this model as a basis for a competitive model governing the states of assemblies.

SpHMC: Spectral Hamiltonian Monte Carlo

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015516 ◽

2019 ◽

Vol 33 ◽

pp. 5516-5524

Author(s):

Haoyi Xiong ◽

Kafeng Wang ◽

Jiang Bian ◽

Zhanxing Zhu ◽

Cheng-Zhong Xu ◽

...

Keyword(s):

Monte Carlo ◽

Probability Distribution ◽

Real World ◽

Dimensional Space ◽

Probability Distributions ◽

Superior Performance ◽

High Dimensional ◽

Hamiltonian Monte Carlo ◽

Real World Datasets ◽

Low Dimensional

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) methods have been widely used to sample from certain probability distributions, incorporating (kernel) density derivatives and/or given datasets. Instead of exploring new samples from kernel spaces, this piece of work proposed a novel SGHMC sampler, namely Spectral Hamiltonian Monte Carlo (SpHMC), that produces the high dimensional sparse representations of given datasets through sparse sensing and SGHMC. Inspired by compressed sensing, we assume all given samples are low-dimensional measurements of certain high-dimensional sparse vectors, while a continuous probability distribution exists in such high-dimensional space. Specifically, given a dictionary for sparse coding, SpHMC first derives a novel likelihood evaluator of the probability distribution from the loss function of LASSO, then samples from the high-dimensional distribution using stochastic Langevin dynamics with derivatives of the logarithm likelihood and Metropolis–Hastings sampling. In addition, new samples in low-dimensional measuring spaces can be regenerated using the sampled high-dimensional vectors and the dictionary. Extensive experiments have been conducted to evaluate the proposed algorithm using real-world datasets. The performance comparisons on three real-world applications demonstrate the superior performance of SpHMC beyond baseline methods.

Real UAV-Bird Image Classification Using CNN with a Synthetic Dataset

Applied Sciences ◽

10.3390/app11093863 ◽

2021 ◽

Vol 11 (9) ◽

pp. 3863

Author(s):

Ali Emre Öztürk ◽

Ergun Erçelebi

Keyword(s):

Deep Learning ◽

Image Classification ◽

Synthetic Data ◽

Real Data ◽

Corner Detection ◽

Batch Size ◽

Test Accuracy ◽

Classification Problems ◽

Auc Value ◽

Classification Test

A large amount of training image data is required for solving image classification problems using deep learning (DL) networks. In this study, we aimed to train DL networks with synthetic images generated by using a game engine and determine the effects of the networks on performance when solving real-image classification problems. The study presents the results of using corner detection and nearest three-point selection (CDNTS) layers to classify bird and rotary-wing unmanned aerial vehicle (RW-UAV) images, provides a comprehensive comparison of two different experimental setups, and emphasizes the significant improvements in the performance in deep learning-based networks due to the inclusion of a CDNTS layer. Experiment 1 corresponds to training the commonly used deep learning-based networks with synthetic data and an image classification test on real data. Experiment 2 corresponds to training the CDNTS layer and commonly used deep learning-based networks with synthetic data and an image classification test on real data. In experiment 1, the best area under the curve (AUC) value for the image classification test accuracy was measured as 72%. In experiment 2, using the CDNTS layer, the AUC value for the image classification test accuracy was measured as 88.9%. A total of 432 different combinations of trainings were investigated in the experimental setups. The experiments were trained with various DL networks using four different optimizers by considering all combinations of batch size, learning rate, and dropout hyperparameters. The test accuracy AUC values for networks in experiment 1 ranged from 55% to 74%, whereas the test accuracy AUC values in experiment 2 networks with a CDNTS layer ranged from 76% to 89.9%. It was observed that the CDNTS layer has considerable effects on the image classification accuracy performance of deep learning-based networks. AUC, F-score, and test accuracy measures were used to validate the success of the networks.

Tag N’ Train: a technique to train improved classifiers on unlabeled data

Journal of High Energy Physics ◽

10.1007/jhep01(2021)153 ◽

2021 ◽

Vol 2021 (1) ◽

Cited By ~ 2

Author(s):

Oz Amram ◽

Cristina Mantilla Suarez

Keyword(s):

Real Data ◽

Unlabeled Data ◽

Machine Learning Techniques ◽

Jet Physics ◽

Classification Problems ◽

Weak Classifier ◽

Potential Applications ◽

Substantial Progress ◽

Resonance Search ◽

Class Labels

Abstract There has been substantial progress in applying machine learning techniques to classification problems in collider and jet physics. But as these techniques grow in sophistication, they are becoming more sensitive to subtle features of jets that may not be well modeled in simulation. Therefore, relying on simulations for training will lead to sub-optimal performance in data, but the lack of true class labels makes it difficult to train on real data. To address this challenge we introduce a new approach, called Tag N’ Train (TNT), that can be applied to unlabeled data that has two distinct sub-objects. The technique uses a weak classifier for one of the objects to tag signal-rich and background-rich samples. These samples are then used to train a stronger classifier for the other object. We demonstrate the power of this method by applying it to a dijet resonance search. By starting with autoencoders trained directly on data as the weak classifiers, we use TNT to train substantially improved classifiers. We show that Tag N’ Train can be a powerful tool in model-agnostic searches and discuss other potential applications.

Kernel Based Data-Adaptive Support Vector Machines for Multi-Class Classification

Mathematics ◽

10.3390/math9090936 ◽

2021 ◽

Vol 9 (9) ◽

pp. 936

Author(s):

Jianli Shao ◽

Xin Liu ◽

Wenqing He

Keyword(s):

Machine Learning ◽

Spatial Association ◽

Class Imbalance ◽

Imbalanced Data ◽

Real Data ◽

Kernel Functions ◽

Support Vector ◽

Classification Problems ◽

Rare Class ◽

Data Adaptive

Imbalanced data exist in many classification problems. The classification of imbalanced data has remarkable challenges in machine learning. The support vector machine (SVM) and its variants are popularly used in machine learning among different classifiers thanks to their flexibility and interpretability. However, the performance of SVMs is impacted when the data are imbalanced, which is a typical data structure in the multi-category classification problem. In this paper, we employ the data-adaptive SVM with scaled kernel functions to classify instances for a multi-class population. We propose a multi-class data-dependent kernel function for the SVM by considering class imbalance and the spatial association among instances so that the classification accuracy is enhanced. Simulation studies demonstrate the superb performance of the proposed method, and a real multi-class prostate cancer image dataset is employed as an illustration. Not only does the proposed method outperform the competitor methods in terms of the commonly used accuracy measures such as the F-score and G-means, but also successfully detects more than 60% of instances from the rare class in the real data, while the competitors can only detect less than 20% of the rare class instances. The proposed method will benefit other scientific research fields, such as multiple region boundary detection.

WIGNER FUNCTION OF PULSED FIELDS RECONSTRUCTED BY DIRECT DETECTION

International Journal of Quantum Information ◽

10.1142/s0219749911006995 ◽

2011 ◽

Vol 09 (supp01) ◽

pp. 39-47

Author(s):

ALESSIA ALLEVI ◽

MARIA BONDANI ◽

ALESSANDRA ANDREONI

Keyword(s):

Probability Distribution ◽

Wigner Function ◽

Beam Splitter ◽

Direct Detection ◽

Probability Distributions ◽

Linear Regime ◽

Wigner Functions ◽

Intensity Measurements ◽

Pulsed Fields

We present the experimental reconstruction of the Wigner function of some optical states. The method is based on direct intensity measurements by non-ideal photodetectors operated in the linear regime. The signal state is mixed at a beam-splitter with a set of coherent probes of known complex amplitudes and the probability distribution of the detected photons is measured. The Wigner function is given by a suitable sum of these probability distributions measured for different values of the probe. For comparison, the same data are analyzed to obtain the number distributions and the Wigner functions for photons.

Co-localization analysis in fluorescence microscopy via maximum entropy copula

The International Journal of Biostatistics ◽

10.1515/ijb-2019-0019 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Zahra Amini Farsani ◽

Volker J. Schmid

Keyword(s):

Fluorescence Microscopy ◽

Maximum Entropy ◽

Probability Distributions ◽

Real Data ◽

Bivariate Distribution ◽

Entropy Method ◽

Gaussian Copula ◽

Microscopy Imaging ◽

High Background ◽

Localization Analysis

AbstractCo-localization analysis is a popular method for quantitative analysis in fluorescence microscopy imaging. The localization of marked proteins in the cell nucleus allows a deep insight into biological processes in the nucleus. Several metrics have been developed for measuring the co-localization of two markers, however, they depend on subjective thresholding of background and the assumption of linearity. We propose a robust method to estimate the bivariate distribution function of two color channels. From this, we can quantify their co- or anti-colocalization. The proposed method is a combination of the Maximum Entropy Method (MEM) and a Gaussian Copula, which we call the Maximum Entropy Copula (MEC). This new method can measure the spatial and nonlinear correlation of signals to determine the marker colocalization in fluorescence microscopy images. The proposed method is compared with MEM for bivariate probability distributions. The new colocalization metric is validated on simulated and real data. The results show that MEC can determine co- and anti-colocalization even in high background settings. MEC can, therefore, be used as a robust tool for colocalization analysis.

Analysis of Magnitude and Frequency of Floods in the Damanganga Basin: Western India

Hydrospatial Analysis ◽

10.21523/gcj3.2021050101 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-11

Author(s):

Vitthal Anwat ◽

Pramodkumar Hire ◽

Uttam Pawar ◽

Rajendra Gunjal

Keyword(s):

Probability Distribution ◽

Probability Distributions ◽

Flood Frequency ◽

Flood Frequency Analysis ◽

Western India ◽

Type I ◽

Return Periods ◽

Pearson Type ◽

Kolmogorov Smirnov ◽

Anderson Darling

Flood Frequency Analysis (FFA) method was introduced by Fuller in 1914 to understand the magnitude and frequency of floods. The present study is carried out using the two most widely accepted probability distributions for FFA in the world namely, Gumbel Extreme Value type I (GEVI) and Log Pearson type III (LP-III). The Kolmogorov-Smirnov (KS) and Anderson-Darling (AD) methods were used to select the most suitable probability distribution at sites in the Damanganga Basin. Moreover, discharges were estimated for various return periods using GEVI and LP-III. The recurrence interval of the largest peak flood on record (Qmax) is 107 years (at Nanipalsan) and 146 years (at Ozarkhed) as per LP-III. Flood Frequency Curves (FFC) specifies that LP-III is the best-fitted probability distribution for FFA of the Damanganga Basin. Therefore, estimated discharges and return periods by LP-III probability distribution are more reliable and can be used for designing hydraulic structures.

Analysis and Synthesis of Mechanical Error in Universal Joints

16th Design Automation Conference: Volume 2 — Optimal Design and Mechanical Systems Analysis ◽

10.1115/detc1990-0090 ◽

1990 ◽

Author(s):

J. L. Cagney ◽

S. S. Rao

Keyword(s):

Probability Distribution ◽

Real World ◽

Probability Distributions ◽

Manufacturing Cost ◽

Universal Joint ◽

Accuracy Requirement ◽

Output Error ◽

Manufacturing Errors ◽

Analysis And Synthesis ◽

Limiting Value

Abstract The modeling of manufacturing errors in mechanisms is a significant task to validate practical designs. The use of probability distributions for errors can simulate manufacturing variations and real world operations. This paper presents the mechanical error analysis of universal joint drivelines. Each error is simulated using a probability distribution, i.e., a design of the mechanism is created by assigning random values to the errors. Each design is then evaluated by comparing the output error with a limiting value and the reliability of the universal joint is estimated. For this, the design is considered a failure whenever the output error exceeds the specified limit. In addition, the problem of synthesis, which involves the allocation of tolerances (errors) for minimum manufacturing cost without violating a specified accuracy requirement of the output, is also considered. Three probability distributions — normal, Weibull and beta distributions — were used to simulate the random values of the errors. The similarity of the results given by the three distributions suggests that the use of normal distribution would be acceptable for modeling the tolerances in most cases.

Field-theoretic density estimation for biological sequence space with applications to 5′ splice site diversity and aneuploidy in cancer

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2025782118 ◽

2021 ◽

Vol 118 (40) ◽

pp. e2025782118

Author(s):

Wei-Chia Chen ◽

Juannan Zhou ◽

Jason M. Sheltzer ◽

Justin B. Kinney ◽

David M. McCandlish

Keyword(s):

Probability Distribution ◽

Maximum Entropy ◽

Density Estimation ◽

Sequence Space ◽

Chromosomal Abnormalities ◽

Fundamental Problem ◽

Probability Distributions ◽

Biological Sequence ◽

Point Estimates ◽

Site Diversity

Density estimation in sequence space is a fundamental problem in machine learning that is also of great importance in computational biology. Due to the discrete nature and large dimensionality of sequence space, how best to estimate such probability distributions from a sample of observed sequences remains unclear. One common strategy for addressing this problem is to estimate the probability distribution using maximum entropy (i.e., calculating point estimates for some set of correlations based on the observed sequences and predicting the probability distribution that is as uniform as possible while still matching these point estimates). Building on recent advances in Bayesian field-theoretic density estimation, we present a generalization of this maximum entropy approach that provides greater expressivity in regions of sequence space where data are plentiful while still maintaining a conservative maximum entropy character in regions of sequence space where data are sparse or absent. In particular, we define a family of priors for probability distributions over sequence space with a single hyperparameter that controls the expected magnitude of higher-order correlations. This family of priors then results in a corresponding one-dimensional family of maximum a posteriori estimates that interpolate smoothly between the maximum entropy estimate and the observed sample frequencies. To demonstrate the power of this method, we use it to explore the high-dimensional geometry of the distribution of 5′ splice sites found in the human genome and to understand patterns of chromosomal abnormalities across human cancers.