scholarly journals Multiple Noisy Label Distribution Propagation for Crowdsourcing

Author(s):  
Hao Zhang ◽  
Liangxiao Jiang ◽  
Wenqiang Xu

Crowdsourcing services provide a fast, efficient, and cost-effective means of obtaining large labeled data for supervised learning. Ground truth inference, also called label integration, designs proper aggregation strategies to infer the unknown true label of each instance from the multiple noisy label set provided by ordinary crowd workers. However, to the best of our knowledge, nearly all existing label integration methods focus solely on the multiple noisy label set itself of the individual instance while totally ignoring the intercorrelation among multiple noisy label sets of different instances. To solve this problem, a multiple noisy label distribution propagation (MNLDP) method is proposed in this study. MNLDP first transforms the multiple noisy label set of each instance into its multiple noisy label distribution and then propagates its multiple noisy label distribution to its nearest neighbors. Consequently, each instance absorbs a fraction of the multiple noisy label distributions from its nearest neighbors and yet simultaneously maintains a fraction of its own original multiple noisy label distribution. Promising experimental results on simulated and real-world datasets validate the effectiveness of our proposed method.

2019 ◽  
Vol 4 (1) ◽  
Author(s):  
Milos Kudelka ◽  
Eliska Ochodkova ◽  
Sarka Zehnalova ◽  
Jakub Plesnik

Abstract The existence of groups of nodes with common characteristics and the relationships between these groups are important factors influencing the structures of social, technological, biological, and other networks. Uncovering such groups and the relationships between them is, therefore, necessary for understanding these structures. Groups can either be found by detection algorithms based solely on structural analysis or identified on the basis of more in-depth knowledge of the processes taking place in networks. In the first case, these are mainly algorithms detecting non-overlapping communities or communities with small overlaps. The latter case is about identifying ground-truth communities, also on the basis of characteristics other than only network structure. Recent research into ground-truth communities shows that in real-world networks, there are nested communities or communities with large and dense overlaps which we are not yet able to detect satisfactorily only on the basis of structural network properties.In our approach, we present a new perspective on the problem of group detection using only the structural properties of networks. Its main contribution is pointing out the existence of large and dense overlaps of detected groups. We use the non-symmetric structural similarity between pairs of nodes, which we refer to as dependency, to detect groups that we call zones. Unlike other approaches, we are able, thanks to non-symmetry, accurately to describe the prominent nodes in the zones which are responsible for large zone overlaps and the reasons why overlaps occur. The individual zones that are detected provide new information associated in particular with the non-symmetric relationships within the group and the roles that individual nodes play in the zone. From the perspective of global network structure, because of the non-symmetric node-to-node relationships, we explore new properties of real-world networks that describe the differences between various types of networks.


2020 ◽  
Vol 34 (04) ◽  
pp. 6837-6844
Author(s):  
Xiaojin Zhang ◽  
Honglei Zhuang ◽  
Shengyu Zhang ◽  
Yuan Zhou

We study a variant of the thresholding bandit problem (TBP) in the context of outlier detection, where the objective is to identify the outliers whose rewards are above a threshold. Distinct from the traditional TBP, the threshold is defined as a function of the rewards of all the arms, which is motivated by the criterion for identifying outliers. The learner needs to explore the rewards of the arms as well as the threshold. We refer to this problem as "double exploration for outlier detection". We construct an adaptively updated confidence interval for the threshold, based on the estimated value of the threshold in the previous rounds. Furthermore, by automatically trading off exploring the individual arms and exploring the outlier threshold, we provide an efficient algorithm in terms of the sample complexity. Experimental results on both synthetic datasets and real-world datasets demonstrate the efficiency of our algorithm.


2021 ◽  
Vol 4 ◽  
Author(s):  
Bradley Butcher ◽  
Vincent S. Huang ◽  
Christopher Robinson ◽  
Jeremy Reffin ◽  
Sema K. Sgaier ◽  
...  

Developing data-driven solutions that address real-world problems requires understanding of these problems’ causes and how their interaction affects the outcome–often with only observational data. Causal Bayesian Networks (BN) have been proposed as a powerful method for discovering and representing the causal relationships from observational data as a Directed Acyclic Graph (DAG). BNs could be especially useful for research in global health in Lower and Middle Income Countries, where there is an increasing abundance of observational data that could be harnessed for policy making, program evaluation, and intervention design. However, BNs have not been widely adopted by global health professionals, and in real-world applications, confidence in the results of BNs generally remains inadequate. This is partially due to the inability to validate against some ground truth, as the true DAG is not available. This is especially problematic if a learned DAG conflicts with pre-existing domain doctrine. Here we conceptualize and demonstrate an idea of a “Causal Datasheet” that could approximate and document BN performance expectations for a given dataset, aiming to provide confidence and sample size requirements to practitioners. To generate results for such a Causal Datasheet, a tool was developed which can generate synthetic Bayesian networks and their associated synthetic datasets to mimic real-world datasets. The results given by well-known structure learning algorithms and a novel implementation of the OrderMCMC method using the Quotient Normalized Maximum Likelihood score were recorded. These results were used to populate the Causal Datasheet, and recommendations could be made dependent on whether expected performance met user-defined thresholds. We present our experience in the creation of Causal Datasheets to aid analysis decisions at different stages of the research process. First, one was deployed to help determine the appropriate sample size of a planned study of sexual and reproductive health in Madhya Pradesh, India. Second, a datasheet was created to estimate the performance of an existing maternal health survey we conducted in Uttar Pradesh, India. Third, we validated generated performance estimates and investigated current limitations on the well-known ALARM dataset. Our experience demonstrates the utility of the Causal Datasheet, which can help global health practitioners gain more confidence when applying BNs.


Author(s):  
Lei Feng ◽  
Bo An

Partial label learning deals with the problem where each training instance is assigned a set of candidate labels, only one of which is correct. This paper provides the first attempt to leverage the idea of self-training for dealing with partially labeled examples. Specifically, we propose a unified formulation with proper constraints to train the desired model and perform pseudo-labeling jointly. For pseudo-labeling, unlike traditional self-training that manually differentiates the ground-truth label with enough high confidence, we introduce the maximum infinity norm regularization on the modeling outputs to automatically achieve this consideratum, which results in a convex-concave optimization problem. We show that optimizing this convex-concave problem is equivalent to solving a set of quadratic programming (QP) problems. By proposing an upper-bound surrogate objective function, we turn to solving only one QP problem for improving the optimization efficiency. Extensive experiments on synthesized and real-world datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art partial label learning approaches.


2019 ◽  
Vol 9 (1) ◽  
pp. 226-242
Author(s):  
Nesma Settouti ◽  
Khalida Douibi ◽  
Mohammed El Amine Bechar ◽  
Mostafa El Habib Daho ◽  
Meryem Saidi

AbstractOver the last few years, Multi-label classification has received significant attention from researchers to solve many issues in many fields. The manual annotation of available datasets is time-consuming and need a huge effort from the expert, especially for Multi-label applications in which each example of learning is associated with many labels at once. To overcome the manual annotation drawback, and to take advantages from the large amounts of unlabeled data, many semi-supervised approaches were proposed in the literature to give more sophisticated and fast solutions to support the automatic labeling of the unlabeled data. In this paper, a Collaborative Bagged Multi-label K-Nearest-Neighbors (CobMLKNN) algorithm is proposed, that extend the co-Training paradigm by a Multi-label K-Nearest-Neighbors algorithm. Experiments on ten real-world Multi-label datasets show the effectiveness of CobMLKNN algorithm to improve the performance of MLKNN to learn from a small number of labeled samples by exploiting unlabeled samples.


Author(s):  
Ning Xu ◽  
An Tao ◽  
Xin Geng

Label distribution is more general than both single-label annotation and multi-label annotation. It covers a certain number of labels, representing the degree to which each label describes the instance. The learning process on the instances labeled by label distributions is called label distribution learning (LDL). Unfortunately, many training sets only contain simple logical labels rather than label distributions due to the difficulty of obtaining the label distributions directly.  To solve the problem, one way is to recover the label distributions from the logical labels in the training set via leveraging the topological information of the feature space and the correlation among the labels. Such process of recovering label distributions from logical labels is defined as label enhancement (LE), which reinforces the supervision information in the training sets. This paper proposes a novel LE algorithm called Graph Laplacian Label Enhancement (GLLE). Experimental results on one artificial dataset and fourteen real-world datasets show clear advantages of GLLE over several existing LE algorithms.


2021 ◽  
Author(s):  
Tarun Kumer Biswas

The Influence Maximization (IM) problem aims at maximizing the diffusion of information or adoption of products among users in a social network by identifying and activating a set of initial users. In real-life applications, it is not unrealistic to have a higher activation cost for a user with higher influence. However, the existing works on IM consider finding the most influential users as the seed set, ignoring either the activation costs of such individual nodes and the total budget or the size of the seed set, which may not be always an optimal solution, particularly from the financial and managerial perspectives, respectively. To address these issues, we propose a more realistic and generalized formulation termed as multi-constraint influence maximization (MCIM) aiming to achieve a cost-effective solution under both budgetary and cardinality constraints. Unlike the existing IM formulations, the proposed MCIM is no longer a monotone but a submodular function. As it is also proved to be an NP-hard problem, we propose a simple additive weighting (SAW) assisted differential evolution (DE) algorithm for solving the large-size real-world problems. Experimental results on four real-world datasets show that the proposed formulation and algorithm are effective in finding a cost-effective seed set.


2020 ◽  
Vol 34 (04) ◽  
pp. 6510-6517 ◽  
Author(s):  
Ning Xu ◽  
Yun-Peng Liu ◽  
Xin Geng

Partial multi-label learning (PML) aims to learn from training examples each associated with a set of candidate labels, among which only a subset are valid for the training example. The common strategy to induce predictive model is trying to disambiguate the candidate label set, such as identifying the ground-truth label via utilizing the confidence of each candidate label or estimating the noisy labels in the candidate label sets. Nonetheless, these strategies ignore considering the essential label distribution corresponding to each instance since the label distribution is not explicitly available in the training set. In this paper, a new partial multi-label learning strategy named Pml-ld is proposed to learn from partial multi-label examples via label enhancement. Specifically, label distributions are recovered by leveraging the topological information of the feature space and the correlations among the labels. After that, a multi-class predictive model is learned by fitting a regularized multi-output regressor with the recovered label distributions. Experimental results on synthetic as well as real-world datasets clearly validate the effectiveness of Pml-ld for solving PML problems.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Madalina Ciortan ◽  
Matthieu Defrance

Abstract Background Single-cell RNA sequencing (scRNA-seq) has emerged has a main strategy to study transcriptional activity at the cellular level. Clustering analysis is routinely performed on scRNA-seq data to explore, recognize or discover underlying cell identities. The high dimensionality of scRNA-seq data and its significant sparsity accentuated by frequent dropout events, introducing false zero count observations, make the clustering analysis computationally challenging. Even though multiple scRNA-seq clustering techniques have been proposed, there is no consensus on the best performing approach. On a parallel research track, self-supervised contrastive learning recently achieved state-of-the-art results on images clustering and, subsequently, image classification. Results We propose contrastive-sc, a new unsupervised learning method for scRNA-seq data that perform cell clustering. The method consists of two consecutive phases: first, an artificial neural network learns an embedding for each cell through a representation training phase. The embedding is then clustered in the second phase with a general clustering algorithm (i.e. KMeans or Leiden community detection). The proposed representation training phase is a new adaptation of the self-supervised contrastive learning framework, initially proposed for image processing, to scRNA-seq data. contrastive-sc has been compared with ten state-of-the-art techniques. A broad experimental study has been conducted on both simulated and real-world datasets, assessing multiple external and internal clustering performance metrics (i.e. ARI, NMI, Silhouette, Calinski scores). Our experimental analysis shows that constastive-sc compares favorably with state-of-the-art methods on both simulated and real-world datasets. Conclusion On average, our method identifies well-defined clusters in close agreement with ground truth annotations. Our method is computationally efficient, being fast to train and having a limited memory footprint. contrastive-sc maintains good performance when only a fraction of input cells is provided and is robust to changes in hyperparameters or network architecture. The decoupling between the creation of the embedding and the clustering phase allows the flexibility to choose a suitable clustering algorithm (i.e. KMeans when the number of expected clusters is known, Leiden otherwise) or to integrate the embedding with other existing techniques.


Author(s):  
Haobo Wang ◽  
Weiwei Liu ◽  
Yang Zhao ◽  
Tianlei Hu ◽  
Ke Chen ◽  
...  

Multi-dimensional classification has attracted huge attention from the community. Though most studies consider fully annotated data, in real practice obtaining fully labeled data in MDC tasks is usually intractable. In this paper, we propose a novel learning paradigm: MultiDimensional Partial Label Learning (MDPL) where the ground-truth labels of each instance are concealed in multiple candidate label sets. We first introduce the partial hamming loss for MDPL that incurs a large loss if the predicted labels are not in candidate label sets, and provide an empirical risk minimization (ERM) framework. Theoretically, we rigorously prove the conditions for ERM learnability of MDPL in both independent and dependent cases. Furthermore, we present two MDPL algorithms under our proposed ERM framework. Comprehensive experiments on both synthetic and real-world datasets validate the effectiveness of our proposals.


Sign in / Sign up

Export Citation Format

Share Document