Leveraging Latent Label Distributions for Partial Label Learning

In partial label learning, each training example is assigned a set of candidate labels, only one of which is the ground-truth label. Existing partial label learning frameworks either assume each candidate label of equal confidence or consider the ground-truth label as a latent variable hidden in the indiscriminate candidate label set, while the different labeling confidence levels of the candidate labels are regrettably ignored. In this paper, we formalize the different labeling confidence levels as the latent label distributions, and propose a novel unified framework to estimate the latent label distributions while training the model simultaneously. Specifically, we present a biconvex formulation with constrained local consistency and adopt an alternating method to solve this optimization problem. The process of alternating optimization exactly facilitates the mutual adaption of the model training and the constrained label propagation. Extensive experimental results on controlled UCI datasets as well as real-world datasets clearly show the effectiveness of the proposed approach.

Download Full-text

Partial Label Learning with Self-Guided Retraining

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013542 ◽

2019 ◽

Vol 33 ◽

pp. 3542-3549 ◽

Cited By ~ 10

Author(s):

Lei Feng ◽

Bo An

Keyword(s):

Real World ◽

Optimization Problem ◽

State Of The Art ◽

Ground Truth ◽

Learning Approaches ◽

High Confidence ◽

Infinity Norm ◽

Real World Datasets ◽

Partial Label Learning ◽

Optimization Efficiency

Partial label learning deals with the problem where each training instance is assigned a set of candidate labels, only one of which is correct. This paper provides the first attempt to leverage the idea of self-training for dealing with partially labeled examples. Specifically, we propose a unified formulation with proper constraints to train the desired model and perform pseudo-labeling jointly. For pseudo-labeling, unlike traditional self-training that manually differentiates the ground-truth label with enough high confidence, we introduce the maximum infinity norm regularization on the modeling outputs to automatically achieve this consideratum, which results in a convex-concave optimization problem. We show that optimizing this convex-concave problem is equivalent to solving a set of quadratic programming (QP) problems. By proposing an upper-bound surrogate objective function, we turn to solving only one QP problem for improving the optimization efficiency. Extensive experiments on synthesized and real-world datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art partial label learning approaches.

Download Full-text

Learning From Multi-Dimensional Partial Labels

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/407 ◽

2020 ◽

Author(s):

Haobo Wang ◽

Weiwei Liu ◽

Yang Zhao ◽

Tianlei Hu ◽

Ke Chen ◽

...

Keyword(s):

Real World ◽

Ground Truth ◽

Empirical Risk Minimization ◽

Risk Minimization ◽

Large Loss ◽

Empirical Risk ◽

Dimensional Classification ◽

Real World Datasets ◽

Partial Label Learning ◽

Real Practice

Multi-dimensional classification has attracted huge attention from the community. Though most studies consider fully annotated data, in real practice obtaining fully labeled data in MDC tasks is usually intractable. In this paper, we propose a novel learning paradigm: MultiDimensional Partial Label Learning (MDPL) where the ground-truth labels of each instance are concealed in multiple candidate label sets. We first introduce the partial hamming loss for MDPL that incurs a large loss if the predicted labels are not in candidate label sets, and provide an empirical risk minimization (ERM) framework. Theoretically, we rigorously prove the conditions for ERM learnability of MDPL in both independent and dependent cases. Furthermore, we present two MDPL algorithms under our proposed ERM framework. Comprehensive experiments on both synthetic and real-world datasets validate the effectiveness of our proposals.

Download Full-text

G-Tric: generating three-way synthetic datasets with triclustering solutions

BMC Bioinformatics ◽

10.1186/s12859-020-03925-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

João Lobo ◽

Rui Henriques ◽

Sara C. Madeira

Keyword(s):

State Of The Art ◽

Synthetic Data ◽

Ground Truth ◽

Real Data ◽

Three Dimensions ◽

Additional Advantage ◽

Urban Dynamics ◽

Data Generator ◽

Real World Datasets ◽

Synthetic Datasets

Abstract Background Three-way data started to gain popularity due to their increasing capacity to describe inherently multivariate and temporal events, such as biological responses, social interactions along time, urban dynamics, or complex geophysical phenomena. Triclustering, subspace clustering of three-way data, enables the discovery of patterns corresponding to data subspaces (triclusters) with values correlated across the three dimensions (observations $$\times$$ × features $$\times$$ × contexts). With increasing number of algorithms being proposed, effectively comparing them with state-of-the-art algorithms is paramount. These comparisons are usually performed using real data, without a known ground-truth, thus limiting the assessments. In this context, we propose a synthetic data generator, G-Tric, allowing the creation of synthetic datasets with configurable properties and the possibility to plant triclusters. The generator is prepared to create datasets resembling real 3-way data from biomedical and social data domains, with the additional advantage of further providing the ground truth (triclustering solution) as output. Results G-Tric can replicate real-world datasets and create new ones that match researchers needs across several properties, including data type (numeric or symbolic), dimensions, and background distribution. Users can tune the patterns and structure that characterize the planted triclusters (subspaces) and how they interact (overlapping). Data quality can also be controlled, by defining the amount of missing, noise or errors. Furthermore, a benchmark of datasets resembling real data is made available, together with the corresponding triclustering solutions (planted triclusters) and generating parameters. Conclusions Triclustering evaluation using G-Tric provides the possibility to combine both intrinsic and extrinsic metrics to compare solutions that produce more reliable analyses. A set of predefined datasets, mimicking widely used three-way data and exploring crucial properties was generated and made available, highlighting G-Tric’s potential to advance triclustering state-of-the-art by easing the process of evaluating the quality of new triclustering approaches.

Download Full-text

Knowledge-aware Multi-modal Adaptive Graph Convolutional Networks for Fake News Detection

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3451215 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1-23

Author(s):

Shengsheng Qian ◽

Jun Hu ◽

Quan Fang ◽

Changsheng Xu

Keyword(s):

Social Media ◽

Visual Information ◽

Representation Learning ◽

Fake News ◽

Unified Framework ◽

Model Learning ◽

Convolutional Network ◽

Textual Information ◽

Convolutional Networks ◽

Real World Datasets

In this article, we focus on fake news detection task and aim to automatically identify the fake news from vast amount of social media posts. To date, many approaches have been proposed to detect fake news, which includes traditional learning methods and deep learning-based models. However, there are three existing challenges: (i) How to represent social media posts effectively, since the post content is various and highly complicated; (ii) how to propose a data-driven method to increase the flexibility of the model to deal with the samples in different contexts and news backgrounds; and (iii) how to fully utilize the additional auxiliary information (the background knowledge and multi-modal information) of posts for better representation learning. To tackle the above challenges, we propose a novel Knowledge-aware Multi-modal Adaptive Graph Convolutional Networks (KMAGCN) to capture the semantic representations by jointly modeling the textual information, knowledge concepts, and visual information into a unified framework for fake news detection. We model posts as graphs and use a knowledge-aware multi-modal adaptive graph learning principal for the effective feature learning. Compared with existing methods, the proposed KMAGCN addresses challenges from three aspects: (1) It models posts as graphs to capture the non-consecutive and long-range semantic relations; (2) it proposes a novel adaptive graph convolutional network to handle the variability of graph data; and (3) it leverages textual information, knowledge concepts and visual information jointly for model learning. We have conducted extensive experiments on three public real-world datasets and superior results demonstrate the effectiveness of KMAGCN compared with other state-of-the-art algorithms.

Download Full-text

DeepMAsED: evaluating the quality of metagenomic assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa124 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3011-3017 ◽

Cited By ~ 5

Author(s):

Olga Mineeva ◽

Mateo Rojas-Carulla ◽

Ruth E Ley ◽

Bernhard Schölkopf ◽

Nicholas D Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Supplementary Information ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Reference Genomes

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Label-free prediction of Cell Painting from brightfield images

10.1101/2021.11.05.467394 ◽

2021 ◽

Author(s):

Jan Oscar Cross-Zamirski ◽

Elizabeth Mouchet ◽

Guy Williams ◽

Carola-Bibiane Schönlieb ◽

Riku Turkki ◽

...

Keyword(s):

Pearson Correlation ◽

Ground Truth ◽

Free Cell ◽

Label Free ◽

Biological Interest ◽

Genetic Perturbations ◽

Toxicity Analysis ◽

Model Training ◽

Downstream Analysis ◽

Cellular Components

Cell Painting is a high-content image-based assay which can reveal rich cellular morphology and is applied in drug discovery to predict bioactivity, assess toxicity and understand diverse mechanisms of action of chemical and genetic perturbations. In this study, we investigate label-free Cell Painting by predicting the five fluorescent Cell Painting channels from paired brightfield z-stacks using deep learning models. We train and validate the models with a dataset representing 1000s of pan-assay interference compounds sampled from 17 unique batches. The model predictions are evaluated using a test set from two additional batches, treated with compounds comprised from a publicly available phenotypic set. In addition to pixel-level evaluation, we process the label-free Cell Painting images with a segmentation-based feature-extraction pipeline to understand whether the generated images are useful in downstream analysis. The mean Pearson correlation coefficient (PCC) of the images across all five channels is 0.84. Without actually incorporating these features into the model training we achieved a mean correlation of 0.45 from the features extracted from the images. Additionally we identified 30 features which correlated greater than 0.8 to the ground truth. Toxicity analysis on the label-free Cell Painting resulted a sensitivity of 62.5% and specificity of 99.3% on images from unseen batches. Additionally, we provide a breakdown of the feature profiles by channel and feature type to understand the potential and limitation of the approach in morphological profiling. Our findings demonstrate that label-free Cell Painting has potential above the improved visualization of cellular components, and it can be used for downstream analysis. The findings also suggest that label-free Cell Painting could allow for repurposing the imaging channels for other non-generic fluorescent stains of more targeted biological interest, thus increasing the information content of the assay.

Download Full-text

Multiple Noisy Label Distribution Propagation for Crowdsourcing

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/204 ◽

2019 ◽

Cited By ~ 1

Author(s):

Hao Zhang ◽

Liangxiao Jiang ◽

Wenqiang Xu

Keyword(s):

Supervised Learning ◽

Real World ◽

Effective Means ◽

Ground Truth ◽

Cost Effective ◽

Nearest Neighbors ◽

True Label ◽

Real World Datasets ◽

The Individual ◽

Label Distribution

Crowdsourcing services provide a fast, efficient, and cost-effective means of obtaining large labeled data for supervised learning. Ground truth inference, also called label integration, designs proper aggregation strategies to infer the unknown true label of each instance from the multiple noisy label set provided by ordinary crowd workers. However, to the best of our knowledge, nearly all existing label integration methods focus solely on the multiple noisy label set itself of the individual instance while totally ignoring the intercorrelation among multiple noisy label sets of different instances. To solve this problem, a multiple noisy label distribution propagation (MNLDP) method is proposed in this study. MNLDP first transforms the multiple noisy label set of each instance into its multiple noisy label distribution and then propagates its multiple noisy label distribution to its nearest neighbors. Consequently, each instance absorbs a fraction of the multiple noisy label distributions from its nearest neighbors and yet simultaneously maintains a fraction of its own original multiple noisy label distribution. Promising experimental results on simulated and real-world datasets validate the effectiveness of our proposed method.

Download Full-text

Human Consensus-Oriented Image Captioning

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/92 ◽

2020 ◽

Cited By ~ 1

Author(s):

Ziwei Wang ◽

Zi Huang ◽

Yadan Luo

Keyword(s):

Data Quality ◽

Ground Truth ◽

Superior Performance ◽

The Novel ◽

Image Captioning ◽

Visual Objects ◽

Grammatical Errors ◽

Wrong Identification ◽

Model Training ◽

Treat All

Image captioning aims to describe an image with a concise, accurate, and interesting sentence. To build such an automatic neural captioner, the traditional models align the generated words with a number of human-annotated sentences to mimic human-like captions. However, the crowd-sourced annotations inevitably come with data quality issues such as grammatical errors, wrong identification of visual objects and sub-optimal sentence focus. During the model training, existing methods treat all the annotations equally regardless of the data quality. In this work, we explicitly engage human consensus to measure the quality of ground truth captions in advance, and directly encourage the model to learn high quality captions with high priority. Therefore, the proposed consensus-oriented method can accelerate the training process and achieve superior performance with only supervised objective without time-consuming reinforcement learning. The novel consensus loss can be implemented into most of the existing state-of-the-art methods, boosting the BLEU-4 performance by maximum relative 12.47% comparing to the conventional cross-entropy loss. Extensive experiments are conducted on MS-COCO Image Captioning dataset demonstrating the proposed human consensus-oriented training method can significantly improve the training efficiency and model effectiveness.

Download Full-text

On the objectivity, reliability, and validity of deep learning enabled bioimage analyses

eLife ◽

10.7554/elife.59780 ◽

2020 ◽

Vol 9 ◽

Cited By ~ 1

Author(s):

Dennis Segebarth ◽

Matthias Griebel ◽

Nikolai Stein ◽

Cora R von Collenberg ◽

Corinna Martin ◽

...

Keyword(s):

Deep Learning ◽

Signal To Noise Ratio ◽

Biological Effects ◽

Reliability And Validity ◽

Ground Truth ◽

Training Data ◽

Model Organisms ◽

Data Annotation ◽

Bioimage Analysis ◽

Model Training

Bioimage analysis of fluorescent labels is widely used in the life sciences. Recent advances in deep learning (DL) allow automating time-consuming manual image analysis processes based on annotated training data. However, manual annotation of fluorescent features with a low signal-to-noise ratio is somewhat subjective. Training DL models on subjective annotations may be instable or yield biased models. In turn, these models may be unable to reliably detect biological effects. An analysis pipeline integrating data annotation, ground truth estimation, and model training can mitigate this risk. To evaluate this integrated process, we compared different DL-based analysis approaches. With data from two model organisms (mice, zebrafish) and five laboratories, we show that ground truth estimation from multiple human annotators helps to establish objectivity in fluorescent feature annotations. Furthermore, ensembles of multiple models trained on the estimated ground truth establish reliability and validity. Our research provides guidelines for reproducible DL-based bioimage analyses.

Download Full-text

Community Detection Based on a Preferential Decision Model

Information ◽

10.3390/info11010053 ◽

2020 ◽

Vol 11 (1) ◽

pp. 53

Author(s):

Jinfang Sheng ◽

Ben Lu ◽

Bin Wang ◽

Jie Hu ◽

Kai Wang ◽

...

Keyword(s):

Community Structure ◽

Complex Networks ◽

Community Detection ◽

Decision Model ◽

Structural Characteristics ◽

Real Life ◽

Stable State ◽

Ground Truth ◽

Label Propagation ◽

Comparison Algorithms

The research on complex networks is a hot topic in many fields, among which community detection is a complex and meaningful process, which plays an important role in researching the characteristics of complex networks. Community structure is a common feature in the network. Given a graph, the process of uncovering its community structure is called community detection. Many community detection algorithms from different perspectives have been proposed. Achieving stable and accurate community division is still a non-trivial task due to the difficulty of setting specific parameters, high randomness and lack of ground-truth information. In this paper, we explore a new decision-making method through real-life communication and propose a preferential decision model based on dynamic relationships applied to dynamic systems. We apply this model to the label propagation algorithm and present a Community Detection based on Preferential Decision Model, called CDPD. This model intuitively aims to reveal the topological structure and the hierarchical structure between networks. By analyzing the structural characteristics of complex networks and mining the tightness between nodes, the priority of neighbor nodes is chosen to perform the required preferential decision, and finally the information in the system reaches a stable state. In the experiments, through the comparison of eight comparison algorithms, we verified the performance of CDPD in real-world networks and synthetic networks. The results show that CDPD not only has better performance than most recent algorithms on most datasets, but it is also more suitable for many community networks with ambiguous structure, especially sparse networks.

Download Full-text