Model and Method for Contributor’s Quality Assessment in Community Image Tagging Systems

Information and Control Systems ◽

10.31799/1684-8853-2018-4-45-51 ◽

2018 ◽

pp. 45-51

Author(s):

A. V. Ponomarev

Keyword(s):

Large Scale ◽

Wide Spectrum ◽

Preference Relation ◽

Pairwise Comparison ◽

Ground Truth ◽

Comparison Method ◽

Characteristic Matrix ◽

Image Tagging ◽

Proposed Model

Introduction: Large-scale human-computer systems involving people of various skills and motivation into the information processing process are currently used in a wide spectrum of applications. An acute problem in such systems is assessing the expected quality of each contributor; for example, in order to penalize incompetent or inaccurate ones and to promote diligent ones.Purpose: To develop a method of assessing the expected contributor’s quality in community tagging systems. This method should only use generally unreliable and incomplete information provided by contributors (with ground truth tags unknown).Results:A mathematical model is proposed for community image tagging (including the model of a contributor), along with a method of assessing the expected contributor’s quality. The method is based on comparing tag sets provided by different contributors for the same images, being a modification of pairwise comparison method with preference relation replaced by a special domination characteristic. Expected contributors’ quality is evaluated as a positive eigenvector of a pairwise domination characteristic matrix. Community tagging simulation has confirmed that the proposed method allows you to adequately estimate the expected quality of community tagging system contributors (provided that the contributors' behavior fits the proposed model).Practical relevance: The obtained results can be used in the development of systems based on coordinated efforts of community (primarily, community tagging systems).

Download Full-text

DeepMAsED: evaluating the quality of metagenomic assemblies

Bioinformatics ◽

10.1093/bioinformatics/btaa124 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3011-3017 ◽

Cited By ~ 5

Author(s):

Olga Mineeva ◽

Mateo Rojas-Carulla ◽

Ruth E Ley ◽

Bernhard Schölkopf ◽

Nicholas D Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Supplementary Information ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Reference Genomes

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Almost Unsupervised Learning for Dense Crowd Counting

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018868 ◽

2019 ◽

Vol 33 ◽

pp. 8868-8875 ◽

Cited By ~ 17

Author(s):

Deepak Babu Sam ◽

Neeraj N Sajjan ◽

Himanshu Maurya ◽

R. Venkatesh Babu

Keyword(s):

Unsupervised Learning ◽

Large Scale ◽

Learning Method ◽

Large Variability ◽

Proposed Model ◽

Dense Crowd ◽

Winner Take All ◽

Learned Features ◽

Take All

We present an unsupervised learning method for dense crowd count estimation. Marred by large variability in appearance of people and extreme overlap in crowds, enumerating people proves to be a difficult task even for humans. This implies creating large-scale annotated crowd data is expensive and directly takes a toll on the performance of existing CNN based counting models on account of small datasets. Motivated by these challenges, we develop Grid Winner-Take-All (GWTA) autoencoder to learn several layers of useful filters from unlabeled crowd images. Our GWTA approach divides a convolution layer spatially into a grid of cells. Within each cell, only the maximally activated neuron is allowed to update the filter. Almost 99.9% of the parameters of the proposed model are trained without any labeled data while the rest 0.1% are tuned with supervision. The model achieves superior results compared to other unsupervised methods and stays reasonably close to the accuracy of supervised baseline. Furthermore, we present comparisons and analyses regarding the quality of learned features across various models.

Download Full-text

CodeMatcher: Searching Code Based on Sequential Semantics of Important Query Words

ACM Transactions on Software Engineering and Methodology ◽

10.1145/3465403 ◽

2022 ◽

Vol 31 (1) ◽

pp. 1-37

Author(s):

Chao Liu ◽

Xin Xia ◽

David Lo ◽

Zhiwe Liu ◽

Ahmed E. Hassan ◽

...

Keyword(s):

Large Scale ◽

Fuzzy Search ◽

Code Search ◽

Proposed Model ◽

Accuracy Measure ◽

Indexing Technique ◽

Google Search ◽

The Relationship ◽

Sequential Semantics

To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS (i.e., the capability of understanding the sequential semantics in important query words), while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. CodeMatcher first collects metadata for query words to identify irrelevant/noisy ones, then iteratively performs fuzzy search with important query words on the codebase that is indexed by the Elasticsearch tool and finally reranks a set of returned candidate code according to how the tokens in the candidate code snippet sequentially matched the important words in a query. We verified its effectiveness on a large-scale codebase with ~41K repositories. Experimental results showed that CodeMatcher achieves an MRR (a widely used accuracy measure for code search) of 0.60, outperforming DeepCS, CodeHow, and UNIF by 82%, 62%, and 46%, respectively. Our proposed model is over 1.2K times faster than DeepCS. Moreover, CodeMatcher outperforms two existing online search engines (GitHub and Google search) by 46% and 33%, respectively, in terms of MRR. We also observed that: fusing the advantages of IR-based and DL-based models is promising; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code.

Download Full-text

An Engagement Model Based on User Interest and QoS in Video Streaming Systems

Wireless Communications and Mobile Computing ◽

10.1155/2018/1398958 ◽

2018 ◽

Vol 2018 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Xiaoying Tan ◽

Yuchun Guo ◽

Mehmet A. Orgun ◽

Liyin Xue ◽

Yishuai Chen

Keyword(s):

Large Scale ◽

Fog Computing ◽

User Engagement ◽

Mobile Video ◽

User Interest ◽

User Behaviors ◽

Proposed Model ◽

Large Scale Dataset ◽

Qoe Model

With the surging demand on high-quality mobile video services and the unabated development of new network technology, including fog computing, there is a need for a generalized quality of user experience (QoE) model that could provide insight for various network optimization designs. A good QoE, especially when measured as engagement, is an important optimization goal for investors and advertisers. Therefore, many works have focused on understanding how the factors, especially quality of service (QoS) factors, impact user engagement. However, the divergence of user interest is usually ignored or deliberatively decoupled from QoS and/or other objective factors. With an increasing trend towards personalization applications, it is necessary as well as feasible to consider user interest to satisfy aesthetic and personal needs of users when optimizing user engagement. We first propose an Extraction-Inference (E-I) algorithm to estimate the user interest from easily obtained user behaviors. Based on our empirical analysis on a large-scale dataset, we then build a QoS and user Interest based Engagement (QI-E) regression model. Through experiments on our dataset, we demonstrate that the proposed model reaches an improvement in accuracy by 9.99% over the baseline model which only considers QoS factors. The proposed model has potential for designing QoE-oriented scheduling strategies in various network scenarios, especially in the fog computing context.

Download Full-text

Devising Digital Twins DNA Paradigm for Modeling ISO-Based City Services

Sensors ◽

10.3390/s21041047 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1047

Author(s):

Hawazin Faiz Badawi ◽

Fedwa Laamarti ◽

Abdulmotaleb El Saddik

Keyword(s):

Dna Sequences ◽

Smart Cities ◽

Ground Truth ◽

International Standards ◽

Quebec City ◽

Digital Twins ◽

Alignment Tool ◽

Proposed Model ◽

Dna Alignment

Digital twins (DTs) technology has recently gained attention within the research community due to its potential to help build sustainable smart cities. However, there is a gap in the literature: currently no unified model for city services has been proposed that can guarantee interoperability across cities, capture each city’s unique characteristics, and act as a base for modeling digital twins. This research aims to fill that gap. In this work, we propose the DT-DNA model in which we design a city services digital twin, with the goal of reflecting the real state of development of a city’s services towards enhancing its citizens’ quality of life (QoL). As it was designed using ISO 37120, one of the leading international standards for city services, the model guarantees interoperability and allows for easy comparison of services within and across cities. In order to test our model, we built DT-DNA sequences of services in both Quebec City and Boston and then used a DNA alignment tool to determine the matching percentage between them. Results show that the DT-DNA sequences of services in both cities are 46.5% identical. Ground truth comparisons show a similar result, which provides a preliminary proof-of-concept for the applicability of the proposed model and framework. These results also imply that one city performs better than the other. Therefore, we propose an algorithm to compare cities based on the proposed DT-DNA and, using Boston and Quebec City as a case study, demonstrate that Boston has better services towards enhancing QoL for its citizens.

Download Full-text

Collective annotation patterns in learning from crowds

Intelligent Data Analysis ◽

10.3233/ida-200009 ◽

2020 ◽

Vol 24 ◽

pp. 63-86

Author(s):

Francisco Mena ◽

Ricardo Ñanculef ◽

Carlos Valle

Keyword(s):

Machine Learning ◽

Large Scale ◽

Ground Truth ◽

Experimental Results ◽

Ground Truth Data ◽

Satisfactory Performance ◽

Machine Learning Applications ◽

Data Points ◽

Confusion Matrices

The lack of annotated data is one of the major barriers facing machine learning applications today. Learning from crowds, i.e. collecting ground-truth data from multiple inexpensive annotators, has become a common method to cope with this issue. It has been recently shown that modeling the varying quality of the annotations obtained in this way, is fundamental to obtain satisfactory performance in tasks where inexpert annotators may represent the majority but not the most trusted group. Unfortunately, existing techniques represent annotation patterns for each annotator individually, making the models difficult to estimate in large-scale scenarios. In this paper, we present two models to address these problems. Both methods are based on the hypothesis that it is possible to learn collective annotation patterns by introducing confusion matrices that involve groups of data point annotations or annotators. The first approach clusters data points with a common annotation pattern, regardless the annotators from which the labels have been obtained. Implicitly, this method attributes annotation mistakes to the complexity of the data itself and not to the variable behavior of the annotators. The second approach explicitly maps annotators to latent groups that are collectively parametrized to learn a common annotation pattern. Our experimental results show that, compared with other methods for learning from crowds, both methods have advantages in scenarios with a large number of annotators and a small number of annotations per annotator.

Download Full-text

DeepSatData: Building large scale datasets of satellite images for training machine learning models

10.36227/techrxiv.16558482.v1 ◽

2021 ◽

Author(s):

Michael Tarasiou

Keyword(s):

Machine Learning ◽

Large Scale ◽

Ground Truth ◽

Semantic Segmentation ◽

Point Of View ◽

Learning Models ◽

Ground Truth Data ◽

Machine Learning Models ◽

Sentinel 2

This paper presents DeepSatData a pipeline for automatically generating satellite imagery datasets for training machine learning models. We also discuss design considerations with emphasis on dense classification tasks, e.g. semantic segmentation. The implementation presented makes use of freely available Sentinel-2 data which allows the generation of large scale datasets required for training deep neural networks (DNN). We discuss issues faced from the point of view of DNN training and evaluation such as checking the quality of ground truth data and comment on the scalability of the approach.

Download Full-text

DeepMAsED: Evaluating the quality of metagenomic assemblies

10.1101/763813 ◽

2019 ◽

Cited By ~ 1

Author(s):

Mateo Rojas-Carulla ◽

Ruth E. Ley ◽

Bernhard Schölkopf ◽

Nicholas D. Youngblut

Keyword(s):

Large Scale ◽

State Of The Art ◽

Ground Truth ◽

Learning Approach ◽

Wide Range ◽

Metagenome Assembly ◽

Model Training ◽

Modelling Assumptions ◽

Reference Genomes

AbstractMotivation/backgroundMethodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large scale metagenome assemblies.ResultsWe present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates close to a 5% contig misassembly rate in two recent large-scale metagenome assembly publications.ConclusionsDeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modelling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects.AvailabilityDeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED.

Download Full-text

Large-scale group decision-making based on Pythagorean linguistic preference relations using experts clustering and consensus measure with non-cooperative behavior analysis of clusters

Complex & Intelligent Systems ◽

10.1007/s40747-021-00369-y ◽

2021 ◽

Author(s):

Prasenjit Mandal ◽

Sovan Samanta ◽

Madhumangal Pal

Keyword(s):

Decision Making ◽

Group Decision Making ◽

Large Scale ◽

Group Decision ◽

Cooperative Behavior ◽

Preference Relation ◽

Consensus Model ◽

Proposed Model ◽

Linguistic Preference Relation ◽

Consensus Index

AbstractTo represent qualitative aspect of uncertainty and imprecise information, linguistic preference relation (LPR) is a powerful tool for experts expressing their opinions in group decision-making (GDM) according to linguistic variables (LVs). Since for an LV, it generally means that membership degree is one, and non-membership and hesitation degrees of the experts cannot be expressed. Pythagorean linguistic numbers/values (PLNs/PLVs) are novel choice to address this issue. The aim of this paper which we propose a GDM problem involved a large number of the experts is called large-scale GDM (LSGDM) based on Pythagorean linguistic preference relation (PLPR) with a consensus model. Sometimes, the experts do not modify their opinions to achieve consensus. Therefore, the experts’ proper opinions’ management with their non-cooperative behaviors (NCBs) is necessary to establish a consensus model. At the same time, it is essential to ensure the proper adjustment of the credibility information. The proposed model using grey clustering method is divided with the experts’ similar evaluations into a subgroup. Then, we aggregate the experts’ evaluations in each cluster. A cluster consensus index (CCI) and a group consensus index (GCI) are presented to measure consensus level among the clusters. Then, we provide a mechanism for managing the NCBs of the clusters, which contain two parts: (1) NCB degree is defined using CCI and GCI for identifying the NCBs of the clusters; (2) implemented the weight punishment mechanism of the NCBs clusters to consensus improvement. Finally, an example is offered for usefulness of the proposed approach.

Download Full-text

Sequence Prediction with Unlabeled Data by Reward Function Learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/432 ◽

2017 ◽

Author(s):

Lijun Wu ◽

Li Zhao ◽

Tao Qin ◽

Jianhuang Lai ◽

Tie-Yan Liu

Keyword(s):

Large Scale ◽

Ground Truth ◽

Text Summarization ◽

Unlabeled Data ◽

Function Learning ◽

Neural Machine Translation ◽

Sequence Prediction ◽

Reward Function ◽

Reward Network

Reinforcement learning (RL), which has been successfully applied to sequence prediction, introduces \textit{reward} as sequence-level supervision signal to evaluate the quality of a generated sequence. Existing RL approaches use the ground-truth sequence to define reward, which limits the application of RL techniques to labeled data. Since labeled data is usually scarce and/or costly to collect, it is desirable to leverage large-scale unlabeled data. In this paper, we extend existing RL methods for sequence prediction to exploit unlabeled data. We propose to learn the reward function from labeled data and use the predicted reward as \textit{pseudo reward} for unlabeled data so that we can learn from unlabeled data using the pseudo reward. To get good pseudo reward on unlabeled data, we propose a RNN-based reward network with attention mechanism, trained with purposely biased data distribution. Experiments show that the pseudo reward can provide good supervision and guide the learning process on unlabeled data. We observe significant improvements on both neural machine translation and text summarization.

Download Full-text