scholarly journals Partial Label Learning with Unlabeled Data

Author(s):  
Qian-Wei Wang ◽  
Yu-Feng Li ◽  
Zhi-Hua Zhou

Partial label learning deals with training examples each associated with a set of candidate labels, among which only one label is valid. Previous studies typically assume that the candidate label sets are provided for all training examples. In many real-world applications such as video character classification, however, it is generally difficult to label a large number of instances and there exists much data left to be unlabeled. We call this kind of problem semi-supervised partial label learning. In this paper, we propose the SSPL method to address this problem. Specifically, an iterative label propagation procedure between partial label examples and unlabeled instances is employed to disambiguate the candidate label sets of partial label examples as well as assign valid labels to unlabeled instances. The importance of unlabeled instances increases adaptively as the number of iteration increases, since they carry richer labeling information. Finally, unseen instances are classified based on the minimum reconstruction error on both partial label and unlabeled instances. Experiments on real-world data sets clearly validate the effectiveness of the proposed SSPL method.

Author(s):  
Yuguang Yan ◽  
Mingkui Tan ◽  
Yanwu Xu ◽  
Jiezhang Cao ◽  
Michael Ng ◽  
...  

The issue of data imbalance occurs in many real-world applications especially in medical diagnosis, where normal cases are usually much more than the abnormal cases. To alleviate this issue, one of the most important approaches is the oversampling method, which seeks to synthesize minority class samples to balance the numbers of different classes. However, existing methods barely consider global geometric information involved in the distribution of minority class samples, and thus may incur distribution mismatching between real and synthetic samples. In this paper, relying on optimal transport (Villani 2008), we propose an oversampling method by exploiting global geometric information of data to make synthetic samples follow a similar distribution to that of minority class samples. Moreover, we introduce a novel regularization based on synthetic samples and shift the distribution of minority class samples according to loss information. Experiments on toy and real-world data sets demonstrate the efficacy of our proposed method in terms of multiple metrics.


2020 ◽  
Vol 34 (04) ◽  
pp. 4691-4698
Author(s):  
Shu Li ◽  
Wen-Tao Li ◽  
Wei Wang

In many real-world applications, the data have several disjoint sets of features and each set is called as a view. Researchers have developed many multi-view learning methods in the past decade. In this paper, we bring Graph Convolutional Network (GCN) into multi-view learning and propose a novel multi-view semi-supervised learning method Co-GCN by adaptively exploiting the graph information from the multiple views with combined Laplacians. Experimental results on real-world data sets verify that Co-GCN can achieve better performance compared with state-of-the-art multi-view semi-supervised methods.


Author(s):  
Chao Qian ◽  
Guiying Li ◽  
Chao Feng ◽  
Ke Tang

The subset selection problem that selects a few items from a ground set arises in many applications such as maximum coverage, influence maximization, sparse regression, etc. The recently proposed POSS algorithm is a powerful approximation solver for this problem. However, POSS requires centralized access to the full ground set, and thus is impractical for large-scale real-world applications, where the ground set is too large to be stored on one single machine. In this paper, we propose a distributed version of POSS (DPOSS) with a bounded approximation guarantee. DPOSS can be easily implemented in the MapReduce framework. Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RandGreeDi.


Author(s):  
Jun-Peng Fang ◽  
Min-Ling Zhang

In partial multi-label learning (PML), each training example is associated with multiple candidate labels which are only partially valid. The task of PML naturally arises in learning scenarios with inaccurate supervision, and the goal is to induce a multi-label predictor which can assign a set of proper labels for unseen instance. To learn from PML training examples, the training procedure is prone to be misled by the false positive labels concealed in candidate label set. In light of this major difficulty, a novel two-stage PML approach is proposed which works by eliciting credible labels from the candidate label set for model induction. In this way, most false positive labels are expected to be excluded from the training procedure. Specifically, in the first stage, the labeling confidence of candidate label for each PML training example is estimated via iterative label propagation. In the second stage, by utilizing credible labels with high labeling confidence, multi-label predictor is induced via pairwise label ranking with virtual label splitting or maximum a posteriori (MAP) reasoning. Extensive experiments on synthetic as well as real-world data sets clearly validate the effectiveness of credible label elicitation in learning from PML examples.


Entropy ◽  
2021 ◽  
Vol 23 (5) ◽  
pp. 507
Author(s):  
Piotr Białczak ◽  
Wojciech Mazurczyk

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.


Author(s):  
Martyna Daria Swiatczak

AbstractThis study assesses the extent to which the two main Configurational Comparative Methods (CCMs), i.e. Qualitative Comparative Analysis (QCA) and Coincidence Analysis (CNA), produce different models. It further explains how this non-identity is due to the different algorithms upon which both methods are based, namely QCA’s Quine–McCluskey algorithm and the CNA algorithm. I offer an overview of the fundamental differences between QCA and CNA and demonstrate both underlying algorithms on three data sets of ascending proximity to real-world data. Subsequent simulation studies in scenarios of varying sample sizes and degrees of noise in the data show high overall ratios of non-identity between the QCA parsimonious solution and the CNA atomic solution for varying analytical choices, i.e. different consistency and coverage threshold values and ways to derive QCA’s parsimonious solution. Clarity on the contrasts between the two methods is supposed to enable scholars to make more informed decisions on their methodological approaches, enhance their understanding of what is happening behind the results generated by the software packages, and better navigate the interpretation of results. Clarity on the non-identity between the underlying algorithms and their consequences for the results is supposed to provide a basis for a methodological discussion about which method and which variants thereof are more successful in deriving which search target.


2016 ◽  
Vol 12 (2) ◽  
pp. 126-149 ◽  
Author(s):  
Masoud Mansoury ◽  
Mehdi Shajari

Purpose This paper aims to improve the recommendations performance for cold-start users and controversial items. Collaborative filtering (CF) generates recommendations on the basis of similarity between users. It uses the opinions of similar users to generate the recommendation for an active user. As a similarity model or a neighbor selection function is the key element for effectiveness of CF, many variations of CF are proposed. However, these methods are not very effective, especially for users who provide few ratings (i.e. cold-start users). Design/methodology/approach A new user similarity model is proposed that focuses on improving recommendations performance for cold-start users and controversial items. To show the validity of the authors’ similarity model, they conducted some experiments and showed the effectiveness of this model in calculating similarity values between users even when only few ratings are available. In addition, the authors applied their user similarity model to a recommender system and analyzed its results. Findings Experiments on two real-world data sets are implemented and compared with some other CF techniques. The results show that the authors’ approach outperforms previous CF techniques in coverage metric while preserves accuracy for cold-start users and controversial items. Originality/value In the proposed approach, the conditions in which CF is unable to generate accurate recommendations are addressed. These conditions affect CF performance adversely, especially in the cold-start users’ condition. The authors show that their similarity model overcomes CF weaknesses effectively and improve its performance even in the cold users’ condition.


2020 ◽  
Vol 19 (2) ◽  
pp. 21-35
Author(s):  
Ryan Beal ◽  
Timothy J. Norman ◽  
Sarvapali D. Ramchurn

AbstractThis paper outlines a novel approach to optimising teams for Daily Fantasy Sports (DFS) contests. To this end, we propose a number of new models and algorithms to solve the team formation problems posed by DFS. Specifically, we focus on the National Football League (NFL) and predict the performance of real-world players to form the optimal fantasy team using mixed-integer programming. We test our solutions using real-world data-sets from across four seasons (2014-2017). We highlight the advantage that can be gained from using our machine-based methods and show that our solutions outperform existing benchmarks, turning a profit in up to 81.3% of DFS game-weeks over a season.


Sign in / Sign up

Export Citation Format

Share Document