scholarly journals Class Prior Estimation with Biased Positives and Unlabeled Examples

2020 ◽  
Vol 34 (04) ◽  
pp. 4255-4263
Author(s):  
Shantanu Jain ◽  
Justin Delano ◽  
Himanshu Sharma ◽  
Predrag Radivojac

Positive-unlabeled learning is often studied under the assumption that the labeled positive sample is drawn randomly from the true distribution of positives. In many application domains, however, certain regions in the support of the positive class-conditional distribution are over-represented while others are under-represented in the positive sample. Although this introduces problems in all aspects of positive-unlabeled learning, we begin to address this challenge by focusing on the estimation of class priors, quantities central to the estimation of posterior probabilities and the recovery of true classification performance. We start by making a set of assumptions to model the sampling bias. We then extend the identifiability theory of class priors from the unbiased to the biased setting. Finally, we derive an algorithm for estimating the class priors that relies on clustering to decompose the original problem into subproblems of unbiased positive-unlabeled learning. Our empirical investigation suggests feasibility of the correction strategy and overall good performance.

2019 ◽  
Vol 121 (11) ◽  
pp. 2937-2950 ◽  
Author(s):  
Nadia Palmieri ◽  
Maria Angela Perito ◽  
Maria Carmela Macrì ◽  
Claudio Lupi

Purpose The purpose of this paper is to investigate the main factors that may affect Italian consumers’ willingness to eat insects. Italy is a fairly special case among Western countries: in many Italian regions, there is old traditional food with insects. Design/methodology/approach Data come from a sample of 456 consumers living in four Italian regions. The empirical investigation involves several steps: modification of class distributions to obtain a balanced sample; model estimation using the least absolute shrinkage and selection operator; model evaluation using out-of-sample classification performance measures; and estimation of the “effect” of each explanatory variable via average predictive comparisons. The uncertainty associated with the whole procedure is evaluated using the bootstrap. Findings The interviewed consumers are generally unwilling to eat insect-based food. However, factors such as previous experience, taste expectations and attitude towards both new food experiences and sustainable food play an important role in shaping individual inclination towards eating insects. Research limitations/implications The sample analysed in this study is not representative of the whole national population, as it happens in most papers dealing with entomophagy. Originality/value The paper revisits the issue using a relatively large sample and sophisticated statistical methods. The likely average effect of each explanatory variable is estimated and discussed in detail. The results provide interesting insights on how to approach a hypothetical Italian consumer in view of the possible development of a new market for edible insects.


2014 ◽  
Vol 1 (2) ◽  
pp. 169-184 ◽  
Author(s):  
Mani Manavalan

The need for quick gene categorization tools is growing as more genomes are sequenced. To evaluate a newly sequenced genome, the genes must first be identified and translated into amino acid sequences, which are then categorized into structural or functional classes. Protein homology detection using sequence alignment algorithms is the most effective way for protein categorization. Discriminative approaches such as support vector machines (SVMs) and position-specific scoring matrices (PSSM) derived from PSI-BLAST have recently been used to improve alignment algorithms. However, if a fresh sequence is being aligned, alignment algorithms take time. must be compared to a large number of previously published sequences — the same is true for SVMs. Building a PSSM for the PSSM is even more time-consuming than a fresh order It would take roughly 25 hours to implement the best-performing approaches to classify the sequences on today's computers. Describing a novel genome (20, 000 genes) as belonging to one single organism. There are hundreds of classes to choose from, though. Another flaw with alignment algorithms is that they do not construct a model of the positive class, instead of measuring the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are common classification approaches for creating a positive class model, but they have poor classification performance. A model's advantage is that it may be evaluated for chemical features that are shared by all members of the class to get fresh insights into protein function and structure. We used LSTM to solve a well-known remote protein homology detection benchmark, in which a protein must be categorized as a member of the SCOP superfamily. LSTM achieves state-of-the-art classification performance while being significantly faster than other algorithms with similar classification performance. LSTM is five orders of magnitude quicker than the quickest SVM-based approaches and two orders of magnitude faster than methods that perform somewhat better in classification (which, however, have lower classification performance than LSTM). We applied LSTM to PROSITE classes and analyzed the derived patterns to test the modeling capabilities of the algorithm. Because it does not require established similarity metrics like BLOSUM or PAM matrices, LSTM is complementary to alignment-based techniques. The PROSITE motif was retrieved by LSTM in 8 out of 15 classes. In the remaining seven examples, alternative motifs are developed that, on average, outperform the PROSITE motifs in categorization.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Richard Zuech ◽  
John Hancock ◽  
Taghi M. Khoshgoftaar

AbstractClass rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in big data from the CSE-CIC-IDS2018 dataset: “Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”. These three individual web attacks are also severely imbalanced, and so we evaluate whether random undersampling (RUS) treatments can improve the classification performance for these three individual web attacks. The following eight different levels of RUS ratios are evaluated: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. For measuring classification performance, Area Under the Receiver Operating Characteristic Curve (AUC) metrics are obtained for the following seven different classifiers: Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Decision Tree (DT), Naive Bayes (NB), and Logistic Regression (LR) (with the first four learners being ensemble learners and for comparison, the last three being single learners). We find that applying random undersampling does improve overall classification performance with the AUC metric in a statistically significant manner. Ensemble learners achieve the top AUC scores after massive undersampling is applied, but the ensemble learners break down and have poor performance (worse than NB and DT) when no sampling is applied to our unique and harsh experimental conditions of severe class imbalance and rarity.


Author(s):  
Dunstan Brown ◽  
Roger Evans

We describe an empirical method to explore and contrast the roles of default and principal part information in the differentiation of inflectional classes. We use an unsupervised machine learning method to classify Russian nouns into inflectional classes, first with full paradigm information, and then with particular types of information removed. When we remove default information, shared across classes, we expect there to be little effect on the classification. In contrast when we remove principal part information we expect there to be a more detrimental effect on classification performance. Our data set consists of paradigm listings of the 80 most frequent Russian nouns, generated from a formal theory which allows us to distinguish default and principal part information. Our results show that removal of forms classified as principal parts has a more detrimental effect on the classification than removal of default information. However, we also find that there are differences within the defaults and principal parts, and we suggest that these may in part be attributable to stress patterns.


1979 ◽  
Vol 12 (2) ◽  
pp. 82-86
Author(s):  
Karen Friedel ◽  
Jo-Ida Hansen ◽  
Thomas J. Hummel ◽  
Warren F. Shaffer

Crisis ◽  
2012 ◽  
Vol 33 (2) ◽  
pp. 106-112 ◽  
Author(s):  
Christopher M. Bloom ◽  
Shareen Holly ◽  
Adam M. P. Miller

Background: Historically, the field of self-injury has distinguished between the behaviors exhibited among individuals with a developmental disability (self-injurious behaviors; SIB) and those present within a normative population (nonsuicidal self-injury; NSSI),which typically result as a response to perceived stress. More recently, however, conclusions about NSSI have been drawn from lines of animal research aimed at examining the neurobiological mechanisms of SIB. Despite some functional similarity between SIB and NSSI, no empirical investigation has provided precedent for the application of SIB-targeted animal research as justification for pharmacological interventions in populations demonstrating NSSI. Aims: The present study examined this question directly, by simulating an animal model of SIB in rodents injected with pemoline and systematically manipulating stress conditions in order to monitor rates of self-injury. Methods: Sham controls and experimental animals injected with pemoline (200 mg/kg) were assigned to either a low stress (discriminated positive reinforcement) or high stress (discriminated avoidance) group and compared on the dependent measures of self-inflicted injury prevalence and severity. Results: The manipulation of stress conditions did not impact the rate of self-injury demonstrated by the rats. The results do not support a model of stress-induced SIB in rodents. Conclusions: Current findings provide evidence for caution in the development of pharmacotherapies of NSSI in human populations based on CNS stimulant models. Theoretical implications are discussed with respect to antecedent factors such as preinjury arousal level and environmental stress.


Author(s):  
Diane Pecher ◽  
Inge Boot ◽  
Saskia van Dantzig ◽  
Carol J. Madden ◽  
David E. Huber ◽  
...  

Previous studies (e.g., Pecher, Zeelenberg, & Wagenmakers, 2005) found that semantic classification performance is better for target words with orthographic neighbors that are mostly from the same semantic class (e.g., living) compared to target words with orthographic neighbors that are mostly from the opposite semantic class (e.g., nonliving). In the present study we investigated the contribution of phonology to orthographic neighborhood effects by comparing effects of phonologically congruent orthographic neighbors (book-hook) to phonologically incongruent orthographic neighbors (sand-wand). The prior presentation of a semantically congruent word produced larger effects on subsequent animacy decisions when the previously presented word was a phonologically congruent neighbor than when it was a phonologically incongruent neighbor. In a second experiment, performance differences between target words with versus without semantically congruent orthographic neighbors were larger if the orthographic neighbors were also phonologically congruent. These results support models of visual word recognition that assume an important role for phonology in cascaded access to meaning.


2006 ◽  
Author(s):  
Robyn J. Geelhoed ◽  
Julia C. Phillips ◽  
Ann R. Fischer ◽  
Elaine Shpungin ◽  
Younnjung Gong

Sign in / Sign up

Export Citation Format

Share Document