Positive unlabeled learning via wrapper-based adaptive sampling

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/457 ◽

2017 ◽

Cited By ~ 6

Author(s):

Pengyi Yang ◽

Wei Liu ◽

Jean Yang

Keyword(s):

Predictive Model ◽

Adaptive Sampling ◽

Unlabeled Data ◽

Training Data ◽

Classification Model ◽

Negative Data ◽

Prior Probabilities ◽

Discriminant Model

Learning from positive and unlabeled data frequently occurs in applications where only a subset of positive instances is available while the rest of the data are unlabeled. In such scenarios, often the goal is to create a discriminant model that can accurately classify both positive and negative data by modelling from labeled and unlabeled instances. In this study, we propose an adaptive sampling (AdaSampling) approach that utilises prediction probabilities from a model to iteratively update the training data. Starting with equal prior probabilities for all unlabeled data, our method "wraps" around a predictive model to iteratively update these probabilities to distinguish positive and negative instances in unlabeled data. Subsequently, one or more robust negative set(s) can be drawn from unlabeled data, according to the likelihood of each instance being negative, to train a single classification model or ensemble of models.

Download Full-text

Leveraging Unlabeled Data for Classification

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch181 ◽

2011 ◽

pp. 1164-1169

Author(s):

Yinghui Yang ◽

Balaji Padmanabhan

Keyword(s):

Research Question ◽

Unlabeled Data ◽

Training Data ◽

Bank Loan ◽

Classification Model ◽

Classification Models ◽

Class Label ◽

Data Record ◽

Model Training ◽

Class Labels

Classification is a form of data analysis that can be used to extract models to predict categorical class labels (Han & Kamber, 2001). Data classification has proven to be very useful in a wide variety of applications. For example, a classification model can be built to categorize bank loan applications as either safe or risky. In order to build a classification model, training data containing multiple independent variables and a dependant variable (class label) is needed. If a data record has a known value for its class label, this data record is termed “labeled”. If the value for its class is unknown, it is “unlabeled”. There are situations with a large amount of unlabeled data and a small amount of labeled data. Using only labeled data to build classification models can potentially ignore useful information contained in the unlabeled data. Furthermore, unlabeled data can often be much cheaper and more plentiful than labeled data, and so if useful information can be extracted from it that reduces the need for labeled examples, this can be a significant benefit (Balcan & Blum 2005). The default practice is to use only the labeled data to build a classification model and then assign class labels to the unlabeled data. However, when the amount of labeled data is not enough, the classification model built only using the labeled data can be biased and far from accurate. The class labels assigned to the unlabeled data can then be inaccurate. How to leverage the information contained in the unlabeled data to help improve the accuracy of the classification model is an important research question. There are two streams of research that addresses the challenging issue of how to appropriately use unlabeled data for building classification models. The details are discussed below.

Download Full-text

Fully automated contrast and non-contrast cardiac view detection in echocardiography a multi-centre, multi-vendor study

European Heart Journal ◽

10.1093/ehjci/ehaa946.0078 ◽

2020 ◽

Vol 41 (Supplement_2) ◽

Author(s):

S Gao ◽

D Stojanovski ◽

A Parker ◽

P Marques ◽

S Heitner ◽

...

Keyword(s):

Neural Network ◽

Training Data ◽

Classification Model ◽

Validation Dataset ◽

Funding Source ◽

Private Company ◽

Validation Data ◽

Independent Test ◽

Model Training ◽

Confusion Matrices

Abstract Background Correctly identifying views acquired in a 2D echocardiographic examination is paramount to post-processing and quantification steps often performed as part of most clinical workflows. In many exams, particularly in stress echocardiography, microbubble contrast is used which greatly affects the appearance of the cardiac views. Here we present a bespoke, fully automated convolutional neural network (CNN) which identifies apical 2, 3, and 4 chamber, and short axis (SAX) views acquired with and without contrast. The CNN was tested in a completely independent, external dataset with the data acquired in a different country than that used to train the neural network. Methods Training data comprised of 2D echocardiograms was taken from 1014 subjects from a prospective multisite, multi-vendor, UK trial with the number of frames in each view greater than 17,500. Prior to view classification model training, images were processed using standard techniques to ensure homogenous and normalised image inputs to the training pipeline. A bespoke CNN was built using the minimum number of convolutional layers required with batch normalisation, and including dropout for reducing overfitting. Before processing, the data was split into 90% for model training (211,958 frames), and 10% used as a validation dataset (23,946 frames). Image frames from different subjects were separated out entirely amongst the training and validation datasets. Further, a separate trial dataset of 240 studies acquired in the USA was used as an independent test dataset (39,401 frames). Results Figure 1 shows the confusion matrices for both validation data (left) and independent test data (right), with an overall accuracy of 96% and 95% for the validation and test datasets respectively. The accuracy for the non-contrast cardiac views of >99% exceeds that seen in other works. The combined datasets included images acquired across ultrasound manufacturers and models from 12 clinical sites. Conclusion We have developed a CNN capable of automatically accurately identifying all relevant cardiac views used in “real world” echo exams, including views acquired with contrast. Use of the CNN in a routine clinical workflow could improve efficiency of quantification steps performed after image acquisition. This was tested on an independent dataset acquired in a different country to that used to train the model and was found to perform similarly thus indicating the generalisability of the model. Figure 1. Confusion matrices Funding Acknowledgement Type of funding source: Private company. Main funding source(s): Ultromics Ltd.

Download Full-text

Continental-Scale Land Cover Mapping at 10 m Resolution Over Europe (ELC10)

Remote Sensing ◽

10.3390/rs13122301 ◽

2021 ◽

Vol 13 (12) ◽

pp. 2301

Author(s):

Zander Venter ◽

Markus Sydenham

Keyword(s):

Land Cover ◽

Atmospheric Correction ◽

Google Earth ◽

Tree Planting ◽

Training Data ◽

Classification Model ◽

Night Time ◽

Continental Scale ◽

Land Cover Maps ◽

Sentinel 2

Land cover maps are important tools for quantifying the human footprint on the environment and facilitate reporting and accounting to international agreements addressing the Sustainable Development Goals. Widely used European land cover maps such as CORINE (Coordination of Information on the Environment) are produced at medium spatial resolutions (100 m) and rely on diverse data with complex workflows requiring significant institutional capacity. We present a 10 m resolution land cover map (ELC10) of Europe based on a satellite-driven machine learning workflow that is annually updatable. A random forest classification model was trained on 70K ground-truth points from the LUCAS (Land Use/Cover Area Frame Survey) dataset. Within the Google Earth Engine cloud computing environment, the ELC10 map can be generated from approx. 700 TB of Sentinel imagery within approx. 4 days from a single research user account. The map achieved an overall accuracy of 90% across eight land cover classes and could account for statistical unit land cover proportions within 3.9% (R2 = 0.83) of the actual value. These accuracies are higher than that of CORINE (100 m) and other 10 m land cover maps including S2GLC and FROM-GLC10. Spectro-temporal metrics that capture the phenology of land cover classes were most important in producing high mapping accuracies. We found that the atmospheric correction of Sentinel-2 and the speckle filtering of Sentinel-1 imagery had a minimal effect on enhancing the classification accuracy (< 1%). However, combining optical and radar imagery increased accuracy by 3% compared to Sentinel-2 alone and by 10% compared to Sentinel-1 alone. The addition of auxiliary data (terrain, climate and night-time lights) increased accuracy by an additional 2%. By using the centroid pixels from the LUCAS Copernicus module polygons we increased accuracy by <1%, revealing that random forests are robust against contaminated training data. Furthermore, the model requires very little training data to achieve moderate accuracies—the difference between 5K and 50K LUCAS points is only 3% (86 vs. 89%). This implies that significantly less resources are necessary for making in situ survey data (such as LUCAS) suitable for satellite-based land cover classification. At 10 m resolution, the ELC10 map can distinguish detailed landscape features like hedgerows and gardens, and therefore holds potential for aerial statistics at the city borough level and monitoring property-level environmental interventions (e.g., tree planting). Due to the reliance on purely satellite-based input data, the ELC10 map can be continuously updated independent of any country-specific geographic datasets.

Download Full-text

Towards Accurate and Efficient Chinese Part-of-Speech Tagging

Computational Linguistics ◽

10.1162/coli_a_00253 ◽

2016 ◽

Vol 42 (3) ◽

pp. 391-419 ◽

Cited By ~ 4

Author(s):

Weiwei Sun ◽

Xiaojun Wan

Keyword(s):

Hybrid Systems ◽

Language Processing ◽

Large Scale ◽

Unlabeled Data ◽

Training Data ◽

Test Time ◽

System Combination ◽

Pos Tagging ◽

Hybrid Approaches ◽

Lexical Relations

From the perspective of structural linguistics, we explore paradigmatic and syntagmatic lexical relations for Chinese POS tagging, an important and challenging task for Chinese language processing. Paradigmatic lexical relations are explicitly captured by word clustering on large-scale unlabeled data and are used to design new features to enhance a discriminative tagger. Syntagmatic lexical relations are implicitly captured by syntactic parsing in the constituency formalism, and are utilized via system combination. Experiments on the Penn Chinese Treebank demonstrate the importance of both paradigmatic and syntagmatic relations. Our linguistically motivated, hybrid approaches yield a relative error reduction of 18% in total over state-of-the-art baselines. Despite the effectiveness to boost accuracy, computationally expensive parsers make hybrid systems inappropriate for many realistic NLP applications. In this article, we are also concerned with improving tagging efficiency at test time. In particular, we explore unlabeled data to transfer the predictive power of hybrid models to simple sequence models. Specifically, hybrid systems are utilized to create large-scale pseudo training data for cheap models. Experimental results illustrate that the re-compiled models not only achieve high accuracy with respect to per token classification, but also serve as a front-end to a parser well.

Download Full-text

Semi-Supervised Classification and its Application to Filtering IDS False Positives

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.2309 ◽

2013 ◽

Vol 427-429 ◽

pp. 2309-2312

Author(s):

Hai Bin Mei ◽

Ming Hua Zhang

Keyword(s):

Supervised Learning ◽

Supervised Classification ◽

Classification Performance ◽

False Positives ◽

Training Data ◽

Classification Model ◽

Classification Technique

Alert classifiers built with the supervised classification technique require large amounts of labeled training alerts. Preparing for such training data is very difficult and expensive. Thus accuracy and feasibility of current classifiers are greatly restricted. This paper employs semi-supervised learning to build alert classification model to reduce the number of needed labeled training alerts. Alert context properties are also introduced to improve the classification performance. Experiments have demonstrated the accuracy and feasibility of our approach.

Download Full-text

Analisis Sentimen Twitter untuk Teks Berbahasa Indonesia dengan Maximum Entropy dan Support Vector Machine

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.3499 ◽

2014 ◽

Vol 8 (1) ◽

pp. 91 ◽

Cited By ~ 5

Author(s):

Noviah Dwi Putranti ◽

Edi Winarko

Keyword(s):

Support Vector Machine ◽

Maximum Entropy ◽

Social Networking Site ◽

Training Data ◽

Classification Model ◽

Support Vector ◽

Public Sentiment ◽

Pos Tagger ◽

Negative Sentiment ◽

Bahasa Indonesia

AbstrakAnalisis sentimen dalam penelitian ini merupakan proses klasifikasi dokumen tekstual ke dalam dua kelas, yaitu kelas sentimen positif dan negatif. Data opini diperoleh dari jejaring sosial Twitter berdasarkan query dalam Bahasa Indonesia. Penelitian ini bertujuan untuk menentukan sentimen publik terhadap objek tertentu yang disampaikan di Twitter dalam bahasa Indonesia, sehingga membantu usaha untuk melakukan riset pasar atas opini publik. Data yang sudah terkumpul dilakukan proses preprocessing dan POS tagger untuk menghasilkan model klasifikasi melalui proses pelatihan. Teknik pengumpulan kata yang memiliki sentimen dilakukan dengan pendekatan berdasarkan kamus, yang dihasilkan dalam penelitian ini berjumlah 18.069 kata. Algoritma Maximum Entropy digunakan untuk POS tagger dan algoritma yang digunakan untuk membangun model klasifikasi atas data pelatihan dalam penelitian ini adalah Support Vector Machine. Fitur yang digunakan adalah unigram dengan fitur pembobotan TFIDF. Implementasi klasifikasi diperoleh akurasi 86,81 % pada pengujian 7 fold cross validation untuk tipe kernel Sigmoid. Pelabelan kelas secara manual dengan POS tagger menghasilkan akurasi 81,67%. Kata kunci—analisis sentimen, klasifikasi, maximum entropy POS tagger, support vector machine, twitter. AbstractSentiment analysis in this research classified textual documents into two classes, positive and negative sentiment. Opinion data obtained a query from social networking site Twitter of Indonesian tweet. This research uses Indonesian tweets. This study aims to determine public sentiment toward a particular object presented in Twitter businesses conduct market. Collected data then prepocessed to help POS tagged to generate classification models through the training process. Sentiment word collection has done the dictionary based approach, which is generated in this study consists 18.069 words. Maximum Entropy algorithm is used for POS tagger and the algorithms used to build the classification model on the training data is Support Vector Machine. The unigram features used are the features of TFIDF weighting.Classification implementation 86,81 % accuration at examination of 7 validation cross fold for the type of kernel of Sigmoid. Class labeling manually with POS tagger yield accuration 81,67 %. Keywords—sentiment analysis, classification, maximum entropy POS tagger, support vector machine, twitter.

Download Full-text

On the Complexity of Learning a Class Ratio from Unlabeled Data

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12013 ◽

2020 ◽

Vol 69 ◽

Author(s):

Benjamin Fish ◽

Lev Reyzin

Keyword(s):

Computational Complexity ◽

Unlabeled Data ◽

Training Data ◽

Pac Learning ◽

Vc Dimension ◽

Standard Set

In the problem of learning a class ratio from unlabeled data, which we call CR learning, the training data is unlabeled, and only the ratios, or proportions, of examples receiving each label are given. The goal is to learn a hypothesis that predicts the proportions of labels on the distribution underlying the sample. This model of learning is applicable to a wide variety of settings, including predicting the number of votes for candidates in political elections from polls. In this paper, we formally define this class and resolve foundational questions regarding the computational complexity of CR learning and characterize its relationship to PAC learning. Among our results, we show, perhaps surprisingly, that for finite VC classes what can be efficiently CR learned is a strict subset of what can be learned efficiently in PAC, under standard complexity assumptions. We also show that there exist classes of functions whose CR learnability is independent of ZFC, the standard set theoretic axioms. This implies that CR learning cannot be easily characterized (like PAC by VC dimension).

Download Full-text

Classification Methods

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch032 ◽

2011 ◽

pp. 196-201 ◽

Cited By ~ 2

Author(s):

Aijun An

Keyword(s):

Unsupervised Learning ◽

Supervised Learning ◽

Credit Card ◽

A Priori ◽

Training Data ◽

Classification Model ◽

Highly Active ◽

Data Object ◽

Patient Database ◽

Data Objects

Generally speaking, classification is the action of assigning an object to a category according to the characteristics of the object. In data mining, classification refers to the task of analyzing a set of pre-classified data objects to learn a model (or a function) that can be used to classify an unseen data object into one of several predefined classes. A data object, referred to as an example, is described by a set of attributes or variables. One of the attributes describes the class that an example belongs to and is thus called the class attribute or class variable. Other attributes are often called independent or predictor attributes (or variables). The set of examples used to learn the classification model is called the training data set. Tasks related to classification include regression, which builds a model from training data to predict numerical values, and clustering, which groups examples to form categories. Classification belongs to the category of supervised learning, distinguished from unsupervised learning. In supervised learning, the training data consists of pairs of input data (typically vectors), and desired outputs, while in unsupervised learning there is no a priori output. Classification has various applications, such as learning from a patient database to diagnose a disease based on the symptoms of a patient, analyzing credit card transactions to identify fraudulent transactions, automatic recognition of letters or digits based on handwriting samples, and distinguishing highly active compounds from inactive ones based on the structures of compounds for drug discovery.

Download Full-text

Liver Cancer Classification Model Using Hybrid Feature Selection Based on Class-Dependent Technique for the Central Region of Thailand

Information ◽

10.3390/info10060187 ◽

2019 ◽

Vol 10 (6) ◽

pp. 187

Author(s):

Rattanawadee Panthong ◽

Anongnart Srivihok

Keyword(s):

Feature Selection ◽

Liver Cancer ◽

Predictive Model ◽

Information Gain ◽

Classification Performance ◽

Cancer Classification ◽

Feature Subset Selection ◽

Classification Model ◽

Feature Subset ◽

Cancer Data

Liver cancer data always consist of a large number of multidimensional datasets. A dataset that has huge features and multiple classes may be irrelevant to the pattern classification in machine learning. Hence, feature selection improves the performance of the classification model to achieve maximum classification accuracy. The aims of the present study were to find the best feature subset and to evaluate the classification performance of the predictive model. This paper proposed a hybrid feature selection approach by combining information gain and sequential forward selection based on the class-dependent technique (IGSFS-CD) for the liver cancer classification model. Two different classifiers (decision tree and naïve Bayes) were used to evaluate feature subsets. The liver cancer datasets were obtained from the Cancer Hospital Thailand database. Three ensemble methods (ensemble classifiers, bagging, and AdaBoost) were applied to improve the performance of classification. The IGSFS-CD method provided good accuracy of 78.36% (sensitivity 0.7841 and specificity 0.9159) on LC_dataset-1. In addition, LC_dataset II delivered the best performance with an accuracy of 84.82% (sensitivity 0.8481 and specificity 0.9437). The IGSFS-CD method achieved better classification performance compared to the class-independent method. Furthermore, the best feature subset selection could help reduce the complexity of the predictive model.

Download Full-text

Semi-Supervised Learning

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch192 ◽

2011 ◽

pp. 1022-1027

Author(s):

Tobias Scheffer

Keyword(s):

Supervised Learning ◽

Supervised Classification ◽

Unlabeled Data ◽

Training Data ◽

Classification Algorithms ◽

Classification Problems

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data.

Download Full-text