scholarly journals Active Classification With Uncertainty Comparison Queries

2022 ◽  
pp. 1-23
Author(s):  
Zhenghang Cui ◽  
Issei Sato

Abstract Noisy pairwise comparison feedback has been incorporated to improve the overall query complexity of interactively learning binary classifiers. The positivity comparison oracle is extensively used to provide feedback on which is more likely to be positive in a pair of data points. Because it is impossible to determine accurate labels using this oracle alone without knowing the classification threshold, existing methods still rely on the traditional explicit labeling oracle, which explicitly answers the label given a data point. The current method conducts sorting on all data points and uses explicit labeling oracle to find the classification threshold. However, it has two drawbacks: (1) it needs unnecessary sorting for label inference and (2) it naively adapts quick sort to noisy feedback. In order to avoid these inefficiencies and acquire information of the classification threshold at the same time, we propose a new pairwise comparison oracle concerning uncertainties. This oracle answers which one has higher uncertainty given a pair of data points. We then propose an efficient adaptive labeling algorithm to take advantage of the proposed oracle. In addition, we address the situation where the labeling budget is insufficient compared to the data set size. Furthermore, we confirm the feasibility of the proposed oracle and the performance of the proposed algorithm theoretically and empirically.

Author(s):  
Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.


2016 ◽  
Author(s):  
Lee Naish

Identifying patterns and associations in data is fundamental to discovery in science. This work investigates a very simple instance of the problem, where each data point consists of a vector of binary attributes, and attributes are treated equally. For example, each data point may correspond to a person and the attributes may be their sex, whether they smoke cigarettes, whether they have been diagnosed with lung cancer, etc. Measuring similarity of attributes in the data is equivalent to measuring similarity of sets - an attribute can be mapped to the set of data points which have the attribute. Furthermore, there is one identified base set (or attribute) and only similarity to that set is considered - the other sets are just ranked according to how similar they are to the base set. For example, if the base set is lung cancer sufferers, the set of smokers may well be high in the ranking. Identifying set similarity or correlation has many uses and is often the first step in determining causality. Set similarity is also the basis for comparing binary classifiers such as diagnostic tests for any data set. More than a hundred set similarity measures have been proposed in the literature is but there is very little understanding of how best to choose a similarity measure for a given domain. This work discusses numerous properties that similarity measures can have, weakening some previously proposed definitions so they are no longer incompatible, and identifying important forms of symmetry which have not previously been considered. It defines ordering relations over similarity measures and shows how some properties of a domain can be used to help choose a similarity measure which will perform well for that domain.


Author(s):  
Simona Babiceanu ◽  
Sanhita Lahiri ◽  
Mena Lockwood

This study uses a suite of performance measures that was developed by taking into consideration various aspects of congestion and reliability, to assess impacts of safety projects on congestion. Safety projects are necessary to help move Virginia’s roadways toward safer operation, but can contribute to congestion and unreliability during execution, and can affect operations after execution. However, safety projects are assessed primarily for safety improvements, not for congestion. This study identifies an appropriate suite of measures, and quantifies and compares the congestion and reliability impacts of safety projects on roadways for the periods before, during, and after project execution. The paper presents the performance measures, examines their sensitivity based on operating conditions, defines thresholds for congestion and reliability, and demonstrates the measures using a set of Virginia safety projects. The data set consists of 10 projects totalling 92 mi and more than 1M data points. The study found that, overall, safety projects tended to have a positive impact on congestion and reliability after completion, and the congestion variability measures were sensitive to the threshold of reliability. The study concludes with practical recommendations for primary measures that may be used to measure overall impacts of safety projects: percent vehicle miles traveled (VMT) reliable with a customized threshold for Virginia; percent VMT delayed; and time to travel 10 mi. However, caution should be used when applying the results directly to other situations, because of the limited number of projects used in the study.


Algorithms ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 37
Author(s):  
Shixun Wang ◽  
Qiang Chen

Boosting of the ensemble learning model has made great progress, but most of the methods are Boosting the single mode. For this reason, based on the simple multiclass enhancement framework that uses local similarity as a weak learner, it is extended to multimodal multiclass enhancement Boosting. First, based on the local similarity as a weak learner, the loss function is used to find the basic loss, and the logarithmic data points are binarized. Then, we find the optimal local similarity and find the corresponding loss. Compared with the basic loss, the smaller one is the best so far. Second, the local similarity of the two points is calculated, and then the loss is calculated by the local similarity of the two points. Finally, the text and image are retrieved from each other, and the correct rate of text and image retrieval is obtained, respectively. The experimental results show that the multimodal multi-class enhancement framework with local similarity as the weak learner is evaluated on the standard data set and compared with other most advanced methods, showing the experience proficiency of this method.


Author(s):  
Jungeui Hong ◽  
Elizabeth A. Cudney ◽  
Genichi Taguchi ◽  
Rajesh Jugulum ◽  
Kioumars Paryani ◽  
...  

The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis-Taguchi System and a neural network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.


2021 ◽  
Author(s):  
Ahmed Al-Sabaa ◽  
Hany Gamal ◽  
Salaheldin Elkatatny

Abstract The formation porosity of drilled rock is an important parameter that determines the formation storage capacity. The common industrial technique for rock porosity acquisition is through the downhole logging tool. Usually logging while drilling, or wireline porosity logging provides a complete porosity log for the section of interest, however, the operational constraints for the logging tool might preclude the logging job, in addition to the job cost. The objective of this study is to provide an intelligent prediction model to predict the porosity from the drilling parameters. Artificial neural network (ANN) is a tool of artificial intelligence (AI) and it was employed in this study to build the porosity prediction model based on the drilling parameters as the weight on bit (WOB), drill string rotating-speed (RS), drilling torque (T), stand-pipe pressure (SPP), mud pumping rate (Q). The novel contribution of this study is to provide a rock porosity model for complex lithology formations using drilling parameters in real-time. The model was built using 2,700 data points from well (A) with 74:26 training to testing ratio. Many sensitivity analyses were performed to optimize the ANN model. The model was validated using unseen data set (1,000 data points) of Well (B), which is located in the same field and drilled across the same complex lithology. The results showed the high performance for the model either for training and testing or validation processes. The overall accuracy for the model was determined in terms of correlation coefficient (R) and average absolute percentage error (AAPE). Overall, R was higher than 0.91 and AAPE was less than 6.1 % for the model building and validation. Predicting the rock porosity while drilling in real-time will save the logging cost, and besides, will provide a guide for the formation storage capacity and interpretation analysis.


2022 ◽  
Author(s):  
Nabeel Durrani ◽  
Damjan Vukovic ◽  
Maria Antico ◽  
Jeroen van der Burgt ◽  
Ruud JG van van Sloun ◽  
...  

<div>Our automated deep learning-based approach identifies consolidation/collapse in LUS images to aid in the diagnosis of late stages of COVID-19 induced pneumonia, where consolidation/collapse is one of the possible associated pathologies. A common challenge in training such models is that annotating each frame of an ultrasound video requires high labelling effort. This effort in practice becomes prohibitive for large ultrasound datasets. To understand the impact of various degrees of labelling precision, we compare labelling strategies to train fully supervised models (frame-based method, higher labelling effort) and inaccurately supervised models (video-based methods, lower labelling effort), both of which yield binary predictions for LUS videos on a frame-by-frame level. We moreover introduce a novel sampled quaternary method which randomly samples only 10% of the LUS video frames and subsequently assigns (ordinal) categorical labels to all frames in the video based on the fraction of positively annotated samples. This method outperformed the inaccurately supervised video-based method of our previous work on pleural effusions. More surprisingly, this method outperformed the supervised frame-based approach with respect to metrics such as precision-recall area under curve (PR-AUC) and F1 score that are suitable for the class imbalance scenario of our dataset despite being a form of inaccurate learning. This may be due to the combination of a significantly smaller data set size compared to our previous work and the higher complexity of consolidation/collapse compared to pleural effusion, two factors which contribute to label noise and overfitting; specifically, we argue that our video-based method is more robust with respect to label noise and mitigates overfitting in a manner similar to label smoothing. Using clinical expert feedback, separate criteria were developed to exclude data from the training and test sets respectively for our ten-fold cross validation results, which resulted in a PR-AUC score of 73% and an accuracy of 89%. While the efficacy of our classifier using the sampled quaternary method must be verified on a larger consolidation/collapse dataset, when considering the complexity of the pathology, our proposed classifier using the sampled quaternary video-based method is clinically comparable with trained experts and improves over the video-based method of our previous work on pleural effusions.</div>


2018 ◽  
Vol 11 (2) ◽  
pp. 53-67
Author(s):  
Ajay Kumar ◽  
Shishir Kumar

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.


2021 ◽  
Vol 50 (1) ◽  
pp. 138-152
Author(s):  
Mujeeb Ur Rehman ◽  
Dost Muhammad Khan

Recently, anomaly detection has acquired a realistic response from data mining scientists as a graph of its reputation has increased smoothly in various practical domains like product marketing, fraud detection, medical diagnosis, fault detection and so many other fields. High dimensional data subjected to outlier detection poses exceptional challenges for data mining experts and it is because of natural problems of the curse of dimensionality and resemblance of distant and adjoining points. Traditional algorithms and techniques were experimented on full feature space regarding outlier detection. Customary methodologies concentrate largely on low dimensional data and hence show ineffectiveness while discovering anomalies in a data set comprised of a high number of dimensions. It becomes a very difficult and tiresome job to dig out anomalies present in high dimensional data set when all subsets of projections need to be explored. All data points in high dimensional data behave like similar observations because of its intrinsic feature i.e., the distance between observations approaches to zero as the number of dimensions extends towards infinity. This research work proposes a novel technique that explores deviation among all data points and embeds its findings inside well established density-based techniques. This is a state of art technique as it gives a new breadth of research towards resolving inherent problems of high dimensional data where outliers reside within clusters having different densities. A high dimensional dataset from UCI Machine Learning Repository is chosen to test the proposed technique and then its results are compared with that of density-based techniques to evaluate its efficiency.


Sign in / Sign up

Export Citation Format

Share Document