binary classification problem
Recently Published Documents


TOTAL DOCUMENTS

57
(FIVE YEARS 27)

H-INDEX

8
(FIVE YEARS 2)

2021 ◽  
Vol 7 ◽  
pp. e757
Author(s):  
Artem Ryzhikov ◽  
Maxim Borisyak ◽  
Andrey Ustyuzhanin ◽  
Denis Derkach

Anomaly detection is a challenging task that frequently arises in practically all areas of industry and science, from fraud detection and data quality monitoring to finding rare cases of diseases and searching for new physics. Most of the conventional approaches to anomaly detection, such as one-class SVM and Robust Auto-Encoder, are one-class classification methods, i.e., focus on separating normal data from the rest of the space. Such methods are based on the assumption of separability of normal and anomalous classes, and subsequently do not take into account any available samples of anomalies. Nonetheless, in practical settings, some anomalous samples are often available; however, usually in amounts far lower than required for a balanced classification task, and the separability assumption might not always hold. This leads to an important task—incorporating known anomalous samples into training procedures of anomaly detection models. In this work, we propose a novel model-agnostic training procedure to address this task. We reformulate one-class classification as a binary classification problem with normal data being distinguished from pseudo-anomalous samples. The pseudo-anomalous samples are drawn from low-density regions of a normalizing flow model by feeding tails of the latent distribution into the model. Such an approach allows to easily include known anomalies into the training process of an arbitrary classifier. We demonstrate that our approach shows comparable performance on one-class problems, and, most importantly, achieves comparable or superior results on tasks with variable amounts of known anomalies.


Algorithms ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 301
Author(s):  
Umberto Michelucci ◽  
Michela Sperti ◽  
Dario Piga ◽  
Francesca Venturini ◽  
Marco A. Deriu

This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features regardless of the model used. This limit, namely, the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper, the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.


2021 ◽  
Vol 28 (3) ◽  
pp. 250-259
Author(s):  
Ksenia Vladimirovna Lagutina

The article compares character-level, word-level, and rhythm features for the authorship verification of literary texts of the 19th-21st centuries. Text corpora contains fragments of novels, each fragment has a size of about 50 000 characters. There are 40 fragments for each author. 20 authors who wrote in English, Russian, French, and 8 Spanish-language authors are considered.The authors of this paper use existing algorithms for calculation of low-level features, popular in the computer linguistics, and rhythm features, common for the literary texts. Low-level features include n-grams of words, frequencies of letters and punctuation marks, average word and sentence lengths, etc. Rhythm features are based on lexico-grammatical figures: anaphora, epiphora, symploce, aposiopesis, epanalepsis, anadiplosis, diacope, epizeuxis, chiasmus, polysyndeton, repetitive exclamatory and interrogative sentences. These features include the frequency of occurrence of particular rhythm figures per 100 sentences, the number of unique words in the aspects of rhythm, the percentage of nouns, adjectives, adverbs and verbs in the aspects of rhythm. Authorship verification is considered as a binary classification problem: whether the text belongs to a particular author or not. AdaBoost and a neural network with an LSTM layer are considered as classification algorithms. The experiments demonstrate the effectiveness of rhythm features in verification of particular authors, and superiority of feature types combinations over single feature types on average. The best value for precision, recall, and F-measure for the AdaBoost classifier exceeds 90% when all three types of features are combined.


Author(s):  
Guangxin Su ◽  
Weitong Chen ◽  
Miao Xu

Positive-unlabeled (PU) learning deals with the binary classification problem when only positive (P) and unlabeled (U) data are available, without negative (N) data. Existing PU methods perform well on the balanced dataset. However, in real applications such as financial fraud detection or medical diagnosis, data are always imbalanced. It remains unclear whether existing PU methods can perform well on imbalanced data. In this paper, we explore this problem and propose a general learning objective for PU learning targeting specially at imbalanced data. By this general learning objective, state-of-the-art PU methods based on optimizing a consistent risk can be adapted to conquer the imbalance. We theoretically show that in expectation, optimizing our learning objective is equivalent to learning a classifier on the oversampled balanced data with both P and N data available, and further provide an estimation error bound. Finally, experimental results validate the effectiveness of our proposal compared to state-of-the-art PU methods.


Author(s):  
Kanae Takahashi ◽  
Kouji Yamamoto ◽  
Aya Kuchiba ◽  
Tatsuki Koyama

AbstractA binary classification problem is common in medical field, and we often use sensitivity, specificity, accuracy, negative and positive predictive values as measures of performance of a binary predictor. In computer science, a classifier is usually evaluated with precision (positive predictive value) and recall (sensitivity). As a single summary measure of a classifier’s performance, F1 score, defined as the harmonic mean of precision and recall, is widely used in the context of information retrieval and information extraction evaluation since it possesses favorable characteristics, especially when the prevalence is low. Some statistical methods for inference have been developed for the F1 score in binary classification problems; however, they have not been extended to the problem of multi-class classification. There are three types of F1 scores, and statistical properties of these F1 scores have hardly ever been discussed. We propose methods based on the large sample multivariate central limit theorem for estimating F1 scores with confidence intervals.


2021 ◽  
Vol 11 (12) ◽  
pp. 5632
Author(s):  
Riku Iikura ◽  
Makoto Okada ◽  
Naoki Mori

The understanding of narrative stories by computer is an important task for their automatic generation. To date, high-performance neural-network technologies such as BERT have been applied to tasks such as the Story Cloze Test and Story Completion. In this study, we focus on the text segmentation of novels into paragraphs, which is an important writing technique for readers to deepen their understanding of the texts. This type of segmentation, which we call “paragraph boundary recognition”, can be considered to be a binary classification problem in terms of the presence or absence of a boundary, such as a paragraph between target sentences. However, in this case, the data imbalance becomes a bottleneck because the number of paragraphs is generally smaller than the number of sentences. To deal with this problem, we introduced several cost-sensitive loss functions, namely. focal loss, dice loss, and anchor loss, which were robust for imbalanced classification in BERT. In addition, introducing the threshold-moving technique into the model was effective in estimating paragraph boundaries. As a result of the experiment on three newly created datasets, BERT with dice loss and threshold moving obtained a higher F1 than the original BERT had using cross-entropy loss as its loss function (76% to 80%, 50% to 54%, 59% to 63%).


Author(s):  
Yijian Chuan ◽  
Chaoyi Zhao ◽  
Zhenrui He ◽  
Lan Wu

We develop a novel approach to explain why AdaBoost is a successful classifier. By introducing a measure of the influence of the noise points (ION) in the training data for the binary classification problem, we prove that there is a strong connection between the ION and the test error. We further identify that the ION of AdaBoost decreases as the iteration number or the complexity of the base learners increases. We confirm that it is impossible to obtain a consistent classifier without deep trees as the base learners of AdaBoost in some complicated situations. We apply AdaBoost in portfolio management via empirical studies in the Chinese market, which corroborates our theoretical propositions.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Pritam Saha ◽  
Debadyuti Mukherjee ◽  
Pawan Kumar Singh ◽  
Ali Ahmadian ◽  
Massimiliano Ferrara ◽  
...  

AbstractCOVID-19, a viral infection originated from Wuhan, China has spread across the world and it has currently affected over 115 million people. Although vaccination process has already started, reaching sufficient availability will take time. Considering the impact of this widespread disease, many research attempts have been made by the computer scientists to screen the COVID-19 from Chest X-Rays (CXRs) or Computed Tomography (CT) scans. To this end, we have proposed GraphCovidNet, a Graph Isomorphic Network (GIN) based model which is used to detect COVID-19 from CT-scans and CXRs of the affected patients. Our proposed model only accepts input data in the form of graph as we follow a GIN based architecture. Initially, pre-processing is performed to convert an image data into an undirected graph to consider only the edges instead of the whole image. Our proposed GraphCovidNet model is evaluated on four standard datasets: SARS-COV-2 Ct-Scan dataset, COVID-CT dataset, combination of covid-chestxray-dataset, Chest X-Ray Images (Pneumonia) dataset and CMSC-678-ML-Project dataset. The model shows an impressive accuracy of 99% for all the datasets and its prediction capability becomes 100% accurate for the binary classification problem of detecting COVID-19 scans. Source code of this work can be found at GitHub-link.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
E. Ghasemian ◽  
M. K. Tavassoly

AbstractWe present a theoretical scheme for the generation of stationary entangled states. To achieve the purpose we consider an open quantum system consisting of a two-qubit plunged in a thermal bath, as the source of dissipation, and then analytically solve the corresponding quantum master equation. We generate two classes of stationary entangled states including the Werner-like and maximally entangled mixed states. In this regard, since the solution of the system depends on its initial state, we can manipulate it and construct robust Bell-like state. In the continuation, we analytically obtain the population and coherence of the considered two-qubit system and show that the residual coherence can be maintained even in the equilibrium condition. Finally, we successfully encode our two-qubit system to solve a binary classification problem. We demonstrate that, the introduced classifiers present high accuracy without requiring any iterative method. In addition, we show that the quantum based classifiers beat the classical ones.


Sign in / Sign up

Export Citation Format

Share Document