Interval SVM-Based Classification Algorithm Using the Uncertainty Trick

2017 ◽  
Vol 26 (04) ◽  
pp. 1750014 ◽  
Author(s):  
Lev V. Utkin ◽  
Yulia A. Zhuk

A new robust SVM-based algorithm of the binary classification is proposed. It is based on the so-called uncertainty trick when training data with the interval uncertainty are transformed to training data with the weight or probabilistic uncertainty. Every interval is replaced by a set of training points with the same class label such that every point inside the interval has an unknown weight from a predefined set of weights. The robust strategy dealing with the upper bound of the interval-valued expected risk produced by a set of weights is used in the SVM. An extension of the algorithm based on using the imprecise Dirichlet model is proposed for its additional robustification. Numerical examples with synthetic and real interval-valued training data illustrate the proposed algorithm and its extension.

Author(s):  
LEV V. UTKIN

One of the most common performance measures in selection and management of projects is the Net Present Value (NPV). In the paper, we study a case when initial data about the NPV parameters (cash flows and the discount rate) are represented in the form of intervals supplied by experts. A method for computing the NPV based on using random set theory is proposed and three conditions of independence of the parameters are taken into account. Moreover, the imprecise Dirichlet model for obtaining more cautious bounds for the NPV is considered. Numerical examples illustrate the proposed approach for computing the NPV.


Author(s):  
Bapin Mondal ◽  
Md Sadikur Rahman

Interval interpolation formulae play a significant role to find the value of an unknown function at some points under interval uncertainty. The objective of this paper is to establish Newton’s divided interpolation formula for interval-valued functions using generalized Hukuhara difference of intervals. For this purpose, arithmetic of intervals, Hukuhara difference and its some properties and concept of interval-valued function have been discussed briefly. Using Hukuhara difference of intervals, the definition of Newton’s divided gH-difference for interval-valued function has been introduced. Then Newton’s divided gH-differences interpolation formula has been derived. Finally, with the help of some numerical examples, the proposed interpolation formula has been illustrated.


Author(s):  
LEV V. UTKIN

Cautious reliability estimates of multi-state and continuum-state systems are studied in the paper under condition that initial data about reliability of components are given in the form of interval-valued observations, measurements or expert judgments. The interval-valued information is processed by means of a set of the imprecise Dirichlet model which can be regarded as a set of Dirichlet distributions. The developed model of reliability provides cautious reliability measures when the number of observations or measurements is rather small. It can be viewed as an extension of models based on random set theory and robust statistical models. A numerical example illustrates the proposed model and an algorithm for computing the system reliability.


Author(s):  
Л.В. Уткин ◽  
Ю.А. Жук

Предложена робастная модификация метода K-средних для решения задачи кластеризации при условии, что элементы обучающей выборки являются интервальными. Существующие методы кластеризации в большинстве либо основаны на замене интервальных данных их точными аналогами, например, центрами интервалов, либо используют специальные метрики расстояния между гиперпрямоугольниками (многомерными интервалами) или между точкой и гиперпрямоугольником, например расстояние Хаусдорфа. В отличие от существующих методов, идеей, лежащей в основе предлагаемого алгоритма, является трансформация интервального характера неопределенности во множество распределений весов примеров и расширение обучающей выборки. При этом новые элементы обучающей выборки, являющиеся точками исходных интервалов, имеют неопределенные веса, назначенные таким образом, чтобы не нарушить исходную структуру обучающей выборки, не внося никакой дополнительной необоснованной информации. Другой идеей является использование минимаксной стратегии для обеспечения робастности. Показано, что новый алгоритм отличается от стандартного алгоритма K-средних этапом решения простой задачи линейного программирования. Также показано, что в простейшем случае, когда элементы исходной обучающей выборки имеют одинаковые веса, предлагаемый алгоритм сводится к тому, что выбираются точки гиперпрямоугольников, находящиеся от текущего центра тяжести на максимальном расстоянии. Полученные результаты можно рассматривать в рамках теории Демпстера-Шейфера. Предлагаемый алгоритм целесообразно применять в случае больших интервалов данных или при малом объеме обучающей выборки. A robust modification of the K-means method for solving a clustering problem under interval-valued training data is proposed. The existing methods of clustering are mainly based on the replacement of interval-valued data with their point-valued representations, for example, with centers of intervals, or they use some special distance metrics between hyper-rectangles (multi-dimensional intervals) or between points and hyper-rectangles, for example, the Hausdorff distance. In contrast to the existing methods, the first idea underlying the proposed algorithm is transferring of interval uncertainty to sets of example weights and to an extension of the training set. At that, new elements of the training set, being points approximating intervals, have imprecise weights assigned such that they do not change an initial structure of training data and do not introduce additional unjustified information. The second idea is to use the minimax strategy for providing the robustness. It is shown in the paper that the new algorithm differs from the standard K-means algorithm by a step of solving a simple linear programming problem. It is also shown in the paper that in the simplest case when all elements of the training set have identical weights, the proposed algorithm is reduced to the choice of a point inside hyper-rectangles, which are located on the largest distance from the center of a cluster. The obtained results can be considered also in the framework of Dempster-Shafer theory. The proposed algorithm is useful when the intervals of data are rather large and when the training set is small.


Author(s):  
JOAQUÍN ABELLÁN ◽  
ANDRÉS R. MASEGOSA

In this paper, we present the following contributions: (i) an adaptation of a precise classifier to work on imprecise classification for cost-sensitive problems; (ii) a new measure to check the performance of an imprecise classifier. The imprecise classifier is based on a method to build simple decision trees that we have modified for imprecise classification. It uses the Imprecise Dirichlet Model (IDM) to represent information, with the upper entropy as a tool for splitting. Our new measure to compare imprecise classifiers takes errors into account. Thus far, this has not been considered by other measures for classifiers of this type. This measure penalizes wrong predictions using a cost matrix of the errors, given by an expert; and it quantifies the success of an imprecise classifier based on the cardinal number of the set of non-dominated states returned. To compare the performance of our imprecise classification method and the new measure, we have used a second imprecise classifier known as Naive Credal Classifier (NCC) which is a variation of the classic Naive Bayes using the IDM; and a known measure for imprecise classification.


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Warattaya Chinnakum ◽  
Laura Berrout Ramos ◽  
Olugbenga Iyiola ◽  
Vladik Kreinovich

Purpose In real life, we only know the consequences of each possible action with some uncertainty. A typical example is interval uncertainty, when we only know the lower and upper bounds on the expected gain. A usual way to compare such interval-valued alternatives is to use the optimism–pessimism criterion developed by Nobelist Leo Hurwicz. In this approach, a weighted combination of the worst-case and the best-case gains is maximized. There exist several justifications for this criterion; however, some of the assumptions behind these justifications are not 100% convincing. The purpose of this paper is to find a more convincing explanation. Design/methodology/approach The authors used utility approach to decision-making. Findings The authors proposed new, hopefully more convincing, justifications for Hurwicz’s approach. Originality/value This is a new, more intuitive explanation of Hurwicz’s approach to decision-making under interval uncertainty.


2021 ◽  
Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>


Sign in / Sign up

Export Citation Format

Share Document