Binary classification SVM-based algorithms with interval-valued training data using triangular and Epanechnikov kernels

2016 ◽  
Vol 80 ◽  
pp. 53-66 ◽  
Author(s):  
Lev V. Utkin ◽  
Anatoly I. Chekh ◽  
Yulia A. Zhuk
2017 ◽  
Vol 26 (04) ◽  
pp. 1750014 ◽  
Author(s):  
Lev V. Utkin ◽  
Yulia A. Zhuk

A new robust SVM-based algorithm of the binary classification is proposed. It is based on the so-called uncertainty trick when training data with the interval uncertainty are transformed to training data with the weight or probabilistic uncertainty. Every interval is replaced by a set of training points with the same class label such that every point inside the interval has an unknown weight from a predefined set of weights. The robust strategy dealing with the upper bound of the interval-valued expected risk produced by a set of weights is used in the SVM. An extension of the algorithm based on using the imprecise Dirichlet model is proposed for its additional robustification. Numerical examples with synthetic and real interval-valued training data illustrate the proposed algorithm and its extension.


2021 ◽  
Author(s):  
Jason Meil

<p>Data preparation process generally consumes up to 80% of the Data Scientists time, with 60% of that being attributed to cleaning and labeling data.[1]  Our solution is to use automated pipelines to prepare, annotate, and catalog data. The first step upon ingestion, especially in the case of real world—unstructured and unlabeled datasets—is to leverage Snorkel, a tool specifically designed around a paradigm to rapidly create, manage, and model training data. Configured properly, Snorkel can be leveraged to temper this labeling bottle-neck through a process called weak supervision. Weak supervision uses programmatic labeling functions—heuristics, distant supervision, SME or knowledge base—scripted in python to generate “noisy labels”. The function traverses the entirety of the dataset and feeds the labeled data into a generative—conditionally probabilistic—model. The function of this model is to output the distribution of each response variable and predict the conditional probability based on a joint probability distribution algorithm. This is done by comparing the various labeling functions and the degree to which their outputs are congruent to each other. A single labeling function that has a high degree of congruence with other labeling functions will have a high degree of learned accuracy, that is, the fraction of predictions that the model got right. Conversely, single labeling functions that have a low degree of congruence with other functions will have low learned accuracy. Each prediction is then combined by the estimated weighted accuracy, whereby the predictions of the higher learned functions are counted multiple times. The result yields a transformation from a binary classification of 0 or 1 to a fuzzy label between 0 and 1— there is “x” probability that based on heuristic “n”, the response variable is “y”. The addition of data to this generative model multi-class inference will be made on the response variables positive, negative, or abstain, assigning probabilistic labels to potentially millions of data points. Thus, we have generated a discriminative ground truth for all further labeling efforts and have improved the scalability of our models. Labeling functions can be applied to unlabeled data to further machine learning efforts.<br> <br>Once our datasets are labeled and a ground truth is established, we need to persist the data into our delta lake since it combines the most performant aspects of a warehouse with the low-cost storage for data lakes. In addition, the lake can accept unstructured, semi structured, or structured data sources, and those sources can be further aggregated into raw ingestion, cleaned, and feature engineered data layers.  By sectioning off the data sources into these “layers”, the data engineering portion is abstracted away from the data scientist, who can access model ready data at any time.  Data can be ingested via batch or stream. <br> <br>The design of the entire ecosystem is to eliminate as much technical debt in machine learning paradigms as possible in terms of configuration, data collection, verification, governance, extraction, analytics, process management, resource management, infrastructure, monitoring, and post verification. </p>


2021 ◽  
Vol 14 (1) ◽  
pp. 40
Author(s):  
Eftychia Koukouraki ◽  
Leonardo Vanneschi ◽  
Marco Painho

Among natural disasters, earthquakes are recorded to have the highest rates of human loss in the past 20 years. Their unexpected nature has severe consequences on both human lives and material infrastructure, demanding urgent action to be taken. For effective emergency relief, it is necessary to gain awareness about the level of damage in the affected areas. The use of remotely sensed imagery is popular in damage assessment applications; however, it requires a considerable amount of labeled data, which are not always easy to obtain. Taking into consideration the recent developments in the fields of Machine Learning and Computer Vision, this study investigates and employs several Few-Shot Learning (FSL) strategies in order to address data insufficiency and imbalance in post-earthquake urban damage classification. While small datasets have been tested against binary classification problems, which usually divide the urban structures into collapsed and non-collapsed, the potential of limited training data in multi-class classification has not been fully explored. To tackle this gap, four models were created, following different data balancing methods, namely cost-sensitive learning, oversampling, undersampling and Prototypical Networks. After a quantitative comparison among them, the best performing model was found to be the one based on Prototypical Networks, and it was used for the creation of damage assessment maps. The contribution of this work is twofold: we show that oversampling is the most suitable data balancing method for training Deep Convolutional Neural Networks (CNN) when compared to cost-sensitive learning and undersampling, and we demonstrate the appropriateness of Prototypical Networks in the damage classification context.


Author(s):  
Farid Jauhari ◽  
Ahmad Afif Supianto

<span lang="EN-US">Student’s performance is the most important value of the educational institutes for their competitiveness. In order to improve the value, they need to predict student’s performance, so they can give special treatment to the student that predicted as low performer. In this paper, we propose 3 boosting algorithms (C5.0, adaBoost.M1, and adaBoost.SAMME) to build the classifier for predicting student’s performance. This research used <sup>1</sup>UCI student performance datasets. There are 3 scenarios of evaluation, the first scenario was employ 10-fold cross-validation to compare performance of boosting algorithms. The result of first scenario showed that adaBoost.SAMME and adaBoost.M1 outperform baseline method in binary classification. The second scenario was used to evaluate boosting algorithms under different number of training data. On the second scenario, adaBoost.M1 was outperformed another boosting algorithms and baseline method on the binary classification. As third scenario, we build models from one subject dataset and test using onother subject dataset. The third scenario results indicate that it can build prediction model using one subject to predict another subject.</span>


2020 ◽  
Vol 20 (S14) ◽  
Author(s):  
Ming Liang ◽  
ZhiXing Zhang ◽  
JiaYing Zhang ◽  
Tong Ruan ◽  
Qi Ye ◽  
...  

Abstract Background Laboratory indicator test results in electronic health records have been applied to many clinical big data analysis. However, it is quite common that the same laboratory examination item (i.e., lab indicator) is presented using different names in Chinese due to the translation problem and the habit problem of various hospitals, which results in distortion of analysis results. Methods A framework with a recall model and a binary classification model is proposed, which could reduce the alignment scale and improve the accuracy of lab indicator normalization. To reduce alignment scale, tf-idf is used for candidate selection. To assure the accuracy of output, we utilize enhanced sequential inference model for binary classification. And active learning is applied with a selection strategy which is proposed for reducing annotation cost. Results Since our indicator standardization method mainly focuses on Chinese indicator inconsistency, we perform our experiment on Shanghai Hospital Development Center and select clinical data from 8 hospitals. The method achieves a F1-score 92.08$$\%$$ % in our final binary classification. As for active learning, the new strategy proposed performs better than random baseline and could outperform the result trained on full data with only 43$$\%$$ % training data. A case study on heart failure clinic analysis conducted on the sub-dataset collected from SHDC shows that our proposed method is practical in the application with good performance. Conclusion This work demonstrates that the structure we proposed can be effectively applied to lab indicator normalization. And active learning is also suitable for this task for cost reduction. Such a method is also valuable in data cleaning, data mining, text extracting and entity alignment.


Author(s):  
Daniel B. Rubin

AdaBoost is a popular and successful data mining technique for binary classification. However, there is no universally agreed upon extension of the method for problems with more than two classes. Most multiclass generalizations simply reduce the problem to a series of binary classification problems. The statistical interpretation of AdaBoost is that it operates through loss-based estimation: by using an exponential loss function as a surrogate for misclassification loss, it sequentially minimizes empirical risk through fitting a base classifier to iteratively reweighted training data. While there are several extensions using loss-based estimation with multiclass base classifiers, these use multiclass versions of the exponential loss that are not classification calibrated: unless restrictions are placed on conditional class probabilities, it becomes possible to have optimal surrogate risk but poor misclassification risk. In this work, we introduce a new AdaBoost extension called AdaBoost.SL that does not reduce the problem into binary subproblems and that uses a classification-calibrated multiclass exponential loss function. Numerical experiments show the algorithm performs well on benchmark datasets.


Machines ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. 39
Author(s):  
Riku-Pekka Nikula ◽  
Mika Ruusunen ◽  
Joni Keski-Rahkonen ◽  
Lars Saarinen ◽  
Fredrik Fagerholm

Drill ships and offshore rigs use azimuth thrusters for propulsion, maneuvering and steering, attitude control and dynamic positioning activities. The versatile operating modes and the challenging marine environment create demand for flexible and practical condition monitoring solutions onboard. This study introduces a condition monitoring algorithm using acceleration and shaft speed data to detect anomalies that give information on the defects in the driveline components of the thrusters. Statistical features of vibration are predicted with linear regression models and the residuals are then monitored relative to multivariate normal distributions. The method includes an automated shaft speed selection approach that identifies the normal distributed operational areas from the training data based on the residuals. During monitoring, the squared Mahalanobis distance to the identified distributions is calculated in the defined shaft speed ranges, providing information on the thruster condition. The performance of the method was validated based on data from two operating thrusters and compared with reference classifiers. The results suggest that the method could detect changes in the condition of the thrusters during online monitoring. Moreover, it had high accuracy in the bearing condition related binary classification tests. In conclusion, the algorithm has practical properties that exhibit suitability for online application.


Author(s):  
Ruslan Babudzhan ◽  
Konstantyn Isaienkov ◽  
Danilo Krasiy ◽  
Oleksii Vodka ◽  
Ivan Zadorozhny ◽  
...  

The paper investigates the relationship between vibration acceleration of bearings with their operational state. To determine these dependencies, a testbench was built and 112 experiments were carried out with different bearings: 100 bearings that developed an internal defect during operation and 12bearings without a defect. From the obtained records, a dataset was formed, which was used to build classifiers. Dataset is freely available. A methodfor classifying new and used bearings was proposed, which consists in searching for dependencies and regularities of the signal using descriptive functions: statistical, entropy, fractal dimensions and others. In addition to processing the signal itself, the frequency domain of the bearing operationsignal was also used to complement the feature space. The paper considered the possibility of generalizing the classification for its application on thosesignals that were not obtained in the course of laboratory experiments. An extraneous dataset was found in the public domain. This dataset was used todetermine how accurate a classifier was when it was trained and tested on significantly different signals. Training and validation were carried out usingthe bootstrapping method to eradicate the effect of randomness, given the small amount of training data available. To estimate the quality of theclassifiers, the F1-measure was used as the main metric due to the imbalance of the data sets. The following supervised machine learning methodswere chosen as classifier models: logistic regression, support vector machine, random forest, and K nearest neighbors. The results are presented in theform of plots of density distribution and diagrams.


Sign in / Sign up

Export Citation Format

Share Document