Sinkhorn Regression

This paper introduces a novel Robust Regression (RR) model, named Sinkhorn regression, which imposes Sinkhorn distances on both loss function and regularization. Traditional RR methods target at searching for an element-wise loss function (e.g., Lp-norm) to characterize the errors such that outlying data have a relatively smaller influence on the regression estimator. Due to the neglect of the geometric information, they often lead to the suboptimal results in the practical applications. To address this problem, we use a cross-bin distance function, i.e., Sinkhorn distances, to capture the geometric knowledge of real data. Sinkhorn distances is invariant in movement, rotation and zoom. Thus, our method is more robust to variations of data than traditional regression models. Meanwhile, we leverage Kullback-Leibler divergence to relax the proposed model with marginal constraints into its unbalanced formulation to adapt more types of features. In addition, we propose an efficient algorithm to solve the relaxed model and establish its complete statistical guarantees under mild conditions. Experiments on the five publicly available microarray data sets and one mass spectrometry data set demonstrate the effectiveness and robustness of our method.

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Extended Odd Fréchet-G Family of Distributions

Journal of Probability and Statistics ◽

10.1155/2018/2931326 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 6

Author(s):

Suleman Nasiru

Keyword(s):

Maximum Likelihood Method ◽

Real Data ◽

Likelihood Method ◽

Data Sets ◽

Data Set ◽

New Class ◽

New Family ◽

Rate Functions ◽

The Given ◽

Family Of Distributions

The need to develop generalizations of existing statistical distributions to make them more flexible in modeling real data sets is vital in parametric statistical modeling and inference. Thus, this study develops a new class of distributions called the extended odd Fréchet family of distributions for modifying existing standard distributions. Two special models named the extended odd Fréchet Nadarajah-Haghighi and extended odd Fréchet Weibull distributions are proposed using the developed family. The densities and the hazard rate functions of the two special distributions exhibit different kinds of monotonic and nonmonotonic shapes. The maximum likelihood method is used to develop estimators for the parameters of the new class of distributions. The application of the special distributions is illustrated by means of a real data set. The results revealed that the special distributions developed from the new family can provide reasonable parametric fit to the given data set compared to other existing distributions.

Download Full-text

Monofractal or multifractal: a case study of spatial distribution of mining-induced seismic activity

Nonlinear Processes in Geophysics ◽

10.5194/npg-1-182-1994 ◽

1994 ◽

Vol 1 (2/3) ◽

pp. 182-190 ◽

Cited By ~ 16

Author(s):

M. Eneva

Keyword(s):

Spatial Distribution ◽

Seismic Activity ◽

Real Data ◽

Data Sets ◽

Point Sets ◽

Data Set ◽

Limited Size ◽

The Real ◽

Induced Seismic Activity ◽

Generalized Correlation

Abstract. Using finite data sets and limited size of study volumes may result in significant spurious effects when estimating the scaling properties of various physical processes. These effects are examined with an example featuring the spatial distribution of induced seismic activity in Creighton Mine (northern Ontario, Canada). The events studied in the present work occurred during a three-month period, March-May 1992, within a volume of approximate size 400 x 400 x 180 m3. Two sets of microearthquake locations are studied: Data Set 1 (14,338 events) and Data Set 2 (1654 events). Data Set 1 includes the more accurately located events and amounts to about 30 per cent of all recorded data. Data Set 2 represents a portion of the first data set that is formed by the most accurately located and the strongest microearthquakes. The spatial distribution of events in the two data sets is examined for scaling behaviour using the method of generalized correlation integrals featuring various moments q. From these, generalized correlation dimensions are estimated using the slope method. Similar estimates are made for randomly generated point sets using the same numbers of events and the same study volumes as for the real data. Uniform and monofractal random distributions are used for these simulations. In addition, samples from the real data are randomly extracted and the dimension spectra for these are examined as well. The spectra for the uniform and monofractal random generations show spurious multifractality due only to the use of finite numbers of data points and limited size of study volume. Comparing these with the spectra of dimensions for Data Set 1 and Data Set 2 allows us to estimate the bias likely to be present in the estimates for the real data. The strong multifractality suggested by the spectrum for Data Set 2 appears to be largely spurious; the spatial distribution, while different from uniform, could originate from a monofractal process. The spatial distribution of microearthquakes in Data Set 1 is either monofractal as well, or only weakly multifractal. In all similar studies, comparisons of result from real data and simulated point sets may help distinguish between genuine and artificial multifractality, without necessarily resorting to large number of data.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

Interpolating wide-aperture ground-penetrating radar beyond aliasing

Geophysics ◽

10.1190/geo2014-0117.1 ◽

2015 ◽

Vol 80 (2) ◽

pp. H13-H22 ◽

Cited By ~ 8

Author(s):

Saulo S. Martins ◽

Jandyr M. Travassos

Keyword(s):

Ground Penetrating Radar ◽

Real Data ◽

Velocity Model ◽

Data Sets ◽

Inversion Problem ◽

Data Set ◽

Isolated Points ◽

Antarctic Continent ◽

Ground Penetrating ◽

The Antarctic

Most of the data acquisition in ground-penetrating radar is done along fixed-offset profiles, in which velocity is known only at isolated points in the survey area, at the locations of variable offset gathers such as a common midpoint. We have constructed sparse, heavily aliased, variable offset gathers from several fixed-offset, collinear, profiles. We interpolated those gathers to produce properly sampled counterparts, thus pushing data beyond aliasing. The interpolation methodology estimated nonstationary, adaptive, filter coefficients at all trace locations, including at the missing traces’ corresponding positions, filled with zeroed traces. This is followed by an inversion problem that uses the previously estimated filter coefficients to insert the new, interpolated, traces between the original ones. We extended this two-step strategy to data interpolation by employing a device in which we used filter coefficients from a denser variable offset gather to interpolate the missing traces on a few independently constructed gathers. We applied the methodology on synthetic and real data sets, the latter acquired in the interior of the Antarctic continent. The variable-offset interpolated data opened the door to prestack processing, making feasible the production of a prestack time migrated section and a 2D velocity model for the entire profile. Notwithstanding, we have used a data set obtained in Antarctica; there is no reason the same methodology could not be used somewhere else.

Download Full-text

A new approach to the fuzzy c-means clustering algorithm by automatic weights and local clustering

10.24271/psr.18 ◽

2021 ◽

Vol 3 (1) ◽

pp. 1-7

Author(s):

Yadgar Sirwan Abdulrahman

Keyword(s):

Clustering Algorithm ◽

Similarity Criterion ◽

Real Data ◽

Well Being ◽

Classical Solutions ◽

Data Sets ◽

Data Set ◽

New Approach ◽

Fuzzy C Means Clustering ◽

Global And Local

Clustering is one of the essential strategies in data analysis. In classical solutions, all features are assumed to contribute equally to the data clustering. Of course, some features are more important than others in real data sets. As a result, essential features will have a more significant impact on identifying optimal clusters than other features. In this article, a fuzzy clustering algorithm with local automatic weighting is presented. The proposed algorithm has many advantages such as: 1) the weights perform features locally, meaning that each cluster's weight is different from the rest. 2) calculating the distance between the samples using a non-euclidian similarity criterion to reduce the noise effect. 3) the weight of the features is obtained comparatively during the learning process. In this study, mathematical analyzes were done to obtain the clustering centers well-being and the features' weights. Experiments were done on the data set range to represent the progressive algorithm's efficiency compared to other proposed algorithms with global and local features

Download Full-text

GRAPH BASED CLUSTERING WITH CONSTRAINTS AND ACTIVE LEARNING

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/37/1/15773 ◽

2021 ◽

Vol 37 (1) ◽

pp. 71-89

Author(s):

Vu-Tuan Dang ◽

Viet-Vu Vu ◽

Hong-Quan Do ◽

Thi Kieu Oanh Le

Keyword(s):

Active Learning ◽

Clustering Algorithm ◽

Side Information ◽

Clustering Algorithms ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Supervised Clustering ◽

Class Labels ◽

Graph Based Clustering

During the past few years, semi-supervised clustering has emerged as a new interesting direction in machine learning research. In a semi-supervised clustering algorithm, the clustering results can be significantly improved by using side information, which is available or collected from users. There are two main kinds of side information that can be learned in semi-supervised clustering algorithms: the class labels - called seeds or the pairwise constraints. The first semi-supervised clustering was introduced in 2000, and since that, many algorithms have been presented in literature. However, it is not easy to use both types of side information in the same algorithm. To address the problem, this paper proposes a semi-supervised graph based clustering algorithm that tries to use seeds and constraints in the clustering process, called MCSSGC. Moreover, we introduces a simple but efficient active learning method to collect the constraints that can boost the performance of MCSSGC, named KMMFFQS. In order to verify effectiveness of the proposed algorithm, we conducted a series of experiments not only on real data sets from UCI, but also on a document data set applied in an Information Extraction of Vietnamese documents. These obtained results show that the proposed algorithm can significantly improve the clustering process compared to some recent algorithms.

Download Full-text

Constructing a Lightweight Key-Value Store Based on the Windows Native Features

Applied Sciences ◽

10.3390/app9183801 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3801 ◽

Cited By ~ 1

Author(s):

Hyuk-Yoon Kwon

Keyword(s):

State Of The Art ◽

Main Idea ◽

Real Data ◽

Data Sets ◽

Parameter Setting ◽

Data Set ◽

Multi Level ◽

Windows Registry ◽

Best Parameter ◽

Better Than

In this paper, we propose a method to construct a lightweight key-value store based on the Windows native features. The main idea is providing a thin wrapper for the key-value store on top of a built-in storage in Windows, called Windows registry. First, we define a mapping of the components in the key-value store onto the components in the Windows registry. Then, we present a hash-based multi-level registry index so as to distribute the key-value data balanced and to efficiently access them. Third, we implement basic operations of the key-value store (i.e., Get, Put, and Delete) by manipulating the Windows registry using the Windows native APIs. We call the proposed key-value store WR-Store. Finally, we propose an efficient ETL (Extract-Transform-Load) method to migrate data stored in WR-Store into any other environments that support existing key-value stores. Because the performance of the Windows registry has not been studied much, we perform the empirical study to understand the characteristics of WR-Store, and then, tune the performance of WR-Store to find the best parameter setting. Through extensive experiments using synthetic and real data sets, we show that the performance of WR-Store is comparable to or even better than the state-of-the-art systems (i.e., RocksDB, BerkeleyDB, and LevelDB). Especially, we show the scalability of WR-Store. That is, WR-Store becomes much more efficient than the other key-value stores as the size of data set increases. In addition, we show that the performance of WR-Store is maintained even in the case of intensive registry workloads where 1000 processes accessing to the registry actively are concurrently running.

Download Full-text

Dynamic ICSP Graph Optimization Approach for Car-Like Robot Localization in Outdoor Environments

Computers ◽

10.3390/computers8030063 ◽

2019 ◽

Vol 8 (3) ◽

pp. 63

Author(s):

Zhan Wang ◽

Alain Lambert ◽

Xun Zhang

Keyword(s):

Particle Filtering ◽

Constraint Satisfaction Problem ◽

Real Data ◽

Constraint Propagation ◽

Optimization Approach ◽

Robot Localization ◽

Data Set ◽

Practical Applications ◽

Outdoor Environments ◽

Interval Constraint

Localization has been regarded as one of the most fundamental problems to enable a mobile robot with autonomous capabilities. Probabilistic techniques such as Kalman or Particle filtering have long been used to solve robotic localization and mapping problem. Despite their good performance in practical applications, they could suffer inconsistency problems. This paper presents an Interval Constraint Satisfaction Problem (ICSP) graph based methodology for consistent car-like robot localization in outdoor environments. The localization problem is cast into a two-stage framework: visual teach and repeat. During a teaching phase, the interval map is built when a robot navigates around the environment with GPS-support. The map is then used for real-time ego-localization as the robot repeats the path autonomously. By dynamically solving the ICSP graph via Interval Constraint Propagation (ICP) techniques, a consistent and improved localization result is obtained. Both numerical simulation results and real data set experiments are presented, showing the soundness of the proposed method in achieving consistent localization.

Download Full-text