A method for independent estimation of false localisation rate for phosphoproteomics

Phosphoproteomics methods are commonly employed in labs to identify and quantify the sites of phosphorylation on proteins. In recent years, various software tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified, or to estimate the global false localisation rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic data sets, and their statistical reliability on real datasets is largely unknown. As a result, there is considerable problem in the field of reporting incorrectly localised phosphosites, due to inadequate statistical control. In this work, we develop the concept of using scoring and ranking modifications on a decoy amino acid, i.e. one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of different amino acids to act as the decoy, on both synthetic and real data sets, demonstrating that the amino acid selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys, although we have a preference for alanine due to the risk of potential confusion between leucine and isoleucine amino acids. We propose that the phosphoproteomics field should adopt the use of a decoy amino acid, so that there is better control of false reporting in the literature, and in public databases that re-distribute the data.

Download Full-text

Clustering Based on a Novel Density Estimation Method

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.748.590 ◽

2013 ◽

Vol 748 ◽

pp. 590-594

Author(s):

Li Liao ◽

Yong Gang Lu ◽

Xu Rong Chen

Keyword(s):

Density Estimation ◽

Nearest Neighbor ◽

Mean Shift ◽

Estimation Method ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Clustering Methods ◽

K Nearest Neighbor ◽

Data Set

We propose a novel density estimation method using both the k-nearest neighbor (KNN) graph and the potential field of the data points to capture the local and global data distribution information respectively. The clustering is performed based on the computed density values. A forest of trees is built using each data point as the tree node. And the clusters are formed according to the trees in the forest. The new clustering method is evaluated by comparing with three popular clustering methods, K-means++, Mean Shift and DBSCAN. Experiments on two synthetic data sets and one real data set show that our approach can effectively improve the clustering results.

Download Full-text

EFFICIENT UNSUPERVISED MINING FROM NOISY CO-OCCURRENCE DATA

New Mathematics and Natural Computation ◽

10.1142/s1793005705000093 ◽

2005 ◽

Vol 01 (01) ◽

pp. 173-193

Author(s):

HIROSHI MAMITSUKA

Keyword(s):

Protein Interactions ◽

Probabilistic Model ◽

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Significance Level ◽

Input Space ◽

Occurrence Data ◽

Low Probability

We consider the problem of mining from noisy unsupervised data sets. The data point we call noise is an outlier in the current context of data mining, and it has been generally defined as the one locates in low probability regions of an input space. The purpose of the approach for this problem is to detect outliers and to perform efficient mining from noisy unsupervised data. We propose a new iterative sampling approach for this problem, using both model-based clustering and the likelihood given to each example by a trained probabilistic model for finding data points of such low probability regions in an input space. Our method uses an arbitrary probabilistic model as a component model and repeats two steps of sampling non-outliers with high likelihoods (computed by previously obtained models) and training the model with the selected examples alternately. In our experiments, we focused on two-mode and co-occurrence data and empirically evaluated the effectiveness of our proposed method, comparing with two other methods, by using both synthetic and real data sets. From the experiments using the synthetic data sets, we found that the significance level of the performance advantage of our method over the two other methods had more pronounced for higher noise ratios, for both medium- and large-sized data sets. From the experiments using a real noisy data set of protein–protein interactions, a typical co-occurrence data set, we further confirmed the performance of our method for detecting outliers from a given data set. Extended abstracts of parts of the work presented in this paper have appeared in Refs. 1 and 2.

Download Full-text

Efficient Identification of Frequent Family Subtrees in Tree Database

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.241-244.3165 ◽

2012 ◽

Vol 241-244 ◽

pp. 3165-3170 ◽

Cited By ~ 2

Author(s):

Kyung Mi Lee ◽

Keon Myung Lee

Keyword(s):

Synthetic Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Labeled Trees ◽

New Type

This paper introduces a new type of problem called the frequent common family subtree mining problem for a collection of leaf-labeled trees and presents some characteristics for the problem. It proposes an algorithm to find frequent common families in trees. To its applicability, the proposed method has been applied to both several synthetic data sets and a real data set.

Download Full-text

A Kernel Independence Test for Geographical Language Variation

Computational Linguistics ◽

10.1162/coli_a_00293 ◽

2017 ◽

Vol 43 (3) ◽

pp. 567-592 ◽

Cited By ~ 3

Author(s):

Dong Nguyen ◽

Jacob Eisenstein

Keyword(s):

Reproducing Kernel ◽

Reproducing Kernel Hilbert Space ◽

Synthetic Data ◽

Language Variation ◽

Real Data ◽

Parametric Models ◽

Data Sets ◽

Test Statistic ◽

Letters To The Editor ◽

Data Set

Quantifying the degree of spatial dependence for linguistic variables is a key task for analyzing dialectal variation. However, existing approaches have important drawbacks. First, they are based on parametric models of dependence, which limits their power in cases where the underlying parametric assumptions are violated. Second, they are not applicable to all types of linguistic data: Some approaches apply only to frequencies, others to boolean indicators of whether a linguistic variable is present. We present a new method for measuring geographical language variation, which solves both of these problems. Our approach builds on Reproducing Kernel Hilbert Space (RKHS) representations for nonparametric statistics, and takes the form of a test statistic that is computed from pairs of individual geotagged observations without aggregation into predefined geographical bins. We compare this test with prior work using synthetic data as well as a diverse set of real data sets: a corpus of Dutch tweets, a Dutch syntactic atlas, and a data set of letters to the editor in North American newspapers. Our proposed test is shown to support robust inferences across a broad range of scenarios and types of data.

Download Full-text

Improved linear inversion of low induction number electromagnetic data

Geophysical Journal International ◽

10.1093/gji/ggaa531 ◽

2020 ◽

Vol 224 (3) ◽

pp. 1505-1522

Author(s):

Saeed Parnow ◽

Behrooz Oskooi ◽

Giovanni Florio

Keyword(s):

Magnetic Dipole ◽

Weighting Function ◽

Synthetic Data ◽

Real Data ◽

Minimum Length ◽

Data Sets ◽

Linear Inversion ◽

Data Set ◽

Inverse Models ◽

The One

SUMMARY We define a two-step procedure to obtain reliable inverse models of the distribution of electrical conductivity at depth from apparent conductivities estimated by electromagnetic instruments such as GEONICS EM38, EM31 or EM 34-3. The first step of our procedure consists in the correction of the apparent conductivities to make them consistent with a low induction number condition, for which these data are very similar to the true conductivity. Then, we use a linear inversion approach to obtain a conductivity model. To improve the conductivity estimation at depth we introduced a depth-weighting function in our regularized weighted minimum length solution algorithm. We test the whole procedure on two synthetic data sets generated by the COMSOL Multiphysics for both the vertical magnetic dipole and horizontal magnetic dipole configurations of the loops. Our technique was also tested on a real data set, and the inversion result has been compared with the one obtained using the dipole-dipole DC electrical resistivity (ER) method. Our model not only reproduces all shallow conductive areas similar to the ER model, but also succeeds in replicating its deeper conductivity structures. On the contrary, inversion of uncorrected data provides a biased model underestimating the true conductivity.

Download Full-text

TOWARDS LEARNING LOW-LIGHT INDOOR SEMANTIC SEGMENTATION WITH ILLUMINATION-INVARIANT FEATURES

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b2-2021-427-2021 ◽

2021 ◽

Vol XLIII-B2-2021 ◽

pp. 427-432

Author(s):

N. Zhang ◽

F. Nex ◽

N. Kerle ◽

G. Vosselman

Keyword(s):

Synthetic Data ◽

Real Data ◽

Semantic Segmentation ◽

Data Sets ◽

Low Light ◽

Data Set ◽

Illumination Invariant ◽

Invariant Features ◽

Illumination Changes ◽

Segmentation Models

Abstract. Semantic segmentation models are often affected by illumination changes, and fail to predict correct labels. Although there has been a lot of research on indoor semantic segmentation, it has not been studied in low-light environments. In this paper we propose a new framework, LISU, for Low-light Indoor Scene Understanding. We first decompose the low-light images into reflectance and illumination components, and then jointly learn reflectance restoration and semantic segmentation. To train and evaluate the proposed framework, we propose a new data set, namely LLRGBD, which consists of a large synthetic low-light indoor data set (LLRGBD-synthetic) and a small real data set (LLRGBD-real). The experimental results show that the illumination-invariant features effectively improve the performance of semantic segmentation. Compared with the baseline model, the mIoU of the proposed LISU framework has increased by 11.5%. In addition, pre-training on our synthetic data set increases the mIoU by 7.2%. Our data sets and models are available on our project website.

Download Full-text

Regularization and datuming of seismic data by weighted, damped least squares

Geophysics ◽

10.1190/1.2235616 ◽

2006 ◽

Vol 71 (5) ◽

pp. U67-U76 ◽

Cited By ~ 7

Author(s):

Robert J. Ferguson

Keyword(s):

Least Squares ◽

Seismic Data ◽

Synthetic Data ◽

Real Data ◽

Structural Data ◽

Irregular Sampling ◽

Data Set ◽

Near Surface ◽

Damped Least Squares ◽

Wavefield Extrapolation

The possibility of improving regularization/datuming of seismic data is investigated by treating wavefield extrapolation as an inversion problem. Weighted, damped least squares is then used to produce the regularized/datumed wavefield. Regularization/datuming is extremely costly because of computing the Hessian, so an efficient approximation is introduced. Approximation is achieved by computing a limited number of diagonals in the operators involved. Real and synthetic data examples demonstrate the utility of this approach. For synthetic data, regularization/datuming is demonstrated for large extrapolation distances using a highly irregular recording array. Without approximation, regularization/datuming returns a regularized wavefield with reduced operator artifacts when compared to a nonregularizing method such as generalized phase shift plus interpolation (PSPI). Approximate regularization/datuming returns a regularized wavefield for approximately two orders of magnitude less in cost; but it is dip limited, though in a controllable way, compared to the full method. The Foothills structural data set, a freely available data set from the Rocky Mountains of Canada, demonstrates application to real data. The data have highly irregular sampling along the shot coordinate, and they suffer from significant near-surface effects. Approximate regularization/datuming returns common receiver data that are superior in appearance compared to conventional datuming.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques ◽

10.5194/amt-7-781-2014 ◽

2014 ◽

Vol 7 (3) ◽

pp. 781-797 ◽

Cited By ~ 174

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Environmental Protection Agency ◽

Synthetic Data ◽

Analytic Solutions ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. The EPA PMF (Environmental Protection Agency positive matrix factorization) version 5.0 and the underlying multilinear engine-executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

Bayesian Classifier for Sparsity-Promoting Feature Selection

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001415500226 ◽

2015 ◽

Vol 29 (06) ◽

pp. 1550022 ◽

Cited By ~ 1

Author(s):

Danlei Xu ◽

Lan Du ◽

Hongwei Liu ◽

Penghui Wang

Keyword(s):

Feature Selection ◽

Synthetic Data ◽

Original Data ◽

Radar Data ◽

Bayesian Classifier ◽

Classification Model ◽

Data Sets ◽

Data Set ◽

Classification Boundary ◽

Nonlinear Mappings

A Bayesian classifier for sparsity-promoting feature selection is developed in this paper, where a set of nonlinear mappings for the original data is performed as a pre-processing step. The linear classification model with such mappings from the original input space to a nonlinear transformation space can not only construct the nonlinear classification boundary, but also realize the feature selection for the original data. A zero-mean Gaussian prior with Gamma precision and a finite approximation of Beta process prior are used to promote sparsity in the utilization of features and nonlinear mappings in our model, respectively. We derive the Variational Bayesian (VB) inference algorithm for the proposed linear classifier. Experimental results based on the synthetic data set, measured radar data set, high-dimensional gene expression data set, and several benchmark data sets demonstrate the aggressive and robust feature selection capability and comparable classification accuracy of our method comparing with some other existing classifiers.

Download Full-text