Dissecting random and systematic differences between noisy composite data sets

Composite data sets measured on different objects are usually affected by random errors, but may also be influenced by systematic (genuine) differences in the objects themselves, or the experimental conditions. If the individual measurements forming each data set are quantitative and approximately normally distributed, a correlation coefficient is often used to compare data sets. However, the relations between data sets are not obvious from the matrix of pairwise correlations since the numerical value of the correlation coefficient is lowered by both random and systematic differences between the data sets. This work presents a multidimensional scaling analysis of the pairwise correlation coefficients which places data sets into a unit sphere within low-dimensional space, at a position given by their CC* values [as defined by Karplus & Diederichs (2012),Science,336, 1030–1033] in the radial direction and by their systematic differences in one or more angular directions. This dimensionality reduction can not only be used for classification purposes, but also to derive data-set relations on a continuous scale. Projecting the arrangement of data sets onto the subspace spanned by systematic differences (the surface of a unit sphere) allows, irrespective of the random-error levels, the identification of clusters of closely related data sets. The method gains power with increasing numbers of data sets. It is illustrated with an example from low signal-to-noise ratio image processing, and an application in macromolecular crystallography is shown, but the approach is completely general and thus should be widely applicable.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques ◽

10.5194/amt-7-781-2014 ◽

2014 ◽

Vol 7 (3) ◽

pp. 781-797 ◽

Cited By ~ 174

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Environmental Protection Agency ◽

Synthetic Data ◽

Analytic Solutions ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. The EPA PMF (Environmental Protection Agency positive matrix factorization) version 5.0 and the underlying multilinear engine-executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

Methods for estimating uncertainty in factor analytic solutions

Atmospheric Measurement Techniques Discussions ◽

10.5194/amtd-6-7593-2013 ◽

2013 ◽

Vol 6 (4) ◽

pp. 7593-7631 ◽

Cited By ~ 9

Author(s):

P. Paatero ◽

S. Eberly ◽

S. G. Brown ◽

G. A. Norris

Keyword(s):

Synthetic Data ◽

Analytic Solutions ◽

The Other ◽

Data Sets ◽

Random Errors ◽

Data Set ◽

Factor Analytic ◽

Uncertainty Estimates ◽

Multilinear Engine ◽

Analytic Models

Abstract. EPA PMF version 5.0 and the underlying multilinear engine executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.

Download Full-text

Analysing omics data sets with weighted nodes networks (WNNets)

Scientific Reports ◽

10.1038/s41598-021-93699-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Gabriele Tosadori ◽

Dario Di Silvestre ◽

Fausto Spoto ◽

Pierluigi Mauri ◽

Carlo Laudanna ◽

...

Keyword(s):

Experimental Data ◽

Network Models ◽

Experimental Information ◽

Data Sets ◽

Proteomics Data ◽

Experimental Conditions ◽

Data Set ◽

Current Trends ◽

Network Properties ◽

The Individual

AbstractCurrent trends in biomedical research indicate data integration as a fundamental step towards precision medicine. In this context, network models allow representing and analysing complex biological processes. However, although effective in unveiling network properties, these models fail in considering the individual, biochemical variations occurring at molecular level. As a consequence, the analysis of these models partially loses its predictive power. To overcome these limitations, Weighted Nodes Networks (WNNets) were developed. WNNets allow to easily and effectively weigh nodes using experimental information from multiple conditions. In this study, the characteristics of WNNets were described and a proteomics data set was modelled and analysed. Results suggested that degree, an established centrality index, may offer a novel perspective about the functional role of nodes in WNNets. Indeed, degree allowed retrieving significant differences between experimental conditions, highlighting relevant proteins, and provided a novel interpretation for degree itself, opening new perspectives in experimental data modelling and analysis. Overall, WNNets may be used to model any high-throughput experimental data set requiring weighted nodes. Finally, improving the power of the analysis by using centralities such as betweenness may provide further biological insights and unveil novel, interesting characteristics of WNNets.

Download Full-text

Profiling of Liquid Crystal Displays with Raman Spectroscopy: Preprocessing of Spectra

Applied Spectroscopy ◽

10.1366/0003702053585309 ◽

2005 ◽

Vol 59 (3) ◽

pp. 267-274 ◽

Cited By ~ 6

Author(s):

Olja Stanimirovic ◽

Hans F. M. Boelens ◽

Arjan J. G. Mank ◽

Huub C. J. Hoefsloot ◽

Age K. Smilde

Keyword(s):

Raman Spectroscopy ◽

Liquid Crystal ◽

Three Dimensional ◽

Functional Materials ◽

Liquid Crystal Displays ◽

Data Sets ◽

Experimental Conditions ◽

Data Set ◽

Average Residual ◽

Error Indicators

Raman spectroscopy is applied for characterizing paintable displays. Few other options than Raman spectroscopy exist for doing so because of the liquid nature of functional materials. The challenge is to develop a method that can be used for estimating the composition of a single display cell on the basis of the collected three-dimensional Raman spectra. A classical least squares (CLS) model is used to model the measured spectra. It is shown that spectral preprocessing is a necessary and critical step for obtaining a good CLS model and reliable compositional profiles. Different kinds of preprocessing are explained. For each data set the type and amount of preprocessing may be different. This is shown using two data sets measured on essentially the same type of display cell, but under different experimental conditions. For model validation three criteria are introduced: mean sum of squares of residuals, percentage of unexplained information (PUN), and average residual curve. It is shown that the decision about the best combination of preprocessing techniques cannot be based only on overall error indicators (such as PUN). In addition, local residual analysis must be done and the feasibility of the extracted profiles should be taken into account.

Download Full-text

Modeling thermal properties of exotic fruits pulps: an artificial neural networks approach

Research Society and Development ◽

10.33448/rsd-v9i11.9806 ◽

2020 ◽

Vol 9 (11) ◽

pp. e7509119806

Author(s):

Leandro Soares Santos ◽

Moysés Naves de Moraes ◽

Julia Dos Santos Lopes ◽

Luciana Carolina Bauer ◽

Paulo Bonomo ◽

...

Keyword(s):

Neural Networks ◽

Artificial Neural Networks ◽

Correlation Coefficient ◽

Thermophysical Properties ◽

Data Sets ◽

Back Propagation Algorithm ◽

Data Set ◽

Polynomial Models ◽

Artificial Neural ◽

Mean Square Errors

Thermophysical properties are important in design, simulation, optimization, and control of food processing. Its prediction is very important but theoretical basis is difficult and empirical models were commonly used. In this work, the modeling of neural networks was applied as an alternative to predict density, thermal conductivity and thermal diffusivity from the temperature and moisture content of jackfruit, genipap and umbu. Data sets from literature were used, combined and individually, to obtain four networks. Supervised multilayer perceptron networks were developed, using the back-propagation algorithm. Several configurations of artificial neural networks (ANNs) were evaluated with one or two hidden layers and a maximum of 21 and 12 neurons in each one, respectively. Data sets were divided to learning (60%) and verification (40%) steps. Best ANNs were chosen based on correlation coefficient and root mean square errors (RMSE), and compared with polynomial models using average absolute deviations (AADs). From total disposable data set, the best ANN developed presents one hidden layer with 15 neurons and shows the same predictive ability of ANNs created from individual fruits data sets, presenting close RMSE and correlation coefficient. The ANNs developed presents AADs near to polynomial models and appers as alternative to conventional modeling. Results indicate that the ANN created from total data set can replace nine polynomial models to predict the thermophysical properties of jackfruit, genipap and umbu pulps.

Download Full-text

Subspace Clustering of High-Dimensional Data: An Evolutionary Approach

Applied Computational Intelligence and Soft Computing ◽

10.1155/2013/863146 ◽

2013 ◽

Vol 2013 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Singh Vijendra ◽

Sahoo Laxman

Keyword(s):

Clustering Algorithm ◽

Dimensional Space ◽

Clustering Algorithms ◽

High Dimensional Data ◽

Subspace Clustering ◽

High Dimensional ◽

Data Sets ◽

Real World Data ◽

Data Set ◽

Data Points

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.

Download Full-text

TheLaueUtiltoolkit for Laue photocrystallography. I. Rapid orientation matrix determination for intermediate-size-unit-cell Laue data

Journal of Applied Crystallography ◽

10.1107/s0021889811038143 ◽

2011 ◽

Vol 44 (6) ◽

pp. 1182-1189 ◽

Cited By ~ 14

Author(s):

Jarosław A. Kalinowski ◽

Anna Makal ◽

Philip Coppens

Keyword(s):

Open Source ◽

Unit Sphere ◽

Reciprocal Lattice ◽

Unit Cell ◽

Data Sets ◽

Data Set ◽

X Ray ◽

Orientation Matrix ◽

Intermediate Size

A new method for determination of the orientation matrix of Laue X-ray data is presented. The method is based on matching of the experimental patterns of central reciprocal lattice rows projected on a unit sphere centered on the origin of the reciprocal lattice with the corresponding pattern of a monochromatic data set on the same material. This technique is applied to the complete data set and thus eliminates problems often encountered when single frames with a limited number of peaks are to be used for orientation matrix determination. Application of the method to a series of Laue data sets on organometallic crystals is described. The corresponding program is available under a Mozilla Public License-like open-source license.

Download Full-text

Heterogeneous Gene Data for Classifying Tumors

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch104 ◽

2011 ◽

pp. 550-554

Author(s):

Benny Yiu-ming Fung ◽

Vincent To-yee Ng

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Single Gene ◽

Data Sets ◽

Expression Data ◽

Classification Models ◽

Experimental Conditions ◽

Data Set ◽

Single Data ◽

Gene Data

When classifying tumors using gene expression data, mining tasks commonly make use of only a single data set. However, classification models based on patterns extracted from a single data set are often not indicative of an entire population and heterogeneous samples subsequently applied to these models may not fit, leading to performance degradation. In short, it is not possible to guarantee that mining results based on a single gene expression data set will be reliable or robust (Miller et al., 2002). This problem can be addressed using classification algorithms capable of handling multiple, heterogeneous gene expression data sets. Apart from improving mining performance, the use of such algorithms would make mining results less sensitive to the variations of different microarray platforms and to experimental conditions embedded in heterogeneous gene expression data sets.

Download Full-text

Introducing data analytics to the robotic drilling process

Industrial Robot the international journal of robotics research and application ◽

10.1108/ir-01-2018-0018 ◽

2018 ◽

Vol 45 (3) ◽

pp. 371-378 ◽

Cited By ~ 1

Author(s):

Toufik Al Khawli ◽

Hamza Bendemra ◽

Muddasar Anwar ◽

Dewald Swart ◽

Jorge Dias

Keyword(s):

Point Cloud ◽

Dimensional Space ◽

Three Dimensional ◽

Data Sets ◽

Drilling Process ◽

Data Set ◽

Content Type ◽

Robotic Drilling ◽

Geometric Primitives ◽

Three Dimensional Space

PurposeThis paper presents a method for extracting the geometric primitives of a circle in a three-dimensional space from a discrete point cloud data set obtained by a laser stripe sensor. This paper aims to first establish a reference frame for the robotic drilling process by detecting the position and orientation of a reference hole on structural parts in a pre-drilling step, and second, to perform quality inspection of the hole in a post-drilling step.Design/methodology/approachThe method is divided into the following steps: a plane is initially fitted on the data by evaluating the principle component analysis using singular value decomposition; the data points or measurements are then rotated around an arbitrary axis using the Rodrigues’ rotation formula such that the normal direction of the estimated plane and thez-axis direction is parallel; the Delaunay triangulation is constructed on the point cloud and the confidence interval is estimated for segmenting the data set located at the circular boundary; and finally, a circular profile is fitted on the extracted set and transformed back to the original position.FindingsThe geometric estimation of the circle in three-dimensional space constitutes of the position of the center, the diameter and the orientation, which is represented by the normal vector of the plane that the circle lives in. The method is applied on both simulated data set with the addition of several noise levels and experimental data sets. The main purpose of both the tests is to quantify the accuracy of the estimated diameter. The results show good accuracy (mean relative error < 1 per cent) and high robustness to noise.Research limitations/implicationsThe proposed method is applied here to estimate the geometric primitives of only one circle (the reference hole). If multiple circles are needed, an addition clustering procedure is required to cluster the segmented data into multiple data sets. Each data set represents a circle. Also, the method does not operate efficiently on a sparse data sets. Dense data are required to cover the hole (at least ten scans to cover the hole diameter).Practical implicationsResearchers and practitioners can integrate this method with several robotic manufacturing applications where high accuracy is required. The extracted position and orientation of the hole are used to minimize the positioning and alignment errors between the mounted tool tip and the workpiece.Originality/valueThe method introduces data analytics for estimating the geometric primitives in the robotic drilling application. The main advantage of the proposed method is to register the top surface of the workpiece with respect to robot base frame with a high accuracy. An accurate workpiece registration is extremely necessary in the lateral direction (identifying where to drill), as well as in the vertical direction (identifying how far to drill).

Download Full-text

Evaluation of Acoustic Analyses of Voice in Nonoptimized Conditions

Journal of Speech Language and Hearing Research ◽

10.1044/2020_jslhr-20-00212 ◽

2020 ◽

Vol 63 (12) ◽

pp. 3991-3999

Author(s):

Benjamin van der Woerd ◽

Min Wu ◽

Vijay Parsa ◽

Philip C. Doyle ◽

Kevin Fung

Keyword(s):

Repeated Measures ◽

Voice Quality ◽

Data Sets ◽

Acoustic Measurements ◽

Sample Collection ◽

Experimental Conditions ◽

Environment Analysis ◽

Acoustic Measures ◽

Recording Conditions ◽

Cepstral Peak Prominence

Objectives This study aimed to evaluate the fidelity and accuracy of a smartphone microphone and recording environment on acoustic measurements of voice. Method A prospective cohort proof-of-concept study. Two sets of prerecorded samples (a) sustained vowels (/a/) and (b) Rainbow Passage sentence were played for recording via the internal iPhone microphone and the Blue Yeti USB microphone in two recording environments: a sound-treated booth and quiet office setting. Recordings were presented using a calibrated mannequin speaker with a fixed signal intensity (69 dBA), at a fixed distance (15 in.). Each set of recordings (iPhone—audio booth, Blue Yeti—audio booth, iPhone—office, and Blue Yeti—office), was time-windowed to ensure the same signal was evaluated for each condition. Acoustic measures of voice including fundamental frequency ( f o ), jitter, shimmer, harmonic-to-noise ratio (HNR), and cepstral peak prominence (CPP), were generated using a widely used analysis program (Praat Version 6.0.50). The data gathered were compared using a repeated measures analysis of variance. Two separate data sets were used. The set of vowel samples included both pathologic ( n = 10) and normal ( n = 10), male ( n = 5) and female ( n = 15) speakers. The set of sentence stimuli ranged in perceived voice quality from normal to severely disordered with an equal number of male ( n = 12) and female ( n = 12) speakers evaluated. Results The vowel analyses indicated that the jitter, shimmer, HNR, and CPP were significantly different based on microphone choice and shimmer, HNR, and CPP were significantly different based on the recording environment. Analysis of sentences revealed a statistically significant impact of recording environment and microphone type on HNR and CPP. While statistically significant, the differences across the experimental conditions for a subset of the acoustic measures (viz., jitter and CPP) have shown differences that fell within their respective normative ranges. Conclusions Both microphone and recording setting resulted in significant differences across several acoustic measurements. However, a subset of the acoustic measures that were statistically significant across the recording conditions showed small overall differences that are unlikely to have clinical significance in interpretation. For these acoustic measures, the present data suggest that, although a sound-treated setting is ideal for voice sample collection, a smartphone microphone can capture acceptable recordings for acoustic signal analysis.

Download Full-text