scholarly journals Dissecting random and systematic differences between noisy composite data sets

2017 ◽  
Vol 73 (4) ◽  
pp. 286-293 ◽  
Author(s):  
Kay Diederichs

Composite data sets measured on different objects are usually affected by random errors, but may also be influenced by systematic (genuine) differences in the objects themselves, or the experimental conditions. If the individual measurements forming each data set are quantitative and approximately normally distributed, a correlation coefficient is often used to compare data sets. However, the relations between data sets are not obvious from the matrix of pairwise correlations since the numerical value of the correlation coefficient is lowered by both random and systematic differences between the data sets. This work presents a multidimensional scaling analysis of the pairwise correlation coefficients which places data sets into a unit sphere within low-dimensional space, at a position given by their CC* values [as defined by Karplus & Diederichs (2012),Science,336, 1030–1033] in the radial direction and by their systematic differences in one or more angular directions. This dimensionality reduction can not only be used for classification purposes, but also to derive data-set relations on a continuous scale. Projecting the arrangement of data sets onto the subspace spanned by systematic differences (the surface of a unit sphere) allows, irrespective of the random-error levels, the identification of clusters of closely related data sets. The method gains power with increasing numbers of data sets. It is illustrated with an example from low signal-to-noise ratio image processing, and an application in macromolecular crystallography is shown, but the approach is completely general and thus should be widely applicable.

2014 ◽  
Vol 7 (3) ◽  
pp. 781-797 ◽  
Author(s):  
P. Paatero ◽  
S. Eberly ◽  
S. G. Brown ◽  
G. A. Norris

Abstract. The EPA PMF (Environmental Protection Agency positive matrix factorization) version 5.0 and the underlying multilinear engine-executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.


2013 ◽  
Vol 6 (4) ◽  
pp. 7593-7631 ◽  
Author(s):  
P. Paatero ◽  
S. Eberly ◽  
S. G. Brown ◽  
G. A. Norris

Abstract. EPA PMF version 5.0 and the underlying multilinear engine executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Gabriele Tosadori ◽  
Dario Di Silvestre ◽  
Fausto Spoto ◽  
Pierluigi Mauri ◽  
Carlo Laudanna ◽  
...  

AbstractCurrent trends in biomedical research indicate data integration as a fundamental step towards precision medicine. In this context, network models allow representing and analysing complex biological processes. However, although effective in unveiling network properties, these models fail in considering the individual, biochemical variations occurring at molecular level. As a consequence, the analysis of these models partially loses its predictive power. To overcome these limitations, Weighted Nodes Networks (WNNets) were developed. WNNets allow to easily and effectively weigh nodes using experimental information from multiple conditions. In this study, the characteristics of WNNets were described and a proteomics data set was modelled and analysed. Results suggested that degree, an established centrality index, may offer a novel perspective about the functional role of nodes in WNNets. Indeed, degree allowed retrieving significant differences between experimental conditions, highlighting relevant proteins, and provided a novel interpretation for degree itself, opening new perspectives in experimental data modelling and analysis. Overall, WNNets may be used to model any high-throughput experimental data set requiring weighted nodes. Finally, improving the power of the analysis by using centralities such as betweenness may provide further biological insights and unveil novel, interesting characteristics of WNNets.


2005 ◽  
Vol 59 (3) ◽  
pp. 267-274 ◽  
Author(s):  
Olja Stanimirovic ◽  
Hans F. M. Boelens ◽  
Arjan J. G. Mank ◽  
Huub C. J. Hoefsloot ◽  
Age K. Smilde

Raman spectroscopy is applied for characterizing paintable displays. Few other options than Raman spectroscopy exist for doing so because of the liquid nature of functional materials. The challenge is to develop a method that can be used for estimating the composition of a single display cell on the basis of the collected three-dimensional Raman spectra. A classical least squares (CLS) model is used to model the measured spectra. It is shown that spectral preprocessing is a necessary and critical step for obtaining a good CLS model and reliable compositional profiles. Different kinds of preprocessing are explained. For each data set the type and amount of preprocessing may be different. This is shown using two data sets measured on essentially the same type of display cell, but under different experimental conditions. For model validation three criteria are introduced: mean sum of squares of residuals, percentage of unexplained information (PUN), and average residual curve. It is shown that the decision about the best combination of preprocessing techniques cannot be based only on overall error indicators (such as PUN). In addition, local residual analysis must be done and the feasibility of the extracted profiles should be taken into account.


2020 ◽  
Vol 9 (11) ◽  
pp. e7509119806
Author(s):  
Leandro Soares Santos ◽  
Moysés Naves de Moraes ◽  
Julia Dos Santos Lopes ◽  
Luciana Carolina Bauer ◽  
Paulo Bonomo ◽  
...  

Thermophysical properties are important in design, simulation, optimization, and control of food processing. Its prediction is very important but theoretical basis is difficult and empirical models were commonly used. In this work, the modeling of neural networks was applied as an alternative to predict density, thermal conductivity and thermal diffusivity from the temperature and moisture content of jackfruit, genipap and umbu. Data sets from literature were used, combined and individually, to obtain four networks. Supervised multilayer perceptron networks were developed, using the back-propagation algorithm. Several configurations of artificial neural networks (ANNs) were evaluated with one or two hidden layers and a maximum of 21 and 12 neurons in each one, respectively. Data sets were divided to learning (60%) and verification (40%) steps. Best ANNs were chosen based on correlation coefficient and root mean square errors (RMSE), and compared with polynomial models using average absolute deviations (AADs). From total disposable data set, the best ANN developed presents one hidden layer with 15 neurons and shows the same predictive ability of ANNs created from individual fruits data sets, presenting close RMSE and correlation coefficient. The ANNs developed presents AADs near to polynomial models and appers as alternative to conventional modeling. Results indicate that the ANN created from total data set can replace nine polynomial models to predict the thermophysical properties of jackfruit, genipap and umbu pulps.


2013 ◽  
Vol 2013 ◽  
pp. 1-12 ◽  
Author(s):  
Singh Vijendra ◽  
Sahoo Laxman

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. In this paper, we have presented a robust multi objective subspace clustering (MOSCL) algorithm for the challenging problem of high-dimensional clustering. The first phase of MOSCL performs subspace relevance analysis by detecting dense and sparse regions with their locations in data set. After detection of dense regions it eliminates outliers. MOSCL discovers subspaces in dense regions of data set and produces subspace clusters. In thorough experiments on synthetic and real-world data sets, we demonstrate that MOSCL for subspace clustering is superior to PROCLUS clustering algorithm. Additionally we investigate the effects of first phase for detecting dense regions on the results of subspace clustering. Our results indicate that removing outliers improves the accuracy of subspace clustering. The clustering results are validated by clustering error (CE) distance on various data sets. MOSCL can discover the clusters in all subspaces with high quality, and the efficiency of MOSCL outperforms PROCLUS.


2011 ◽  
Vol 44 (6) ◽  
pp. 1182-1189 ◽  
Author(s):  
Jarosław A. Kalinowski ◽  
Anna Makal ◽  
Philip Coppens

A new method for determination of the orientation matrix of Laue X-ray data is presented. The method is based on matching of the experimental patterns of central reciprocal lattice rows projected on a unit sphere centered on the origin of the reciprocal lattice with the corresponding pattern of a monochromatic data set on the same material. This technique is applied to the complete data set and thus eliminates problems often encountered when single frames with a limited number of peaks are to be used for orientation matrix determination. Application of the method to a series of Laue data sets on organometallic crystals is described. The corresponding program is available under a Mozilla Public License-like open-source license.


Author(s):  
Benny Yiu-ming Fung ◽  
Vincent To-yee Ng

When classifying tumors using gene expression data, mining tasks commonly make use of only a single data set. However, classification models based on patterns extracted from a single data set are often not indicative of an entire population and heterogeneous samples subsequently applied to these models may not fit, leading to performance degradation. In short, it is not possible to guarantee that mining results based on a single gene expression data set will be reliable or robust (Miller et al., 2002). This problem can be addressed using classification algorithms capable of handling multiple, heterogeneous gene expression data sets. Apart from improving mining performance, the use of such algorithms would make mining results less sensitive to the variations of different microarray platforms and to experimental conditions embedded in heterogeneous gene expression data sets.


Author(s):  
Toufik Al Khawli ◽  
Hamza Bendemra ◽  
Muddasar Anwar ◽  
Dewald Swart ◽  
Jorge Dias

PurposeThis paper presents a method for extracting the geometric primitives of a circle in a three-dimensional space from a discrete point cloud data set obtained by a laser stripe sensor. This paper aims to first establish a reference frame for the robotic drilling process by detecting the position and orientation of a reference hole on structural parts in a pre-drilling step, and second, to perform quality inspection of the hole in a post-drilling step.Design/methodology/approachThe method is divided into the following steps: a plane is initially fitted on the data by evaluating the principle component analysis using singular value decomposition; the data points or measurements are then rotated around an arbitrary axis using the Rodrigues’ rotation formula such that the normal direction of the estimated plane and thez-axis direction is parallel; the Delaunay triangulation is constructed on the point cloud and the confidence interval is estimated for segmenting the data set located at the circular boundary; and finally, a circular profile is fitted on the extracted set and transformed back to the original position.FindingsThe geometric estimation of the circle in three-dimensional space constitutes of the position of the center, the diameter and the orientation, which is represented by the normal vector of the plane that the circle lives in. The method is applied on both simulated data set with the addition of several noise levels and experimental data sets. The main purpose of both the tests is to quantify the accuracy of the estimated diameter. The results show good accuracy (mean relative error < 1 per cent) and high robustness to noise.Research limitations/implicationsThe proposed method is applied here to estimate the geometric primitives of only one circle (the reference hole). If multiple circles are needed, an addition clustering procedure is required to cluster the segmented data into multiple data sets. Each data set represents a circle. Also, the method does not operate efficiently on a sparse data sets. Dense data are required to cover the hole (at least ten scans to cover the hole diameter).Practical implicationsResearchers and practitioners can integrate this method with several robotic manufacturing applications where high accuracy is required. The extracted position and orientation of the hole are used to minimize the positioning and alignment errors between the mounted tool tip and the workpiece.Originality/valueThe method introduces data analytics for estimating the geometric primitives in the robotic drilling application. The main advantage of the proposed method is to register the top surface of the workpiece with respect to robot base frame with a high accuracy. An accurate workpiece registration is extremely necessary in the lateral direction (identifying where to drill), as well as in the vertical direction (identifying how far to drill).


2020 ◽  
Vol 63 (12) ◽  
pp. 3991-3999
Author(s):  
Benjamin van der Woerd ◽  
Min Wu ◽  
Vijay Parsa ◽  
Philip C. Doyle ◽  
Kevin Fung

Objectives This study aimed to evaluate the fidelity and accuracy of a smartphone microphone and recording environment on acoustic measurements of voice. Method A prospective cohort proof-of-concept study. Two sets of prerecorded samples (a) sustained vowels (/a/) and (b) Rainbow Passage sentence were played for recording via the internal iPhone microphone and the Blue Yeti USB microphone in two recording environments: a sound-treated booth and quiet office setting. Recordings were presented using a calibrated mannequin speaker with a fixed signal intensity (69 dBA), at a fixed distance (15 in.). Each set of recordings (iPhone—audio booth, Blue Yeti—audio booth, iPhone—office, and Blue Yeti—office), was time-windowed to ensure the same signal was evaluated for each condition. Acoustic measures of voice including fundamental frequency ( f o ), jitter, shimmer, harmonic-to-noise ratio (HNR), and cepstral peak prominence (CPP), were generated using a widely used analysis program (Praat Version 6.0.50). The data gathered were compared using a repeated measures analysis of variance. Two separate data sets were used. The set of vowel samples included both pathologic ( n = 10) and normal ( n = 10), male ( n = 5) and female ( n = 15) speakers. The set of sentence stimuli ranged in perceived voice quality from normal to severely disordered with an equal number of male ( n = 12) and female ( n = 12) speakers evaluated. Results The vowel analyses indicated that the jitter, shimmer, HNR, and CPP were significantly different based on microphone choice and shimmer, HNR, and CPP were significantly different based on the recording environment. Analysis of sentences revealed a statistically significant impact of recording environment and microphone type on HNR and CPP. While statistically significant, the differences across the experimental conditions for a subset of the acoustic measures (viz., jitter and CPP) have shown differences that fell within their respective normative ranges. Conclusions Both microphone and recording setting resulted in significant differences across several acoustic measurements. However, a subset of the acoustic measures that were statistically significant across the recording conditions showed small overall differences that are unlikely to have clinical significance in interpretation. For these acoustic measures, the present data suggest that, although a sound-treated setting is ideal for voice sample collection, a smartphone microphone can capture acceptable recordings for acoustic signal analysis.


Sign in / Sign up

Export Citation Format

Share Document