distance based measure of data quality

Data quality can be seen as a very important factor for the validity of information extracted from data sets using statistical or data mining procedures. In the paper we propose a description of data quality allowing us to characterize data quality of the whole data set, as well as data quality of particular variables and individual cases. On the basis of the proposed description, we define a distance based measure of data quality for individual cases as a distance of the cases from the ideal one. Such a measure can be used as additional information for preparation of a training data set, fitting models, decision making based on results of analyses etc. It can be utilized in different ways ranging from a simple weighting function to belief functions.

Download Full-text

Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples

10.1101/533273 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jacob Schreiber ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Biological Activity ◽

Protein Binding ◽

Histone Modification ◽

Chromatin Accessibility ◽

Training Data ◽

Data Sets ◽

Cellular Mechanisms ◽

Data Set ◽

Genome Wide

AbstractMotivationRecent efforts to describe the human epigenome have yielded thousands of uniformly processed epigenomic and transcriptomic data sets. These data sets characterize a rich variety of biological activity in hundreds of human cell lines and tissues (“biosamples”). Understanding these data sets, and specifically how they differ across biosamples, can help explain many cellular mechanisms, particularly those driving development and disease. However, due primarily to cost, the total number of assays that can be performed is limited. Previously described imputation approaches, such as Avocado, have sought to overcome this limitation by predicting genome-wide epigenomics experiments using learned associations among available epigenomic data sets. However, these previous imputations have focused primarily on measurements of histone modification and chromatin accessibility, despite other biological activity being crucially important.ResultsWe applied Avocado to a data set of 3,814 tracks of data derived from the ENCODE compendium, spanning 400 human biosamples and 84 assays. The resulting imputations cover measurements of chromatin accessibility, histone modification, transcription, and protein binding. We demonstrate the quality of these imputations by comprehensively evaluating the model’s predictions and by showing significant improvements in protein binding performance compared to the top models in an ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model, achieving high accuracy at predicting protein binding, even with only a single track of training data.AvailabilityTutorials and source code are available under an Apache 2.0 license at https://github.com/jmschrei/[email protected] or [email protected]

Download Full-text

Validity as a Measure of Data Quality in Internet of Things Systems

10.21203/rs.3.rs-811543/v1 ◽

2021 ◽

Author(s):

Rishabh Deo Pandey ◽

Itu Snigdh

Keyword(s):

Data Quality ◽

Data Sets ◽

Major Focus ◽

Quality Of Data ◽

Data Set ◽

Environment Analysis ◽

Aggregated Data ◽

Pervasive Environment ◽

Validation Parameters

Abstract Data quality became significant with the emergence of data warehouse systems. While accuracy is intrinsic data quality, validity of data presents a wider perspective, which is more representational and contextual in nature. Through our article we present a different perspective in data collection and collation. We focus on faults experienced in data sets and present validity as a function of allied parameters such as completeness, usability, availability and timeliness for determining the data quality. We also analyze the applicability of these metrics and apply modifications to make it conform to IoT applications. Another major focus of this article is to verify these metrics on aggregated data set instead of separate data values. This work focuses on using the different validation parameters for determining the quality of data generated in a pervasive environment. Analysis approach presented is simple and can be employed to test the validity of collected data, isolate faults in the data set and also measure the suitability of data before applying algorithms for analysis.

Download Full-text

NUCOME: A comprehensive database of nucleosome organization referenced landscapes in mammalian genomes

BMC Bioinformatics ◽

10.1186/s12859-021-04239-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xiaolan Chen ◽

Hui Yang ◽

Guifen Liu ◽

Yong Zhang

Keyword(s):

Data Quality ◽

Cell Types ◽

Data Sets ◽

Data Quality Control ◽

Data Set ◽

Redundant Data ◽

Nucleosome Organization ◽

Public Data ◽

Mammalian Genomes

Abstract Background Nucleosome organization is involved in many regulatory activities in various organisms. However, studies integrating nucleosome organization in mammalian genomes are very limited mainly due to the lack of comprehensive data quality control (QC) assessment and uneven data quality of public data sets. Results The NUCOME is a database focused on filtering qualified nucleosome organization referenced landscapes covering various cell types in human and mouse based on QC metrics. The filtering strategy guarantees the quality of nucleosome organization referenced landscapes and exempts users from redundant data set selection and processing. The NUCOME database provides standardized, qualified data source and informative nucleosome organization features at a whole-genome scale and on the level of individual loci. Conclusions The NUCOME provides valuable data resources for integrative analyses focus on nucleosome organization. The NUCOME is freely available at http://compbio-zhanglab.org/NUCOME.

Download Full-text

Practical strategies for classification of unexploded ordnance

Geophysics ◽

10.1190/geo2012-0236.1 ◽

2013 ◽

Vol 78 (1) ◽

pp. E41-E46 ◽

Cited By ~ 9

Author(s):

Laurens Beran ◽

Barry Zelt ◽

Leonard Pasion ◽

Stephen Billings ◽

Kevin Kingdon ◽

...

Keyword(s):

Classification Performance ◽

Training Data ◽

Sensor Data ◽

Data Sets ◽

Unexploded Ordnance ◽

Model Quality ◽

Data Set ◽

Time Domain Electromagnetic

We have developed practical strategies for discriminating between buried unexploded ordnance (UXO) and metallic clutter. These methods are applicable to time-domain electromagnetic data acquired with multistatic, multicomponent sensors designed for UXO classification. Each detected target is characterized by dipole polarizabilities estimated via inversion of the observed sensor data. The polarizabilities are intrinsic target features and so are used to distinguish between UXO and clutter. We tested this processing with four data sets from recent field demonstrations, with each data set characterized by metrics of data and model quality. We then developed techniques for building a representative training data set and determined how the variable quality of estimated features affects overall classification performance. Finally, we devised a technique to optimize classification performance by adapting features during target prioritization.

Download Full-text

Comparison between XBT data and TOPEX/Poseidon satellite altimetry in the Ligurian-Tyrrhenian area

Annales Geophysicae ◽

10.5194/angeo-21-123-2003 ◽

2003 ◽

Vol 21 (1) ◽

pp. 123-135 ◽

Cited By ~ 18

Author(s):

S. Vignudelli ◽

P. Cipollini ◽

F. Reseghetti ◽

G. Fusco ◽

G. P. Gasparini ◽

...

Keyword(s):

Pilot Project ◽

Atlantic Water ◽

Data Sets ◽

Tyrrhenian Sea ◽

Satellite Altimeter ◽

Ligurian Sea ◽

Data Set ◽

Additional Information ◽

Forecasting System ◽

Computational Errors

Abstract. From September 1999 to December 2000, eXpendable Bathy-Thermograph (XBT) profiles were collected along the Genova-Palermo shipping route in the framework of the Mediterranean Forecasting System Pilot Project (MFSPP). The route is virtually coincident with track 0044 of the TOPEX/Poseidon satellite altimeter, crossing the Ligurian and Tyrrhenian basins in an approximate N–S direction. This allows a direct comparison between XBT and altimetry, whose findings are presented in this paper. XBT sections reveal the presence of the major features of the regional circulation, namely the eastern boundary of the Ligurian gyre, the Bonifacio gyre and the Modified Atlantic Water inflow along the Sicily coast. Twenty-two comparisons of steric heights derived from the XBT data set with concurrent realizations of single-pass altimetric heights are made. The overall correlation is around 0.55 with an RMS difference of less than 3 cm. In the Tyrrhenian Sea the spectra are remarkably similar in shape, but in general the altimetric heights contain more energy. This difference is explained in terms of oceanographic signals, which are captured with a different intensity by the satellite altimeter and XBTs, as well as computational errors. On scales larger than 100 km, the data sets are also significantly coherent, with increasing coherence values at longer wavelengths. The XBTs were dropped every 18–20 km along the track: as a consequence, the spacing scale was unable to resolve adequately the internal radius of deformation (< 20 km). Furthermore, few XBT drops were carried out in the Ligurian Sea, due to the limited north-south extent of this basin, so the comparison is problematic there. On the contrary, the major features observed in the XBT data in the Tyrrhenian Sea are also detected by TOPEX/Poseidon. The manuscript is completed by a discussion on how to integrate the two data sets, in order to extract additional information. In particular, the results emphasize their complementariety in providing a dynamically complete description of the observed structures. Key words. Oceanography: general (descriptive and regional oceanography) Oceanography: physical (sea level variations; instruments and techniques)

Download Full-text

Data Analysis With Shapley Values For Automatic Subject Selection in Alzheimer's Disease Data Sets Using Interpretable Machine Learning

10.21203/rs.3.rs-245707/v1 ◽

2021 ◽

Author(s):

Louise Bloch ◽

Christoph M. Friedrich

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Test Data ◽

Noisy Data ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Model Interpretation ◽

Percentage Points ◽

Shapley Values

Abstract Background: The prediction of whether Mild Cognitive Impaired (MCI) subjects will prospectively develop Alzheimer's Disease (AD) is important for the recruitment and monitoring of subjects for therapy studies. Machine Learning (ML) is suitable to improve early AD prediction. The etiology of AD is heterogeneous, which leads to noisy data sets. Additional noise is introduced by multicentric study designs and varying acquisition protocols. This article examines whether an automatic and fair data valuation method based on Shapley values can identify subjects with noisy data. Methods: An ML-workow was developed and trained for a subset of the Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. The validation was executed for an independent ADNI test data set and for the Australian Imaging, Biomarker and Lifestyle Flagship Study of Ageing (AIBL) cohort. The workow included volumetric Magnetic Resonance Imaging (MRI) feature extraction, subject sample selection using data Shapley, Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) for model training and Kernel SHapley Additive exPlanations (SHAP) values for model interpretation. This model interpretation enables clinically relevant explanation of individual predictions. Results: The XGBoost models which excluded 116 of the 467 subjects from the training data set based on their Logistic Regression (LR) data Shapley values outperformed the models which were trained on the entire training data set and which reached a mean classification accuracy of 58.54 % by 14.13 % (8.27 percentage points) on the independent ADNI test data set. The XGBoost models, which were trained on the entire training data set reached a mean accuracy of 60.35 % for the AIBL data set. An improvement of 24.86 % (15.00 percentage points) could be reached for the XGBoost models if those 72 subjects with the smallest RF data Shapley values were excluded from the training data set. Conclusion: The data Shapley method was able to improve the classification accuracies for the test data sets. Noisy data was associated with the number of ApoEϵ4 alleles and volumetric MRI measurements. Kernel SHAP showed that the black-box models learned biologically plausible associations.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Mathematically aggregating experts' predictions of possible futures

10.31222/osf.io/rxmh7 ◽

2021 ◽

Author(s):

Anca Hanea ◽

David Peter Wilkinson ◽

Marissa McBride ◽

Aidan Lyon ◽

Don van Ravenzwaaij ◽

...

Keyword(s):

Data Sets ◽

Future Event ◽

Linear Combinations ◽

Aggregation Methods ◽

Mathematical Rule ◽

Multiple Experts ◽

Combine Probability ◽

Final Group ◽

The Ideal

Experts are often asked to represent their uncertainty as a subjective probability. Structured protocols offer a transparent and systematic way to elicit and combine probability judgements from multiple experts. As part of this process, experts are asked to individually estimate a probability (e.g., of a future event) which needs to be combined/aggregated into a final group prediction. The experts' judgements can be aggregated behaviourally (by striving for consensus), or mathematically (by using a mathematical rule to combine individual estimates). Mathematical rules (e.g., weighted linear combinations of judgments) provide an objective approach to aggregation. However, the choice of a rule is not straightforward, and the aggregated group probability judgement's quality depends on it. The quality of an aggregation can be defined in terms of accuracy, calibration and informativeness. These measures can be used to compare different aggregation approaches and help decide on which aggregation produces the "best" final prediction.In the ideal case, individual experts' performance (as probability assessors) is scored, these scores are translated into performance-based weights, and a performance-based weighted aggregation is used. When this is not possible though, several other aggregation methods, informed by measurable proxies for good performance, can be formulated and compared. We use several data sets to investigate the relative performance of multiple aggregation methods informed by previous experience and the available literature. Even though the accuracy, calibration, and informativeness of the majority of methods are very similar, two of the aggregation methods distinguish themselves as the best and worst.

Download Full-text

An Incremental Isomap Method for Hyperspectral Dimensionality Reduction and Classification

Photogrammetric Engineering & Remote Sensing ◽

10.14358/pers.87.7.445 ◽

2021 ◽

Vol 87 (6) ◽

pp. 445-455

Author(s):

Yi Ma ◽

Zezhong Zheng ◽

Yutang Ma ◽

Mingcang Zhu ◽

Ran Huang ◽

...

Keyword(s):

Manifold Learning ◽

Nearest Neighbor ◽

Hyperspectral Image ◽

Hyperspectral Data ◽

Training Data ◽

Support Vector ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Data Points

Many manifold learning algorithms conduct an eigen vector analysis on a data-similarity matrix with a size of N×N, where N is the number of data points. Thus, the memory complexity of the analysis is no less than O(N2). We pres- ent in this article an incremental manifold learning approach to handle large hyperspectral data sets for land use identification. In our method, the number of dimensions for the high-dimensional hyperspectral-image data set is obtained with the training data set. A local curvature varia- tion algorithm is utilized to sample a subset of data points as landmarks. Then a manifold skeleton is identified based on the landmarks. Our method is validated on three AVIRIS hyperspectral data sets, outperforming the comparison algorithms with a k–nearest-neighbor classifier and achieving the second best performance with support vector machine.

Download Full-text

Simple Convolutional-Based Models: Are They Learning the Task or the Data?

Neural Computation ◽

10.1162/neco_a_01446 ◽

2021 ◽

pp. 1-17

Author(s):

Luis Sa-Couto ◽

Andreas Wichert

Keyword(s):

Neural Networks ◽

Pattern Recognition ◽

Training Data ◽

Model Complexity ◽

Data Sets ◽

Simple Task ◽

Data Set ◽

Knowing That ◽

Handwritten Digit ◽

End To End

Abstract Convolutional neural networks (CNNs) evolved from Fukushima's neocognitron model, which is based on the ideas of Hubel and Wiesel about the early stages of the visual cortex. Unlike other branches of neocognitron-based models, the typical CNN is based on end-to-end supervised learning by backpropagation and removes the focus from built-in invariance mechanisms, using pooling not as a way to tolerate small shifts but as a regularization tool that decreases model complexity. These properties of end-to-end supervision and flexibility of structure allow the typical CNN to become highly tuned to the training data, leading to extremely high accuracies on typical visual pattern recognition data sets. However, in this work, we hypothesize that there is a flip side to this capability, a hidden overfitting. More concretely, a supervised, backpropagation based CNN will outperform a neocognitron/map transformation cascade (MTCCXC) when trained and tested inside the same data set. Yet if we take both models trained and test them on the same task but on another data set (without retraining), the overfitting appears. Other neocognitron descendants like the What-Where model go in a different direction. In these models, learning remains unsupervised, but more structure is added to capture invariance to typical changes. Knowing that, we further hypothesize that if we repeat the same experiments with this model, the lack of supervision may make it worse than the typical CNN inside the same data set, but the added structure will make it generalize even better to another one. To put our hypothesis to the test, we choose the simple task of handwritten digit classification and take two well-known data sets of it: MNIST and ETL-1. To try to make the two data sets as similar as possible, we experiment with several types of preprocessing. However, regardless of the type in question, the results align exactly with expectation.

Download Full-text