Literature Consistency of Bioinformatics Sequence Databases is Effective for Assessing Record Quality

Mapping Intimacies ◽

10.1101/101873 ◽

2017 ◽

Author(s):

Mohamed Reda Bouadjenek ◽

Karin Verspoor ◽

Justin Zobel

Keyword(s):

Data Quality ◽

Quality Indicators ◽

Large Scale ◽

Genomic Data ◽

Principal Component ◽

Mutual Relationship ◽

Query Quality ◽

Automatic Methods ◽

Sequence Databases

AbstractBioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness, and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records.Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using Principal Component Analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that 1 record out of 4 is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records.

Download Full-text

Automated Detection of Records in Biological Sequence Databases that are Inconsistent with the Literature

10.1101/101246 ◽

2017 ◽

Cited By ~ 1

Author(s):

Mohamed Reda Bouadjenek ◽

Karin Verspoor ◽

Justin Zobel

Keyword(s):

Quality Indicators ◽

Large Scale ◽

Detection Algorithm ◽

Point Of View ◽

Biological Sequence ◽

Data Set ◽

Pubmed Central ◽

Data Record ◽

Sequence Databases

AbstractWe investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as “confident” or “suspicious”.Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.

Download Full-text

DATA QUALITY OF NATIONAL QUALITY INDICATORS AND THEIR ACCEPTANCE IN SWISS NURSING HOMES

Innovation in Aging ◽

10.1093/geroni/igy023.1575 ◽

2018 ◽

Vol 2 (suppl_1) ◽

pp. 421-421

Author(s):

F Zuniga ◽

C Blatter ◽

M Simon

Keyword(s):

Nursing Homes ◽

Data Quality ◽

Quality Indicators ◽

National Quality

Download Full-text

Conservation of Mineral Elements in Maize Grains by a Triple Bagging System and Biopesticide (Lippia multiflora Moldenke and Hyptis suaveolens Poit Leaves).

Asian Journal of Agriculture and Food Sciences ◽

10.24203/ajafs.v8i3.6188 ◽

2020 ◽

Vol 8 (3) ◽

Author(s):

Gnande Romaric Die ◽

Kouamé Olivier Chatigre ◽

Ibrahim Fofana ◽

N’guessan Verdier Abouo ◽

Godi Henri Marius Biego

Keyword(s):

Large Scale ◽

Principal Component ◽

Mineral Elements ◽

Scale Production ◽

Lippia Multiflora ◽

Large Scale Production ◽

Hyptis Suaveolens ◽

Storage Methods ◽

Central Composite

Maize (Zea mays) is a staple food in the traditional diet of rural populations in CÃ´te d'Ivoire. It is a source of many minerals. However, inefficient and sometimes harmful storage methods hamper its large-scale production in CÃ´te d'Ivoire. It is in this context that a triple bagging system associated or not with biopesticides of plant origin (Lippia multiflora and Hyptis suaveolens leaves) was proposed in this study to evaluate its efficacy on the conservation of mineral quality of grains over an 18-month period following a 3-factor central composite design (CCD). The first CCD factor consisted of 6 observation periods: 0; 1; 4.5; 9.5; 14.5 and 18 months. The second factor, the type of treatment, included 1 control lot with a polypropylene bag (TB0SP) and 9 experimental lots including 1 lot in triple bagging without biopesticides (TB0P) and the remaining 8 lots containing variable proportions and/or combinations of biopesticides (TB1 to TB8). And finally, the third factor was the combination of the two biopesticides with % Lippia multiflora as a reference. The results indicate that the shelf life, ratio and combination of biopesticides significantly (P < 0.05) influence the mineral quality of grain maize. Principal component analysis revealed that the addition of at least 1.01% biopesticides (leaves of Lippia multiflora and Hyptis suaveolens) in triple bagging systems improves preservation efficiency and preserves the mineral quality of the grain over a period of 15 months as opposed to triple bagging without biopesticides where the mineral elements are preserved during the first 10 months of storage. However, this preservation of mineral quality is more pronounced in these storage systems with combinations of biopesticides (of which the proportion is greater than or equal to 3.99%) or with 2.5 % of individual biopesticides.

Download Full-text

Data Quality in Cooperative Information Systems

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch057 ◽

2011 ◽

pp. 297-301

Author(s):

Carla Marchetti ◽

Massimo Mecella ◽

Monica Scannapieco ◽

Antoninio Virgillito

Keyword(s):

Information System ◽

Information Systems ◽

Data Quality ◽

Real World ◽

Large Scale ◽

Quality Certification ◽

Geographically Distributed ◽

Cooperative Information Systems

A Cooperative Information System (CIS) is a large-scale information system that interconnects various systems of different and autonomous organizations, geographically distributed and sharing common objectives (De Michelis et al., 1997). Among the different resources that are shared by organizations, data are fundamental; in real world scenarios, organization A may not request data from organization B, if it does not trust B’s data (i.e., if A does not know that the quality of the data that B can provide is high). As an example, in an e-government scenario in which public administrations cooperate in order to fulfill service requests from citizens and enterprises (Batini & Mecella, 2001), administrations very often prefer asking citizens for data rather than from other administrations that have stored the same data, because the quality of such data is not known. Therefore, lack of cooperation may occur due to lack of quality certification.

Download Full-text

Analysis of discrepancies in mirror data relating to the weight of goods in intra-EU trade statistics

Wiadomości Statystyczne. The Polish Statistician ◽

10.5604/01.3001.0014.3525 ◽

2020 ◽

Vol 65 (8) ◽

pp. 27-38

Author(s):

Iwona Markowicz ◽

Paweł Baran

Keyword(s):

European Union ◽

Data Quality ◽

Research Methods ◽

Quality Indicators ◽

Research Methodology ◽

Quality Of Data ◽

Eu Countries ◽

The Eu ◽

Data Quality Indicators

In the research carried out to date by the authors of the article, the assessment of the quality of mirror data in the exchange of goods between European Union (EU) countries was based on the value of goods. A similar approach is applied by many researchers. The aim of the research discussed in the article is to assess the quality of data relating to intra-EU trade based on not only the value, but also on the quantity of goods. The analysis of discrepancies in data relating to trade between EU countries, with a particular emphasis on Poland, was based on selected research methods known from literature. Both the value-based and the quantitative approach constitute the authors' contribution to the development of research methodology. Data quality indicators were proposed and data pertaining to the weight of goods were used. Information on trade in goods between EU countries in 2017 obtained from Eurostat's Comext database was analysed. The research relating to the dynamics also covered the years 2005, 2008, 2011, and 2014. The results of the analysis demonstrated that the total share of export of goods from Poland to a given country within the EU is different for data expressed in value (value of goods) and in quantity (weight of goods). Therefore, both approaches should be applied in the study of the quality of mirror data.

Download Full-text

The Effects of Survey Enhancements on the Quality of Reporting in the Medical Expenditure Panel Survey, 2008–2015

Journal of Survey Statistics and Methodology ◽

10.1093/jssam/smz014 ◽

2019 ◽

Vol 8 (3) ◽

pp. 589-616 ◽

Cited By ~ 1

Author(s):

Samuel H Zuvekas ◽

Adam I Biener ◽

Wendy D Hicks

Keyword(s):

Health Care ◽

Data Quality ◽

Large Scale ◽

Medical Expenditure Panel Survey ◽

Lessons Learned ◽

Health Care Use ◽

Medical Expenditure ◽

Panel Survey ◽

Quality Of Reporting

Abstract It is well established that survey respondents imperfectly recall health care use in surveys. However, careful attention to both survey design and fielding procedures can enhance recall. We examine the effects of a comprehensive, multi-pronged approach to changing field procedures in the Medical Expenditure Panel Survey (MEPS) to improve quality of health care use reporting. Conducted annually since 1996, the MEPS is the leading large-scale nationally representative health survey with detailed individual and household information on health care use and expenditures. These survey enhancements were undertaken in 2013–2014 because of concerns over a drop in the quality of reporting in 2010 that persisted into 2011–2012. The approach combined focused retraining of field supervisors and interviewers, developing quality metrics and reports for ongoing monitoring of interviewers, and revising advanced letters and materials sent to respondents. We seek to determine the extent to which changes in field procedures and trainings improved interviewer and respondent behaviors associated with better reporting, and more importantly, improved reporting accuracy. We use longitudinal MEPS data from 2008 through 2015, combining household reported use with sociodemographic and health status characteristics, and paradata on the characteristics of the interviews and interviewers. We exploit the longitudinal data and timings of major trainings and changes in field procedures in regression models, separating out the effects of the trainings and other fielding changes to the extent possible. We find that the 2013–2014 data quality improvement activities substantially improved reporting quality. Positive interviewer behaviors increased substantially to above pre-2010 levels, and utilization reporting has recovered to above pre-2010 levels, returning MEPS to trend. Importantly, these substantial gains occurred in 2013, prior to extensive in-person training for most of the field force. We examine the lessons learned from this data quality initiative both for the MEPS program and for other large household surveys.

Download Full-text

Data quality of Whole Genome Bisulfite Sequencing on Illumina platforms

10.1101/188797 ◽

2017 ◽

Author(s):

Amanda Raine ◽

Ulrika Liljedahl ◽

Jessica Nordlund

Keyword(s):

Dna Methylation ◽

Data Quality ◽

Large Scale ◽

Cost Effective ◽

Bisulfite Treatment ◽

Real Time Analysis ◽

Genome Bisulfite Sequencing ◽

Software Upgrade ◽

Impact Data

AbstractThe powerful HiSeq X sequencers with their patterned flowcell technology and fast turnaround times are instrumental for many large-scale genomic and epigenomic studies. However, assessment of DNA methylation by sodium bisulfite treatment results in sequencing libraries of low diversity, which may impact data quality and yield. In this report we assess the quality of WGBS data generated on the HiSeq X system in comparison with data generated on the HiSeq 2500 system and the newly released NovaSeq system. We report a systematic issue with low basecall quality scores assigned to guanines in the second read of WGBS when using certain Real Time Analysis (RTA) software versions on the HiSeq X sequencer, reminiscent of an issue that was previously reported with certain HiSeq 2500 software versions. However, with the HD.3.4.0/RTA 2.7.7 software upgrade for the HiSeq X system, we observed an overall improved quality and yield of the WGBS data generated, which in turn empowers cost-effective and high quality DNA methylation studies.

Download Full-text

Automated analysis of large-scale NMR data generates metabolomic signatures and links them to candidate metabolites

10.1101/613935 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bita Khalili ◽

Mattia Tomasoni ◽

Mirjam Mattei ◽

Roger Mallol Parera ◽

Reyhan Sonmez ◽

...

Keyword(s):

Principal Component Analysis ◽

Human Urine ◽

Nmr Spectra ◽

Large Scale ◽

Principal Component ◽

Automated Analysis ◽

Urine Samples ◽

H Nmr ◽

Nmr Data

AbstractIdentification of metabolites in large-scale 1H NMR data from human biofluids remains challenging due to the complexity of the spectra and their sensitivity to pH and ionic concentrations. In this work, we test the capacity of three analysis tools to extract metabolite signatures from 968 NMR profiles of human urine samples. Specifically, we studied sets of co-varying features derived from Principal Component Analysis (PCA), the Iterative Signature Algorithm (ISA) and Averaged Correlation Profiles (ACP), a new method we devised inspired by the STOCSY approach. We used our previously developed metabomatching method to match the sets generated by these algorithms to NMR spectra of individual metabolites available in public databases. Based on the number and quality of the matches we concluded that both ISA and ACP can robustly identify about a dozen metabolites, half of which were shared, while PCA did not produce any signatures with robust matches.

Download Full-text

Using Soft Sensors as a Basis of an Innovative Architecture for Operation Planning and Quality Evaluation in Agricultural Sprayers

Sensors ◽

10.3390/s21041269 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1269

Author(s):

Elmer A. G. Peñaloza ◽

Vilma A. Oliveira ◽

Paulo E. Cruvinel

Keyword(s):

Large Scale ◽

Principal Component ◽

Soft Sensor ◽

Operating Conditions ◽

K Nearest Neighbor ◽

Pesticide Application ◽

Soft Sensors ◽

Agricultural Pesticides ◽

Pcr Method

One of the major problems facing humanity in the coming decades is the production of food on a large scale. The production of large quantities of food must be conducted in a sustainable and responsible manner for nature and humans. In this sense, the appropriate application of agricultural pesticides plays a fundamental role since pesticide application in a qualified manner reduces human and environmental risks as well as the costs of food production. Evaluation of the quality of application using sprayers is an important issue, and several quality descriptors related to the average diameter and distribution of droplets are used. This paper describes the construction of a data-driven soft sensor using the parametric principal component regression (PCR) method based on principal component analysis (PCA), which works in two configurations: with the input being the operating conditions of the agricultural boom sprayers and its outputs being the prediction of the quality descriptors of spraying, and vice versa. The soft sensor provides, in one configuration, estimates of the quality of pesticide application at a certain time and, in the other, estimates of the appropriate sprayer-operating conditions, which can be used for control and optimization of the processes in pesticide application. Full cone nozzles are used to illustrate a practical application as well as to validate the usefulness of the soft sensor designed with the PCR method. The selection of historical data, exploration, and filtering of data, and the structure and validation of the soft sensor are presented. For comparison purposes, the results with the well-known nonparametric k-Nearest Neighbor (k−NN) regression method are presented. The results of this research reveal the usefulness of soft sensors in the application of agricultural pesticides and as a knowledge base to assist in agricultural decision-making.

Download Full-text

Metody i modele oceny jakości danych przestrzennych

10.15576/978-83-66602-30-4 ◽

2017 ◽

Author(s):

Marek Ślusarski ◽

Keyword(s):

Data Quality ◽

Spatial Data ◽

Large Scale ◽

Spatial Databases ◽

Test Sample ◽

Data Sets ◽

Infrastructure Network ◽

Underground Infrastructure

The quality of data collected in official spatial databases is crucial in making strategic decisions as well as in the implementation of planning and design works. Awareness of the level of the quality of these data is also important for individual users of official spatial data. The author presents methods and models of description and evaluation of the quality of spatial data collected in public registers. Data describing the space in the highest degree of detail, which are collected in three databases: land and buildings registry (EGiB), geodetic registry of the land infrastructure network (GESUT) and in database of topographic objects (BDOT500) were analyzed. The results of the research concerned selected aspects of activities in terms of the spatial data quality. These activities include: the assessment of the accuracy of data collected in official spatial databases; determination of the uncertainty of the area of registry parcels, analysis of the risk of damage to the underground infrastructure network due to the quality of spatial data, construction of the quality model of data collected in official databases and visualization of the phenomenon of uncertainty in spatial data. The evaluation of the accuracy of data collected in official, large-scale spatial databases was based on a representative sample of data. The test sample was a set of deviations of coordinates with three variables dX, dY and Dl – deviations from the X and Y coordinates and the length of the point offset vector of the test sample in relation to its position recognized as a faultless. The compatibility of empirical data accuracy distributions with models (theoretical distributions of random variables) was investigated and also the accuracy of the spatial data has been assessed by means of the methods resistant to the outliers. In the process of determination of the accuracy of spatial data collected in public registers, the author’s solution was used – resistant method of the relative frequency. Weight functions, which modify (to varying degree) the sizes of the vectors Dl – the lengths of the points offset vector of the test sample in relation to their position recognized as a faultless were proposed. From the scope of the uncertainty of estimation of the area of registry parcels the impact of the errors of the geodetic network points was determined (points of reference and of the higher class networks) and the effect of the correlation between the coordinates of the same point on the accuracy of the determined plot area. The scope of the correction was determined (in EGiB database) of the plots area, calculated on the basis of re-measurements, performed using equivalent techniques (in terms of accuracy). The analysis of the risk of damage to the underground infrastructure network due to the low quality of spatial data is another research topic presented in the paper. Three main factors have been identified that influence the value of this risk: incompleteness of spatial data sets and insufficient accuracy of determination of the horizontal and vertical position of underground infrastructure. A method for estimation of the project risk has been developed (quantitative and qualitative) and the author’s risk estimation technique, based on the idea of fuzzy logic was proposed. Maps (2D and 3D) of the risk of damage to the underground infrastructure network were developed in the form of large-scale thematic maps, presenting the design risk in qualitative and quantitative form. The data quality model is a set of rules used to describe the quality of these data sets. The model that has been proposed defines a standardized approach for assessing and reporting the quality of EGiB, GESUT and BDOT500 spatial data bases. Quantitative and qualitative rules (automatic, office and field) of data sets control were defined. The minimum sample size and the number of eligible nonconformities in random samples were determined. The data quality elements were described using the following descriptors: range, measure, result, and type and unit of value. Data quality studies were performed according to the users needs. The values of impact weights were determined by the hierarchical analytical process method (AHP). The harmonization of conceptual models of EGiB, GESUT and BDOT500 databases with BDOT10k database was analysed too. It was found that the downloading and supplying of the information in BDOT10k creation and update processes from the analyzed registers are limited. An effective approach to providing spatial data sets users with information concerning data uncertainty are cartographic visualization techniques. Based on the author’s own experience and research works on the quality of official spatial database data examination, the set of methods for visualization of the uncertainty of data bases EGiB, GESUT and BDOT500 was defined. This set includes visualization techniques designed to present three types of uncertainty: location, attribute values and time. Uncertainty of the position was defined (for surface, line, and point objects) using several (three to five) visual variables. Uncertainty of attribute values and time uncertainty, describing (for example) completeness or timeliness of sets, are presented by means of three graphical variables. The research problems presented in the paper are of cognitive and application importance. They indicate on the possibility of effective evaluation of the quality of spatial data collected in public registers and may be an important element of the expert system.

Download Full-text