Quality Data
Recently Published Documents





2021 ◽  
pp. 1-15
Ali Reza Honarvar ◽  
Ashkan Sami

At present, the issue of air quality in populated urban areas is recognized as an environmental crisis. Air pollution affects the sustainability of the city. In controlling air pollution and protecting its hazards from humans, air quality data are very important. However, the costs of constructing and maintaining air quality registration infrastructure are very expensive and high, and air quality data recording at one point will not be generalizable to even a few kilometers. Some of the gains come from the integration of multiple data sources, which can never be achieved through independent single-source processing. Urban organizations in each city independently produce and record data relevant to the organization’s goals and objectives. These issues create separate data silos associated with an urban system. These data are varied in model and structure, and the integration of such data provides an appropriate opportunity to discover knowledge that can be useful in urban planning and decision making. This paper aims to show the generality of our previous research, which proposed a novel model to predict Particulate Matter (PM) as the main factor of air quality in the regions of the cities where air quality sensors are not available through urban big data resources integration, by extending the model and experiments with various configuration for different settings in smart cities. This work extends the evaluation scenarios of the model with the extended dataset of city of Aarhus, in Denmark, and compare the model performance against various specified baselines. Details of removing the heterogeneity of multiple data sources in the Multiple Data Set Aggregator & Heterogeneity Remover (MDA&HR) and improving the operation of Train Data Splitter (TDS) part of the model by focusing on the finding more similar pattern of air quality also are presented in this paper. The acceptable accuracy of the results shows the generality of the model.

2021 ◽  
Vol 11 (1) ◽  
M. A. Dakka ◽  
T. V. Nguyen ◽  
J. M. M. Hall ◽  
S. M. Diakiw ◽  
M. VerMilyea ◽  

AbstractThe detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.

2021 ◽  
Vol 34 (3) ◽  
pp. 650-658

ABSTRACT Water scarcity is one of the main problems in the Semiarid region of Brazil, which can be mitigated by water resource management strategies. The objective of this work was to classify waters of a watershed in the Semiarid region of Brazil and select the water attributes that most affect the quality of waters used for irrigation (QWI), using multivariate statistics. The study area was the Riacho da Bica watershed, which is between the municipalities of Portalegre and Viçosa, Rio Grande do Norte, Brazil. The QWI was determined using water samples from 15 collections carried out from 2016 to 2018, in five specific points of the watershed, starting in the spring and following the water course. The water attributes evaluated were: electrical conductivity (EC), potential hydrogen (pH), and sodium (Na+), potassium (K+), magnesium (Mg2+), calcium (Ca2+), carbonate (CO32-), chloride (Cl-), and bicarbonate (HCO3-) contents. The water quality data were subjected to multivariate statistics through factorial analysis (FA) and principal component analysis (PCA). The application of multivariate statistics through FA-PCA generated four principal components. The attributes that most explained the QWI variation were potassium, calcium, and pH for Factor 01, and sodium and RAS for Factor 02. The watershed waters were classified as low risk of salinity and medium risk of sodicity (C1S2) for irrigation purposes.

Electronics ◽  
2021 ◽  
Vol 10 (17) ◽  
pp. 2049
Kennedy Edemacu ◽  
Jong Wook Kim

Nowadays, the internet of things (IoT) is used to generate data in several application domains. A logistic regression, which is a standard machine learning algorithm with a wide application range, is built on such data. Nevertheless, building a powerful and effective logistic regression model requires large amounts of data. Thus, collaboration between multiple IoT participants has often been the go-to approach. However, privacy concerns and poor data quality are two challenges that threaten the success of such a setting. Several studies have proposed different methods to address the privacy concern but to the best of our knowledge, little attention has been paid towards addressing the poor data quality problems in the multi-party logistic regression model. Thus, in this study, we propose a multi-party privacy-preserving logistic regression framework with poor quality data filtering for IoT data contributors to address both problems. Specifically, we propose a new metric gradient similarity in a distributed setting that we employ to filter out parameters from data contributors with poor quality data. To solve the privacy challenge, we employ homomorphic encryption. Theoretical analysis and experimental evaluations using real-world datasets demonstrate that our proposed framework is privacy-preserving and robust against poor quality data.

2021 ◽  
David Hauser ◽  
Aaron J Moss ◽  
Cheskie Rosenzweig ◽  
Shalom Noach Jaffe ◽  
Jonathan Robinson ◽  

Maintaining data quality on Amazon Mechanical Turk (MTurk) has always been a concern for researchers. CloudResearch, a third-party website that interfaces with MTurk, assessed ~100,000 MTurkers and categorized them into those that provide high- (~65,000, Approved) and low-(~35,000, Blocked) quality data. Here, we examined the predictive validity of CloudResearch’s vetting. Participants (N = 900) from the Approved and Blocked groups, along with a Standard MTurk sample, completed an array of data quality measures. Approved participants had better reading comprehension, reliability, honesty, and attentiveness scores, were less likely to cheat and satisfice, and replicated classic experimental effects more reliably than Blocked participants who performed at chance on multiple outcomes. Data quality of the Standard sample was generally in between the Approved and Blocked groups. We discuss the implications of using the Approved group for scientific studies conducted on Mechanical Turk.

Eckhard Kirchner ◽  
Stefan Schork ◽  
Gunnar Vorwerk-Handing ◽  
Sven Vogel

A mandatory requirement for the concept of intelligent systems and of digital or cyber-physical twins is the availability of high-quality data. Therefore, the authors investigate the possibility to integrate sensors, actuators and information technologies in standardized machine elements such as screws, bearings and couplings. In this paper, the focus is on sensing machine elements, which are a sub-category of mechatronic machine elements. To gain insights about those in development as well as to verify and validate their functionality, prototypes are needed. Those prototypes should help the designer to gain knowledge about the product in development and they should preferably be developed with low efforts. Therefore, a method is proposed to analyse concepts of mechatronic machine elements, especially sensing machine elements, regarding critical aspects that may interfere with the functionality of the product. The method is based on analysing the flow of the signal that is used for the measurement, starting from its mechanical origin and ending at the analysis unit. Different examples of sensing machine elements are given in this article and the respective flow of the usable signal is analysed, leading to the identification of subsystems that can be tested individually. Based on this, prototypes for the subsystems are developed and introduced.

2021 ◽  
Vol 8 ◽  
I. M. G. A. Santman-Berends ◽  
M. H. Mars ◽  
M. F. Weber ◽  
L. van Duijn ◽  
H. W. F. Waldeck ◽  

Within the European Union, infectious cattle diseases are categorized in the Animal Health Law. No strict EU regulations exist for control, evidence of disease freedom, and surveillance of diseases listed other than categories A and B. Consequently, EU member states follow their own varying strategies for disease control. The aim of this study was to provide an overview of the control and eradication programs (CPs) for non-EU regulated cattle diseases in the Netherlands between 2009 and 2019 and to highlight characteristics specific to the Dutch situation. In the Netherlands, CPs are in place for six endemic cattle diseases: bovine viral diarrhea, infectious bovine rhinotracheitis, salmonellosis, paratuberculosis, leptospirosis, and neosporosis. These CPs have been tailored to the specific situation in the Netherlands: a country with a high cattle density, a high rate of animal movements, a strong dependence on export of dairy products, and a high-quality data-infrastructure. The latter specifically applies to the dairy sector, which is the leading cattle sector in the Netherlands. When a herd enters a CP, generally the within-herd prevalence of infection is estimated in an initial assessment. The outcome creates awareness of the infection status of a herd and also provides an indication of the costs and time to achieve the preferred herd status. Subsequently, the herd enrolls in the control phase of the CP to, if present, eliminate the infection from a herd and a surveillance phase to substantiate the free or low prevalence status over time. The high-quality data infrastructure that results in complete and centrally registered census data on cattle movements provides the opportunity to design CPs while minimizing administrative efforts for the farmer. In the CPs, mostly routinely collected samples are used for surveillance. Where possible, requests for proof of the herd status are sent automatically. Automated detection of risk factors for introduction of new animals originating from a herd without the preferred herd status i.e., free or unsuspected, is in place using centrally registered data. The presented overview may inspire countries that want to develop cost-effective CPs for endemic diseases that are not (yet) regulated at EU level.

2021 ◽  
Rachel Wittenauer ◽  
Spike Nowak ◽  
Nick Luter

Abstract Background: Rapid diagnostic tests (RDTs) for malaria are a vital part of global malaria control. Over the past decade, RDT prices have declined, and quality has increased. However, the relationship between price and product quality and their larger implications on the market have yet to be characterized. We sought to use purchase data from the Global Fund together with product quality data from the World Health Organization and Foundation for Innovative New Diagnostics (WHO-FIND) Malaria RDT Product Testing Programme to understand three unanswered questions: 1) Has the market share by quality of RDTs in the Global Fund’s procurement orders changed over time? 2) What is the relationship between unit price and RDT quality? 3) Has the Global Fund procurement market become more concentrated over time?Methods: We merged data from 10,075 procurement transactions in the Global Fund’s database, which includes year, product, volume, and price, with product quality data from all eight rounds of the WHO-FIND program, which evaluated 227 unique RDT products. To describe trends in market share by quality level of RDT, we used descriptive statistics to analyze trends in market share from 2009–2018. We then applied a generalized linear regression model to characterize the relationship between price and panel detection score (PDS), adjusting for order volume, year purchased, product type, and manufacturer. Third, we calculated a Herfindahl-Hirschman Index (HHI) score to characterize the degree of market concentration.Results: Lower-quality RDTs have lost market share between 2009-2018, as have higher-quality RDTs. We find no statistically significant relationship between price per test and PDS when adjusting for order volume, product type, and year of purchase. The HHI was 3,570, indicating a highly concentrated market.Conclusions: Advancements in RDT affordability, quality, and access over the past decade risk stagnation if health of the RDT market as a whole is neglected. Our results suggest that this market is highly concentrated and that quality is not a distinguishing feature between RDTs. This information adds to previous reports noting concerns about the long-term sustainability of this market. Further research is needed to understand the causes and implications of these trends.

Sign in / Sign up

Export Citation Format

Share Document