Ensuring data trustworthiness within SMART Monitoring of environmental processes

Author(s):  
Uta Koedel ◽  
Peter Dietrich ◽  
Philipp Fischer ◽  
Claudia Schuetze

<p>The term SMART Monitoring was also defined by the project Digital Earth (DE) , a central joint project of eight Helmholtz centers in Earth and Environment. SMART Monitoring in the sense of DE means that measured environmental parameters and values need to be specific/scalable, measurable/modular, accepted/adaptive, relevant/robust, and trackable/transferable (SMART) for sustainable use as data and improved real data acquisition. SMART Monitoring can be defined as a reliable monitoring approach with machine-learning, and artificial intelligence (A.I.) supported procedures for an “as automated as possible” data flow from individual sensors to databases. SMART Monitoring Tools must include various standardized data flows within the entire data lifecycle, e.g., specific sensor solutions, novel approaches for sampling designs, and defined standardized metadata descriptions. One of the SMART Monitoring workflows' essential components is enhancing metadata with comprehensive information on data quality. On the other hand, SMART Monitoring must be highly modular and adaptive to apply to different monitoring approaches and disciplines in the sciences.</p><p>In SMART monitoring, data quality is crucial, not only with respect to data FAIRness. It is essential to ensure data reliability and representativeness. Hence, comprehensively documented data quality is essential and required to enable meaningful data selection for specific data blending, integration, and joint interpretation. Data integration from different sources represents a prerequisite for parameterization and validation of predictive tools or models. This data integration demonstrates the importance of implementing the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) for sustainable data management (Wilkinson et al. 2016). So far, the principle of FAIRdata does not include a detailed description of data quality and does not cover content-related quality aspects. Even though data may be FAIR in terms of availability, it is not necessarily “good" in accuracy and precision. Unfortunately, there is still considerable confusion in science about the definition of good or trustworthy data.</p><p>An assessment of data quality and data origin is essential to preclude the possibility of inaccurate, incomplete, or even unsatisfactory data analysis applying, e.g., machine learning methods, and avoid poorly derived, misleading or incorrect conclusions. The terms trustworthiness and representativeness summarise all aspects related to these issues. The central pillars of trustworthiness/representativeness are validity, provenience/provenance, and reliability, which are fundamental features in assessing any data collection or processing step for transparent research. For all kinds of secondary data usage and analysis, a detailed description and assessment of reliability and validity involve an appraisal of applied data collection methods.</p><p>The presentation will give exemplary examples to show the importance of data trustworthiness and representativeness evaluation and description, allowing scientists to find appropriate tools and methods for FAIR data handling and more accurate data interpretation.</p>

10.2196/17619 ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. e17619
Author(s):  
Neha Shah ◽  
Diwakar Mohan ◽  
Jean Juste Harisson Bashingwa ◽  
Osama Ummer ◽  
Arpita Chakraborty ◽  
...  

Background Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. International Registered Report Identifier (IRRID) DERR1-10.2196/17619


2019 ◽  
Author(s):  
Neha Shah ◽  
Diwakar Mohan ◽  
Jean Juste Harisson Bashingwa ◽  
Osama Ummer ◽  
Arpita Chakraborty ◽  
...  

BACKGROUND Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. OBJECTIVE This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. METHODS In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. RESULTS Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. CONCLUSIONS Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. CLINICALTRIAL INTERNATIONAL REGISTERED REPORT DERR1-10.2196/17619


2018 ◽  
Vol 12 ◽  
pp. 85-98
Author(s):  
Bojan Kostadinov ◽  
Mile Jovanov ◽  
Emil STANKOV

Data collection and machine learning are changing the world. Whether it is medicine, sports or education, companies and institutions are investing a lot of time and money in systems that gather, process and analyse data. Likewise, to improve competitiveness, a lot of countries are making changes to their educational policy by supporting STEM disciplines. Therefore, it’s important to put effort into using various data sources to help students succeed in STEM. In this paper, we present a platform that can analyse student’s activity on various contest and e-learning systems, combine and process the data, and then present it in various ways that are easy to understand. This in turn enables teachers and organizers to recognize talented and hardworking students, identify issues, and/or motivate students to practice and work on areas where they’re weaker.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Suppawong Tuarob ◽  
Poom Wettayakorn ◽  
Ponpat Phetchai ◽  
Siripong Traivijitkhun ◽  
Sunghoon Lim ◽  
...  

AbstractThe explosion of online information with the recent advent of digital technology in information processing, information storing, information sharing, natural language processing, and text mining techniques has enabled stock investors to uncover market movement and volatility from heterogeneous content. For example, a typical stock market investor reads the news, explores market sentiment, and analyzes technical details in order to make a sound decision prior to purchasing or selling a particular company’s stock. However, capturing a dynamic stock market trend is challenging owing to high fluctuation and the non-stationary nature of the stock market. Although existing studies have attempted to enhance stock prediction, few have provided a complete decision-support system for investors to retrieve real-time data from multiple sources and extract insightful information for sound decision-making. To address the above challenge, we propose a unified solution for data collection, analysis, and visualization in real-time stock market prediction to retrieve and process relevant financial data from news articles, social media, and company technical information. We aim to provide not only useful information for stock investors but also meaningful visualization that enables investors to effectively interpret storyline events affecting stock prices. Specifically, we utilize an ensemble stacking of diversified machine-learning-based estimators and innovative contextual feature engineering to predict the next day’s stock prices. Experiment results show that our proposed stock forecasting method outperforms a traditional baseline with an average mean absolute percentage error of 0.93. Our findings confirm that leveraging an ensemble scheme of machine learning methods with contextual information improves stock prediction performance. Finally, our study could be further extended to a wide variety of innovative financial applications that seek to incorporate external insight from contextual information such as large-scale online news articles and social media data.


2021 ◽  
Vol 5 (3) ◽  
pp. 1-30
Author(s):  
Gonçalo Jesus ◽  
António Casimiro ◽  
Anabela Oliveira

Sensor platforms used in environmental monitoring applications are often subject to harsh environmental conditions while monitoring complex phenomena. Therefore, designing dependable monitoring systems is challenging given the external disturbances affecting sensor measurements. Even the apparently simple task of outlier detection in sensor data becomes a hard problem, amplified by the difficulty in distinguishing true data errors due to sensor faults from deviations due to natural phenomenon, which look like data errors. Existing solutions for runtime outlier detection typically assume that the physical processes can be accurately modeled, or that outliers consist in large deviations that are easily detected and filtered by appropriate thresholds. Other solutions assume that it is possible to deploy multiple sensors providing redundant data to support voting-based techniques. In this article, we propose a new methodology for dependable runtime detection of outliers in environmental monitoring systems, aiming to increase data quality by treating them. We propose the use of machine learning techniques to model each sensor behavior, exploiting the existence of correlated data provided by other related sensors. Using these models, along with knowledge of processed past measurements, it is possible to obtain accurate estimations of the observed environment parameters and build failure detectors that use these estimations. When a failure is detected, these estimations also allow one to correct the erroneous measurements and hence improve the overall data quality. Our methodology not only allows one to distinguish truly abnormal measurements from deviations due to complex natural phenomena, but also allows the quantification of each measurement quality, which is relevant from a dependability perspective. We apply the methodology to real datasets from a complex aquatic monitoring system, measuring temperature and salinity parameters, through which we illustrate the process for building the machine learning prediction models using a technique based on Artificial Neural Networks, denoted ANNODE ( ANN Outlier Detection ). From this application, we also observe the effectiveness of our ANNODE approach for accurate outlier detection in harsh environments. Then we validate these positive results by comparing ANNODE with state-of-the-art solutions for outlier detection. The results show that ANNODE improves existing solutions regarding accuracy of outlier detection.


Author(s):  
Christopher D O’Connor ◽  
John Ng ◽  
Dallas Hill ◽  
Tyler Frederick

Policing is increasingly being shaped by data collection and analysis. However, we still know little about the quality of the data police services acquire and utilize. Drawing on a survey of analysts from across Canada, this article examines several data collection, analysis, and quality issues. We argue that as we move towards an era of big data policing it is imperative that police services pay more attention to the quality of the data they collect. We conclude by discussing the implications of ignoring data quality issues and the need to develop a more robust research culture in policing.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Michelle Amri ◽  
Christina Angelakis ◽  
Dilani Logan

Abstract Objective Through collating observations from various studies and complementing these findings with one author’s study, a detailed overview of the benefits and drawbacks of asynchronous email interviewing is provided. Through this overview, it is evident there is great potential for asynchronous email interviews in the broad field of health, particularly for studies drawing on expertise from participants in academia or professional settings, those across varied geographical settings (i.e. potential for global public health research), and/or in circumstances when face-to-face interactions are not possible (e.g. COVID-19). Results Benefits of asynchronous email interviewing and additional considerations for researchers are discussed around: (i) access transcending geographic location and during restricted face-to-face communications; (ii) feasibility and cost; (iii) sampling and inclusion of diverse participants; (iv) facilitating snowball sampling and increased transparency; (v) data collection with working professionals; (vi) anonymity; (vii) verification of participants; (viii) data quality and enhanced data accuracy; and (ix) overcoming language barriers. Similarly, potential drawbacks of asynchronous email interviews are also discussed with suggested remedies, which centre around: (i) time; (ii) participant verification and confidentiality; (iii) technology and sampling concerns; (iv) data quality and availability; and (v) need for enhanced clarity and precision.


2021 ◽  
Vol 13 (6) ◽  
pp. 3320
Author(s):  
Amy R. Villarosa ◽  
Lucie M. Ramjan ◽  
Della Maneze ◽  
Ajesh George

The COVID-19 pandemic has resulted in many changes, including restrictions on indoor gatherings and visitation to residential aged care facilities, hospitals and certain communities. Coupled with potential restrictions imposed by health services and academic institutions, these changes may significantly impact the conduct of population health research. However, the continuance of population health research is beneficial for the provision of health services and sometimes imperative. This paper discusses the impact of COVID-19 restrictions on the conduct of population health research. This discussion unveils important ethical considerations, as well as potential impacts on recruitment methods, face-to-face data collection, data quality and validity. In addition, this paper explores potential recruitment and data collection methods that could replace face-to-face methods. The discussion is accompanied by reflections on the challenges experienced by the authors in their own research at an oral health service during the COVID-19 pandemic and alternative methods that were utilised in place of face-to-face methods. This paper concludes that, although COVID-19 presents challenges to the conduct of population health research, there is a range of alternative methods to face-to-face recruitment and data collection. These alternative methods should be considered in light of project aims to ensure data quality is not compromised.


Sign in / Sign up

Export Citation Format

Share Document