scholarly journals Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India (Preprint)

2019 ◽  
Author(s):  
Neha Shah ◽  
Diwakar Mohan ◽  
Jean Juste Harisson Bashingwa ◽  
Osama Ummer ◽  
Arpita Chakraborty ◽  
...  

BACKGROUND Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. OBJECTIVE This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. METHODS In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. RESULTS Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. CONCLUSIONS Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. CLINICALTRIAL INTERNATIONAL REGISTERED REPORT DERR1-10.2196/17619

10.2196/17619 ◽  
2020 ◽  
Vol 9 (8) ◽  
pp. e17619
Author(s):  
Neha Shah ◽  
Diwakar Mohan ◽  
Jean Juste Harisson Bashingwa ◽  
Osama Ummer ◽  
Arpita Chakraborty ◽  
...  

Background Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. Objective This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. Methods In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. Results Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. Conclusions Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. International Registered Report Identifier (IRRID) DERR1-10.2196/17619


2021 ◽  
Author(s):  
Rodrigo Chamusca Machado ◽  
Fabbio Leite ◽  
Cristiano Xavier ◽  
Alberto Albuquerque ◽  
Samuel Lima ◽  
...  

Objectives/Scope This paper presents how a brazilian Drilling Contractor and a startup built a partnership to optimize the maintenance window of subsea blowout preventers (BOPs) using condition-based maintenance (CBM). It showcases examples of insights about the operational conditions of its components, which were obtained by applying machine learning techniques in real time and historic, structured or unstructured, data. Methods, Procedures, Process From unstructured and structured historical data, which are generated daily from BOP operations, a knowledge bank was built and used to develop normal functioning models. This has been possible even without real-time data, as it has been tested with large sets of operational data collected from event log text files. Software retrieves the data from Event Loggers and creates structured database, comprising analog variables, warnings, alarms and system information. Using machine learning algorithms, the historical data is then used to develop normal behavior modeling for the target components. Thereby, it is possible to use the event logger or real time data to identify abnormal operation moments and detect failure patterns. Critical situations are immediately transmitted to the RTOC (Real-time Operations Center) and management team, while less critical alerts are recorded in the system for further investigation. Results, Observations, Conclusions During the implementation period, Drilling Contractor was able to identify a BOP failure using the detection algorithms and used 100% of the information generated by the system and reports to efficiently plan for equipment maintenance. The system has also been intensively used for incident investigation, helping to identify root causes through data analytics and retro-feeding the machine learning algorithms for future automated failure predictions. This development is expected to significantly reduce the risk of BOP retrieval during the operation for corrective maintenance, increased staff efficiency in maintenance activities, reducing the risk of downtime and improving the scope of maintenance during operational windows, and finally reduction in the cost of spare parts replacementduring maintenance without impact on operational safety. Novel/Additive Information For the near future, the plan is to integrate the system with the Computerized Maintenance Management System (CMMS), checking for historical maintenance, overdue maintenance, certifications, at the same place and time that we are getting real-time operational data and insights. Using real-time data as input, we expect to expand the failure prediction application for other BOP parts (such as regulators, shuttle valves, SPMs (Submounted Plate valves), etc) and increase the applicability for other critical equipment on the rig.


Sensors ◽  
2018 ◽  
Vol 18 (9) ◽  
pp. 2994 ◽  
Author(s):  
Bhagya Silva ◽  
Murad Khan ◽  
Changsu Jung ◽  
Jihun Seo ◽  
Diyan Muhammad ◽  
...  

The Internet of Things (IoT), inspired by the tremendous growth of connected heterogeneous devices, has pioneered the notion of smart city. Various components, i.e., smart transportation, smart community, smart healthcare, smart grid, etc. which are integrated within smart city architecture aims to enrich the quality of life (QoL) of urban citizens. However, real-time processing requirements and exponential data growth withhold smart city realization. Therefore, herein we propose a Big Data analytics (BDA)-embedded experimental architecture for smart cities. Two major aspects are served by the BDA-embedded smart city. Firstly, it facilitates exploitation of urban Big Data (UBD) in planning, designing, and maintaining smart cities. Secondly, it occupies BDA to manage and process voluminous UBD to enhance the quality of urban services. Three tiers of the proposed architecture are liable for data aggregation, real-time data management, and service provisioning. Moreover, offline and online data processing tasks are further expedited by integrating data normalizing and data filtering techniques to the proposed work. By analyzing authenticated datasets, we obtained the threshold values required for urban planning and city operation management. Performance metrics in terms of online and offline data processing for the proposed dual-node Hadoop cluster is obtained using aforementioned authentic datasets. Throughput and processing time analysis performed with regard to existing works guarantee the performance superiority of the proposed work. Hence, we can claim the applicability and reliability of implementing proposed BDA-embedded smart city architecture in the real world.


Sensors ◽  
2021 ◽  
Vol 21 (20) ◽  
pp. 6750
Author(s):  
Mubashir Rehman ◽  
Raza Ali Shah ◽  
Muhammad Bilal Khan ◽  
Syed Aziz Shah ◽  
Najah Abed AbuAli ◽  
...  

The recent severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), also known as coronavirus disease (COVID)-19, has appeared as a global pandemic with a high mortality rate. The main complication of COVID-19 is rapid respirational deterioration, which may cause life-threatening pneumonia conditions. Global healthcare systems are currently facing a scarcity of resources to assist critical patients simultaneously. Indeed, non-critical patients are mostly advised to self-isolate or quarantine themselves at home. However, there are limited healthcare services available during self-isolation at home. According to research, nearly 20–30% of COVID patients require hospitalization, while almost 5–12% of patients may require intensive care due to severe health conditions. This pandemic requires global healthcare systems that are intelligent, secure, and reliable. Tremendous efforts have been made already to develop non-contact sensing technologies for the diagnosis of COVID-19. The most significant early indication of COVID-19 is rapid and abnormal breathing. In this research work, RF-based technology is used to collect real-time breathing abnormalities data. Subsequently, based on this data, a large dataset of simulated breathing abnormalities is generated using the curve fitting technique for developing a machine learning (ML) classification model. The advantages of generating simulated breathing abnormalities data are two-fold; it will help counter the daunting and time-consuming task of real-time data collection and improve the ML model accuracy. Several ML algorithms are exploited to classify eight breathing abnormalities: eupnea, bradypnea, tachypnea, Biot, sighing, Kussmaul, Cheyne–Stokes, and central sleep apnea (CSA). The performance of ML algorithms is evaluated based on accuracy, prediction speed, and training time for real-time breathing data and simulated breathing data. The results show that the proposed platform for real-time data classifies breathing patterns with a maximum accuracy of 97.5%, whereas by introducing simulated breathing data, the accuracy increases up to 99.3%. This work has a notable medical impact, as the introduced method mitigates the challenge of data collection to build a realistic model of a large dataset during the pandemic.


Author(s):  
Christopher D O’Connor ◽  
John Ng ◽  
Dallas Hill ◽  
Tyler Frederick

Policing is increasingly being shaped by data collection and analysis. However, we still know little about the quality of the data police services acquire and utilize. Drawing on a survey of analysts from across Canada, this article examines several data collection, analysis, and quality issues. We argue that as we move towards an era of big data policing it is imperative that police services pay more attention to the quality of the data they collect. We conclude by discussing the implications of ignoring data quality issues and the need to develop a more robust research culture in policing.


Sign in / Sign up

Export Citation Format

Share Document