Clinical Text De-identification and Other Large Scale Processing Tasks in Resource Constrained Environments

ABSTRACT ObjectivesClinical text de-identification is a common requirement of the ‘enclave’ governance model of ethical EHR research. However, there is often little consideration of the engineering task that is required to scale these approaches across the hundreds of millions of clinical documents containing personal identifiers that are resident in the data repositories of a typical NHS Trust. Similarly, natural language processing is an increasingly important field of clinical data science, yet it requires fault tolerant approaches to data processing. This work concerns the development of “turbo-laser” - a distributed document processing architecture based upon the popular ‘battle hardened’ Spring Batch framework - an industry standard for large scale processing tasks. ApproachUsing Spring Batch, we developed a highly scalable unstructured data processing framework, using the concept of remote partitioning. Remote partitioning allows us to offload processing tasks to any and all computers in a network. With this approach, it is possible to harness the entire compute available of an organisation, whether it be an office of 15 desktop PCs that go unused overnight, or a compute cluster of a thousand processors. This method is especially valuable in the NHS, where the provision of sufficient compute to make large scale analytics possible are often hindered by the lack of available hardware, or difficulties in navigating technical governance policies ill equipped for the demands of modern data science. ResultsTurbo-laser was developed in consideration of the processing challenges common in the NHS. Currently, four types of ‘job’ are available - De-identification, using the Cognition algorithm, generic GATE output, text extraction from binary files such as MS Office, PDF and scanned documents, and a document re-compiler to deal with EHR legacy issues. Examples of turbo-laser usage include processing 9 million binary documents on modest hardware, within 48 hours. ConclusionTurbo-laser is an enterprise grade processing tool, in keeping with the software engineering pattern of ‘batch processing’ that has been at the forefront of the informatics movement. An open source project, it is hoped that others may contribute and extend its principles, lowering the barrier of large scale data processing throughout the NHS.

Download Full-text

Image Classification With Convolutional Neural Networks In MapReduce

10.21203/rs.3.rs-1131730/v1 ◽

2021 ◽

Author(s):

Min Chen

Keyword(s):

Neural Networks ◽

Image Classification ◽

Convolutional Neural Networks ◽

Language Processing ◽

Large Scale ◽

Data Science ◽

Fault Tolerant ◽

Computational Cost ◽

Data Intensive ◽

Computationally Intensive

Abstract Deep learning (DL) techniques, more specifically Convolutional Neural Networks (CNNs), have become increasingly popular in advancing the field of data science and have had great successes in a wide array of applications including computer vision, speech, natural language processing and etc. However, the training process of CNNs is computationally intensive and high computational cost, especially when the dataset is huge. To overcome these obstacles, this paper takes advantage of distributed frameworks and cloud computing to develop a parallel CNN algorithm. MapReduce is a scalable and fault-tolerant data processing tool that was developed to provide significant improvements in large-scale data-intensive applications in clusters. A MapReduce-based CNN (MCNN) is developed in this work to tackle the task of image classification. In addition, the proposed MCNN adopted the idea of adding dropout layers in the networks to tackle the overfitting problem. Close examination of the implementation of MCNN as well as how the proposed algorithm accelerates learning are discussed and demonstrated through experiments. Results reveal high classification accuracy and significant improvements in speedup, scaleup and sizeup compared to the standard algorithms.

Download Full-text

Teaching large scale data processing

Proceedings of the 1st ACM Summit on Computing Education in China on First ACM Summit on Computing Education in China - SCE '08 ◽

10.1145/1517632.1517635 ◽

2008 ◽

Author(s):

Kang Chen ◽

Yubing Yin ◽

Weimin Zheng

Keyword(s):

Data Processing ◽

Large Scale ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

Download Full-text

Advanced monitoring techniques for a large‐scale data‐processing network

Campus-Wide Information Systems ◽

10.1108/10650740810921448 ◽

2008 ◽

Vol 25 (5) ◽

pp. 287-300 ◽

Cited By ~ 1

Author(s):

B. Martin ◽

A. Al‐Shabibi ◽

S.M. Batraneanu ◽

Ciobotaru ◽

G.L. Darlea ◽

...

Keyword(s):

Data Processing ◽

Large Scale ◽

Monitoring Techniques ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Processing Network ◽

Scale Data

Download Full-text

Large scale data processing in real world: From analytics to predictions

2014 14th International Conference on Advances in ICT for Emerging Regions (ICTer) ◽

10.1109/icter.2014.7083870 ◽

2014 ◽

Author(s):

Srinath Perera

Keyword(s):

Data Processing ◽

Real World ◽

Large Scale ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

Download Full-text

BestPeer++: A Peer-to-Peer Based Large-Scale Data Processing Platform

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2012.236 ◽

2014 ◽

Vol 26 (6) ◽

pp. 1316-1331 ◽

Cited By ~ 6

Author(s):

Gang Chen ◽

Tianlei Hu ◽

Dawei Jiang ◽

Peng Lu ◽

Kian-Lee Tan ◽

...

Keyword(s):

Data Processing ◽

Large Scale ◽

Peer To Peer ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Processing Platform ◽

Scale Data

Download Full-text

Medimate : Ailment Diffusion Control System with Real Time Large Scale Data Processing

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.31.18233 ◽

2018 ◽

Vol 7 (2.31) ◽

pp. 240

Author(s):

S Sujeetha ◽

Veneesa Ja ◽

K Vinitha ◽

R Suvedha

Keyword(s):

Control System ◽

Data Processing ◽

Real Time ◽

Large Scale ◽

Diffusion Control ◽

Qr Code ◽

Healthcare Applications ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

In the existing scenario, a patient has to go to the hospital to take necessary tests, consult a doctor and buy prescribed medicines or use specified healthcare applications. Hence time is wasted at hospitals and in medical shops. In the case of healthcare applications, face to face interaction with the doctor is not available. The downside of the existing scenario can be improved by the Medimate: Ailment diffusion control system with real time large scale data processing. The purpose of medimate is to establish a Tele Conference Medical System that can be used in remote areas. The medimate is configured for better diagnosis and medical treatment for the rural people. The system is installed with Heart Beat Sensor, Temperature Sensor, Ultrasonic Sensor and Load Cell to monitor the patient’s health parameters. The voice instructions are updated for easier access. The application for enabling video and voice communication with the doctor through Camera and Headphone is installed at both the ends. The doctor examines the patient and prescribes themedicines. The medical dispenser delivers medicine to the patient as per the prescription. The QR code will be generated for each prescription by medimate and that QR code can be used forthe repeated medical conditions in the future. Medical details are updated in the server periodically.

Download Full-text

Towards Heterogeneous Network Alignment: Design and Implementation of a Large-Scale Data Processing Framework

Lecture Notes in Computer Science - Euro-Par 2018: Parallel Processing Workshops ◽

10.1007/978-3-030-10549-5_54 ◽

2018 ◽

pp. 692-703 ◽

Cited By ~ 1

Author(s):

Marianna Milano ◽

Pierangelo Veltri ◽

Mario Cannataro ◽

Pietro H. Guzzi

Keyword(s):

Data Processing ◽

Heterogeneous Network ◽

Large Scale ◽

Network Alignment ◽

Design And Implementation ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data ◽

Processing Framework

Download Full-text

Integration of large-scale data processing systems and traditional parallel database technology

Proceedings of the VLDB Endowment ◽

10.14778/3352063.3352145 ◽

2019 ◽

Vol 12 (12) ◽

pp. 2290-2299

Author(s):

Azza Abouzied ◽

Daniel J. Abadi ◽

Kamil Bajda-Pawlikowski ◽

Avi Silberschatz

Keyword(s):

Data Processing ◽

Large Scale ◽

Parallel Database ◽

Large Scale Data ◽

Database Technology ◽

Large Scale Data Processing ◽

Scale Data

Download Full-text

Measuring adherence to a Choosing Wisely recommendation in a regional oncology clinic.

Journal of Clinical Oncology ◽

10.1200/jco.2016.34.7_suppl.196 ◽

2016 ◽

Vol 34 (7_suppl) ◽

pp. 196-196

Author(s):

Kathryn S. Egan ◽

Gary H. Lyman ◽

Karma L. Kreizenbeck ◽

Catherine R. Fedorenko ◽

April Alfiler ◽

...

Keyword(s):

Data Processing ◽

Language Processing ◽

Cancer Registry ◽

Choosing Wisely ◽

National Guidelines ◽

Billing Data ◽

Database Queries ◽

Clinical Text ◽

Clinical Notes ◽

Processing Techniques

196 Background: Natural language processing (NLP) has the potential to significantly ease the burden of manual abstraction of unstructured electronic text when measuring adherence to national guidelines. We incorporated NLP into standard data processing techniques such as manual abstraction and database queries in order to more efficiently evaluate a regional oncology clinic’s adherence to ASCO’s Choosing Wisely colony stimulating factor (CSF) recommendation using clinical, billing, and cancer registry data. Methods: Database queries on the clinic’s cancer registry yielded the study population of patients with stage II-IV breast, non-small cell lung (NSCL), and colorectal cancer. We manually abstracted chemotherapy regimens from paper prescription records. CSF orders were collected through queries on the clinic’s facility billing data, when available; otherwise through a custom NLP program and manual abstraction of the electronic medical record. The NLP program was designed to identify clinical note text containing CSF information, which was then manually abstracted. Results: Out of 31,725 clinical notes for the eligible population, the NLP program identified 1,487 clinical notes with CSF-related language, effectively reducing the number of notes requiring abstraction by up to 95%. Between 1/1/2012-12/31/2014, adherence to the ASCO CW CSF recommendation at the regional oncology clinic was 89% for a population of 322 patients. Conclusions: NLP significantly reduced the burden of manual abstraction by singling out relevant clinical text for abstractors. Abstraction is often necessary due to the complexity of data collection tasks or the use of paper records. However, NLP is a valuable addition to the suite of data processing techniques traditionally used to measure adherence to national guidelines.

Download Full-text