Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems

Although the WHO recommends all countries use International Classification of Diseases (ICD)-10 coding for reporting health data, accurate health facility data are rarely available in developing or low and middle income countries. Compliance with ICD-10 is extremely resource intensive, and the lack of real data seriously undermines evidence-based approaches to improving quality of care and to clinical and public health programme management. We developed a simple tool for the collection of accurate admission and outcome data and implemented it in 16 provincial hospitals in Papua New Guinea over 6 years. The programme was low cost and easy to use by ward clerks and nurses. Over 6 years, it gathered data on the causes of 96 998 admissions of children and 7128 deaths. National reports on child morbidity and mortality were produced each year summarising the incidence and mortality rates for 21 common conditions of children and newborns, and the lessons learned for policy and practice. These data informed the National Policy and Plan for Child Health, triggered the implementation of a process of clinical quality improvement and other interventions to reduce mortality in the neediest areas, focusing on diseases with the highest burdens. It is possible to collect large-scale data on paediatric morbidity and mortality, to be used locally by health workers who gather it, and nationally for improving policy and practice, even in very resource-limited settings where ICD-10 coding systems such as those that exist in some high-income countries are not feasible or affordable.

Download Full-text

iez: Resource Contention Aware Load Balancing for Large-Scale Parallel File Systems

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps.2019.00070 ◽

2019 ◽

Cited By ~ 1

Author(s):

Bharti Wadhwa ◽

Arnab K. Paul ◽

Sarah Neuwirth ◽

Feiyi Wang ◽

Sarp Oral ◽

...

Keyword(s):

Load Balancing ◽

Large Scale ◽

File Systems ◽

Parallel File Systems ◽

Resource Contention ◽

Parallel File

Download Full-text

DMFSsim: A Distributed Metadata File System Simulator

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.241-244.1556 ◽

2012 ◽

Vol 241-244 ◽

pp. 1556-1561

Author(s):

Qi Meng Wu ◽

Ke Xie ◽

Ming Fa Zhu ◽

Li Min Xiao ◽

Li Ruan

Keyword(s):

Large Scale ◽

File System ◽

File Systems ◽

Performance Gain ◽

Parallel File Systems ◽

Management Mechanism ◽

Metadata File ◽

Parallel File ◽

Distribution Algorithms

Parallel file systems deploy multiple metadata servers to distribute heavy metadata workload from clients. With the increasing number of metadata servers, metadata-intensive operations are facing some problems related with collaboration among them, compromising the performance gain. Consequently, a file system simulator is very helpful to try out some optimization ideas to solve these problems. In this paper, we propose DMFSsim to simulate the metadata-intensive operations on large-scale distributed metadata file systems. DMFSsim can flexibly replay traces of multiple metadata operations, support several commonly used metadata distribution algorithms, simulate file system tree hierarchy and underlying disk blocks management mechanism in real systems. Extensive simulations show that DMFSsim is capable of demonstrating the performance of metadata-intensive operations in distributed metadata file system.

Download Full-text

File Systems and Access Technologies for the Large Scale Data Facility

Remote Instrumentation for eScience and Related Aspects ◽

10.1007/978-1-4614-0508-5_16 ◽

2011 ◽

pp. 239-256 ◽

Cited By ~ 1

Author(s):

M. Sutter ◽

V. Hartmann ◽

M. Götter ◽

J. van Wezel ◽

A. Trunov ◽

...

Keyword(s):

Large Scale ◽

File Systems ◽

Large Scale Data ◽

Scale Data

Download Full-text

Rethinking key–value store for parallel I/O optimization

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016677084 ◽

2016 ◽

Vol 31 (4) ◽

pp. 335-356 ◽

Cited By ~ 4

Author(s):

Anthony Kougkas ◽

Hassan Eslami ◽

Xian-He Sun ◽

Rajeev Thakur ◽

William Gropp

Keyword(s):

Cloud Storage ◽

Large Scale ◽

Storage Systems ◽

Storage System ◽

File Systems ◽

Data Synchronization ◽

Input Output ◽

Parallel File Systems ◽

Parallel File ◽

And Performance

Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.

Download Full-text

Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data

10.1101/117812 ◽

2017 ◽

Cited By ~ 1

Author(s):

Julie A McMurry ◽

Nick Juty ◽

Niklas Blomberg ◽

Tony Burdett ◽

Tom Conlin ◽

...

Keyword(s):

Best Practices ◽

Large Scale ◽

Data Science ◽

Life Science ◽

Core Component ◽

Science Data ◽

Web Based ◽

Large Scale Data ◽

Online Databases ◽

Scale Data

AbstractIn many disciplines, data is highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline ten lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers; we also outline important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.

Download Full-text