Best Practices and Lessons Learned from Deploying and Operating Large-Scale Data-Centric Parallel File Systems

Author(s):  
Sarp Oral ◽  
James Simmons ◽  
Jason Hill ◽  
Dustin Leverman ◽  
Feiyi Wang ◽  
...  
2015 ◽  
Vol 101 (4) ◽  
pp. 392-397 ◽  
Author(s):  
Trevor Duke ◽  
Edilson Yano ◽  
Adrian Hutchinson ◽  
Ilomo Hwaihwanje ◽  
Jimmy Aipit ◽  
...  

Although the WHO recommends all countries use International Classification of Diseases (ICD)-10 coding for reporting health data, accurate health facility data are rarely available in developing or low and middle income countries. Compliance with ICD-10 is extremely resource intensive, and the lack of real data seriously undermines evidence-based approaches to improving quality of care and to clinical and public health programme management. We developed a simple tool for the collection of accurate admission and outcome data and implemented it in 16 provincial hospitals in Papua New Guinea over 6 years. The programme was low cost and easy to use by ward clerks and nurses. Over 6 years, it gathered data on the causes of 96 998 admissions of children and 7128 deaths. National reports on child morbidity and mortality were produced each year summarising the incidence and mortality rates for 21 common conditions of children and newborns, and the lessons learned for policy and practice. These data informed the National Policy and Plan for Child Health, triggered the implementation of a process of clinical quality improvement and other interventions to reduce mortality in the neediest areas, focusing on diseases with the highest burdens. It is possible to collect large-scale data on paediatric morbidity and mortality, to be used locally by health workers who gather it, and nationally for improving policy and practice, even in very resource-limited settings where ICD-10 coding systems such as those that exist in some high-income countries are not feasible or affordable.


2012 ◽  
Vol 241-244 ◽  
pp. 1556-1561
Author(s):  
Qi Meng Wu ◽  
Ke Xie ◽  
Ming Fa Zhu ◽  
Li Min Xiao ◽  
Li Ruan

Parallel file systems deploy multiple metadata servers to distribute heavy metadata workload from clients. With the increasing number of metadata servers, metadata-intensive operations are facing some problems related with collaboration among them, compromising the performance gain. Consequently, a file system simulator is very helpful to try out some optimization ideas to solve these problems. In this paper, we propose DMFSsim to simulate the metadata-intensive operations on large-scale distributed metadata file systems. DMFSsim can flexibly replay traces of multiple metadata operations, support several commonly used metadata distribution algorithms, simulate file system tree hierarchy and underlying disk blocks management mechanism in real systems. Extensive simulations show that DMFSsim is capable of demonstrating the performance of metadata-intensive operations in distributed metadata file system.


Author(s):  
M. Sutter ◽  
V. Hartmann ◽  
M. Götter ◽  
J. van Wezel ◽  
A. Trunov ◽  
...  

Author(s):  
Anthony Kougkas ◽  
Hassan Eslami ◽  
Xian-He Sun ◽  
Rajeev Thakur ◽  
William Gropp

Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.


2017 ◽  
Author(s):  
Julie A McMurry ◽  
Nick Juty ◽  
Niklas Blomberg ◽  
Tony Burdett ◽  
Tom Conlin ◽  
...  

AbstractIn many disciplines, data is highly decentralized across thousands of online databases (repositories, registries, and knowledgebases). Wringing value from such databases depends on the discipline of data science and on the humble bricks and mortar that make integration possible; identifiers are a core component of this integration infrastructure. Drawing on our experience and on work by other groups, we outline ten lessons we have learned about the identifier qualities and best practices that facilitate large-scale data integration. Specifically, we propose actions that identifier practitioners (database providers) should take in the design, provision and reuse of identifiers; we also outline important considerations for those referencing identifiers in various circumstances, including by authors and data generators. While the importance and relevance of each lesson will vary by context, there is a need for increased awareness about how to avoid and manage common identifier problems, especially those related to persistence and web-accessibility/resolvability. We focus strongly on web-based identifiers in the life sciences; however, the principles are broadly relevant to other disciplines.


Sign in / Sign up

Export Citation Format

Share Document