Can Applications Recover from
            
              fsync
            
            Failures?

Anthony Rebello; Yuvraj Patel; Ramnatthan Alagappan; Andrea C. Arpaci-Dusseau; Remzi H. Arpaci-Dusseau

doi:10.1145/3450338

Can Applications Recover from fsync Failures?

ACM Transactions on Storage ◽

10.1145/3450338 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-30

Author(s):

Anthony Rebello ◽

Yuvraj Patel ◽

Ramnatthan Alagappan ◽

Andrea C. Arpaci-Dusseau ◽

Remzi H. Arpaci-Dusseau

Keyword(s):

File Systems ◽

Data Loss ◽

Data Intensive ◽

Failure Handling ◽

Data Intensive Applications ◽

Failure Reporting

We analyze how file systems and modern data-intensive applications react to fsync failures. First, we characterize how three Linux file systems (ext4, XFS, Btrfs) behave in the presence of failures. We find commonalities across file systems (pages are always marked clean, certain block writes always lead to unavailability) as well as differences (page content and failure reporting is varied). Next, we study how five widely used applications (PostgreSQL, LMDB, LevelDB, SQLite, Redis) handle fsync failures. Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption. Our findings have strong implications for the design of file systems and applications that intend to provide strong durability guarantees.

Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

Scalable Computing Practice and Experience ◽

10.12694/scpe.v22i4.1945 ◽

2021 ◽

Vol 22 (4) ◽

pp. 401-412

Author(s):

Hrachya Astsatryan ◽

Arthur Lalayan ◽

Aram Kocharyan ◽

Daniel Hagimont

Keyword(s):

Big Data ◽

Data Compression ◽

Data Storage ◽

File Systems ◽

Large Datasets ◽

Data Sets ◽

Mapreduce Framework ◽

Data Intensive ◽

Parallel Data ◽

Data Intensive Applications

The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.

Towards Data Intensive Many-Task Computing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch002 ◽

2012 ◽

pp. 28-73 ◽

Cited By ~ 8

Author(s):

Ioan Raicu ◽

Ian Foster ◽

Yong Zhao ◽

Alex Szalay ◽

Philip Little ◽

...

Keyword(s):

High Performance ◽

File Systems ◽

Data Locality ◽

Resource Provisioning ◽

Parallel File Systems ◽

Data Intensive ◽

Dynamic Resource Provisioning ◽

Rate Of Increase ◽

Parallel File ◽

Data Intensive Applications

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.

I/O and File Systems for Data-Intensive Applications

Handbook on Data Centers ◽

10.1007/978-1-4939-2092-1_18 ◽

2015 ◽

pp. 561-582

Author(s):

Yanlong Yin ◽

Hui Jin ◽

Xian-He Sun

Keyword(s):

File Systems ◽

Data Intensive ◽

Data Intensive Applications

Understanding performance of distributed data-intensive applications

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2010.0168 ◽

2010 ◽

Vol 368 (1926) ◽

pp. 4089-4102 ◽

Cited By ~ 4

Author(s):

Christopher Miceli ◽

Michael Miceli ◽

Bety Rodriguez-Milla ◽

Shantenu Jha

Keyword(s):

File Systems ◽

Data Placement ◽

Distributed Data ◽

Distributed File Systems ◽

Sequence Matching ◽

Data Intensive ◽

Relative Placement ◽

Level Data ◽

A Genome ◽

Data Intensive Applications

Grids, clouds and cloud-like infrastructures are capable of supporting a broad range of data-intensive applications. There are interesting and unique performance issues that appear as the volume of data and degree of distribution increases. New scalable data-placement and management techniques, as well as novel approaches to determine the relative placement of data and computational workload, are required. We develop and study a genome sequence matching application that is simple to control and deploy, yet serves as a prototype of a data-intensive application. The application uses a SAGA-based implementation of the All-Pairs pattern. This paper aims to understand some of the factors that influence the performance of this application and the interplay of those factors. We also demonstrate how the SAGA approach can enable data-intensive applications to be extensible and interoperable over a range of infrastructure. This capability enables us to compare and contrast two different approaches for executing distributed data-intensive applications—simple application-level data-placement heuristics versus distributed file systems.

6G Enabled Smart Infrastructure for Sustainable Society: Opportunities, Challenges, and Research Roadmap

Sensors ◽

10.3390/s21051709 ◽

2021 ◽

Vol 21 (5) ◽

pp. 1709

Author(s):

Agbotiname Lucky Imoize ◽

Oluwadara Adedeji ◽

Nistha Tandiya ◽

Sachin Shetty

Keyword(s):

Wireless Communication ◽

Psychological Health ◽

Future Research ◽

Agriculture Education ◽

Social Psychological ◽

Research Issues ◽

Data Intensive ◽

Wireless Communication Network ◽

Data Intensive Applications

The 5G wireless communication network is currently faced with the challenge of limited data speed exacerbated by the proliferation of billions of data-intensive applications. To address this problem, researchers are developing cutting-edge technologies for the envisioned 6G wireless communication standards to satisfy the escalating wireless services demands. Though some of the candidate technologies in the 5G standards will apply to 6G wireless networks, key disruptive technologies that will guarantee the desired quality of physical experience to achieve ubiquitous wireless connectivity are expected in 6G. This article first provides a foundational background on the evolution of different wireless communication standards to have a proper insight into the vision and requirements of 6G. Second, we provide a panoramic view of the enabling technologies proposed to facilitate 6G and introduce emerging 6G applications such as multi-sensory–extended reality, digital replica, and more. Next, the technology-driven challenges, social, psychological, health and commercialization issues posed to actualizing 6G, and the probable solutions to tackle these challenges are discussed extensively. Additionally, we present new use cases of the 6G technology in agriculture, education, media and entertainment, logistics and transportation, and tourism. Furthermore, we discuss the multi-faceted communication capabilities of 6G that will contribute significantly to global sustainability and how 6G will bring about a dramatic change in the business arena. Finally, we highlight the research trends, open research issues, and key take-away lessons for future research exploration in 6G wireless communication.

Exploratory Development of Data-intensive Applications

Proceedings of the International Conference on the Art, Science, and Engineering of Programming - Programming '17 ◽

10.1145/3079368.3079399 ◽

2017 ◽

Cited By ~ 1

Author(s):

Patrick Rein ◽

Marcel Taeumel ◽

Robert Hirschfeld ◽

Michael Perscheid

Keyword(s):

Data Intensive ◽

Data Intensive Applications

EZIOTracer

ACM SIGOPS Operating Systems Review ◽

10.1145/3469379.3469391 ◽

2021 ◽

Vol 55 (1) ◽

pp. 88-98

Author(s):

Mohammed Islam Naas ◽

François Trahay ◽

Alexis Colin ◽

Pierre Olivier ◽

Stéphane Rubini ◽

...

Keyword(s):

Analysis Framework ◽

Comprehensive Understanding ◽

Kernel Space ◽

Data Intensive ◽

Storage Performance ◽

Performance Requirements ◽

Memory Footprint ◽

Extreme Performance ◽

Data Intensive Applications ◽

Kernel Level

Tracing is a popular method for evaluating, investigating, and modeling the performance of today's storage systems. Tracing has become crucial with the increase in complexity of modern storage applications/systems, that are manipulating an ever-increasing amount of data and are subject to extreme performance requirements. There exists many tracing tools focusing either on the user-level or the kernel-level, however we observe the lack of a unified tracer targeting both levels: this prevents a comprehensive understanding of modern applications' storage performance profiles. In this paper, we present EZIOTracer, a unified I/O tracer for both (Linux) kernel and user spaces, targeting data intensive applications. EZIOTracer is composed of a userland as well as a kernel space tracer, complemented with a trace analysis framework able to merge the output of the two tracers, and in particular to relate user-level events to kernel-level ones, and vice-versa. On the kernel side, EZIOTracer relies on eBPF to offer safe, low-overhead, low memory footprint, and flexible tracing capabilities. We demonstrate using FIO benchmark the ability of EZIOTracer to track down I/O performance issues by relating events recorded at both the kernel and user levels. We show that this can be achieved with a relatively low overhead that ranges from 2% to 26% depending on the I/O intensity.

Domain Metric Driven Decomposition of Data-Intensive Applications

2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) ◽

10.1109/issrew51248.2020.00071 ◽

2020 ◽

Author(s):

Matteo Camilli ◽

Carmine Colarusso ◽

Barbara Russo ◽

Eugenio Zimeo

Keyword(s):

Data Intensive ◽

Data Intensive Applications

Energy Efficient Storage Management Cooperated with Large Data Intensive Applications

2012 IEEE 28th International Conference on Data Engineering ◽

10.1109/icde.2012.47 ◽

2012 ◽

Cited By ~ 7

Author(s):

Norifumi Nishikawa ◽

Miyuki Nakano ◽

Masaru Kitsuregawa

Keyword(s):

Energy Efficient ◽

Large Data ◽

Storage Management ◽

Data Intensive ◽

Efficient Storage ◽

Data Intensive Applications

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.