SepStore: Data Storage Accelerator for Distributed File Systems by Separating Small Files from Large Files

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

Modeling of distributed file System in big data storage by event- B

MATEC Web of Conferences ◽

10.1051/matecconf/201821004042 ◽

2018 ◽

Vol 210 ◽

pp. 04042

Author(s):

Ammar Alhaj Ali ◽

Pavel Varacha ◽

Said Krayem ◽

Roman Jasek ◽

Petr Zacek ◽

...

Keyword(s):

Big Data ◽

Data Storage ◽

High Performance ◽

File System ◽

Formal Method ◽

File Systems ◽

Distributed File System ◽

Distributed File Systems ◽

Data Systems ◽

Big Data Systems

Nowadays, a wide set of systems and application, especially in high performance computing, depends on distributed environments to process and analyses huge amounts of data. As we know, the amount of data increases enormously, and the goal to provide and develop efficient, scalable and reliable storage solutions has become one of the major issue for scientific computing. The storage solution used by big data systems is Distributed File Systems (DFSs), where DFS is used to build a hierarchical and unified view of multiple file servers and shares on the network. In this paper we will offer Hadoop Distributed File System (HDFS) as DFS in big data systems and we will present an Event-B as formal method that can be used in modeling, where Event-B is a mature formal method which has been widely used in a number of industry projects in a number of domains, such as automotive, transportation, space, business information, medical device and so on, And will propose using the Rodin as modeling tool for Event-B, which integrates modeling and proving as well as the Rodin platform is open source, so it supports a large number of plug-in tools.

Download Full-text

Role of Open Source Software in Big Data Storage

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch005 ◽

2018 ◽

pp. 123-150 ◽

Cited By ~ 1

Author(s):

Rupali Ahuja ◽

Jigyasa Malik ◽

Ronak Tyagi ◽

R. Brinda

Keyword(s):

Big Data ◽

Open Source ◽

Data Storage ◽

Open Source Software ◽

File Systems ◽

Distributed File Systems ◽

The World ◽

Storage Technologies ◽

Big Data Storage

Today, the world is revolving around Big Data. Each organization is trying hard to explore ways for deriving value out of huge pile of data we are generating each moment. Open Source Software are widely being adopted by most academicians, researchers and industrialists to handle various Big Data needs because of their easy availability, flexibility, affordability and interoperability. As a result, several open source Big Data tools have been developed. This chapter discusses the role of Open Source Software in Big Data Storage and how various organizations have benefitted from its use. It provides an overview of popular Open Source Big Data Storage technologies existing today. Distributed File Systems and NoSQL databases meant for storing Big Data have been discussed with their features, applications and comparison.

Download Full-text

Using Hadoop Distributed and Deduplicated File System (HD2FS) in Astronomy

Proceedings of the International Astronomical Union ◽

10.1017/s1743921321000387 ◽

2019 ◽

Vol 15 (S367) ◽

pp. 464-466

Author(s):

Paul Bartus

Keyword(s):

Data Storage ◽

Storage Capacity ◽

File System ◽

Storage Systems ◽

Storage System ◽

File Systems ◽

Distributed File System ◽

Distributed File Systems ◽

Output Performance ◽

Hadoop Distributed File System

AbstractDuring the last years, the amount of data has skyrocketed. As a consequence, the data has become more expensive to store than to generate. The storage needs for astronomical data are also following this trend. Storage systems in Astronomy contain redundant copies of data such as identical files or within sub-file regions. We propose the use of the Hadoop Distributed and Deduplicated File System (HD2FS) in Astronomy. HD2FS is a deduplication storage system that was created to improve data storage capacity and efficiency in distributed file systems without compromising Input/Output performance. HD2FS can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy of data in astronomy and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume.

Download Full-text

Role of Open Source Software in Big Data Storage

Research Anthology on Usage and Development of Open Source Software ◽

10.4018/978-1-7998-9158-1.ch043 ◽

2021 ◽

pp. 876-904

Author(s):

Rupali Ahuja ◽

Jigyasa Malik ◽

Ronak Tyagi ◽

R. Brinda

Keyword(s):

Big Data ◽

Open Source ◽

Data Storage ◽

Open Source Software ◽

File Systems ◽

Distributed File Systems ◽

The World ◽

Storage Technologies ◽

Big Data Storage

Today, the world is revolving around Big Data. Each organization is trying hard to explore ways for deriving value out of huge pile of data we are generating each moment. Open Source Software are widely being adopted by most academicians, researchers and industrialists to handle various Big Data needs because of their easy availability, flexibility, affordability and interoperability. As a result, several open source Big Data tools have been developed. This chapter discusses the role of Open Source Software in Big Data Storage and how various organizations have benefitted from its use. It provides an overview of popular Open Source Big Data Storage technologies existing today. Distributed File Systems and NoSQL databases meant for storing Big Data have been discussed with their features, applications and comparison.

Download Full-text

A validated performance model for distributed file systems

Journal of Systems and Software ◽

10.1016/0164-1212(89)90030-7 ◽

1989 ◽

Vol 10 (3) ◽

pp. 169-185 ◽

Cited By ~ 1

Author(s):

Anna Hać

Keyword(s):

File Systems ◽

Performance Model ◽

Distributed File Systems

Download Full-text

PABIRS: A data access middleware for distributed file systems

2015 IEEE 31st International Conference on Data Engineering ◽

10.1109/icde.2015.7113277 ◽

2015 ◽

Cited By ~ 1

Author(s):

Sai Wu ◽

Gang Chen ◽

Xianke Zhou ◽

Zhenjie Zhang ◽

Anthony K. H. Tung ◽

...

Keyword(s):

File Systems ◽

Data Access ◽

Distributed File Systems

Download Full-text

Distributed File Systems

Proceedings Thirteenth IEEE Symposium on Mass Storage Systems. Toward Distributed Storage and Data Management Systems ◽

10.1109/mass.1994.373020 ◽

2005 ◽

Author(s):

R. Watson

Keyword(s):

File Systems ◽

Distributed File Systems

Download Full-text

ALDM: Adaptive Loading Data Migration in Distributed File Systems

IEEE Transactions on Magnetics ◽

10.1109/tmag.2013.2251616 ◽

2013 ◽

Vol 49 (6) ◽

pp. 2645-2652 ◽

Cited By ~ 2

Author(s):

Zhipeng Tan ◽

Wei Zhou ◽

Dan Feng ◽

Wenhua Zhang

Keyword(s):

File Systems ◽

Data Migration ◽

Distributed File Systems

Download Full-text

Octopus + : An RDMA-Enabled Distributed Persistent Memory File System

ACM Transactions on Storage ◽

10.1145/3448418 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1-25

Author(s):

Bohong Zhu ◽

Youmin Chen ◽

Qing Wang ◽

Youyou Lu ◽

Jiwu Shu

Keyword(s):

High Speed ◽

High Performance ◽

File System ◽

Direct Memory Access ◽

File Systems ◽

Distributed File Systems ◽

Persistent Memory ◽

Memory Modules ◽

Non Volatile Memory ◽

Volatile Memory

Non-volatile memory and remote direct memory access (RDMA) provide extremely high performance in storage and network hardware. However, existing distributed file systems strictly isolate file system and network layers, and the heavy layered software designs leave high-speed hardware under-exploited. In this article, we propose an RDMA-enabled distributed persistent memory file system, Octopus + , to redesign file system internal mechanisms by closely coupling non-volatile memory and RDMA features. For data operations, Octopus + directly accesses a shared persistent memory pool to reduce memory copying overhead, and actively fetches and pushes data all in clients to rebalance the load between the server and network. For metadata operations, Octopus + introduces self-identified remote procedure calls for immediate notification between file systems and networking, and an efficient distributed transaction mechanism for consistency. Octopus + is enabled with replication feature to provide better availability. Evaluations on Intel Optane DC Persistent Memory Modules show that Octopus + achieves nearly the raw bandwidth for large I/Os and orders of magnitude better performance than existing distributed file systems.

Download Full-text