scholarly journals Efficiently Processing and Storing Library Linked Data using Apache Spark and Parquet

2018 ◽  
Vol 37 (3) ◽  
pp. 29-49
Author(s):  
Kumar Sharma ◽  
Ujjal Marjit ◽  
Utpal Biswas

Resource Description Framework (RDF) is a commonly used data model in the Semantic Web environment. Libraries and various other communities have been using the RDF data model to store valuable data after it is extracted from traditional storage systems. However, because of the large volume of the data, processing and storing it is becoming a nightmare for traditional data-management tools. This challenge demands a scalable and distributed system that can manage data in parallel. In this article, a distributed solution is proposed for efficiently processing and storing the large volume of library linked data stored in traditional storage systems. Apache Spark is used for parallel processing of large data sets and a column-oriented schema is proposed for storing RDF data. The storage system is built on top of Hadoop Distributed File Systems (HDFS) and uses the Apache Parquet format to store data in a compressed form. The experimental evaluation showed that storage requirements were reduced significantly as compared to Jena TDB, Sesame, RDF/XML, and N-Triples file formats. SPARQL queries are processed using Spark SQL to query the compressed data. The experimental evaluation showed a good query response time, which significantly reduces as the number of worker nodes increases.

Author(s):  
Ahmet Artu Yıldırım ◽  
Dan Watson

Major Internet services are required to process a tremendous amount of data at real time. As we put these services under the magnifying glass, It's seen that distributed object storage systems play an important role at back-end in achieving this success. In this chapter, overall information of the current state-of –the-art storage systems are given which are used for reliable, high performance and scalable storage needs in data centers and cloud. Then, an experimental distributed object storage system (CADOS) is introduced for retrieving large data, such as hundreds of megabytes, efficiently through HTML5-enabled web browsers over big data – terabytes of data – in cloud infrastructure. The objective of the system is to minimize latency and propose a scalable storage system on the cloud using a thin RESTful web service and modern HTML5 capabilities.


2019 ◽  
Vol 135 ◽  
pp. 04076 ◽  
Author(s):  
Marina Bolsunovskaya ◽  
Svetlana Shirokova ◽  
Aleksandra Loginova

This paper is devoted to the problem of developing and application of data storage systems (DSS) and tools for managing such systems to predict failures and provide fault tolerance specifications. Nowadays DSS are widely used for collecting data in Smart Home and Smart Cites management systems. For example, large data warehouses are utilized in traffic management systems. The results of the current data storage market state analysis are shown, and the project the purpose of which is to develop a hardware and software complex to predict failures in the storage system is presented.


2021 ◽  
Author(s):  
Marco Kulüke ◽  
Fabian Wachsmann ◽  
Georg Leander Siemund ◽  
Hannes Thiemann ◽  
Stephan Kindermann

<p>This study provides a guidance to data providers on how to transfer existing NetCDF data from a hierarchical storage system into Zarr to an object storage system.</p><p>In recent years, object storage systems became an alternative to traditional hierarchical file systems, because they are easily scalable and offer faster data retrieval, as compared to hierarchical storage systems.</p><p>Earth system sciences, and climate science in particular, handle large amounts of data. These data usually are represented as multi-dimensional arrays and traditionally stored in netCDF format on hierarchical file systems. However, the current netCDF-4 format is not yet optimized for object storage systems. NetCDF data transfers from an object storage can only be conducted on file level which results in heavy download volumes. An improvement to mitigate this problem can be the Zarr format, which reduces data transfers, due to the direct chunk and meta data access and hence increases the input/output operation speed in parallel computing environments.</p><p>As one of the largest climate data providers worldwide, the German Climate Computing Center (DKRZ) continuously works towards efficient ways to make data accessible for the user. This use case shows the conversion and the transfer of a subset of the Coupled Model Intercomparison Project Phase 6 (CMIP6) climate data archive from netCDF on the hierarchical file system into Zarr to the OpenStack object store, known as Swift, by using the Zarr Python package. Conclusively, this study will evaluate to what extent Zarr formatted climate data on an object storage system is a meaningful addition to the existing high performance computing environment of the DKRZ.</p>


Information ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 304 ◽  
Author(s):  
Anas Khan

In this article, we look at the potential for a wide-coverage modelling of etymological information as linked data using the Resource Data Framework (RDF) data model. We begin with a discussion of some of the most typical features of etymological data and the challenges that these might pose to an RDF-based modelling. We then propose a new vocabulary for representing etymological data, the Ontolex-lemon Etymological Extension (lemonETY), based on the ontolex-lemon model. Each of the main elements of our new model is motivated with reference to the preceding discussion.


2021 ◽  
Author(s):  
Chiara Forresi ◽  
Enrico Gallinucci ◽  
Matteo Golfarelli ◽  
Hamdi Ben Hamadou

AbstractThe success of NoSQL DBMSs has pushed the adoption of polyglot storage systems that take advantage of the best characteristics of different technologies and data models. While operational applications take great benefit from this choice, analytical applications suffer the absence of schema consistency, not only between different DBMSs but within a single NoSQL system as well. In this context, the discipline of data science is steering analysts away from traditional data warehousing and toward a more flexible and lightweight approach to data analysis. The idea is to perform OLAP analyses in a pay-as-you-go manner across heterogeneous schemas and data models, where the integration is progressively carried out by the user as the available data is explored. In this paper, we propose an approach to support data analysis within a high-variety multistore, with heterogeneous schemas and overlapping records. Our approach supports relational, document, wide-column, and key-value data models by automatically handling both data model and schema heterogeneity through a dataspace layer on top of the underlying DBMSs. The expressiveness we enable corresponds to GPSJ queries, which are the most common class of queries in OLAP applications. We rely on nested relational algebra to define a cross-database execution plan. The system has been prototyped on Apache Spark.


Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.


2011 ◽  
Vol 05 (04) ◽  
pp. 433-462 ◽  
Author(s):  
ANDRÉ FREITAS ◽  
EDWARD CURRY ◽  
JOÃO GABRIEL OLIVEIRA ◽  
SEÁN O'RIAIN

The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data on the Web today, end users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting them from the underlying data model represents a fundamental problem for Web-scale Linked Data consumption. This article introduces a distributional structured semantic space which enables data model independent natural language queries over RDF data. The center of the approach relies on the use of a distributional semantic model to address the level of semantic interpretation demanded to build the data model independent approach. The article analyzes the geometric aspects of the proposed space, providing its description as a distributional structured vector space, which is built upon the Generalized Vector Space Model (GVSM). The final semantic space proved to be flexible and precise under real-world query conditions achieving mean reciprocal rank = 0.516, avg. precision = 0.482 and avg. recall = 0.491.


Author(s):  
Anthony Kougkas ◽  
Hassan Eslami ◽  
Xian-He Sun ◽  
Rajeev Thakur ◽  
William Gropp

Key–value stores are being widely used as the storage system for large-scale internet services and cloud storage systems. However, they are rarely used in HPC systems, where parallel file systems are the dominant storage solution. In this study, we examine the architecture differences and performance characteristics of parallel file systems and key–value stores. We propose using key–value stores to optimize overall Input/Output (I/O) performance, especially for workloads that parallel file systems cannot handle well, such as the cases with intense data synchronization or heavy metadata operations. We conducted experiments with several synthetic benchmarks, an I/O benchmark, and a real application. We modeled the performance of these two systems using collected data from our experiments, and we provide a predictive method to identify which system offers better I/O performance given a specific workload. The results show that we can optimize the I/O performance in HPC systems by utilizing key–value stores.


2019 ◽  
Vol 15 (S367) ◽  
pp. 464-466
Author(s):  
Paul Bartus

AbstractDuring the last years, the amount of data has skyrocketed. As a consequence, the data has become more expensive to store than to generate. The storage needs for astronomical data are also following this trend. Storage systems in Astronomy contain redundant copies of data such as identical files or within sub-file regions. We propose the use of the Hadoop Distributed and Deduplicated File System (HD2FS) in Astronomy. HD2FS is a deduplication storage system that was created to improve data storage capacity and efficiency in distributed file systems without compromising Input/Output performance. HD2FS can be developed by modifying existing storage system environments such as the Hadoop Distributed File System. By taking advantage of deduplication technology, we can better manage the underlying redundancy of data in astronomy and reduce the space needed to store these files in the file systems, thus allowing for more capacity per volume.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Sujan Saha ◽  
Sukumar Mandal

Purpose These projects aim to improve library services for users in the future by combining Link Open Data (LOD) technology with data visualization. It displays and analyses search results in an intuitive manner. These services are enhanced by integrating various LOD technologies into the authority control system. Design/methodology/approach The technology known as LOD is used to access, recycle, share, exchange and disseminate information, among other things. The applicability of Linked Data technologies for the development of library information services is evaluated in this study. Findings Apache Hadoop is used for rapidly storing and processing massive Linked Data data sets. Apache Spark is a free and open-source data processing tool. Hive is a SQL-based data warehouse that enables data scientists to write, read and manage petabytes of data. Originality/value The distributed large data storage system Apache HBase does not use SQL. This study’s goal is to search the geographic, authority and bibliographic databases for relevant links found on various websites. When data items are linked together, all of the data bits are linked together as well. The study observed and evaluated the tools and processes and recorded each data item’s URL. As a result, data can be combined across silos, enhanced by third-party data sources and contextualized.


Sign in / Sign up

Export Citation Format

Share Document