An algebra for distributed Big Data analytics

Author(s):  
LEONIDAS FEGARAS

AbstractWe present an algebra for data-intensive scalable computing based on monoid homomorphisms that consists of a small set of operations that capture most features supported by current domain-specific languages for data-centric distributed computing. This algebra is being used as the formal basis of MRQL, which is a query processing and optimization system for large-scale distributed data analysis. The MRQL semantics is given in terms of monoid comprehensions, which support group-by and order-by syntax and can work on heterogeneous collections without requiring any extension to the monoid algebra. We present the syntax and semantics of monoid comprehensions and provide rules to translate them to the monoid algebra. We give evidence of the effectiveness of our algebra by presenting some important optimization rules, such as converting nested queries to joins.

2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Zhen Zhang ◽  
Bing Guo ◽  
Yan Shen ◽  
Chengjie Li ◽  
Xinhua Suo ◽  
...  

Bitcoin mining consumes tremendous amounts of electricity to solve the hash problem. At the same time, large-scale applications of artificial intelligence (AI) require efficient and secure computing. There are many computing devices in use, and the hardware resources are highly heterogeneous. This means a cooperation mechanism is needed to realize cooperation among computing devices, and a good calculation structure is required in the case of data dispersion. In this paper, we propose an architecture where devices (also called nodes) can reach a consensus on task results using off-chain smart contracts and private data. The proposed distributed computing architecture can accelerate computing-intensive and data-intensive supervised classification algorithms with limited resources. This architecture can significantly increase privacy protection and prevent leakage of distributed data. Our proposed architecture can support heterogeneous data, making computing on each device more efficient. We used mathematical formulas to prove the correctness and robustness of our system and deduced the condition to stop a given task. In the experiments, we transformed Bitcoin hash collision into distributed computing on several nodes and evaluated the training and prediction accuracy for handwritten digit images (MNIST). The experimental results demonstrate the effectiveness of the proposed method.


Information ◽  
2019 ◽  
Vol 10 (12) ◽  
pp. 360 ◽  
Author(s):  
Nikos Kefalakis ◽  
Aikaterini Roukounaki ◽  
John Soldatos

One of the main challenges in modern Internet of Things (IoT) systems is the efficient collection, routing and management of data streams from heterogeneous sources, including sources with high ingestion rates. Despite the existence of various IoT data streaming frameworks, there is still no easy way for collecting and routing IoT streams in efficient and configurable ways that are easy to be implemented and deployed in realistic environments. In this paper, we introduce a programmable engine for Distributed Data Analytics (DDA), which eases the task of collecting IoT streams from different sources and accordingly, routing them to appropriate consumers. The engine provides also the means for preprocessing and analysis of data streams, which are two of the most important tasks in Big Data analytics applications. At the heart of the engine lies a Domain Specific Language (DSL) that enables the zero-programming definition of data routing and preprocessing tasks. This DSL is outlined in the paper, along with the middleware that supports its runtime execution. As part of the paper, we present the architecture of the engine, as well as the digital models that it uses for modelling data streams in the digital world. We also discuss the validation of the DDA in several data intensive IoT use cases in industrial environments, including use cases in pilot productions lines and in several real-life manufacturing environments. The latter manifest the configurability, programmability and flexibility of the DDA engine, as well as its ability to support practical applications.


Author(s):  
Sathishkumar S. ◽  
Devi Priya R. ◽  
Karthika K.

Big data computing in clouds is a new paradigm for next-generation analytics development. It enables large-scale data organizations to share and explore large quantities of ever-increasing data types using cloud computing technology as a back-end. Knowledge exploration and decision-making from this rapidly increasing volume of data encourage data organization, access, and timely processing, an evolving trend known as big data computing. This modern paradigm incorporates large-scale computing, new data-intensive techniques, and mathematical models to create data analytics for intrinsic information extraction. Cloud computing emerged as a service-oriented computing model to deliver infrastructure, platform, and applications as services from the providers to the consumers meeting the QoS parameters by enabling the archival and processing of large volumes of rapidly growing data faster economy models.


Author(s):  
Lichao Xu ◽  
Szu-Yun Lin ◽  
Andrew W. Hlynka ◽  
Hao Lu ◽  
Vineet R. Kamat ◽  
...  

AbstractThere has been a strong need for simulation environments that are capable of modeling deep interdependencies between complex systems encountered during natural hazards, such as the interactions and coupled effects between civil infrastructure systems response, human behavior, and social policies, for improved community resilience. Coupling such complex components with an integrated simulation requires continuous data exchange between different simulators simulating separate models during the entire simulation process. This can be implemented by means of distributed simulation platforms or data passing tools. In order to provide a systematic reference for simulation tool choice and facilitating the development of compatible distributed simulators for deep interdependent study in the context of natural hazards, this article focuses on generic tools suitable for integration of simulators from different fields but not the platforms that are mainly used in some specific fields. With this aim, the article provides a comprehensive review of the most commonly used generic distributed simulation platforms (Distributed Interactive Simulation (DIS), High Level Architecture (HLA), Test and Training Enabling Architecture (TENA), and Distributed Data Services (DDS)) and data passing tools (Robot Operation System (ROS) and Lightweight Communication and Marshalling (LCM)) and compares their advantages and disadvantages. Three specific limitations in existing platforms are identified from the perspective of natural hazard simulation. For mitigating the identified limitations, two platform design recommendations are provided, namely message exchange wrappers and hybrid communication, to help improve data passing capabilities in existing solutions and provide some guidance for the design of a new domain-specific distributed simulation framework.


Sign in / Sign up

Export Citation Format

Share Document