distributed cluster
Recently Published Documents


TOTAL DOCUMENTS

121
(FIVE YEARS 29)

H-INDEX

8
(FIVE YEARS 2)

Author(s):  
Gautam Pal ◽  
Katie Atkinson ◽  
Gangmin Li

AbstractThis paper presents an approach to analyzing consumers’ e-commerce site usage and browsing motifs through pattern mining and surfing behavior. User-generated clickstream is first stored in a client site browser. We build an ingestion pipeline to capture the high-velocity data stream from a client-side browser through Apache Storm, Kafka, and Cassandra. Given the consumer’s usage pattern, we uncover the user’s browsing intent through n-grams and Collocation methods. An innovative clustering technique is constructed through the Expectation-Maximization algorithm with Gaussian Mixture Model. We discuss a framework for predicting a user’s clicks based on the past click sequences through higher order Markov Chains. We developed our model on top of a big data Lambda Architecture which combines high throughput Hadoop batch setup with low latency real-time framework over a large distributed cluster. Based on this approach, we developed an experimental setup for an optimized Storm topology and enhanced Cassandra database latency to achieve real-time responses. The theoretical claims are corroborated with several evaluations in Microsoft Azure HDInsight Apache Storm deployment and in the Datastax distribution of Cassandra. The paper demonstrates that the proposed techniques help user experience optimization, building recently viewed products list, market-driven analyses, and allocation of website resources.


Author(s):  
Dajun Chang ◽  
Li Li ◽  
Ying Chang ◽  
Zhangquan Qiao

AbstractNowadays, with the rapid growth of data volume, massive data has become one of the factors that plague the development of enterprises. How to effectively process data and reduce the concurrency pressure of data access has become the driving force for the continuous development of big data solutions. This article mainly studies the MapReduce parallel computing framework based on multiple data fusion sensors and GPU clusters. This experimental environment uses a Hadoop fully distributed cluster environment, and the entire programming of the single-source shortest path algorithm based on MapReduce is implemented in Java language. 8 ordinary physical machines are used to build a fully distributed cluster, and the configuration environment of each node is basically the same. The MapReduce framework divides the request job into several mapping tasks and assigns them to different computing nodes. After the mapping process, a certain intermediate file that is consistent with the final file format is generated. At this time, the system will generate several reduction tasks and distribute these files to different cluster nodes for execution. This experiment will verify the changes in the running time of the PSON algorithm when the size of the test data set gradually increases while keeping the hardware level and software configuration of the Hadoop platform unchanged. When the number of computing nodes increases from 2 to 4, the running time is significantly reduced. When the number of computing nodes continues to increase, the reduction in running time will become less and less significant. The results show that NESTOR can complete the basic workflow of MapReduce, and simplifies the process of user development of GPU positive tree order, which has a significant speedup for applications with large amounts of calculations.


Author(s):  
M. Jalasri ◽  
L. Lakshmanan

AbstractFog computing and the Internet of Things (IoT) played a crucial role in storing data in the third-party server. Fog computing provides various resources to collect data by managing data security. However, intermediate attacks and data sharing create enormous security challenges like data privacy, confidentiality, authentication, and integrity issues. Various researchers introduce several cryptographic techniques; security is still significant while sharing data in the distributed environment. Therefore, in this paper, Code-Based Encryption with the Energy Consumption Routing Protocol (CBE-ECR) has been proposed for managing data security and data transmission protocols using keyed-hash message authentication. Initially, the data have been analyzed, and the distributed cluster head is selected, and the stochastically distributed energy clustering protocol is utilized for making the data transmission. Code-driven cryptography relies on the severity of code theory issues such as disorder demodulation and vibration required to learn equivalence. These crypto-systems are based on error codes to build a single-way function. The encryption technique minimizes intermediate attacks, and the data have protected all means of transmission. In addition to data security management, the introduced CBE-ECR reduces unauthorized access and manages the network lifetime successfully, leading to the effective data management of 96.17% and less energy consumption of 21.11% than other popular methods.The effectiveness of the system is compared to the traditional clustering techniques.


Author(s):  
Shajulin Benedict ◽  
Deepumon Saji ◽  
Rajesh P. Sukumaran ◽  
Bhagyalakshmi M

The biggest realization of the Machine Learning (ML) in societal applications, including air quality prediction, has been the inclusion of novel learning techniques with the focus on solving privacy and scalability issues which capture the inventiveness of tens of thousands of data scientists. Transferring learning models across multi-regions or locations has been a considerable challenge as sufficient technologies were not adopted in the recent past. This paper proposes a Blockchain- enabled Federated Learning Air Quality Prediction (BFL-AQP) framework on Kubernetes cluster which transfers the learning model parameters of ML algorithms across distributed cluster nodes and predicts the air quality parameters of different locations. Experiments were carried out to explore the frame- work and transfer learning models of air quality prediction parameters. Besides, the performance aspects of increasing the Kubernetes cluster nodes of blockchains in the federated learning environment were studied; the time taken to establish seven blockchain organizations on top of the Kubernetes cluster while investigating into the federated learning algorithms namely Federated Random Forests (FRF) and Federated Linear Regression (FLR) for air quality predictions, were revealed in the paper.


2021 ◽  
Vol 13 (9) ◽  
pp. 227
Author(s):  
Mariano Scazzariello ◽  
Lorenzo Ariemma ◽  
Giuseppe Di Battista ◽  
Maurizio Patrignani

We introduce an open-source, scalable, and distributed architecture, called Megalos, that supports the implementation of virtual network scenarios consisting of virtual devices (VDs) where each VD may have several Layer 2 interfaces assigned to virtual LANs. We rely on Docker containers to realize vendor-independent VDs and we leverage Kubernetes for the management of the nodes of a distributed cluster. Our architecture does not require platform-specific configurations and supports a seamless interconnection between the virtual environment and the physical one. Also, it guarantees the segregation of each virtual LAN traffic from the traffic of other LANs, from the cluster traffic, and from Internet traffic. Further, a packet is only sent to the cluster node containing the recipient VD. We produce several example applications where we emulate large network scenarios, with thousands of VDs and LANs. Finally, we experimentally show the scalability potential of Megalos by measuring the overhead of the distributed environment and of its signaling protocols.


2021 ◽  
Vol 8 (1) ◽  
Author(s):  
El Mehdi Saoudi ◽  
Said Jai-Andaloussi

AbstractWith the rapid growth in the amount of video data, efficient video indexing and retrieval methods have become one of the most critical challenges in multimedia management. For this purpose, Content-Based Video Retrieval (CBVR) is nowadays an active area of research. In this article, a CBVR system providing similar videos from a large multimedia dataset based on query video has been proposed. This approach uses vector motion-based signatures to describe the visual content and uses machine learning techniques to extract key frames for rapid browsing and efficient video indexing. The proposed method has been implemented on both single machine and real-time distributed cluster to evaluate the real-time performance aspect, especially when the number and size of videos are large. Experiments were performed using various benchmark action and activity recognition datasets and the results reveal the effectiveness of the proposed method in both accuracy and processing time compared to previous studies.


2021 ◽  
Vol 13 (10) ◽  
pp. 1969
Author(s):  
Fang Chen ◽  
Ning Wang ◽  
Bo Yu ◽  
Yuchu Qin ◽  
Lei Wang

The volume of remote sensing images continues to grow as image sources become more diversified and with increasing spatial and spectral resolution. The handling of such large-volume datasets, which exceed available CPU memory, in a timely and efficient manner is becoming a challenge for single machines. The distributed cluster provides an effective solution with strong calculation power. There has been an increasing number of big data technologies that have been adopted to deal with large images using mature parallel technology. However, since most commercial big data platforms are not specifically developed for the remote sensing field, two main issues exist in processing large images with big data platforms using a distributed cluster. On the one hand, the quantities and categories of official algorithms used to process remote sensing images in big data platforms are limited compared to large amounts of sequential algorithms. On the other hand, the sequential algorithms employed directly to process large images in parallel over a distributed cluster may lead to incomplete objects in the tile edges and the generation of large communication volumes at the shuffle stage. It is, therefore, necessary to explore the distributed strategy and adapt the sequential algorithms over the distributed cluster. In this research, we employed two seed-based image segmentation algorithms to construct a distributed strategy based on the Spark platform. The proposed strategy focuses on modifying the incomplete objects by processing border areas and reducing the communication volume to a reasonable size by limiting the auxiliary bands and the buffer size to a small range during the shuffle stage. We calculated the F-measure and execution time to evaluate the accuracy and execution efficiency. The statistical data reveal that both segmentation algorithms maintained high accuracy, as achieved in the reference image segmented in the sequential way. Moreover, generally the strategy took less execution time compared to significantly larger auxiliary bands and buffer sizes. The proposed strategy can modify incomplete objects, with execution time being twice as fast as the strategies that do not employ communication volume reduction in the distributed cluster.


2021 ◽  
Vol 7 (2) ◽  
pp. 71-78
Author(s):  
Timothy Dicky ◽  
Alva Erwin ◽  
Heru Purnomo Ipung

The purpose of this research is to develop a job recommender system based on the Hadoop MapReduce framework to achieve scalability of the system when it processes big data. Also, a machine learning algorithm is implemented inside the job recommender to produce an accurate job recommendation. The project begins by collecting sample data to build an accurate job recommender system with a centralized program architecture. Then a job recommender with a distributed system program architecture is implemented using Hadoop MapReduce which then deployed to a Hadoop cluster. After the implementation, both systems are tested using a large number of applicants and job data, with the time required for the program to compute the data is recorded to be analyzed. Based on the experiments, we conclude that the recommender produces the most accurate result when the cosine similarity measure is used inside the algorithm. Also, the centralized job recommender system is able to process the data faster compared to the distributed cluster job recommender system. But as the size of the data grows, the centralized system eventually will lack the capacity to process the data, while the distributed cluster job recommender is able to scale according to the size of the data.


Sign in / Sign up

Export Citation Format

Share Document