hadoop distributed file system
Recently Published Documents


TOTAL DOCUMENTS

211
(FIVE YEARS 96)

H-INDEX

12
(FIVE YEARS 2)

2022 ◽  
Vol 16 (1) ◽  
pp. 0-0

Anomaly detection is a very important step in building a secure and trustworthy system. Manually it is daunting to analyze and detect failures and anomalies. In this paper, we proposed an approach that leverages the pattern matching capabilities of Convolution Neural Network (CNN) for anomaly detection in system logs. Features from log files are extracted using a windowing technique. Based on this feature, a one-dimensional image (1×n dimension) is generated where the pixel values of an image correlate with the features of the logs. On these images, the 1D Convolution operation is applied followed by max pooling. Followed by Convolution layers, a multi-layer feed-forward neural network is used as a classifier that learns to classify the logs as normal or abnormal from the representation created by the convolution layers. The model learns the variation in log pattern for normal and abnormal behavior. The proposed approach achieved improved accuracy compared to existing approaches for anomaly detection in Hadoop Distributed File System (HDFS) logs.


Author(s):  
Pinjari Vali Basha

<p>By rapid transformation of technology, huge amount of data (structured data and Un Structured data) is generated every day.  With the aid of 5G technology and IoT the data generated and processed every day is very large. If we dig deeper the data generated approximately 2.5 quintillion bytes.<br> This data (Big Data) is stored and processed with the help of Hadoop framework. Hadoop framework has two phases for storing and retrieve the data in the network.</p> <ul> <li>Hadoop Distributed file System (HDFS)</li> <li>Map Reduce algorithm</li> </ul> <p>In the native Hadoop framework, there are some limitations for Map Reduce algorithm. If the same job is repeated again then we have to wait for the results to carry out all the steps in the native Hadoop. This led to wastage of time, resources.  If we improve the capabilities of Name node i.e., maintain Common Job Block Table (CJBT) at Name node will improve the performance. By employing Common Job Block Table will improve the performance by compromising the cost to maintain Common Job Block Table.<br> Common Job Block Table contains the meta data of files which are repeated again. This will avoid re computations, a smaller number of computations, resource saving and faster processing. The size of Common Job Block Table will keep on increasing, there should be some limit on the size of the table by employing algorithm to keep track of the jobs. The optimal Common Job Block table is derived by employing optimal algorithm at Name node.</p>


2021 ◽  
Vol 24 ◽  
pp. 26-32
Author(s):  
Fredrick Ishengoma

Vaccine requirements are becoming more mandatory in several countries as public health experts and governments become more concerned about the COVID-19 pandemic and its variants. In the meantime, as the number of vaccine requirements grows, so does the counterfeiting of vaccination documents. Fake vaccination certificates are steadily growing, being sold online and on the dark web. Due to the nature of the COVID-19 pandemic, there is a need of robust authentication mechanisms that support touch-less technologies like Near Field Communication (NFC). Thus, in this paper, a blockchain-NFC based COVID-19 Digital Immunity Certificate (DIC) system is proposed. The vaccination data are first encrypted by the Advanced Encryption Standard (AES) algorithm on Hadoop Distributed File System (HDFS) and then uploaded to the blockchain. The proposed system is based on the amalgamation of NCF and blockchain technologies which can mitigate the issue of fake vaccination certificates. Furthermore, the emerging issues of employing the proposed system are discussed with future directions.


Author(s):  
Shubh Goyal

Abstract: By utilizing the Hadoop environment, data may be loaded and searched from local data nodes. Because the dataset's capacity may be vast, loading and finding data using a query is often more difficult. We suggest a method for dealing with data in local nodes that does not overlap with data acquired by script. The query's major purpose is to store information in a distributed environment and look for it quickly. In this section, we define the script to eliminate duplicate data redundancy when searching and loading data in a dynamic manner. In addition, the Hadoop file system is available in a distributed environment. Keywords: HDFS; Hadoop distributed file system; replica; local; distributed; capacity; SQL; redundancy


Author(s):  
Dr. K. B. V. Brahma Rao ◽  
◽  
Dr. R Krishnam Raju Indukuri ◽  
Dr. Suresh Varma Penumatsa ◽  
Dr. M. V. Rama Sundari ◽  
...  

The objective of comparing various dimensionality techniques is to reduce feature sets in order to group attributes effectively with less computational processing time and utilization of memory. The various reduction algorithms can decrease the dimensionality of dataset consisting of a huge number of interrelated variables, while retaining the dissimilarity present in the dataset as much as possible. In this paper we use, Standard Deviation, Variance, Principal Component Analysis, Linear Discriminant Analysis, Factor Analysis, Positive Region, Information Entropy and Independent Component Analysis reduction algorithms using Hadoop Distributed File System for massive patient datasets to achieve lossless data reduction and to acquire required knowledge. The experimental results demonstrate that the ICA technique can efficiently operate on massive datasets eliminates irrelevant data without loss of accuracy, reduces storage space for the data and also the computation time compared to other techniques.


Author(s):  
Viraaji Mothukuri ◽  
Sai S. Cheerla ◽  
Reza M. Parizi ◽  
Qi Zhang ◽  
Kim-Kwang Raymond Choo

Author(s):  
Anisha P Rodrigues ◽  
Roshan Fernandes ◽  
P. Vijaya ◽  
Satish Chander

Hadoop Distributed File System (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the Hadoop cluster, which is inexpensive and easily available. The large number of small files stored in HDFS consumed more memory which lags the performance because small files consumed heavy load on NameNode. Thus, the efficiency of indexing and accessing the small files on HDFS is improved by several techniques, such as archive files, New Hadoop Archive (New HAR), CombineFileInputFormat (CFIF), and Sequence file generation. The archive file combines the small files into single blocks. The new HAR file combines the smaller files into a single large file. The CFIF module merges the multiple files into a single split using NameNode, and the sequence file combines all the small files into a single sequence. The indexing and accessing of a small file in HDFS are evaluated using performance metrics, such as processing time and memory usage. The experiment shows that the sequence file generation approach is efficient when compared to other approaches concerning file access time is 1.5[Formula: see text]s, memory usage is 20 KB in multi-node, and the processing time is 0.1[Formula: see text]s.


Author(s):  
Karwan Jameel Merceedi ◽  
Nareen Abdulla Sabry

In the last few days, data and the internet have become increasingly growing, occurring in big data. For these problems, there are many software frameworks used to increase the performance of the distributed system. This software is used for available ample data storage. One of the most beneficial software frameworks used to utilize data in distributed systems is Hadoop. This software creates machine clustering and formatting the work between them. Hadoop consists of two major components: Hadoop Distributed File System (HDFS) and Map Reduce (MR). By Hadoop, we can process, count, and distribute each word in a large file and know the number of affecting for each of them. The HDFS is designed to effectively store and transmit colossal data sets to high-bandwidth user applications. The differences between this and other file systems provided are relevant. HDFS is intended for low-cost hardware and is exceptionally tolerant to defects. Thousands of computers in a vast cluster both have directly associated storage functions and user programmers. The resource scales with demand while being cost-effective in all sizes by distributing storage and calculation through numerous servers. Depending on the above characteristics of the HDFS, many researchers worked in this field trying to enhance the performance and efficiency of the addressed file system to be one of the most active cloud systems. This paper offers an adequate study to review the essential investigations as a trend beneficial for researchers wishing to operate in such a system. The basic ideas and features of the investigated experiments were taken into account to have a robust comparison, which simplifies the selection for future researchers in this subject. According to many authors, this paper will explain what Hadoop is and its architectures, how it works, and its performance analysis in a distributed systems. In addition, assessing each Writing and compare with each other.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-29
Author(s):  
Xiaolu Li ◽  
Zuoru Yang ◽  
Jinhong Li ◽  
Runhui Li ◽  
Patrick P. C. Lee ◽  
...  

We propose repair pipelining , a technique that speeds up the repair performance in general erasure-coded storage. By carefully scheduling the repair of failed data in small-size units across storage nodes in a pipelined manner, repair pipelining reduces the single-block repair time to approximately the same as the normal read time for a single block in homogeneous environments. We further design different extensions of repair pipelining algorithms for heterogeneous environments and multi-block repair operations. We implement a repair pipelining prototype, called ECPipe , and integrate it as a middleware system into two versions of Hadoop Distributed File System (HDFS) (namely, HDFS-RAID and HDFS-3) as well as Quantcast File System. Experiments on a local testbed and Amazon EC2 show that repair pipelining significantly improves the performance of degraded reads and full-node recovery over existing repair techniques.


TEM Journal ◽  
2021 ◽  
pp. 806-814
Author(s):  
Yordan Kalmukov ◽  
Milko Marinov ◽  
Tsvetelina Mladenova ◽  
Irena Valova

In the age of big data, the amount of data that people generate and use on a daily basis has far exceeded the storage and processing capabilities of a single computer system. That motivates the use of distributed big data storage and processing system such as Hadoop. It provides a reliable, horizontallyscalable, fault-tolerant and efficient service, based on the Hadoop Distributed File System (HDFS) and MapReduce. The purpose of this research is to experimentally determine whether (and to what extent) the network communication speed, the file replication factor, the files’ sizes and their number, and the location of the HDFS client influence the performance of the HDFS read/write operations.


Sign in / Sign up

Export Citation Format

Share Document