A Parallel Apriori-Transaction Reduction Algorithm Using Hadoop-Mapreduce in Cloud

Apriori algorithm is a classical algorithm of association rule mining and widely used for generating frequent item sets. However, the original Apriori algorithm has some limitation such as it needs to scan the dataset many times to discover all frequent itemset and generate huge number of candidate itemset. To overcome these limitations, researchers have made a lot of improvements to the Apriori such as candidate generation, without candidate generation, transaction reduction, partitioning, and sampling. When it comes to mine massive data, these algorithms failed to prove efficiency because limitation of the processing capacity, storage capacity, and main memory constraints. Therefore, parallel and distributed algorithms are developed to perform large-scale computing in ARM on multiple processors. However, the problems with most of the parallel and distributed framework are overheads of managing distributed system, lack of high level parallel programming language, and node failures. Hadoop-MapReduce is an efficient, scalable, and simplified programming model for massive data processing and it also available on cloud environment. Cloud computing offers huge computing resources, and capacities to solve big data challenges. Recently many parallel algorithms have been proposed on Hadoop-MapReduce to enhance the performance of Apriori algorithm but there are some drawbacks: since multiple scan over the dataset is needed to generate candidate itemset, it consume more execution time. The aim of this study is to propose a parallel Transaction Reduction MapReduce Apriori algorithm (TRMR-Apriori) which is reduce unnecessary transaction values and transactions from the dataset in parallel manner to overcome above problems. The experiments show that TRMR-Apriori is able to achieve better execution time to discover frequent itemset those of previous sequential ARM algorithms such as Apriori, AprioriTid, Eclat, and FP-Growth and the previous parallel algorithms such as PApriori, MRApriori, and Modified Apriori with different condition on homogeneous computing environment using Hadoop-MapReduce platform in cloud. Overall, the TRMR-Apriori shows the strength to extract the frequent itemset from massive dataset in cloud.

Download Full-text

Optimization of large-scale water transfer networks: Conic integer programming model and distributed parallel algorithms

AIChE Journal ◽

10.1002/aic.15505 ◽

2016 ◽

Vol 63 (5) ◽

pp. 1566-1581 ◽

Cited By ~ 2

Author(s):

Li-Juan Li ◽

Rui-Jie Zhou

Keyword(s):

Integer Programming ◽

Parallel Algorithms ◽

Large Scale ◽

Programming Model ◽

Water Transfer ◽

Integer Programming Model

Download Full-text

Scalable Mining of High-Utility Sequential Patterns With Three-Tier MapReduce Model

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3487046 ◽

2022 ◽

Vol 16 (3) ◽

pp. 1-26

Author(s):

Jerry Chun-Wei Lin ◽

Youcef Djenouri ◽

Gautam Srivastava ◽

Yuanfa Li ◽

Philip S. Yu

Keyword(s):

Large Scale ◽

Pattern Mining ◽

Sequential Pattern Mining ◽

Main Memory ◽

Frequent Itemset ◽

Sequential Pattern ◽

Sequential Patterns ◽

Speed Up ◽

Mapreduce Model ◽

High Utility

High-utility sequential pattern mining (HUSPM) is a hot research topic in recent decades since it combines both sequential and utility properties to reveal more information and knowledge rather than the traditional frequent itemset mining or sequential pattern mining. Several works of HUSPM have been presented but most of them are based on main memory to speed up mining performance. However, this assumption is not realistic and not suitable in large-scale environments since in real industry, the size of the collected data is very huge and it is impossible to fit the data into the main memory of a single machine. In this article, we first develop a parallel and distributed three-stage MapReduce model for mining high-utility sequential patterns based on large-scale databases. Two properties are then developed to hold the correctness and completeness of the discovered patterns in the developed framework. In addition, two data structures called sidset and utility-linked list are utilized in the developed framework to accelerate the computation for mining the required patterns. From the results, we can observe that the designed model has good performance in large-scale datasets in terms of runtime, memory, efficiency of the number of distributed nodes, and scalability compared to the serial HUSP-Span approach.

Download Full-text

Simplified Mapreduce Mechanism for Large Scale Data Processing

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.8.15211 ◽

2018 ◽

Vol 7 (3.8) ◽

pp. 16

Author(s):

Md Tahsir Ahmed Munna ◽

Shaikh Muhammad Allayear ◽

Mirza Mohtashim Alam ◽

Sheikh Shah Mohammad Motiur Rahman ◽

Md Samadur Rahman ◽

...

Keyword(s):

Data Processing ◽

Large Scale ◽

Processing Time ◽

Programming Model ◽

Data Sets ◽

Hadoop Mapreduce ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data ◽

Large Scale Data Sets

MapReduce has become a popular programming model for processing and running large-scale data sets with a parallel, distributed paradigm on a cluster. Hadoop MapReduce is needed especially for large scale data like big data processing. In this paper, we work to modify the Hadoop MapReduce Algorithm and implement it to reduce processing time.

Download Full-text

Using Hadoop MapReduce for Parallel Genetic Algorithms: A Comparison of the Global, Grid and Island Models

Evolutionary Computation ◽

10.1162/evco_a_00213 ◽

2018 ◽

Vol 26 (4) ◽

pp. 535-567 ◽

Cited By ~ 15

Author(s):

Filomena Ferrucci ◽

Pasquale Salza ◽

Federica Sarro

Keyword(s):

Genetic Algorithms ◽

Parallel Algorithms ◽

Execution Time ◽

Island Model ◽

Island Models ◽

Parallel Genetic Algorithms ◽

Hadoop Mapreduce ◽

Data Store ◽

Parallel Models ◽

Global Grid

The need to improve the scalability of Genetic Algorithms (GAs) has motivated the research on Parallel Genetic Algorithms (PGAs), and different technologies and approaches have been used. Hadoop MapReduce represents one of the most mature technologies to develop parallel algorithms. Based on the fact that parallel algorithms introduce communication overhead, the aim of the present work is to understand if, and possibly when, the parallel GAs solutions using Hadoop MapReduce show better performance than sequential versions in terms of execution time. Moreover, we are interested in understanding which PGA model can be most effective among the global, grid, and island models. We empirically assessed the performance of these three parallel models with respect to a sequential GA on a software engineering problem, evaluating the execution time and the achieved speedup. We also analysed the behaviour of the parallel models in relation to the overhead produced by the use of Hadoop MapReduce and the GAs' computational effort, which gives a more machine-independent measure of these algorithms. We exploited three problem instances to differentiate the computation load and three cluster configurations based on 2, 4, and 8 parallel nodes. Moreover, we estimated the costs of the execution of the experimentation on a potential cloud infrastructure, based on the pricing of the major commercial cloud providers. The empirical study revealed that the use of PGA based on the island model outperforms the other parallel models and the sequential GA for all the considered instances and clusters. Using 2, 4, and 8 nodes, the island model achieves an average speedup over the three datasets of 1.8, 3.4, and 7.0 times, respectively. Hadoop MapReduce has a set of different constraints that need to be considered during the design and the implementation of parallel algorithms. The overhead of data store (i.e., HDFS) accesses, communication, and latency requires solutions that reduce data store operations. For this reason, the island model is more suitable for PGAs than the global and grid model, also in terms of costs when executed on a commercial cloud provider.

Download Full-text

Optimizations for filter-based join algorithms in MapReduce

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201220 ◽

2021 ◽

pp. 1-18

Author(s):

Salahaldeen Rababa ◽

Amer Al-Badarneh

Keyword(s):

Cost Analysis ◽

Execution Time ◽

Large Scale ◽

Programming Model ◽

State Of The Art ◽

Total Execution Time ◽

Large Scale Data ◽

Heterogeneous Datasets ◽

Join Algorithms ◽

Scale Data

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.

Download Full-text

An Efficient Speculative Task Detection Algorithm for MapReduce Schedulers

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666190911113129 ◽

2019 ◽

Vol 13 ◽

Author(s):

Utsav Upadhyay ◽

Geeta Sikka

Keyword(s):

Large Scale ◽

File System ◽

Programming Model ◽

Detection Algorithm ◽

Hadoop Mapreduce ◽

Hadoop Distributed File System ◽

Hadoop Clusters ◽

Improved Accuracy ◽

Self Adaptive

The MapReduce programming model was developed and designed for Google File System to efficiently process large distributed datasets. The open source implementation of the Google project was called Apache Hadoop. Hadoop architecture comprises of Hadoop Distributed File System (HDFS) and Hadoop MapReduce. HDFS provides support to Hadoop for effectively managing large datasets over the cluster and MapReduce helps in efficient large-scale distributed datasets processing. MapReduce incorporates strategies to re-executes speculative task on some other node in order to finish computation quickly, enhancing the overall Quality of Service (QoS). Several mechanisms were suggested over default Hadoop’s Scheduler, such as Longest Approximate Time to End (LATE), Self-Adaptive MapReduce scheduler (SAMR) and Enhanced Self-Adaptive MapReduce scheduler (ESAMR), to improve speculative re-execution of tasks over the cluster. This paper presents an efficient speculative task detection algorithm for MapReduce schedulers. Our studies suggest the importance of keeping a regular track of node’s performance in order to re-execute speculative tasks more efficiently. We have successfully improved the QoS offered by Hadoop clusters over jobs in terms of reducing the detection time of speculative tasks (~ 15%) and improved accuracy of correct speculative task detection (~10%).

Download Full-text

Parallel Computation of Rough Set Approximations in Information Systems with Missing Decision Data

Computers ◽

10.3390/computers7030044 ◽

2018 ◽

Vol 7 (3) ◽

pp. 44 ◽

Cited By ~ 1

Author(s):

Thinh Cao ◽

Koichi Yamada ◽

Muneyuki Unehara ◽

Izumi Suzuki ◽

Do Nguyen

Keyword(s):

Information Systems ◽

Missing Data ◽

Parallel Algorithms ◽

Parallel Computation ◽

Rough Set ◽

Large Scale ◽

Programming Model ◽

Sequential Algorithm ◽

Distributed Programming ◽

Decision Attributes

The paper discusses the use of parallel computation to obtain rough set approximations from large-scale information systems where missing data exist in both condition and decision attributes. To date, many studies have focused on missing condition data, but very few have accounted for missing decision data, especially in enlarging datasets. One of the approaches for dealing with missing data in condition attributes is named twofold rough approximations. The paper aims to extend the approach to deal with missing data in the decision attribute. In addition, computing twofold rough approximations is very intensive, thus the approach is not suitable when input datasets are large. We propose parallel algorithms to compute twofold rough approximations in large-scale datasets. Our method is based on MapReduce, a distributed programming model for processing large-scale data. We introduce the original sequential algorithm first and then the parallel version is introduced. Comparison between the two approaches through experiments shows that our proposed parallel algorithms are suitable for and perform efficiently on large-scale datasets that have missing data in condition and decision attributes.

Download Full-text

Social inequality in the evolution of human societies

Sociology: Theory, Methods, Marketing ◽

10.15407/sociology2019.02.098 ◽

2019 ◽

pp. 98-120

Author(s):

Georgi Derluguian

Keyword(s):

Social Inequality ◽

Large Scale ◽

Human Beings ◽

Mass Violence ◽

Public And Private ◽

Tangible Assets ◽

Transition To Agriculture ◽

New Institutions ◽

High Level ◽

Political Economic

The author develops ideas about the origin of social inequality during the evolution of human societies and reflects on the possibilities of its overcoming. What makes human beings different from other primates is a high level of egalitarianism and altruism, which contributed to more successful adaptability of human collectives at early stages of the development of society. The transition to agriculture, coupled with substantially increasing population density, was marked by the emergence and institutionalisation of social inequality based on the inequality of tangible assets and symbolic wealth. Then, new institutions of warfare came into existence, and they were aimed at conquering and enslaving the neighbours engaged in productive labour. While exercising control over nature, people also established and strengthened their power over other people. Chiefdom as a new type of polity came into being. Elementary forms of power (political, economic and ideological) served as a basis for the formation of early states. The societies in those states were characterised by social inequality and cruelties, including slavery, mass violence and numerous victims. Nowadays, the old elementary forms of power that are inherent in personalistic chiefdom are still functioning along with modern institutions of public and private bureaucracy. This constitutes the key contradiction of our time, which is the juxtaposition of individual despotic power and public infrastructural one. However, society is evolving towards an ever more efficient combination of social initiatives with the sustainability and viability of large-scale organisations.

Download Full-text

Arabidopsis Genes Essential for Seedling Viability: Isolation of Insertional Mutants and Molecular Cloning

Genetics ◽

10.1093/genetics/159.4.1765 ◽

2001 ◽

Vol 159 (4) ◽

pp. 1765-1778

Author(s):

Gregory J Budziszewski ◽

Sharon Potter Lewis ◽

Lyn Wegrich Glover ◽

Jennifer Reineke ◽

Gary Jones ◽

...

Keyword(s):

Large Scale ◽

Protein Translocation ◽

Gene Families ◽

Mutant Phenotype ◽

Lethal Mutant ◽

A Genome ◽

Genes Encoding ◽

High Level ◽

Mutant Lines ◽

Genome Scale

Abstract We have undertaken a large-scale genetic screen to identify genes with a seedling-lethal mutant phenotype. From screening ~38,000 insertional mutant lines, we identified >500 seedling-lethal mutants, completed cosegregation analysis of the insertion and the lethal phenotype for >200 mutants, molecularly characterized 54 mutants, and provided a detailed description for 22 of them. Most of the seedling-lethal mutants seem to affect chloroplast function because they display altered pigmentation and affect genes encoding proteins predicted to have chloroplast localization. Although a high level of functional redundancy in Arabidopsis might be expected because 65% of genes are members of gene families, we found that 41% of the essential genes found in this study are members of Arabidopsis gene families. In addition, we isolated several interesting classes of mutants and genes. We found three mutants in the recently discovered nonmevalonate isoprenoid biosynthetic pathway and mutants disrupting genes similar to Tic40 and tatC, which are likely to be involved in chloroplast protein translocation. Finally, we directly compared T-DNA and Ac/Ds transposon mutagenesis methods in Arabidopsis on a genome scale. In each population, we found only about one-third of the insertion mutations cosegregated with a mutant phenotype.

Download Full-text

New Harvesting and Curing Techniques in the Production of Breeder's Seed Peanuts1

Peanut Science ◽

10.3146/i0095-3679-6-2-1 ◽

1979 ◽

Vol 6 (2) ◽

pp. 70-72

Author(s):

T. A. Coffelt ◽

F. S. Wright ◽

J. L. Steele

Keyword(s):

Large Scale ◽

Combine Method ◽

Frost Damage ◽

New Method ◽

Experimental Equipment ◽

Harvesting Method ◽

Curing Methods ◽

Labor Requirements ◽

High Level ◽

Varietal Purity

Abstract A new method of harvesting and curing breeder's seed peanuts in Virginia was initiated that would 1) reduce the labor requirements, 2) maintain a high level of germination, 3) maintain varietal purity at 100%, and 4) reduce the risk of frost damage. Three possible harvesting and curing methods were studied. The traditional stack-pole method satisfied the latter 3 objectives, but not the first. The windrow-combine method satisfied the first 2 objectives, but not the last 2. The direct harvesting method satisfied all four objectives. The experimental equipment and curing procedures for direct harvesting had been developed but not tested on a large scale for seed harvesting. This method has been used in Virginia to produce breeder's seed of 3 peanut varieties (Florigiant, VA 72R and VA 61R) during five years. Compared to the stackpole method, labor requirements have been reduced, satisfactory levels of germination and varietal purity have been obtained, and the risk of frost damage has been minimized.

Download Full-text