spark framework
Recently Published Documents


TOTAL DOCUMENTS

84
(FIVE YEARS 48)

H-INDEX

3
(FIVE YEARS 2)

Author(s):  
Bharathi Garimella ◽  
G. V. S. N. R. V. Prasad ◽  
M. H. M. Krishna Prasad

The churn prediction based on telecom data has been paid great attention because of the increasing the number telecom providers, but due to inconsistent data, sparsity, and hugeness, the churn prediction becomes complicated and challenging. Hence, an effective and optimal prediction of churns mechanism, named adaptive firefly-spider optimization (adaptive FSO) algorithm, is proposed in this research to predict the churns using the telecom data. The proposed churn prediction method uses telecom data, which is the trending domain of research in predicting the churns; hence, the classification accuracy is increased. However, the proposed adaptive FSO algorithm is designed by integrating the spider monkey optimization (SMO), firefly optimization algorithm (FA), and the adaptive concept. The input data is initially given to the master node of the spark framework. The feature selection is carried out using Kendall’s correlation to select the appropriate features for further processing. Then, the selected unique features are given to the master node to perform churn prediction. Here, the churn prediction is made using a deep convolutional neural network (DCNN), which is trained by the proposed adaptive FSO algorithm. Moreover, the developed model obtained better performance using the metrics, like dice coefficient, accuracy, and Jaccard coefficient by varying the training data percentage and selected features. Thus, the proposed adaptive FSO-based DCNN showed improved results with a dice coefficient of 99.76%, accuracy of 98.65%, Jaccard coefficient of 99.52%.


2021 ◽  
Author(s):  
Alan L. Nunes ◽  
Alba Cristina Magalhaes Alves de Melo ◽  
Cristina Boeres ◽  
Daniel de Oliveira ◽  
Lúcia Maria de Assumpção Drummond

In this paper, we developed a Spark application, named Diff Sequences Spark, which compares 540 SARS-CoV-2 sequences from South America in Amazon EC2 Cloud, generating as output the positions where the differences occur. We analyzed the performance of the proposed application on selected memory and storage optimized virtual machines (VMs) at on-demand and spot markets. The execution times and financial costs of the memory optimized VMs outperformed the storage optimized ones. Regarding the markets, Diff Sequences Spark reduced the average execution times and monetary costs when using spot VMs compared to their respective on-demand VMs, even in scenarios with several spot revocations, benefiting from the low overhead fault tolerance Spark framework.


2021 ◽  
Vol 7 (1) ◽  
pp. 34
Author(s):  
Marco Martínez-Sánchez ◽  
Roberto R. Expósito ◽  
Juan Touriño

Due to the continuous development in the field of Next Generation Sequencing (NGS) technologies that have allowed researchers to take advantage of greater genetic samples in less time, it is a matter of relevance to improve the existing algorithms aimed at the enhancement of the quality of those generated reads. In this work, we present a Big Data tool implemented upon the open-source Apache Spark framework that is able to execute validated error-correction algorithms at an improved performance. The experimental evaluation conducted on a multi-core cluster has shown significant improvements in execution times, providing a maximum speedup of 9.5 over existing error correction tools when processing an NGS dataset with 25 million reads.


2021 ◽  
Author(s):  
Maryam Bagheri ◽  
Shahram Jamali ◽  
Reza Fotohi

Abstract Nowadays with the development of technology and access to the Internet everywhere for everyone, the interest to get the news from newspapers and other traditional media is decreasing. Therefore, the popularity of news websites is ascending as the newspapers are changing into electronic versions. News websites can be accessed from anywhere, i.e., any country, city, region, etc. So, the need to present the news depends on where the reader is from can be a research area, as with facing with variety of news topics on websites readers prefer to choose those which more often show the news, they are interested in on their home pages. Based on this idea we represent the technique to find favorite topics of Twitter users of certain geographical districts to provide news websites a way of increasing popularity. In this work we processed tweets. It seems that tweets are some small data, but we found out that processing this small data needs a lot of time, due to the repetition of the algorithm a lot and many searches to be done. Therefore, we categorized our work as big data. To help this problem we developed our work in the Spark framework. Our technique includes 2 phases; Feature Extraction Phase and Topic Discovery Phase. Our analysis shows that with this technique we can get the accuracy between 68% and 76%, in 3 developments 3-fold, 5-fold, and 10-fold.


2021 ◽  
Vol 2010 (1) ◽  
pp. 012067
Author(s):  
Changchao Dong ◽  
Yanbin Jiao ◽  
Youyong Chen ◽  
Lanxian Feng

2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Xijun Hong

With the rapid development, different information relating to sports may now be recorded forms of useful big data through wearable and sensing technology. Big data technology has become a pressing challenge to tackle in the present basketball training, which improves the effect of baseball analysis. In this study, we propose the Spark framework based on in-memory computing for big data processing. First, we use a new swarm intelligence optimization cuckoo search algorithm because the algorithm has fewer parameters, powerful global search ability, and support of fast convergence. Second, we apply the traditional K-clustering algorithm to improve the final output using clustering means in Spark distributed environment. Last, we examine the aspects that could lead to high-pressure game circumstances to study professional athletes’ defensive performance. Both recruiters and trainers may use our technique to better understand essential player’s qualities and eventually, to assess and improve a team’s performance. The experimental findings reveal that the suggested approach outperforms previous methods in terms of clustering performance and practical utility. It has the greatest influence on the shooting training impact when moving, yielding complimentary outcomes in the training effect.


2021 ◽  
Vol 18 (3) ◽  
pp. 42-62
Author(s):  
Anilkumar V Brahmane ◽  
Chaitanya B Krishna

The novelty in big data is rising day-by-day in such a way that the existing software tools face difficulty in supervision of big data. Furthermore, the rate of the imbalanced data in the huge datasets is a key constraint to the research industry. Thus, this paper proposes a novel technique for handling the big data using Spark framework. The proposed technique undergoes two steps for classifying the big data, which involves feature selection and classification, which is performed in the initial nodes of Spark architecture. The proposed optimization algorithm is named rider chaotic biography optimization (RCBO) algorithm, which is the integration of the rider optimization algorithm (ROA) and the standard chaotic biogeography-based optimisation (CBBO). The proposed RCBO deep-stacked auto-encoder using Spark framework effectively handles the big data for attaining effective big data classification. Here, the proposed RCBO is employed for selecting suitable features from the massive dataset.


Author(s):  
Gangadhara Rao Kommu

TeraSort is one of Hadoop’s widely used benchmarks. Hadoop’s distribution contains both the input generator and sorting implementations: the TeraGen generates the input and TeraSort conducts the sorting. We focus on the comparison of TeraSort algorithm on the different distributed platforms with different configurations of the resources. We have considered the parameters of measure of efficiency as Compute Time, Data Read, Data Write, Compute Time, and Speedup. We have conducted experiments using Hadoop map reduce and Spark (Java). We empirically evaluate the performance of TeraSort algorithm on Amazon EC2 Machine Images, and demonstrate that it achieves 3.95 × - 2.4 × speedup, compared with TeraSort, for typical settings of interest.


Sign in / Sign up

Export Citation Format

Share Document