Research on Clustering Analysis of Big Data

Advanced Engineering Forum ◽

10.4028/www.scientific.net/aef.6-7.82 ◽

2012 ◽

Vol 6-7 ◽

pp. 82-87 ◽

Cited By ~ 2

Author(s):

Yuan Ming Yuan ◽

Chan Le Wu

Keyword(s):

Big Data ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Mapreduce Framework ◽

Traditional Technologies

Data quantity of Big Data was too big to be processed with traditional clustering analysis technologies. Time consuming was long, problem of computability existed with traditional technologies. Having analyzed on k-means clustering algorithm, a new algorithm was proposed. Parallelizing part of k-means was found. The algorithm was improved with the method of redesigning flow with MapReduce framework. Problems mentioned above were solved. Experiments show that new algorithm is feasible and effective.

Download Full-text

Big Data Clustering Analysis Algorithm for Internet of Things Based on K-Means

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2019010101 ◽

2019 ◽

Vol 10 (1) ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Zhanqiu Yu

Keyword(s):

Big Data ◽

Internet Of Things ◽

Clustering Analysis ◽

Data Clustering ◽

Clustering Algorithm ◽

Prototype System ◽

Point Selection ◽

Logistics System ◽

Relational Schema ◽

Analysis Algorithm

To explore the Internet of things logistics system application, an Internet of things big data clustering analysis algorithm based on K-mans was discussed. First of all, according to the complex event relation and processing technology, the big data processing of Internet of things was transformed into the extraction and analysis of complex relational schema, so as to provide support for simplifying the processing complexity of big data in Internet of things (IOT). The traditional K-means algorithm was optimized and improved to make it fit the demand of big data RFID data network. Based on Hadoop cloud cluster platform, a K-means cluster analysis was achieved. In addition, based on the traditional clustering algorithm, a center point selection technology suitable for RFID IOT data clustering was selected. The results showed that the clustering efficiency was improved to some extent. As a result, an RFID Internet of things clustering analysis prototype system is designed and realized, which further tests the feasibility.

Download Full-text

Analysis of Fuzzy Clustering for the Adoption in Data Mining

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.989-994.2047 ◽

2014 ◽

Vol 989-994 ◽

pp. 2047-2050

Author(s):

Ying Jie Wang

Keyword(s):

Machine Learning ◽

Data Mining ◽

Mathematical Method ◽

Big Data ◽

Fuzzy Clustering ◽

Clustering Analysis ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Methods ◽

Fuzzy Clustering Methods

Data mining is the general methodology for retrieving useful information from big data. Clustering analysis is a mathematical method of classification for unsupervised machine learning. It can be adopted for data classification in Data mining. This paper combines the clustering process by fuzzy way and then deduces a special clustering algorithm with fast fuzzy c-means (FFCM) method. In summary, the paper illustrates the adoption of a series of fuzzy clustering methods in Data Mining. These methods have improved the computational efficiency with learning as the convergence speed is fast. The methodology of this paper presents significantly meaningful for information retrieval of big data.

Download Full-text

New algorithm for clustering unlabeled big data

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i2.pp1054-1062 ◽

2021 ◽

Vol 24 (2) ◽

pp. 1054

Author(s):

Marwan B. Mohammed ◽

Wafaa AL-Hameed

Keyword(s):

Data Mining ◽

Big Data ◽

Hierarchical Clustering ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Unlabeled Data ◽

New Techniques ◽

Clustering Techniques ◽

Analysis Techniques ◽

Hierarchical Clustering Algorithm

The clustering analysis techniques play an important role in the area of data mining. Although from existence several clustering techniques. However, it still to their tries to improve the clustering process efficiently or propose new techniques seeks to allocate objects into clusters so that two objects in the same cluster are more similar than two objects in different clusters and careful not to duplicate the same objects in different groups with the ability to cover all data as much as possible. This paper presents two directions. The first is to propose a new algorithm that coined a name (MB Algorithm) to collect unlabeled data and put them into appropriate groups. The second is the creation of a lexical sequence sentence (LCS) based on similar semantic sentences which are different from the traditional lexical word chain (LCW) based on words. The results showed that the performance of the MB algorithm has generally outperformed the two algorithms the hierarchical clustering algorithm and the K-mean algorithm.

Download Full-text

Fractional Fuzzy Clustering and Particle Whale Optimization-Based MapReduce Framework for Big Data Clustering

Journal of Intelligent Systems ◽

10.1515/jisys-2018-0117 ◽

2019 ◽

Vol 29 (1) ◽

pp. 1496-1513 ◽

Cited By ~ 1

Author(s):

Omkaresh Kulkarni ◽

Sudarson Jena ◽

C. H. Sanjay

Keyword(s):

Big Data ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Mapreduce Framework ◽

Swarm Optimization ◽

Skin Segmentation ◽

Kernel Clustering ◽

Whale Optimization ◽

Clustering Approach

Abstract The recent advancements in information technology and the web tend to increase the volume of data used in day-to-day life. The result is a big data era, which has become a key issue in research due to the complexity in the analysis of big data. This paper presents a technique called FPWhale-MRF for big data clustering using the MapReduce framework (MRF), by proposing two clustering algorithms. In FPWhale-MRF, the mapper function estimates the cluster centroids using the Fractional Tangential-Spherical Kernel clustering algorithm, which is developed by integrating the fractional theory into a Tangential-Spherical Kernel clustering approach. The reducer combines the mapper outputs to find the optimal centroids using the proposed Particle-Whale (P-Whale) algorithm, for the clustering. The P-Whale algorithm is proposed by combining Whale Optimization Algorithm with Particle Swarm Optimization, for effective clustering such that its performance is improved. Two datasets, namely localization and skin segmentation datasets, are used for the experimentation and the performance is evaluated regarding two performance evaluation metrics: clustering accuracy and DB-index. The maximum accuracy attained by the proposed FPWhale-MRF technique is 87.91% and 90% for the localization and skin segmentation datasets, respectively, thus proving its effectiveness in big data clustering.

Download Full-text

Energy-Aware Task Scheduling Using Hybrid Firefly-BAT (FFABAT) in Big Data

Cybernetics and Information Technologies ◽

10.2478/cait-2018-0031 ◽

2018 ◽

Vol 18 (2) ◽

pp. 98-111

Author(s):

M. Senthilkumar

Keyword(s):

Big Data ◽

Task Scheduling ◽

Clustering Algorithm ◽

Optimization Technique ◽

Minimum Cost ◽

Bat Algorithm ◽

Optimization Method ◽

Mapreduce Framework ◽

Energy Aware ◽

Optimal Resource

Abstract In modern times there is an increasing trend of applications for handling Big data. However, negotiating with the concepts of the Big data is an extremely difficult issue today. The MapReduce framework has been in focus recently for serious consideration. The aim of this study is to get the task-scheduling over Big data using Hadoop. Initially, we prioritize the tasks with the help of k-means clustering algorithm. Then, the MapReduce framework is employed. The available resource is optimally selected using optimization technique in map-phase. The proposed method uses the FireFly Algorithm and BAT algorithms (FFABAT) for choosing the optimal resource with minimum cost value. The bat-inspired algorithm is a meta-heuristic optimization method developed by Xin-She Yang (2010). This bat algorithm is established on the echo-location behaviour of micro-bats with variable pulse rates of emission and loudness. Finally, the tasks are scheduled with the optimal resource in reducer-phase and stored in the cloud. The performance of the algorithm is analysed, based on the total cost, time and memory utilization.

Download Full-text

CNB-MRF: Adapting Correlative Naive Bayes Classifier and MapReduce Framework for Big Data Classification

International Review on Computers and Software (IRECOS) ◽

10.15866/irecos.v11i11.10116 ◽

2016 ◽

Vol 11 (11) ◽

pp. 1007 ◽

Cited By ~ 3

Author(s):

Chitrakant Banchhor ◽

N. Srinivasu

Keyword(s):

Big Data ◽

Naive Bayes ◽

Data Classification ◽

Naïve Bayes ◽

Naive Bayes Classifier ◽

Bayes Classifier ◽

Naïve Bayes Classifier ◽

Mapreduce Framework ◽

Big Data Classification

Download Full-text

Partition Quantitative Assessment (PQA): A Quantitative Methodology to Assess the Embedded Noise in Clustered Omics and Systems Biology Data

Applied Sciences ◽

10.3390/app11135999 ◽

2021 ◽

Vol 11 (13) ◽

pp. 5999

Author(s):

Diego A. Camacho-Hernández ◽

Victor E. Nieto-Caballero ◽

José E. León-Burguete ◽

Julio A. Freyre-González

Keyword(s):

Systems Biology ◽

Clustering Analysis ◽

Quantitative Assessment ◽

Clustering Algorithm ◽

Statistical Evaluation ◽

Quantitative Methodology ◽

Typical Problem ◽

Statistical Validation ◽

Common Features ◽

Biology Research

Identifying groups that share common features among datasets through clustering analysis is a typical problem in many fields of science, particularly in post-omics and systems biology research. In respect of this, quantifying how a measure can cluster or organize intrinsic groups is important since currently there is no statistical evaluation of how ordered is, or how much noise is embedded in the resulting clustered vector. Much of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical validation; but no score has been developed to quantify statistically the noise in an arranged vector posterior to a clustering algorithm, i.e., how much of the clustering is due to randomness. Here, we present a quantitative methodology, based on autocorrelation, in order to assess this problem.

Download Full-text

Research on the university intelligent learning analysis system based on AI

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189820 ◽

2021 ◽

pp. 1-10

Author(s):

Meng Huang ◽

Shuai Liu ◽

Yahao Zhang ◽

Kewei Cui ◽

Yana Wen

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Academic Performance ◽

Clustering Algorithm ◽

Back Propagation ◽

Three Dimensional ◽

Training Model ◽

Future Trend ◽

Artificial Intelligence Technology ◽

Visualization Technology

The integration of Artificial Intelligence technology and school education had become a future trend, and became an important driving force for the development of education. With the advent of the era of big data, although the relationship between students’ learning status data was closer to nonlinear relationship, combined with the application analysis of artificial intelligence technology, it could be found that students’ living habits were closely related to their academic performance. In this paper, through the investigation and analysis of the living habits and learning conditions of more than 2000 students in the past 10 grades in Information College of Institute of Disaster Prevention, we used the hierarchical clustering algorithm to classify the nearly 180000 records collected, and used the big data visualization technology of Echarts + iView + GIS and the JavaScript development method to dynamically display the students’ life track and learning information based on the map, then apply Three Dimensional ArcGIS for JS API technology showed the network infrastructure of the campus. Finally, a training model was established based on the historical learning achievements, life trajectory, graduates’ salary, school infrastructure and other information combined with the artificial intelligence Back Propagation neural network algorithm. Through the analysis of the training resulted, it was found that the students’ academic performance was related to the reasonable laboratory study time, dormitory stay time, physical exercise time and social entertainment time. Finally, the system could intelligently predict students’ academic performance and give reasonable suggestions according to the established prediction model. The realization of this project could provide technical support for university educators.

Download Full-text

Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework

Journal of Grid Computing ◽

10.1007/s10723-019-09503-0 ◽

2019 ◽

Vol 18 (2) ◽

pp. 239-250 ◽

Cited By ~ 3

Author(s):

Weijia Lu

Keyword(s):

Data Mining ◽

Big Data ◽

Clustering Algorithm ◽

Big Data Mining

Download Full-text

Efficient indexing and retrieval of patient information from the big data using MapReduce framework and optimisation

Journal of Information Science ◽

10.1177/01655515211013708 ◽

2021 ◽

pp. 016555152110137

Author(s):

N.R. Gladiss Merlin ◽

Vigilson Prem. M

Keyword(s):

Big Data ◽

Similarity Measure ◽

Patient Information ◽

Complex Data ◽

Mapreduce Framework ◽

Maximum Value ◽

User Query ◽

Indexing And Retrieval ◽

Sine Cosine Algorithm ◽

Disparate Source

Large and complex data becomes a valuable resource in biomedical discovery, which is highly facilitated to increase the scientific resources for retrieving the helpful information. However, indexing and retrieving the patient information from the disparate source of big data is challenging in biomedical research. Indexing and retrieving the patient information from big data is performed using the MapReduce framework. In this research, the indexing and retrieval of information are performed using the proposed Jaya-Sine Cosine Algorithm (Jaya–SCA)-based MapReduce framework. Initially, the input big data is forwarded to the mapper randomly. The average of each mapper data is calculated, and these data are forwarded to the reducer, where the representative data are stored. For each user query, the input query is matched with the reducer, and thereby, it switches over to the mapper for retrieving the matched best result. The bilevel matching is performed while retrieving the data from the mapper based on the distance between the query. The similarity measure is computed based on the parametric-enabled similarity measure (PESM), cosine similarity and the proposed Jaya–SCA, which is the integration of the Jaya algorithm and the SCA. Moreover, the proposed Jaya–SCA algorithm attained the maximum value of F-measure, recall and precision of 0.5323, 0.4400 and 0.6867, respectively, using the StatLog Heart Disease dataset.

Download Full-text