A Distributed Processing Framework of Incremental Text Clustering under the Background of Big Data

In the era of big data, due to the rapid expansion of the data, the existing incremental text clustering algorithm has the drawback that the efficiency of algorithm will sharp decline with the time and data volume increasing. Because of poor timeliness and robustness, the algorithms are hard to be applied in practice. In this paper, we propose a distributed model framework of Single-Pass algorithm based on MapReduce, the experiments result of increment text cluster is accuracy, the algorithm effectively improve the computing efficiency of the algorithm and real-time of result. Algorithm has a great prospect under the background of big data.

Download Full-text

Research on Belt and Road Big Data Visualization Based on Text Clustering Algorithm

2020 6th International Conference on Robotics and Artificial Intelligence ◽

10.1145/3449301.3449322 ◽

2020 ◽

Author(s):

Yana Wen ◽

Tingyue Wei ◽

Kewei Cui ◽

Bai Ling ◽

Yahao Zhang ◽

...

Keyword(s):

Big Data ◽

Data Visualization ◽

Clustering Algorithm ◽

Text Clustering ◽

Big Data Visualization ◽

Belt And Road

Download Full-text

Large-Scale Text Clustering Based on Improved K-Means Algorithm in the Storm Platform

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.1913 ◽

2014 ◽

Vol 543-547 ◽

pp. 1913-1916

Author(s):

Sheng Hang Wu ◽

Zhe Wang ◽

Ming Yuan He ◽

Huai Lin Dong

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

Distributed Processing ◽

Text Clustering ◽

Research Field ◽

Data Intensive Computing ◽

Operation Mechanism ◽

Data Intensive ◽

And Performance ◽

Algorithm Base

With the web information dramatically increases, Distributed processing of mass data through a cluster have been the focus of research field. An efficient distributed algorithm is the determinant of the scalability and performance in data analyses. This dissertation firstly studies the operation mechanism of Storm, which is a simplified distributed and real-time computation platform. Based on the Storm platform, an improved K-Means algorithm which could be used for data intensive computing is designed and implemented. Finally, the experience results show that the K-Means clustering algorithm base on Storm platform could obtain a higher performance in experience and improve the effectiveness and accuracy in large-scale text clustering.

Download Full-text

K-means text clustering algorithm based on density and nearest neighbor

Journal of Computer Applications ◽

10.3724/sp.j.1087.2010.01933 ◽

2010 ◽

Vol 30 (7) ◽

pp. 1933-1935 ◽

Cited By ~ 6

Author(s):

Wen-ming ZHANG ◽

Jiang WU ◽

Xiao-jiao YUAN

Keyword(s):

Clustering Algorithm ◽

Nearest Neighbor ◽

Text Clustering

Download Full-text

Research on the university intelligent learning analysis system based on AI

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189820 ◽

2021 ◽

pp. 1-10

Author(s):

Meng Huang ◽

Shuai Liu ◽

Yahao Zhang ◽

Kewei Cui ◽

Yana Wen

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

Academic Performance ◽

Clustering Algorithm ◽

Back Propagation ◽

Three Dimensional ◽

Training Model ◽

Future Trend ◽

Artificial Intelligence Technology ◽

Visualization Technology

The integration of Artificial Intelligence technology and school education had become a future trend, and became an important driving force for the development of education. With the advent of the era of big data, although the relationship between students’ learning status data was closer to nonlinear relationship, combined with the application analysis of artificial intelligence technology, it could be found that students’ living habits were closely related to their academic performance. In this paper, through the investigation and analysis of the living habits and learning conditions of more than 2000 students in the past 10 grades in Information College of Institute of Disaster Prevention, we used the hierarchical clustering algorithm to classify the nearly 180000 records collected, and used the big data visualization technology of Echarts + iView + GIS and the JavaScript development method to dynamically display the students’ life track and learning information based on the map, then apply Three Dimensional ArcGIS for JS API technology showed the network infrastructure of the campus. Finally, a training model was established based on the historical learning achievements, life trajectory, graduates’ salary, school infrastructure and other information combined with the artificial intelligence Back Propagation neural network algorithm. Through the analysis of the training resulted, it was found that the students’ academic performance was related to the reasonable laboratory study time, dormitory stay time, physical exercise time and social entertainment time. Finally, the system could intelligently predict students’ academic performance and give reasonable suggestions according to the established prediction model. The realization of this project could provide technical support for university educators.

Download Full-text

Improved K-Means Clustering Algorithm for Big Data Mining under Hadoop Parallel Framework

Journal of Grid Computing ◽

10.1007/s10723-019-09503-0 ◽

2019 ◽

Vol 18 (2) ◽

pp. 239-250 ◽

Cited By ~ 3

Author(s):

Weijia Lu

Keyword(s):

Data Mining ◽

Big Data ◽

Clustering Algorithm ◽

Big Data Mining

Download Full-text

Word2Cluster: A New Multi-Label Text Clustering Algorithm with an Adaptive Clusters Number

2019 IEEE Global Communications Conference (GLOBECOM) ◽

10.1109/globecom38437.2019.9013266 ◽

2019 ◽

Author(s):

Kaili Mao ◽

Jianwei Niu ◽

Xuefeng Liu ◽

Shui Yu ◽

Longbo Zhao

Keyword(s):

Clustering Algorithm ◽

Text Clustering

Download Full-text

Computational storage: an efficient and scalable platform for big data and HPC applications

Journal Of Big Data ◽

10.1186/s40537-019-0265-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Mahdi Torabzadehkashi ◽

Siavash Rezaei ◽

Ali HeydariGorji ◽

Hosein Bobarshad ◽

Vladimir Alves ◽

...

Keyword(s):

Big Data ◽

High Performance ◽

Distributed Processing ◽

Data Access ◽

Distributed Applications ◽

Process Data ◽

Storage Devices ◽

Hadoop Mapreduce ◽

Big Data Applications ◽

Application Processor

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Download Full-text

Fuzzy Set Based Clustering Algorithm of Web Text

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.678.19 ◽

2014 ◽

Vol 678 ◽

pp. 19-22

Author(s):

Hong Xin Wan ◽

Yun Peng

Keyword(s):

Key Words ◽

Fuzzy Set ◽

Clustering Algorithm ◽

Text Clustering ◽

Classification Methods ◽

Comparative Experiment ◽

Fuzzy Algorithm ◽

Pattern Clustering ◽

The Web ◽

Computing Accuracy

Web text exists non-certain and non-structure contents ,and it is difficult to cluster the text by normal classification methods. We propose a web text clustering algorithm based on fuzzy set to increase the computing accuracy with the web text. After abstracting the key words of the text, we can look it as attributes and design the fuzzy algorithm to decide the membership of the words. The algorithm can improve the algorithm complexity of time and space, increase the robustness comparing to the normal algorithm. To test the accuracy and efficiency of the algorithm, we take the comparative experiment between pattern clustering and our algorithm. The experiment shows that our method has a better result.

Download Full-text

Research on complex attribute big data classification based on iterative fuzzy clustering algorithm

Web Intelligence ◽

10.3233/web-210463 ◽

2021 ◽

pp. 1-12

Author(s):

Li Qian

Keyword(s):

Big Data ◽

Fuzzy Clustering ◽

Classification Accuracy ◽

Clustering Algorithm ◽

Principal Component ◽

Data Classification ◽

Fisher Discriminant Analysis ◽

Fuzzy Clustering Algorithm ◽

Local Fisher Discriminant Analysis ◽

Big Data Classification

In order to overcome the low classification accuracy of traditional methods, this paper proposes a new classification method of complex attribute big data based on iterative fuzzy clustering algorithm. Firstly, principal component analysis and kernel local Fisher discriminant analysis were used to reduce dimensionality of complex attribute big data. Then, the Bloom Filter data structure is introduced to eliminate the redundancy of the complex attribute big data after dimensionality reduction. Secondly, the redundant complex attribute big data is classified in parallel by iterative fuzzy clustering algorithm, so as to complete the complex attribute big data classification. Finally, the simulation results show that the accuracy, the normalized mutual information index and the Richter’s index of the proposed method are close to 1, the classification accuracy is high, and the RDV value is low, which indicates that the proposed method has high classification effectiveness and fast convergence speed.

Download Full-text

Research on Particle Swarm Optimization Clustering Algorithm for Big Data Based on Cloud Storage Environment

10.1145/3482632.3484002 ◽

2021 ◽

Author(s):

Dan Liu

Keyword(s):

Big Data ◽

Particle Swarm Optimization ◽

Cloud Storage ◽

Clustering Algorithm ◽

Particle Swarm ◽

Swarm Optimization ◽

Cloud Storage Environment

Download Full-text