Big Data Summarization Using Novel Clustering Algorithm and Semantic Feature Approach

2017 ◽  
Vol 4 (3) ◽  
pp. 108-117
Author(s):  
Shilpa G. Kolte ◽  
Jagdish W. Bakal

This paper proposes a big data (i.e., documents, texts) summarization method using proposed clustering and semantic features. This paper proposes a novel clustering algorithm which is used for big data summarization. The proposed system works in four phases and provides a modular implementation of multiple documents summarization. The experimental results using Iris dataset show that the proposed clustering algorithm performs better than K-means and K-medodis algorithm. The performance of big data (i.e., documents, texts) summarization is evaluated using Australian legal cases from the Federal Court of Australia (FCA) database. The experimental results demonstrate that the proposed method can summarize big data document superior as compared with existing systems.

Author(s):  
Joaquín Pérez Ortega ◽  
Nelva Nely Almanza Ortega ◽  
Andrea Vega Villalobos ◽  
Marco A. Aguirre L. ◽  
Crispín Zavala Díaz ◽  
...  

In recent years, the amount of texts in natural language, in digital format, has had an impressive increase. To obtain useful information from a large volume of data, new specialized techniques and efficient algorithms are required. Text mining consists of extracting meaningful patterns from texts; one of the basic approaches is clustering. The most used clustering algorithm is k-means. This chapter proposes an improvement of the k-means algorithm in the convergence step; the process stops whenever the number of objects that change their assigned cluster in the current iteration is bigger than the ones that changed in the previous iteration. Experimental results showed a reduction in execution time up to 93%. It is remarkable that, in general, better results are obtained when the volume of the text increase, particularly in those texts within big data environments.


2012 ◽  
Vol 532-533 ◽  
pp. 1716-1720 ◽  
Author(s):  
Chun Xia Jin ◽  
Hai Yan Zhou ◽  
Qiu Chan Bai

To solve the problem of sparse keywords and similarity drift in short text segments, this paper proposes short text clustering algorithm with feature keyword expansion (STCAFKE). The method can realize short text clustering by expanding feature keyword based on HowNet and combining K-means algorithm and density algorithm. It may add the number of text keyword with feature keyword expansion and increase text semantic features to realize short text clustering. Experimental results show that this algorithm has increased the short text clustering quality on precision and recall.


2018 ◽  
Vol 11 (1) ◽  
pp. 98
Author(s):  
Liu Xiang Wei

In today's society has entered the era of big data, data of the diversity and the amount of data increases to the data storage and processing brought great challenges, Hadoop HDFS and MapReduce better solves the these two problems. Classical K-means algorithm is the most widely used one based on the partition of the clustering algorithm. At the completion of the cluster configuration based on, the k-means algorithm in cluster mode of operation principle and in the cluster mode realized kmeans algorithm, and the experimental results are research and analysis, summarized the k-means algorithm is run on the Hadoop platform's strengths and limitations.


2012 ◽  
Vol 182-183 ◽  
pp. 1881-1884
Author(s):  
Xiu Fang Xu ◽  
Sen Xu ◽  
Tian Zhou

In this paper a novel document clustering spectral algorithm is proposed, which uses a minimum maximum principle. Firstly the low dimensional embedding of documents is attained by eigenvalue decomposition, and then a minimum maximum principle is used to get the initial seeds for k-means algorithm. Finally, K-means algorithm is performed to get the clustering results. Experimental results show that the clustering results found by this method is better than traditional clustering algorithm.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
HangLin Lu ◽  
XiuYun Peng

With the development of big data, in the financial market, the stock price prediction has many research directions from the perspective of big data. The classical time series prediction model cannot adapt to the high-latitude information of stock data in the era of big data. The development of deep learning provides a new idea for high-latitude stock data prediction. Four neural network models and three integrated learning models form different strategy sets, and the opening price of the next timestamp is predicted by backtracking information over the past 15 days with the characteristics of 12 indexes of the stock. The experimental results show that the prediction effect of the integration model based on the average weight policy and stacking policy is better than that of the single neural network, and the integration model based on stacking policy is expected to have the highest prediction accuracy and the minimum expected error. The accuracy was 80.2%, and the mean square error was 0.024. Compared with the single model, the accuracy is increased by 2%~7%, and the error is reduced by 0.01~0.03. The innovation of this article lies in the traditional machine learning thinking is applied to deep learning, as an individual with a variety of neural network to study, through the integration of learning strategies, fusion for the integration model, the experimental results show that the effect of the integrated model is better than that of a single model, to improve the robustness and accuracy of the model; the performance of the integrated model is more stable. For the utilization of big data resources, the integrated model of neural network has better prediction effect.


2016 ◽  
Vol 43 (1) ◽  
pp. 54-74 ◽  
Author(s):  
Baojun Ma ◽  
Hua Yuan ◽  
Ye Wu

Clustering is a powerful unsupervised tool for sentiment analysis from text. However, the clustering results may be affected by any step of the clustering process, such as data pre-processing strategy, term weighting method in Vector Space Model and clustering algorithm. This paper presents the results of an experimental study of some common clustering techniques with respect to the task of sentiment analysis. Different from previous studies, in particular, we investigate the combination effects of these factors with a series of comprehensive experimental studies. The experimental results indicate that, first, the K-means-type clustering algorithms show clear advantages on balanced review datasets, while performing rather poorly on unbalanced datasets by considering clustering accuracy. Second, the comparatively newly designed weighting models are better than the traditional weighting models for sentiment clustering on both balanced and unbalanced datasets. Furthermore, adjective and adverb words extraction strategy can offer obvious improvements on clustering performance, while strategies of adopting stemming and stopword removal will bring negative influences on sentiment clustering. The experimental results would be valuable for both the study and usage of clustering methods in online review sentiment analysis.


Author(s):  
Wenhui Zhou ◽  
Lili Lin ◽  
Guangtao Ge

Accurate vertebrae segmentation from CT spinal images is crucial for the clinical tasks of diagnosis, surgical planning, and post-operative assessment. This paper describes an [Formula: see text]-shaped 3D fully convolution network (FCN) for vertebrae segmentation: [Formula: see text]-net. In this network, a global structure guidance pathway is designed for fusing the high-level semantic features with the global structure information. Moreover, the residual structure and the skip connection are introduced into traditional 3D FCN framework. These schemes can significantly improve the accuracy of vertebrae segmentation. Experimental results demonstrate the effectiveness and robustness of our method. A high average DICE score of 0.9499 [Formula: see text] 0.02 can be obtained, which is better than those of existing methods.


2012 ◽  
Vol 239-240 ◽  
pp. 1318-1323
Author(s):  
Zu Qiang Meng ◽  
Shi Mo Shen ◽  
Qiu Lian Chen

Text clustering is one of the most popular topic detection techniques. However, the existing text clustering approaches require that each document has to be partitioned to one and only one cluster. This is not reasonable in some cases for there exist some documents which should not used to constitute topics. This paper firstly models a text document set as a network and designs a method for decomposing such a network, and then proposes a truly original text clustering algorithm for topic detection, called a network decomposition-based text clustering algorithm for topic detection (NDTCATD). The proposed algorithm ensures that meaningless documents can not be used to constitute topics. Experimental results show that NDTCATD is much better than bisecting k-means algorithm in terms of overall similarity and average cluster similarity. Therefore the proposed algorithm is reasonable and effective and is especially suitable for topic detection.


2012 ◽  
Vol 241-244 ◽  
pp. 3209-3212
Author(s):  
Guan Bo ◽  
Liang Xu Liu ◽  
Jian Bo Fan ◽  
Jin Yang Chen

along with more and more trajectory dataset being collected into application servers, the research in trajectory clustering has become increasingly important topic. This paper proposes a new mobile object trajectory Clustering algorithm (Trajectory Clustering based Improved Minimum Hausdorff Distance under Translation, TraClustMHD). In this framework, improved Minimum Hausdorff Distance under Translation is presented to measure the similarity between sub-segments. In additional, R-Tree is employed to improve the efficiency. The experimental results showed that this algorithm better than based on Hausdorff distance and based on line Hausdorff distance has good trajectory clustering performance.


Sign in / Sign up

Export Citation Format

Share Document