A state-of-the-art survey on semantic similarity for document clustering using GloVe and density-based algorithms

Semantic similarity is the process of identifying relevant data semantically. The traditional way of identifying document similarity is by using synonymous keywords and syntactician. In comparison, semantic similarity is to find similar data using meaning of words and semantics. Clustering is a concept of grouping objects that have the same features and properties as a cluster and separate from those objects that have different features and properties. In semantic document clustering, documents are clustered using semantic similarity techniques with similarity measurements. One of the common techniques to cluster documents is the density-based clustering algorithms using the density of data points as a main strategic to measure the similarity between them. In this paper, a state-of-the-art survey is presented to analyze the density-based algorithms for clustering documents. Furthermore, the similarity and evaluation measures are investigated with the selected algorithms to grasp the common ones. The delivered review revealed that the most used density-based algorithms in document clustering are DBSCAN and DPC. The most effective similarity measurement has been used with density-based algorithms, specifically DBSCAN and DPC, is Cosine similarity with F-measure for performance and accuracy evaluation.

Download Full-text

State of the art document clustering algorithms based on semantic similarity

Jurnal Informatika ◽

10.26555/jifo.v14i2.a17513 ◽

2020 ◽

Vol 14 (2) ◽

pp. 58

Author(s):

Karwan Jacksi ◽

Niyaz Salih

Keyword(s):

Semantic Similarity ◽

State Of The Art ◽

Clustering Algorithms ◽

Document Clustering

Download Full-text

Implementation of Clustering Algorithms for Real Time Large Datasets

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.c2570.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 2303-2304

Keyword(s):

Big Data ◽

Clustering Algorithms ◽

Vital Role ◽

Large Datasets ◽

Similar Data ◽

Data Set ◽

Survey Paper ◽

Density Based Clustering ◽

Geographical Maps ◽

Data Objects

Now a day’s clustering plays vital role in big data. It is very difficult to analyze and cluster large volume of data. Clustering is a procedure for grouping similar data objects of a data set. We make sure that inside the cluster high intra cluster similarity and outside the cluster high inter similarity. Clustering used in statistical analysis, geographical maps, biology cell analysis and in google maps. The various approaches for clustering grid clustering, density based clustering, hierarchical methods, partitioning approaches. In this survey paper we focused on all these algorithms for large datasets like big data and make a report on comparison among them. The main metric is time complexity to differentiate all algorithms.

Download Full-text

ClusterEnG: An interactive educational web resource for clustering big data

10.1101/120915 ◽

2017 ◽

Author(s):

Mohith Manjunath ◽

Yi Zhang ◽

Steve H. Yeo ◽

Omar Sobh ◽

Nathan Russell ◽

...

Keyword(s):

Big Data ◽

State Of The Art ◽

Clustering Algorithms ◽

Clustering Methods ◽

Web Resource ◽

Interactive Visualizations ◽

Data Points ◽

Similarities And Differences ◽

Intuitive Manner ◽

The Web

AbstractSummaryClustering is one of the most common techniques used in data analysis to discover hidden structures by grouping together data points that are similar in some measure into clusters. Although there are many programs available for performing clustering, a single web resource that provides both state-of-the-art clustering methods and interactive visualizations is lacking. ClusterEnG (acronym for Clustering Engine for Genomics) provides an interface for clustering big data and interactive visualizations including 3D views, cluster selection and zoom features. ClusterEnG also aims at educating the user about the similarities and differences between various clustering algorithms and provides clustering tutorials that demonstrate potential pitfalls of each algorithm. The web resource will be particularly useful to scientists who are not conversant with computing but want to understand the structure of their data in an intuitive manner.AvailabilityClusterEnG is part of a bigger project called KnowEnG (Knowledge Engine for Genomics) and is available at http://education.knoweng.org/[email protected]

Download Full-text

An Approach of Semantic Similarity Measure between Documents Based on Big Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i5.10853 ◽

2016 ◽

Vol 6 (5) ◽

pp. 2454 ◽

Cited By ~ 2

Author(s):

Mohammed Erritali ◽

Abderrahim Beni-Hssane ◽

Marouane Birjali ◽

Youness Madani

Keyword(s):

Big Data ◽

Semantic Similarity ◽

Retrieval System ◽

Programming Model ◽

State Of The Art ◽

Distributed Processing ◽

Similarity Measures ◽

The State ◽

Semantic Similarity Measure ◽

Document Similarity

Semantic indexing and document similarity is an important information retrieval system problem in Big Data with broad applications. In this paper, we investigate MapReduce programming model as a specific framework for managing distributed processing in a large of amount documents. Then we study the state of the art of different approaches for computing the similarity of documents. Finally, we propose our approach of semantic similarity measures using WordNet as an external network semantic resource. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Experimental results review that our proposed approach outperforms the state of the art ones on running time performance and increases the measurement of semantic similarity.

Download Full-text

An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering

Computational and Mathematical Methods in Medicine ◽

10.1155/2021/7937573 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Meijing Li ◽

Tianjie Chen ◽

Keun Ho Ryu ◽

Cheng Hao Jin

Keyword(s):

Semantic Similarity ◽

Clustering Algorithms ◽

Semantic Features ◽

Text Data ◽

Semantic Similarity Measure ◽

Document Similarity ◽

Hadoop Mapreduce ◽

Semantic Mining ◽

Semantic Similarity Measurement ◽

Open Datasets

Semantic mining is always a challenge for big biomedical text data. Ontology has been widely proved and used to extract semantic information. However, the process of ontology-based semantic similarity calculation is so complex that it cannot measure the similarity for big text data. To solve this problem, we propose a parallelized semantic similarity measurement method based on Hadoop MapReduce for big text data. At first, we preprocess and extract the semantic features from documents. Then, we calculate the document semantic similarity based on ontology network structure under MapReduce framework. Finally, based on the generated semantic document similarity, document clusters are generated via clustering algorithms. To validate the effectiveness, we use two kinds of open datasets. The experimental results show that the traditional methods can hardly work for more than ten thousand biomedical documents. The proposed method keeps efficient and accurate for big dataset and is of high parallelism and scalability.

Download Full-text

Conventional displays of structures in data compared with interactive projection-based clustering (IPBC)

International Journal of Data Science and Analytics ◽

10.1007/s41060-021-00264-2 ◽

2021 ◽

Author(s):

Michael C. Thrun ◽

Felix Pape ◽

Alfred Ultsch

Keyword(s):

Visual Analytics ◽

Clustering Algorithms ◽

Empirical Evaluation ◽

Similar Data ◽

Qualitative Comparison ◽

Domain Experts ◽

Human In The Loop ◽

3D Displays ◽

Benchmark Datasets ◽

Data Points

AbstractClustering is an important task in knowledge discovery with the goal to identify structures of similar data points in a dataset. Here, the focus lies on methods that use a human-in-the-loop, i.e., incorporate user decisions into the clustering process through 2D and 3D displays of the structures in the data. Some of these interactive approaches fall into the category of visual analytics and emphasize the power of such displays to identify the structures interactively in various types of datasets or to verify the results of clustering algorithms. This work presents a new method called interactive projection-based clustering (IPBC). IPBC is an open-source and parameter-free method using a human-in-the-loop for an interactive 2.5D display and identification of structures in data based on the user’s choice of a dimensionality reduction method. The IPBC approach is systematically compared with accessible visual analytics methods for the display and identification of cluster structures using twelve clustering benchmark datasets and one additional natural dataset. Qualitative comparison of 2D, 2.5D and 3D displays of structures and empirical evaluation of the identified cluster structures show that IPBC outperforms comparable methods. Additionally, IPBC assists in identifying structures previously unknown to domain experts in an application.

Download Full-text

An Approach of Semantic Similarity Measure between Documents Based on Big Data

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v6i5.pp2454-2461 ◽

2016 ◽

Vol 6 (5) ◽

pp. 2454 ◽

Cited By ~ 1

Author(s):

Mohammed Erritali ◽

Abderrahim Beni-Hssane ◽

Marouane Birjali ◽

Youness Madani

Keyword(s):

Big Data ◽

Semantic Similarity ◽

Retrieval System ◽

Programming Model ◽

State Of The Art ◽

Distributed Processing ◽

Similarity Measures ◽

The State ◽

Semantic Similarity Measure ◽

Document Similarity

Download Full-text

Clustering Algorithms: An Exploratory Review

10.5772/intechopen.100376 ◽

2021 ◽

Author(s):

R.S.M. Lakshmi Patibandla ◽

Veeranjaneyulu N

Keyword(s):

Standard Deviation ◽

Evolutionary Algorithms ◽

Root Mean Square ◽

Optimization Problems ◽

Clustering Algorithms ◽

Similar Data ◽

Mean Square ◽

Data Set ◽

Validation Criteria ◽

Data Points

A process of similar data items into groups is called data clustering. Partitioning a Data Set into some groups based on the resemblance within a group by using various algorithms. Partition Based algorithms key idea is to split the data points into partitions and each one replicates one cluster. The performance of partition depends on certain objective functions. Evolutionary algorithms are used for the evolution of social aspects and to provide optimum solutions for huge optimization problems. In this paper, a survey of various partitioning and evolutionary algorithms can be implemented on a benchmark dataset and proposed to apply some validation criteria methods such as Root-Mean-Square Standard Deviation, R-square and SSD, etc., on some algorithms like Leader, ISODATA, SGO and PSO, and so on.

Download Full-text

Refined Spectral Clustering via Embedded Label Propagation

Neural Computation ◽

10.1162/neco_a_01022 ◽

2017 ◽

Vol 29 (12) ◽

pp. 3381-3396 ◽

Cited By ~ 5

Author(s):

Yan-Shuo Chang ◽

Feiping Nie ◽

Zhihui Li ◽

Xiaojun Chang ◽

Heng Huang

Keyword(s):

Spectral Clustering ◽

State Of The Art ◽

Clustering Algorithms ◽

Original Data ◽

Label Propagation ◽

Locally Linear Embedding ◽

Data Sets ◽

Laplacian Matrices ◽

Data Points ◽

Linear Embedding

Spectral clustering is a key research topic in the field of machine learning and data mining. Most of the existing spectral clustering algorithms are built on gaussian Laplacian matrices, which is sensitive to parameters. We propose a novel parameter-free distance-consistent locally linear embedding. The proposed distance-consistent LLE can promise that edges between closer data points are heavier. We also propose a novel improved spectral clustering via embedded label propagation. Our algorithm is built on two advancements of the state of the art. First is label propagation, which propagates a node's labels to neighboring nodes according to their proximity. We perform standard spectral clustering on original data and assign each cluster with [Formula: see text]-nearest data points and then we propagate labels through dense unlabeled data regions. Second is manifold learning, which has been widely used for its capacity to leverage the manifold structure of data points. Extensive experiments on various data sets validate the superiority of the proposed algorithm compared to state-of-the-art spectral algorithms.

Download Full-text

Representation Learning and Similarity of Legal Judgements using Citation Networks

10.5121/csit.2021.112302 ◽

2021 ◽

Author(s):

Harshit Jain ◽

Naveen Pundir

Keyword(s):

State Of The Art ◽

Representation Learning ◽

Current Case ◽

Document Similarity ◽

Legal Information ◽

Law System ◽

Legal Documents ◽

Novel Approach ◽

The Common ◽

Dense Embedding

India and many other countries like UK, Australia, Canada follow the ‘common law system’ which gives substantial importance to prior related cases in determining the outcome of the current case. Better similarity methods can help in finding earlier similar cases, which can help lawyers searching for precedents. Prior approaches in computing similarity of legal judgements use a basic representation which is either abag-of-words or dense embedding which is learned by only using the words present in the document. They, however, either neglect or do not emphasize the vital ‘legal’ information in the judgements, e.g. citations to prior cases, act and article numbers or names etc. In this paper, we propose a novel approach to learn the embeddings of legal documents using the citationnetwork of documents. Experimental results demonstrate that the learned embedding is at par with the state-of-the-art methods for document similarity on a standard legal dataset.

Download Full-text