scholarly journals Role of Pre-processing Phase in Document Clustering Technique for Gurmukhi Script

Document clustering plays a central role in knowledge discovery and data mining by representing large data-sets into a certain number of data objects called clusters. Each cluster consists similar data objects in such a way that data objects in the same cluster are more similar and dissimilar to the data objects of other clusters. Document clustering technique for Gurmukhi script consists two phases namely: 1) Pre-processing phase 2) Processing phase. This paper concentrates pre-processing phase of document clustering technique for Gurmukhi script. The purpose of pre-processing phase is to convert unstructured text into structured text format. Various sub-phases of pre-processing phase are: segmentation, tokenization, removal of stop words, stemming, and normalization. The purpose of this paper is to present the significant role of pre-processing phase in an overall performance of document clustering technique for Gurmukhi script. The experimental results represent the significant role of pre-processing phase in terms of performance regarding assignment of data objects to the relevant clusters as well as in creation of meaningful cluster title list. .

2019 ◽  
Vol 8 (2) ◽  
pp. 1646-1653

Document clustering is an unsupervised machine learning technique which designates the creation of classes of a certain number of similar objects without prior knowledge of data-sets. These classes of similar objects are known as clusters; each cluster consists unlabeled data objects in such a way that data objects within the same cluster have maximum similarity and have dissimilarity to the data objects of other groups. The purpose of this research work is to develop domain independent Gurmukhi script clustering technique. It is the first ever effort as no prior work has been done to develop domain independent clustering technique for Gurmukhi script. In this paper, a hybrid algorithm for the development of document clustering technique for Gurmukhi script has been developed. The experimental results of proposed document clustering technique reveal that the proposed hybrid technique performs better in terms of defining number of clusters, creation of meaningful cluster titles, and in terms of performance regarding assignment of real time unlabeled data sets to the relevant cluster as a result of various pre-processing steps like segmentation, stemming, normalization as well as extraction of named/noun entities, creation of cluster titles and placing text documents into relevant clusters using fuzzy term weight.


Author(s):  
Afrand Agah ◽  
Mehran Asadi

This article introduces a new method to discover the role of influential people in online social networks and presents an algorithm that recognizes influential users to reach a target in the network, in order to provide a strategic advantage for organizations to direct the scope of their digital marketing strategies. Social links among friends play an important role in dictating their behavior in online social networks, these social links determine the flow of information in form of wall posts via shares, likes, re-tweets, mentions, etc., which determines the influence of a node. This article initially identities the correlated nodes in large data sets using customized divide-and-conquer algorithm and then measures the influence of each of these nodes using a linear function. Furthermore, the empirical results show that users who have the highest influence are those whose total number of friends are closer to the total number of friends of each node divided by the total number of nodes in the network.


2005 ◽  
Vol 13 (4) ◽  
pp. 277-298 ◽  
Author(s):  
Rob Pike ◽  
Sean Dorward ◽  
Robert Griesemer ◽  
Sean Quinlan

Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design – including the separation into two phases, the form of the programming language, and the properties of the aggregators – exploits the parallelism inherent in having data and computation distributed across many machines.


Author(s):  
Afrand Agah ◽  
Mehran Asadi

This article introduces a new method to discover the role of influential people in online social networks and presents an algorithm that recognizes influential users to reach a target in the network, in order to provide a strategic advantage for organizations to direct the scope of their digital marketing strategies. Social links among friends play an important role in dictating their behavior in online social networks, these social links determine the flow of information in form of wall posts via shares, likes, re-tweets, mentions, etc., which determines the influence of a node. This article initially identities the correlated nodes in large data sets using customized divide-and-conquer algorithm and then measures the influence of each of these nodes using a linear function. Furthermore, the empirical results show that users who have the highest influence are those whose total number of friends are closer to the total number of friends of each node divided by the total number of nodes in the network.


2013 ◽  
Vol 791-793 ◽  
pp. 1289-1292
Author(s):  
Le Qiang Bai ◽  
Yan Yao Zhou ◽  
Shi Hong Zhang

Aiming at the problem of K-Means algorithm which is sensitive to select initial clustering center, this paper proposes a kind of initial point of K-Means algorithm. The algorithm processes the properties of the data objects, which determines the density of data object by counting the number of similar data objects and selects the center of categories according to the density of data object. The cluster numbers given and the UCI standard sets of data and the random data sets used, the clustering results demonstrate that the proposed algorithm has good stability, accuracy.


Acta Numerica ◽  
2001 ◽  
Vol 10 ◽  
pp. 313-355 ◽  
Author(s):  
Markus Hegland

Methods for knowledge discovery in data bases (KDD) have been studied for more than a decade. New methods are required owing to the size and complexity of data collections in administration, business and science. They include procedures for data query and extraction, for data cleaning, data analysis, and methods of knowledge representation. The part of KDD dealing with the analysis of the data has been termed data mining. Common data mining tasks include the induction of association rules, the discovery of functional relationships (classification and regression) and the exploration of groups of similar data objects in clustering. This review provides a discussion of and pointers to efficient algorithms for the common data mining tasks in a mathematical framework. Because of the size and complexity of the data sets, efficient algorithms and often crude approximations play an important role.


2014 ◽  
pp. 26-35
Author(s):  
Dan Cvrcek ◽  
Vaclav Matyas ◽  
Marek Kumpost

Many papers and articles attempt to define or even quantify privacy, typically with a major focus on anonymity. A related research exercise in the area of evidence-based trust models for ubiquitous computing environments has given us an impulse to take a closer look at the definition(s) of privacy in the Common Criteria, which we then transcribed in a bit more formal manner. This led us to a further review of unlinkability, and revision of another semi-formal model allowing for expression of anonymity and unlinkability – the Freiburg Privacy Diamond. We propose new means of describing (obviously only observable) characteristics of a system to reflect the role of contexts for profiling – and linking – users with actions in a system. We believe this approach should allow for evaluating privacy in large data sets.


Geophysics ◽  
2003 ◽  
Vol 68 (1) ◽  
pp. 168-180 ◽  
Author(s):  
Valentine Mikhailov ◽  
Armand Galdeano ◽  
Michel Diament ◽  
Alexei Gvishiani ◽  
Sergei Agayan ◽  
...  

Results of Euler deconvolution strongly depend on the selection of viable solutions. Synthetic calculations using multiple causative sources show that Euler solutions cluster in the vicinity of causative bodies even when they do not group densely about the perimeter of the bodies. We have developed a clustering technique to serve as a tool for selecting appropriate solutions. The clustering technique uses a methodology based on artificial intelligence, and it was originally designed to classify large data sets. It is based on a geometrical approach to study object concentration in a finite metric space of any dimension. The method uses a formal definition of cluster and includes free parameters that search for clusters of given properties. Tests on synthetic and real data showed that the clustering technique successfully outlines causative bodies more accurately than other methods used to discriminate Euler solutions. In complex field cases, such as the magnetic field in the Gulf of Saint Malo region (Brittany, France), the method provides dense clusters, which more clearly outline possible causative sources. In particular, it allows one to trace offshore the main inland tectonic structures and to study their interrelationships in the Gulf of Saint Malo. The clusters provide solutions associated with particular bodies, or parts of bodies, allowing the analysis of different clusters of Euler solutions separately. This may allow computation of average parameters for individual causative bodies. Those measurements of the anomalous field that yield clusters also form dense clusters themselves. Application of this clustering technique thus outlines areas where the influence of different causative sources is more prominent. This allows one to focus on these areas for more detailed study, using different window sizes, structural indices, etc.


Sign in / Sign up

Export Citation Format

Share Document