Role of Pre-processing Phase in Document Clustering Technique for Gurmukhi Script

Document clustering is an unsupervised machine learning technique which designates the creation of classes of a certain number of similar objects without prior knowledge of data-sets. These classes of similar objects are known as clusters; each cluster consists unlabeled data objects in such a way that data objects within the same cluster have maximum similarity and have dissimilarity to the data objects of other groups. The purpose of this research work is to develop domain independent Gurmukhi script clustering technique. It is the first ever effort as no prior work has been done to develop domain independent clustering technique for Gurmukhi script. In this paper, a hybrid algorithm for the development of document clustering technique for Gurmukhi script has been developed. The experimental results of proposed document clustering technique reveal that the proposed hybrid technique performs better in terms of defining number of clusters, creation of meaningful cluster titles, and in terms of performance regarding assignment of real time unlabeled data sets to the relevant cluster as a result of various pre-processing steps like segmentation, stemming, normalization as well as extraction of named/noun entities, creation of cluster titles and placing text documents into relevant clusters using fuzzy term weight.

Download Full-text

Influence and Information Flow in Online Social Networks

International Journal of Virtual Communities and Social Networking ◽

10.4018/ijvcsn.2017100101 ◽

2017 ◽

Vol 9 (4) ◽

pp. 1-17

Author(s):

Afrand Agah ◽

Mehran Asadi

Keyword(s):

Social Networks ◽

Online Social Networks ◽

Large Data ◽

Marketing Strategies ◽

Large Data Sets ◽

Divide And Conquer ◽

Data Sets ◽

Divide And Conquer Algorithm ◽

Flow Of Information

This article introduces a new method to discover the role of influential people in online social networks and presents an algorithm that recognizes influential users to reach a target in the network, in order to provide a strategic advantage for organizations to direct the scope of their digital marketing strategies. Social links among friends play an important role in dictating their behavior in online social networks, these social links determine the flow of information in form of wall posts via shares, likes, re-tweets, mentions, etc., which determines the influence of a node. This article initially identities the correlated nodes in large data sets using customized divide-and-conquer algorithm and then measures the influence of each of these nodes using a linear function. Furthermore, the empirical results show that users who have the highest influence are those whose total number of friends are closer to the total number of friends of each node divided by the total number of nodes in the network.

Download Full-text

Interpreting the Data: Parallel Analysis with Sawzall

Scientific Programming ◽

10.1155/2005/962135 ◽

2005 ◽

Vol 13 (4) ◽

pp. 277-298 ◽

Cited By ~ 217

Author(s):

Rob Pike ◽

Sean Dorward ◽

Robert Griesemer ◽

Sean Quinlan

Keyword(s):

Programming Language ◽

Regular Structure ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Data Parallel ◽

Web Document ◽

Distributed Computations ◽

Procedural Programming ◽

Two Phases

Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design – including the separation into two phases, the form of the programming language, and the properties of the aggregators – exploits the parallelism inherent in having data and computation distributed across many machines.

Download Full-text

Influence and Information Flow in Online Social Networks

Research Anthology on Strategies for Using Social Media as a Service and Tool in Business ◽

10.4018/978-1-7998-9020-1.ch026 ◽

2021 ◽

pp. 502-520

Author(s):

Afrand Agah ◽

Mehran Asadi

Keyword(s):

Social Networks ◽

Information Flow ◽

Online Social Networks ◽

Large Data ◽

Marketing Strategies ◽

Divide And Conquer ◽

Data Sets ◽

Divide And Conquer Algorithm ◽

Flow Of Information

This article introduces a new method to discover the role of influential people in online social networks and presents an algorithm that recognizes influential users to reach a target in the network, in order to provide a strategic advantage for organizations to direct the scope of their digital marketing strategies. Social links among friends play an important role in dictating their behavior in online social networks, these social links determine the flow of information in form of wall posts via shares, likes, re-tweets, mentions, etc., which determines the influence of a node. This article initially identities the correlated nodes in large data sets using customized divide-and-conquer algorithm and then measures the influence of each of these nodes using a linear function. Furthermore, the empirical results show that users who have the highest influence are those whose total number of friends are closer to the total number of friends of each node divided by the total number of nodes in the network.

Download Full-text

An Initial Point Selection Algorithm for K-Means Clustering

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.791-793.1289 ◽

2013 ◽

Vol 791-793 ◽

pp. 1289-1292

Author(s):

Le Qiang Bai ◽

Yan Yao Zhou ◽

Shi Hong Zhang

Keyword(s):

Initial Point ◽

Data Sets ◽

Similar Data ◽

Selection Algorithm ◽

Point Selection ◽

Random Data ◽

Data Object ◽

Clustering Center ◽

Data Objects ◽

Standard Sets

Aiming at the problem of K-Means algorithm which is sensitive to select initial clustering center, this paper proposes a kind of initial point of K-Means algorithm. The algorithm processes the properties of the data objects, which determines the density of data object by counting the number of similar data objects and selects the center of categories according to the density of data object. The cluster numbers given and the UCI standard sets of data and the random data sets used, the clustering results demonstrate that the proposed algorithm has good stability, accuracy.

Download Full-text

Data mining techniques

Acta Numerica ◽

10.1017/s0962492901000058 ◽

2001 ◽

Vol 10 ◽

pp. 313-355 ◽

Cited By ~ 16

Author(s):

Markus Hegland

Keyword(s):

Data Mining ◽

Efficient Algorithms ◽

Data Sets ◽

Mathematical Framework ◽

Similar Data ◽

Functional Relationships ◽

New Methods ◽

The Common ◽

Data Objects ◽

Data Collections

Methods for knowledge discovery in data bases (KDD) have been studied for more than a decade. New methods are required owing to the size and complexity of data collections in administration, business and science. They include procedures for data query and extraction, for data cleaning, data analysis, and methods of knowledge representation. The part of KDD dealing with the analysis of the data has been termed data mining. Common data mining tasks include the induction of association rules, the discovery of functional relationships (classification and regression) and the exploration of groups of similar data objects in clustering. This review provides a discussion of and pointers to efficient algorithms for the common data mining tasks in a mathematical framework. Because of the size and complexity of the data sets, efficient algorithms and often crude approximations play an important role.

Download Full-text

ON PRIVACY CLASSIFICATION IN UBIQUITOUS COMPUTING SYSTEMS

International Journal of Computing ◽

10.47839/ijc.4.2.334 ◽

2014 ◽

pp. 26-35

Author(s):

Dan Cvrcek ◽

Vaclav Matyas ◽

Marek Kumpost

Keyword(s):

Ubiquitous Computing ◽

Formal Model ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Computing Systems ◽

Common Criteria ◽

Trust Models ◽

Ubiquitous Computing Environments

Many papers and articles attempt to define or even quantify privacy, typically with a major focus on anonymity. A related research exercise in the area of evidence-based trust models for ubiquitous computing environments has given us an impulse to take a closer look at the definition(s) of privacy in the Common Criteria, which we then transcribed in a bit more formal manner. This led us to a further review of unlinkability, and revision of another semi-formal model allowing for expression of anonymity and unlinkability – the Freiburg Privacy Diamond. We propose new means of describing (obviously only observable) characteristics of a system to reflect the role of contexts for profiling – and linking – users with actions in a system. We believe this approach should allow for evaluating privacy in large data sets.

Download Full-text

Application of artificial intelligence for Euler solutions clustering

Geophysics ◽

10.1190/1.1543204 ◽

2003 ◽

Vol 68 (1) ◽

pp. 168-180 ◽

Cited By ~ 45

Author(s):

Valentine Mikhailov ◽

Armand Galdeano ◽

Michel Diament ◽

Alexei Gvishiani ◽

Sergei Agayan ◽

...

Keyword(s):

Artificial Intelligence ◽

Large Data ◽

Real Data ◽

Data Sets ◽

Geometrical Approach ◽

Tectonic Structures ◽

Clustering Technique ◽

Anomalous Field ◽

Definition Of ◽

Structural Indices

Results of Euler deconvolution strongly depend on the selection of viable solutions. Synthetic calculations using multiple causative sources show that Euler solutions cluster in the vicinity of causative bodies even when they do not group densely about the perimeter of the bodies. We have developed a clustering technique to serve as a tool for selecting appropriate solutions. The clustering technique uses a methodology based on artificial intelligence, and it was originally designed to classify large data sets. It is based on a geometrical approach to study object concentration in a finite metric space of any dimension. The method uses a formal definition of cluster and includes free parameters that search for clusters of given properties. Tests on synthetic and real data showed that the clustering technique successfully outlines causative bodies more accurately than other methods used to discriminate Euler solutions. In complex field cases, such as the magnetic field in the Gulf of Saint Malo region (Brittany, France), the method provides dense clusters, which more clearly outline possible causative sources. In particular, it allows one to trace offshore the main inland tectonic structures and to study their interrelationships in the Gulf of Saint Malo. The clusters provide solutions associated with particular bodies, or parts of bodies, allowing the analysis of different clusters of Euler solutions separately. This may allow computation of average parameters for individual causative bodies. Those measurements of the anomalous field that yield clusters also form dense clusters themselves. Application of this clustering technique thus outlines areas where the influence of different causative sources is more prominent. This allows one to focus on these areas for more detailed study, using different window sizes, structural indices, etc.

Download Full-text