scholarly journals Correlation and Probability Based Similarity Measure for Detecting Outliers in Categorical Data

Determining the similarity or distance among data objects is an important part in many research fields such as statistics, data mining, machine learning etc. There are many measures available in the literature to define the distance between two numerical data objects. It is difficult to define such a metric to measure the similarity between two categorical data objects since categorical data objects are not ordered. Only a few distance measures are available in the literature to find the similarities among categorical data objects. This paper presents a comparative evaluation of various similarity measures for categorical data and also introduces a novel similarity measure for categorical data based on occurrence frequency and correlation. We evaluated the performance of these similarity measures in the context of outlier detection task in data mining using real world data sets. Experimental results show that the proposed similarity measure outperform the existing similarity measures to detect outliers in categorical datasets. The performances are evaluated in the context of outlier detection task in data mining

In data mining ample techniques use distance based measures for data clustering. Improving clustering performance is the fundamental goal in cluster domain related tasks. Many techniques are available for clustering numerical data as well as categorical data. Clustering is an unsupervised learning technique and objects are grouped or clustered based on similarity among the objects. A new cluster similarity finding measure, which is cosine like cluster similarity measure (CLCSM), is proposed in this paper. The proposed cluster similarity measure is used for data classification. Extensive experiments are conducted by taking UCI machine learning datasets. The experimental results have shown that the proposed cosinelike cluster similarity measure is superior to many of the existing cluster similarity measures for data classification.


2013 ◽  
Vol 12 (5) ◽  
pp. 3443-3451
Author(s):  
Rajesh Pasupuleti ◽  
Narsimha Gugulothu

Clustering analysis initiatives  a new direction in data mining that has major impact in various domains including machine learning, pattern recognition, image processing, information retrieval and bioinformatics. Current clustering techniques address some of the  requirements not adequately and failed in standardizing clustering algorithms to support for all real applications. Many clustering methods mostly depend on user specified parametric methods and initial seeds of clusters are randomly selected by  user.  In this paper, we proposed new clustering method based on linear approximation of function by getting over all idea of behavior knowledge of clustering function, then pick the initial seeds of clusters as the points on linear approximation line and perform clustering operations, unlike grouping data objects into clusters by using distance measures, similarity measures and statistical distributions in traditional clustering methods. We have shown experimental results as clusters based on linear approximation yields good  results in practice with an example of  business data are provided.  It also  explains privacy preserving clusters of sensitive data objects.


Author(s):  
T. Warren Liao

In this chapter, we present genetic algorithm (GA) based methods developed for clustering univariate time series with equal or unequal length as an exploratory step of data mining. These methods basically implement the k-medoids algorithm. Each chromosome encodes in binary the data objects serving as the k-medoids. To compare their performance, both fixed-parameter and adaptive GAs were used. We first employed the synthetic control chart data set to investigate the performance of three fitness functions, two distance measures, and other GA parameters such as population size, crossover rate, and mutation rate. Two more sets of time series with or without known number of clusters were also experimented: one is the cylinder-bell-funnel data and the other is the novel battle simulation data. The clustering results are presented and discussed.


2013 ◽  
Vol 811 ◽  
pp. 547-551 ◽  
Author(s):  
Hong Xu Wang ◽  
Hai Feng Wang ◽  
Kun Zhang ◽  
Hui Wang

In order to amend the defects of existing similarity measure formula between vague sets, a new definition of similarity measure between vague sets is proposed and a new formula with higher resolution and highlighted uncertainty is presented on the basis of data mining vague value method. A general fault diagnosis method of Vague sets (GFDMVS) is proposed. The same practical case is studied with three methods and the results demonstrate the validity and reasonability of the method proposed in this paper.


2008 ◽  
pp. 371-380
Author(s):  
Takao Ito

One of the most important issues in data mining is to discover an implicit relationship between words in a large corpus and labels in a large database. The relationship between words and labels often is expressed as a function of distance measures. An effective measure would be useful not only for getting the high precision of data mining, but also for time saving of the operation in data mining. In previous research, many measures for calculating the one-to-many relationship have been proposed, such as the complementary similarity measure, the mutual information, and the phi coefficient. Some research showed that the complementary similarity measure is the most effective. The author reviewed previous research related to the measures in one-to-many relationships and proposed a new idea to get an effective one, based on the heuristic approach in this article.


Author(s):  
Takao Ito

One of the most important issues in data mining is to discover an implicit relationship between words in a large corpus and labels in a large database. The relationship between words and labels often is expressed as a function of distance measures. An effective measure would be useful not only for getting the high precision of data mining, but also for time saving of the operation in data mining. In previous research, many measures for calculating the one-to-many relationship have been proposed, such as the complementary similarity measure, the mutual information, and the phi coefficient. Some research showed that the complementary similarity measure is the most effective. The author reviewed previous research related to the measures in one-to-many relationships and proposed a new idea to get an effective one, based on the heuristic approach in this article.


2018 ◽  
pp. 972-985
Author(s):  
Lixin Fan

The measurement of uncertainty is an important topic for the theories dealing with uncertainty. The definition of similarity measure between two IFSs is one of the most interesting topics in IFSs theory. A similarity measure is defined to compare the information carried by IFSs. Many similarity measures have been proposed. A few of them come from the well-known distance measures. In this work, a new similarity measure between IFSs was proposed by the consideration of the information carried by the membership degree, the non-membership degree, and hesitancy degree in intuitionistic fuzzy sets (IFSs). To demonstrate the efficiency of the proposed similarity measure, various similarity measures between IFSs were compared with the proposed similarity measure between IFSs by numerical examples. The compared results demonstrated that the new similarity measure is reasonable and has stronger discrimination among them. Finally, the similarity measure was applied to pattern recognition and medical diagnosis. Two illustrative examples were provided to show the effectiveness of the pattern recognition and medical diagnosis.


Mathematics ◽  
2020 ◽  
Vol 8 (4) ◽  
pp. 519 ◽  
Author(s):  
Muhammad Jabir Khan ◽  
Poom Kumam ◽  
Wejdan Deebani ◽  
Wiyada Kumam ◽  
Zahir Shah

A new condition on positive membership, neutral membership, and negative membership functions give us the successful extension of picture fuzzy set and Pythagorean fuzzy set and called spherical fuzzy sets ( SFS ) . This extends the domain of positive membership, neutral membership, and negative membership functions. Keeping in mind the importance of similarity measure and application in data mining, medical diagnosis, decision making, and pattern recognition, several studies on similarity measures have been proposed in the literature. Some of those, however, cannot satisfy the axioms of similarity and provide counter-intuitive cases. In this paper, we proposed the set-theoretic similarity and distance measures. We provide some counterexamples for already proposed similarity measures in the literature and shows that how our proposed method is important and applicable to the pattern recognition problems. In the end, we provide an application of a proposed similarity measure for selecting mega projects in under developed countries.


Author(s):  
Lixin Fan

The measurement of uncertainty is an important topic for the theories dealing with uncertainty. The definition of similarity measure between two IFSs is one of the most interesting topics in IFSs theory. A similarity measure is defined to compare the information carried by IFSs. Many similarity measures have been proposed. A few of them come from the well-known distance measures. In this work, a new similarity measure between IFSs was proposed by the consideration of the information carried by the membership degree, the non-membership degree, and hesitancy degree in intuitionistic fuzzy sets (IFSs). To demonstrate the efficiency of the proposed similarity measure, various similarity measures between IFSs were compared with the proposed similarity measure between IFSs by numerical examples. The compared results demonstrated that the new similarity measure is reasonable and has stronger discrimination among them. Finally, the similarity measure was applied to pattern recognition and medical diagnosis. Two illustrative examples were provided to show the effectiveness of the pattern recognition and medical diagnosis.


Text data analytics became an integral part of World Wide Web data management and Internet based applications rapidly growing all over the world. E-commerce applications are growing exponentially in the business field and the competitors in the E-commerce are gradually increasing many machine learning techniques for predicting business related operations with the aim of increasing the product sales to the greater extent. Usage of similarity measures is inevitable in modern day to day real applications. Cosine similarity plays a dominant role in text data mining applications such as text classification, clustering, querying, and searching and so on. A modified clustering based cosine similarity measure called MCS is proposed in this paper for data classification. The proposed method is experimentally verified by employing many UCI machine learning datasets involving categorical attributes. The proposed method is superior in producing more accurate classification results in majority of experiments conducted on the UCI machine learning datasets.


Sign in / Sign up

Export Citation Format

Share Document