Adaptive Initialization Method Based on Spatial Local Information fork-Means Algorithm

k-means algorithm is a widely used clustering algorithm in data mining and machine learning community. However, the initial guess of cluster centers affects the clustering result seriously, which means that improper initialization cannot lead to a desirous clustering result. How to choose suitable initial centers is an important research issue fork-means algorithm. In this paper, we propose an adaptive initialization framework based on spatial local information (AIF-SLI), which takes advantage of local density of data distribution. As it is difficult to estimate density correctly, we develop two approximate estimations: density byt-nearest neighborhoods (t-NN) and density byϵ-neighborhoods (ϵ-Ball), leading to two implements of the proposed framework. Our empirical study on more than 20 datasets shows promising performance of the proposed framework and denotes that it has several advantages: (1) can find the reasonable candidates of initial centers effectively; (2) it can reduce the iterations ofk-means’ methods significantly; (3) it is robust to outliers; and (4) it is easy to implement.

Download Full-text

Improved minimum-minimum roughness algorithm for clustering categorical data

International Journal of ADVANCED AND APPLIED SCIENCES ◽

10.21833/ijaas.2021.10.006 ◽

2021 ◽

Vol 8 (10) ◽

pp. 43-50

Author(s):

Truong et al. ◽

Keyword(s):

Machine Learning ◽

Data Mining ◽

Hierarchical Clustering ◽

Categorical Data ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Data Sets ◽

Top Down ◽

Hierarchical Clustering Algorithm

Clustering is a fundamental technique in data mining and machine learning. Recently, many researchers are interested in the problem of clustering categorical data and several new approaches have been proposed. One of the successful and pioneering clustering algorithms is the Minimum-Minimum Roughness algorithm (MMR) which is a top-down hierarchical clustering algorithm and can handle the uncertainty in clustering categorical data. However, MMR tends to choose the category with less value leaf node with more objects, leading to undesirable clustering results. To overcome such shortcomings, this paper proposes an improved version of the MMR algorithm for clustering categorical data, called IMMR (Improved Minimum-Minimum Roughness). Experimental results on actual data sets taken from UCI show that the IMMR algorithm outperforms MMR in clustering categorical data.

Download Full-text

Bloodless Technique to Detect Diabetes using Soft Computational Tool

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Intelligent Applications for Heterogeneous System Modeling and Design ◽

10.4018/978-1-4666-8493-5.ch006 ◽

2015 ◽

pp. 139-158

Author(s):

Puspalata Sah ◽

Kandarpa Kumar Sarma

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Support Vector Machine ◽

Support Vector ◽

Cotton Wool ◽

Research Issue ◽

Important Research ◽

Cotton Wool Spots ◽

Essential Components ◽

Soft Computing Techniques

Detection of diabetes using bloodless technique is an important research issue in the area of machine learning and artificial intelligence (AI). Here we present the working of a system designed to detect the abnormality of the eye with pain and blood free method. The typical features for diabetic retinopathy (DR) are used along with certain soft computing techniques to design such a system. The essential components of DR are blood vessels, red lesions visible as microaneurysms, hemorrhages and whitish lesions i.e., lipid exudates and cotton wool spots. The chapter reports the use of a unique feature set derived from the retinal image of the eye. The feature set is applied to a Support Vector Machine (SVM) which provides the decision regarding the state of infection of the eye. The classification ability of the proposed system for blood vessel and exudate is 91.67% and for optic disc and microaneurysm is 83.33%.

Download Full-text

A Data Distribution View of Clustering Algorithms

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch059 ◽

2011 ◽

pp. 374-381 ◽

Cited By ~ 1

Author(s):

Junjie Wu ◽

Jian Chen ◽

Hui Xiong

Keyword(s):

Data Mining ◽

Cluster Analysis ◽

Clustering Analysis ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Data Distribution ◽

Point Of View ◽

Group Method ◽

Data Sets ◽

Distribution Point

Cluster analysis (Jain & Dubes, 1988) provides insight into the data by dividing the objects into groups (clusters), such that objects in a cluster are more similar to each other than objects in other clusters. Cluster analysis has long played an important role in a wide variety of fields, such as psychology, bioinformatics, pattern recognition, information retrieval, machine learning, and data mining. Many clustering algorithms, such as K-means and Unweighted Pair Group Method with Arithmetic Mean (UPGMA), have been wellestablished. A recent research focus on clustering analysis is to understand the strength and weakness of various clustering algorithms with respect to data factors. Indeed, people have identified some data characteristics that may strongly affect clustering analysis including high dimensionality and sparseness, the large size, noise, types of attributes and data sets, and scales of attributes (Tan, Steinbach, & Kumar, 2005). However, further investigation is expected to reveal whether and how the data distributions can have the impact on the performance of clustering algorithms. Along this line, we study clustering algorithms by answering three questions: 1. What are the systematic differences between the distributions of the resultant clusters by different clustering algorithms? 2. How can the distribution of the “true” cluster sizes make impact on the performances of clustering algorithms? 3. How to choose an appropriate clustering algorithm in practice? The answers to these questions can guide us for the better understanding and the use of clustering methods. This is noteworthy, since 1) in theory, people seldom realized that there are strong relationships between the clustering algorithms and the cluster size distributions, and 2) in practice, how to choose an appropriate clustering algorithm is still a challenging task, especially after an algorithm boom in data mining area. This chapter thus tries to fill this void initially. To this end, we carefully select two widely used categories of clustering algorithms, i.e., K-means and Agglomerative Hierarchical Clustering (AHC), as the representative algorithms for illustration. In the chapter, we first show that K-means tends to generate the clusters with a relatively uniform distribution on the cluster sizes. Then we demonstrate that UPGMA, one of the robust AHC methods, acts in an opposite way to K-means; that is, UPGMA tends to generate the clusters with high variation on the cluster sizes. Indeed, the experimental results indicate that the variations of the resultant cluster sizes by K-means and UPGMA, measured by the Coefficient of Variation (CV), are in the specific intervals, say [0.3, 1.0] and [1.0, 2.5] respectively. Finally, we put together K-means and UPGMA for a further comparison, and propose some rules for the better choice of the clustering schemes from the data distribution point of view.

Download Full-text

Ensemble Hybrid K- Means and DBSCAN Clustering Algorithm – HDKA for Cancer Dataset

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8257.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 6036-6040

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Experimental Results ◽

Cancer Dataset ◽

Dbscan Clustering ◽

Selection Of

Data Mining is the foremost vital space of analysis and is pragmatically utilized in totally different domains, It becomes a highly demanding field because huge amounts of data have been collected in various applications. The database can be clustered in more number of ways depending on the clustering algorithm used, parameter settings and other factors. Multiple clustering algorithms can be combined to get the final partitioning of data which provides better clustering results. In this paper, Ensemble hybrid KMeans and DBSCAN (HDKA) algorithm has been proposed to overcome the drawbacks of DBSCAN and KMeans clustering algorithms. The performance of the proposed algorithm improves the selection of centroid points through the centroid selection strategy.For experimental results we have used two dataset Colon and Leukemia from UCI machine learning repository.

Download Full-text

Adaptive Upgradation of Personalized E-Learning Portal using Data Mining

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a8167.1110120 ◽

2020 ◽

Vol 10 (1) ◽

pp. 224-227

Keyword(s):

Machine Learning ◽

Data Mining ◽

Clustering Algorithm ◽

Learning Experience ◽

Learning Systems ◽

Machine Learning Techniques ◽

Development Phase ◽

Data Mining Techniques ◽

E Learning ◽

Learning Portal

Implementation of data mining techniques in elearning is a trending research area, due to the increasing popularity of e-learning systems. E-learning systems provide increased portability, convenience and better learning experience. In this research, we proposed two novel schemes for upgrading the e-learning portals based on the learner’s data for improving the quality of e-learning. The first scheme is Learner History-based E-learning Portal Up-gradation (LHEPU). In this scheme, the web log history data of the learner is acquired. Using this data, various useful attributes are extracted. Using these attributes, the data mining techniques like pattern analysis, machine learning, frequency distribution, correlation analysis, sequential mining and machine learning techniques are applied. The results of these data mining techniques are used for the improvement of e-learning portal like topic recommendations, learner grade prediction, etc. The second scheme is Learner Assessment-based E-Learning Portal Up-gradation (LAEPU). This scheme is implemented in two phases, namely, the development phase and the deployment phase. In the development phase, the learner is made to attend a short pretraining program. Followed by the program, the learner must attend an assessment test. Based on the learner’s performance in this test, the learners are clustered into different groups using clustering algorithm such as K-Means clustering or DBSCAN algorithms. The portal is designed to support each group of learners. In the deployment phase, a new learner is mapped to a particular group based on his/her performance in the pretraining program.

Download Full-text

HIGH CANDIDATES GENERATION: A NEW EFFICIENT METHOD FOR MINING SHARE-FREQUENT PATTERNS

Jurnal Teknologi ◽

10.11113/jt.v79.10292 ◽

2017 ◽

Vol 79 (7) ◽

Author(s):

Chayanan Nawapornanan ◽

Sarun Intakosum ◽

Veera Boonjing

Keyword(s):

Data Mining ◽

Efficient Method ◽

Execution Time ◽

Experimental Results ◽

Closure Property ◽

Frequent Patterns ◽

Research Issue ◽

Important Research ◽

Useful Knowledge ◽

Downward Closure

The share frequent patterns mining is more practical than the traditional frequent patternset mining because it can reflect useful knowledge such as total costs and profits of patterns. Mining share-frequent patterns becomes one of the most important research issue in the data mining. However, previous algorithms extract a large number of candidate and spend a lot of time to generate and test a large number of useless candidate in the mining process. This paper proposes a new efficient method for discovering share-frequent patterns. The new method reduces a number of candidates by generating candidates from only high transaction-measure-value patterns. The downward closure property of transaction-measure-value patterns assures correctness of the proposed method. Experimental results on dense and sparse datasets show that the proposed method is very efficient in terms of execution time. Also, it decreases the number of generated useless candidates in the mining process by at least 70%.

Download Full-text

Defining and Predicting Pain Volatility in Users of the Manage My Pain App: Analysis Using Data Mining and Machine Learning Methods (Preprint)

10.2196/preprints.12001 ◽

2018 ◽

Author(s):

Quazi Abidur Rahman ◽

Tahir Janmohamed ◽

Meysam Pirbaglou ◽

Hance Clarke ◽

Paul Ritvo ◽

...

Keyword(s):

Machine Learning ◽

Data Mining ◽

Random Forests ◽

Prediction Accuracy ◽

Clustering Algorithm ◽

Prediction Models ◽

Class Imbalance ◽

Learning Methods ◽

Machine Learning Methods ◽

High Volatility

BACKGROUND Measuring and predicting pain volatility (fluctuation or variability in pain scores over time) can help improve pain management. Perceptions of pain and its consequent disabling effects are often heightened under the conditions of greater uncertainty and unpredictability associated with pain volatility. OBJECTIVE This study aimed to use data mining and machine learning methods to (1) define a new measure of pain volatility and (2) predict future pain volatility levels from users of the pain management app, Manage My Pain, based on demographic, clinical, and app use features. METHODS Pain volatility was defined as the mean of absolute changes between 2 consecutive self-reported pain severity scores within the observation periods. The k-means clustering algorithm was applied to users’ pain volatility scores at the first and sixth month of app use to establish a threshold discriminating low from high volatility classes. Subsequently, we extracted 130 demographic, clinical, and app usage features from the first month of app use to predict these 2 volatility classes at the sixth month of app use. Prediction models were developed using 4 methods: (1) logistic regression with ridge estimators; (2) logistic regression with Least Absolute Shrinkage and Selection Operator; (3) Random Forests; and (4) Support Vector Machines. Overall prediction accuracy and accuracy for both classes were calculated to compare the performance of the prediction models. Training and testing were conducted using 5-fold cross validation. A class imbalance issue was addressed using a random subsampling of the training dataset. Users with at least five pain records in both the predictor and outcome periods (N=782 users) are included in the analysis. RESULTS k-means clustering algorithm was applied to pain volatility scores to establish a threshold of 1.6 to differentiate between low and high volatility classes. After validating the threshold using random subsamples, 2 classes were created: low volatility (n=611) and high volatility (n=171). In this class-imbalanced dataset, all 4 prediction models achieved 78.1% (611/782) to 79.0% (618/782) in overall accuracy. However, all models have a prediction accuracy of less than 18.7% (32/171) for the high volatility class. After addressing the class imbalance issue using random subsampling, results improved across all models for the high volatility class to greater than 59.6% (102/171). The prediction model based on Random Forests performs the best as it consistently achieves approximately 70% accuracy for both classes across 3 random subsamples. CONCLUSIONS We propose a novel method for measuring pain volatility. Cluster analysis was applied to divide users into subsets of low and high volatility classes. These classes were then predicted at the sixth month of app use with an acceptable degree of accuracy using machine learning methods based on the features extracted from demographic, clinical, and app use information from the first month.

Download Full-text

An Ordinal Data Clustering Algorithm with Automated Distance Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6168 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6869-6876

Author(s):

Yiqun Zhang ◽

Yiu-ming Cheung

Keyword(s):

Machine Learning ◽

Data Mining ◽

Categorical Data ◽

Data Clustering ◽

Time Complexity ◽

Ordinal Data ◽

Clustering Algorithm ◽

Major Type ◽

Common Task ◽

Consecutive Integers

Clustering ordinal data is a common task in data mining and machine learning fields. As a major type of categorical data, ordinal data is composed of attributes with naturally ordered possible values (also called categories interchangeably in this paper). However, due to the lack of dedicated distance metric, ordinal categories are usually treated as nominal ones, or coded as consecutive integers and treated as numerical ones. Both these two common ways will roughly define the distances between ordinal categories because the former way ignores the order relationship and the latter way simply assigns identical distances to different pairs of adjacent categories that may have intrinsically unequal distances. As a result, they may produce unsatisfactory ordinal data clustering results. This paper, therefore, proposes a novel ordinal data clustering algorithm, which iteratively learns: 1) The partition of ordinal dataset, and 2) the inter-category distances. To the best of our knowledge, this is the first attempt to dynamically adjust inter-category distances during the clustering process to search for a better partition of ordinal data. The proposed algorithm features superior clustering accuracy, low time complexity, fast convergence, and is parameter-free. Extensive experiments show its efficacy.

Download Full-text

Review of Data Mining for Twitter Data Analysis

Journal of Data Mining and Management ◽

10.46610/jodmm.2021.v06i02.003 ◽

2021 ◽

Vol 6 (2) ◽

Author(s):

Vaishnavi Shailesh Manthalkar ◽

S. R. Barbade

Keyword(s):

Machine Learning ◽

Data Mining ◽

Social Networks ◽

Online Social Networks ◽

Clustering Algorithm ◽

Automated Analysis ◽

Search Query ◽

Twitter Data ◽

Twitter Data Analysis ◽

Diverse Data

In today’s life Twitter, Facebook, Google are well known social sites that many uses for different purposes. Social sites are the fastest medium which delivers news to user as compared to the new paper and television. One among the online social networks like Twitter, has quickly gained fame as it provides people with the opportunity to communicate and share messages known as “tweets”. Tremendous value lies in automated analysis and data mining of such vast and diverse data to derive meaningful insights, which carries potential opportunities for businesses, consumers, product survey and political survey. In the proposed system of analysis of the tweets, a search query for topic is provided to extract required data using ‘clustering algorithm’ in machine learning. The unique advantage of using machine learning is that once an algorithm knows what to do with the information, it can do its job automatically. To deduce for a search query, the proposed system extracts feature from the tweets like keywords in tweets, number of words in tweets. The output tweet list can further be used for analysis for business improvement or for surveys.

Download Full-text

Clustering Methods Using Distance-Based Similarity Measures of Single-Valued Neutrosophic Sets

Journal of Intelligent Systems ◽

10.1515/jisys-2013-0091 ◽

2014 ◽

Vol 23 (4) ◽

pp. 379-389 ◽

Cited By ~ 45

Author(s):

Jun Ye

Keyword(s):

Machine Learning ◽

Data Mining ◽

Fuzzy Sets ◽

Clustering Algorithm ◽

Distance Measure ◽

Similarity Measures ◽

Intuitionistic Fuzzy Sets ◽

Clustering Methods ◽

Neutrosophic Sets ◽

Generalized Distance

AbstractClustering plays an important role in data mining, pattern recognition, and machine learning. Single-valued neutrosophic sets (SVNSs) are useful means to describe and handle indeterminate and inconsistent information that fuzzy sets and intuitionistic fuzzy sets cannot describe and deal with. To cluster the data represented by single-valued neutrosophic information, this article proposes single-valued neutrosophic clustering methods based on similarity measures between SVNSs. First, we define a generalized distance measure between SVNSs and propose two distance-based similarity measures of SVNSs. Then, we present a clustering algorithm based on the similarity measures of SVNSs to cluster single-valued neutrosophic data. Finally, an illustrative example is given to demonstrate the application and effectiveness of the developed clustering methods.

Download Full-text