scholarly journals A survey on feature weighting based K-Means algorithms

2020 ◽  
Author(s):  
Renato Cordeiro de Amorim

In a real-world data set there is always the possibility, rather high in our opinion, that different features may have different degrees of relevance. Most machine learning algorithms deal with this fact by either selecting or deselecting features in the data preprocessing phase. However, we maintain that even among relevant features there may be different degrees of relevance, and this should be taken into account during the clustering process. With over 50 years of history, K-Means is arguably the most popular partitional clustering algorithm there is. The first K-Means based clustering algorithm to compute feature weights was designed just over 30 years ago. Various such algorithms have been designed since but there has not been, to our knowledge, a survey integrating empirical evidence of cluster recovery ability, common flaws, and possible directions for future research. This paper elaborates on the concept of feature weighting and addresses these issues by critically analysing some of the most popular, or innovative, feature weighting mechanisms based in K-Means

2018 ◽  
Vol 210 ◽  
pp. 04019 ◽  
Author(s):  
Hyontai SUG

Recent world events in go games between human and artificial intelligence called AlphaGo showed the big advancement in machine learning technologies. While AlphaGo was trained using real world data, AlphaGo Zero was trained using massive random data, and the fact that AlphaGo Zero won AlphaGo completely revealed that diversity and size in training data is important for better performance for the machine learning algorithms, especially in deep learning algorithms of neural networks. On the other hand, artificial neural networks and decision trees are widely accepted machine learning algorithms because of their robustness in errors and comprehensibility respectively. In this paper in order to prove that diversity and size in data are important factors for better performance of machine learning algorithms empirically, the two representative algorithms are used for experiment. A real world data set called breast tissue was chosen, because the data set consists of real numbers that is very good property for artificial random data generation. The result of the experiment proved the fact that the diversity and size of data are very important factors for better performance.


2011 ◽  
Vol 2011 ◽  
pp. 1-14 ◽  
Author(s):  
Chunzhong Li ◽  
Zongben Xu

Structure of data set is of critical importance in identifying clusters, especially the density difference feature. In this paper, we present a clustering algorithm based on density consistency, which is a filtering process to identify same structure feature and classify them into same cluster. This method is not restricted by the shapes and high dimension data set, and meanwhile it is robust to noises and outliers. Extensive experiments on synthetic and real world data sets validate the proposed the new clustering algorithm.


2019 ◽  
Vol 8 (3) ◽  
pp. 7071-7081

Current generation real-world data sets processed through machine learning are imbalanced by nature. This imbalanced data enables the researchers with a challenging scenario in the context of perdition for both the machine learning and data mining algorithms. It is observed from the past research studies most of the imbalanced data sets consists of the major classes and minor classes and the major class leads the minor class. Several standards and hybrid prediction algorithms are proposed in various application domains but in most of the real-time data sets analyzed in the studies are imbalanced by nature thereby affecting the accuracy of the prediction. This paper presents a systematic survey of the past research studies to analyze intrinsic data characteristics and techniques utilized for handling class-imbalanced data. In addition, this study reveals the research gaps, trends and patterns in existing studies and discusses briefly on future research directions


2020 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Marcus Renatus Johannes Wolkenfelt ◽  
Frederik Bungaran Ishak Situmeang

Purpose The purpose of this paper is to contribute to the marketing literature and practice by examining the effect of product pricing on consumer behaviours with regard to the assertiveness and the sentiments expressed in their product reviews. In addition, the paper uses new data collection and machine learning tools that can also be extended for other research of online consumer reviewing behaviours. Design/methodology/approach Using web crawling techniques, a large data set was extracted from the Google Play Store. Following this, the authors created machine learning algorithms to identify topics from product reviews and to quantify assertiveness and sentiments from the review texts. Findings The results indicate that product pricing models affect consumer review sentiment, assertiveness and topics. Removing upfront payment obligations positively impacts the overall and pricing specific consumer sentiment and reduces assertiveness. Research limitations/implications The results reveal new effects of pricing models on the nature of consumer reviews of products and form a basis for future research. The study was conducted in the gaming category of the Google Play Store and the generalisability of the findings for other app segments or marketplaces should be further tested. Originality/value The findings can help companies that create digital products in choosing a pricing strategy for their apps. The paper is the first to investigate how pricing modes affect the nature of online reviews written by consumers.


2009 ◽  
Vol 2009 ◽  
pp. 1-16 ◽  
Author(s):  
David J. Miller ◽  
Carl A. Nelson ◽  
Molly Boeka Cannon ◽  
Kenneth P. Cannon

Fuzzy clustering algorithms are helpful when there exists a dataset with subgroupings of points having indistinct boundaries and overlap between the clusters. Traditional methods have been extensively studied and used on real-world data, but require users to have some knowledge of the outcome a priori in order to determine how many clusters to look for. Additionally, iterative algorithms choose the optimal number of clusters based on one of several performance measures. In this study, the authors compare the performance of three algorithms (fuzzy c-means, Gustafson-Kessel, and an iterative version of Gustafson-Kessel) when clustering a traditional data set as well as real-world geophysics data that were collected from an archaeological site in Wyoming. Areas of interest in the were identified using a crisp cutoff value as well as a fuzzyα-cut to determine which provided better elimination of noise and non-relevant points. Results indicate that theα-cut method eliminates more noise than the crisp cutoff values and that the iterative version of the fuzzy clustering algorithm is able to select an optimum number of subclusters within a point set (in both the traditional and real-world data), leading to proper indication of regions of interest for further expert analysis


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0247059
Author(s):  
Yoshitake Kitanishi ◽  
Masakazu Fujiwara ◽  
Bruce Binkowitz

Health insurance and acute hospital-based claims have recently become available as real-world data after marketing in Japan and, thus, classification and prediction using the machine learning approach can be applied to them. However, the methodology used for the analysis of real-world data has been hitherto under debate and research on visualizing the patient journey is still inconclusive. So far, to classify diseases based on medical histories and patient demographic background and to predict the patient prognosis for each disease, the correlation structure of real-world data has been estimated by machine learning. Therefore, we applied association analysis to real-world data to consider a combination of disease events as the patient journey for depression diagnoses. However, association analysis makes it difficult to interpret multiple outcome measures simultaneously and comprehensively. To address this issue, we applied the Topological Data Analysis (TDA) Mapper to sequentially interpret multiple indices, thus obtaining a visual classification of the diseases commonly associated with depression. Under this approach, the visual and continuous classification of related diseases may contribute to precision medicine research and can help pharmaceutical companies provide appropriate personalized medical care.


2020 ◽  
Author(s):  
Andrea Cominola ◽  
Marie-Philine Becker ◽  
Riccardo Taormina

<p>As several cities all over the world face the exacerbating challenges posed by climate change, population growth, and urbanization, it becomes clear how increased water security and more resilient urban water systems can be achieved by optimizing the use of water resources and minimize losses and inefficient usage. In the literature, there is growing evidence about the potential of demand management programs to complement supply-side interventions and foster more efficient water use behaviors. A new boost to demand management is offered by the ongoing digitalization of the water utility sector, which facilitates accurate measuring and estimation of urban water demands down to the scale of individual end-uses of residential water consumers (e.g., showering, watering). This high-resolution data can play a pivotal role in supporting demand-side management programs, fostering more efficient and sustainable water uses, and prompting the detection of anomalous behaviors (e.g., leakages, faulty meters). The problem of deriving individual end-use consumption traces from the composite signal recorded by single-point meters installed at the inlet of each household has been studied for nearly 30 years in the electricity field (Non-Intrusive Load Monitoring). Conversely, the similar disaggregation problem in the water sector - here called Non-Intrusive Water Monitoring (NIWM) - is still a very open research challenge. Most of the state-of-the-art end-use disaggregation algorithms still need an intrusive calibration or time- consuming expert-based manual processing. Moreover, the limited availability of large-scale open datasets with end- use ground truth data has so far greatly limited the development and benchmarking of NIWM methods.</p><p>In this work, we comparatively test the suitability of different machine learning algorithms to perform NIWM. First, we formulate the NIWM problem both as a regression problem, where water consumption traces are processed as continuous time-series, and a classification problem, where individual water use events are associated to one or more end use labels. Second, a number of algorithms based on the last trends in Artificial Intelligence and Machine Learning are tested both on synthetic and real-world data, including state-of-the-art tree-based and Deep Learning methods. Synthetic water end-use time series generated with the STREaM stochastic simulation model are considered for algorithm testing, along with labelled real-world data from the Residential End Uses of Water, Version 2, database by the Water Research Foundation. Finally, the performance of the different NIWM algorithms is comparatively assessed with metrics that include (i) NIWM accuracy, (ii) computational cost, and (iii) amount of needed training data.</p>


Webology ◽  
2021 ◽  
Vol 18 (05) ◽  
pp. 1212-1225
Author(s):  
Siva C ◽  
Maheshwari K.G ◽  
Nalinipriya G ◽  
Priscilla Mary J

In our day to day life, the availability of correctly labelled data as well as handling of categorical data are mostly acknowledged as two main challenges in dynamic analysis. Therefore, clustering techniques are applied on unlabelled data to group them in accordance with the homogeneity. There are many prediction methods that are being popularly used in handling forecasting problems in real time environment. The outbreak of coronavirus disease (COVID19)-2019 creates the need for a medical emergency of worldwide concern with a rapidly high danger of open out and strike the entire world. Recently, the ML prediction models were used in many real time applications which necessitate the identification and categorization for real time environment. In medical field Prediction models are vital role to obtain observations of spread and significances of infectious diseases. Machine learning related forecasting mechanisms have showed their importance to develop the decision making on the upcoming course of actions. The K-means algorithm and hierarchy were applied directly on the renewed dataset using R programming language to create the covid patient cluster. Confirmed Covid patients count are passed to Prophet package, then the prophet model has been created. This forecasts model predicts the future covid count, which is essential for the clinical and healthcare leaders to make the appropriate measures in advance. The results of the experiments indicate that the quality of Hierarchical clustering outperforms than the K-Means clustering algorithm in the structured structured dataset. Thus, the prediction model also used to support model predictions help for the officials to take timely actions and make decisions to contain the COVID-19 dilemma. This work concludes Hierarchical clustering algorithm is the best model for clustering the covid data set obtained from world health organization (WHO).


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Tinofirei Museba ◽  
Fulufhelo Nelwamondo ◽  
Khmaies Ouahada ◽  
Ayokunle Akinola

For most real-world data streams, the concept about which data is obtained may shift from time to time, a phenomenon known as concept drift. For most real-world applications such as nonstationary time-series data, concept drift often occurs in a cyclic fashion, and previously seen concepts will reappear, which supports a unique kind of concept drift known as recurring concepts. A cyclically drifting concept exhibits a tendency to return to previously visited states. Existing machine learning algorithms handle recurring concepts by retraining a learning model if concept is detected, leading to the loss of information if the concept was well learned by the learning model, and the concept will recur again in the next learning phase. A common remedy for most machine learning algorithms is to retain and reuse previously learned models, but the process is time-consuming and computationally prohibitive in nonstationary environments to appropriately select any optimal ensemble classifier capable of accurately adapting to recurring concepts. To learn streaming data, fast and accurate machine learning algorithms are needed for time-dependent applications. Most of the existing algorithms designed to handle concept drift do not take into account the presence of recurring concept drift. To accurately and efficiently handle recurring concepts with minimum computational overheads, we propose a novel and evolving ensemble method called Recurrent Adaptive Classifier Ensemble (RACE). The algorithm preserves an archive of previously learned models that are diverse and always trains both new and existing classifiers. The empirical experiments conducted on synthetic and real-world data stream benchmarks show that RACE significantly adapts to recurring concepts more accurately than some state-of-the-art ensemble classifiers based on classifier reuse.


Sign in / Sign up

Export Citation Format

Share Document