k-Means Clustering with Outlier Detection, Mixed Variables and Missing Values

Author(s):  
D. Wishart
Author(s):  
Swati Aggarwal ◽  
Shambeel Azim

Reliability is a major concern in qualitative research. Most of the current research deals with finding the reliability of the data, but not much work is reported on how to improve the reliability of the unreliable data. This paper discusses three important aspects of the data pre-processing: how to detect the outliers, dealing with the missing values and finally increasing the reliability of the dataset. Here authors have suggested a framework for pre-processing of the inter-judged data which is incomplete and also contains erroneous values. The suggested framework integrates three approaches, Krippendorff's alpha for reliability computation, frequency based outlier detection method and a hybrid fuzzy c-means and multilayer perceptron based imputation technique. The proposed integrated approach results in an increase of reliability for the dataset which can be used to make strong conclusions.


2016 ◽  
Vol 45 (1) ◽  
pp. 3-23 ◽  
Author(s):  
Marc Bill ◽  
Beat Hulliger

The distribution of multivariate quantitative survey data usually is not normal. Skewed and semi-continuous distributions occur often. In addition, missing values and non-response is common. All together this mix of problems makes multivariate outlier detection difficult. Examples of surveys where these problems occur are most business surveys and some household surveys like the Survey for the Statistics of Income and Living Condition (SILC) of the European Union. Several methods for multivariate outlier detection  are collected in the R-package modi. This paper gives an overview of modi and its functions for outlier detection and corresponding imputation. The use of the methods is explained with a business survey dataset. The discussion covers pre- and post-processing  to deal with skewness and zero-inflation, advantages and disadvantages of the methods and the choice of the parameters.


2019 ◽  
Vol 29 (1) ◽  
pp. 1416-1424
Author(s):  
M. Rao Batchanaboyina ◽  
Nagaraju Devarakonda

Abstract Social media contain abundant information about the events or news occurring all over the world. Social media growth has a greater impact on various domains like marketing, e-commerce, health care, e-governance, and politics, etc. Currently, Twitter was developed as one of the social media platforms, and now, it is one of the most popular social media platforms. There are 1 billion user’s profiles and millions of active users, who post tweets daily. In this research, buzz detection in social media was carried out by the semantic approach using the condensed nearest neighbor (SACNN). The Twitter and Tom’s Hardware data are stored in the UC Irvine Machine Learning Repository, and this dataset is used in this research for outlier detection. The min–max normalization technique is applied to the social media dataset, and additionally, missing values were replaced by the normalized value. The condensed nearest neighbor (CNN) is used for semantic analysis of the database, and based on the optimized value provided by the proposed method, the threshold is calculated. The threshold value is used to classify buzz and non-buzz discussions in the social media database. The result showed that the SACNN achieved 99% of accuracy, and relative error is less than the existing methods.


Marketing ZFP ◽  
2019 ◽  
Vol 41 (4) ◽  
pp. 21-32
Author(s):  
Dirk Temme ◽  
Sarah Jensen

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.


2012 ◽  
Vol 2 (3) ◽  
pp. 98-101 ◽  
Author(s):  
E.Sateesh E.Sateesh ◽  
◽  
M.L.Prasanthi M.L.Prasanthi

Sign in / Sign up

Export Citation Format

Share Document