scholarly journals Study on retail marketing sale data using r software data cleaning and clustering algorithms

2017 ◽  
Vol 7 (1) ◽  
pp. 135-141
Author(s):  
MARUDACHALAM N ◽  
RAMESH L

Nowadays, data cleaning solutions are very essential for the large amount of data handling users in an industry and others. The data were collected from Retail Marketing sale data in terms of the mentioned attributes. Normally, data cleaning, deals with detecting, outlier detection, removing errors and inconsistencies from data in order to improve the quality of data. There are number of frameworks to handle the noisy data and inconsistencies in the market. While traditional data integration problems can deal with single data sources at instance level. The Hierarchical clusters and DBSCAN clusters were grouped with related similarities, analysis and Time taken to build model in different cluster mode was experimented using WEKA tool. It also focuses on different input retail marketing data by timecalculated analysis. Clustering is one of the basic techniques often used in analyzing data sets. The Hierarchical and DBSCAN clustering Advantage and disadvantage also discussed.

2016 ◽  
Vol 25 (3) ◽  
pp. 431-440 ◽  
Author(s):  
Archana Purwar ◽  
Sandeep Kumar Singh

AbstractThe quality of data is an important task in the data mining. The validity of mining algorithms is reduced if data is not of good quality. The quality of data can be assessed in terms of missing values (MV) as well as noise present in the data set. Various imputation techniques have been studied in MV study, but little attention has been given on noise in earlier work. Moreover, to the best of knowledge, no one has used density-based spatial clustering of applications with noise (DBSCAN) clustering for MV imputation. This paper proposes a novel technique density-based imputation (DBSCANI) built on density-based clustering to deal with incomplete values in the presence of noise. Density-based clustering algorithm proposed by Kriegal groups the objects according to their density in spatial data bases. The high-density regions are known as clusters, and the low-density regions refer to the noise objects in the data set. A lot of experiments have been performed on the Iris data set from life science domain and Jain’s (2D) data set from shape data sets. The performance of the proposed method is evaluated using root mean square error (RMSE) as well as it is compared with existing K-means imputation (KMI). Results show that our method is more noise resistant than KMI on data sets used under study.


2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


Author(s):  
Jessica Oliveira De Souza ◽  
Jose Eduardo Santarem Segundo

Since the Semantic Web was created in order to improve the current web user experience, the Linked Data is the primary means in which semantic web application is theoretically full, respecting appropriate criteria and requirements. Therefore, the quality of data and information stored on the linked data sets is essential to meet the basic semantic web objectives. Hence, this article aims to describe and present specific dimensions and their related quality issues.


2019 ◽  
Vol 16 (2) ◽  
pp. 469-489 ◽  
Author(s):  
Piotr Lasek ◽  
Jarek Gryz

In this paper we present our ic-NBC and ic-DBSCAN algorithms for data clustering with constraints. The algorithms are based on density-based clustering algorithms NBC and DBSCAN but allow users to incorporate background knowledge into the process of clustering by means of instance constraints. The knowledge about anticipated groups can be applied by specifying the so-called must-link and cannot-link relationships between objects or points. These relationships are then incorporated into the clustering process. In the proposed algorithms this is achieved by properly merging resulting clusters and introducing a new notion of deferred points which are temporarily excluded from clustering and assigned to clusters based on their involvement in cannot-link relationships. To examine the algorithms, we have carried out a number of experiments. We used benchmark data sets and tested the efficiency and quality of the results. We have also measured the efficiency of the algorithms against their original versions. The experiments prove that the introduction of instance constraints improves the quality of both algorithms. The efficiency is only insignificantly reduced and is due to extra computation related to the introduced constraints.


2021 ◽  
Vol 13 (3) ◽  
pp. 1-15
Author(s):  
Rada Chirkova ◽  
Jon Doyle ◽  
Juan Reutter

Assessing and improving the quality of data are fundamental challenges in Big-Data applications. These challenges have given rise to numerous solutions targeting transformation, integration, and cleaning of data. However, while schema design, data cleaning, and data migration are nowadays reasonably well understood in isolation, not much attention has been given to the interplay between standalone tools in these areas. In this article, we focus on the problem of determining whether the available data-transforming procedures can be used together to bring about the desired quality characteristics of the data in business or analytics processes. For example, to help an organization avoid building a data-quality solution from scratch when facing a new analytics task, we ask whether the data quality can be improved by reusing the tools that are already available, and if so, which tools to apply, and in which order, all without presuming knowledge of the internals of the tools, which may be external or proprietary. Toward addressing this problem, we conduct a formal study in which individual data cleaning, data migration, or other data-transforming tools are abstracted as black-box procedures with only some of the properties exposed, such as their applicability requirements, the parts of the data that the procedure modifies, and the conditions that the data satisfy once the procedure has been applied. As a proof of concept, we provide foundational results on sequential applications of procedures abstracted in this way, to achieve prespecified data-quality objectives, for the use case of relational data and for procedures described by standard relational constraints. We show that, while reasoning in this framework may be computationally infeasible in general, there exist well-behaved cases in which these foundational results can be applied in practice for achieving desired data-quality results on Big Data.


2021 ◽  
Author(s):  
Rishabh Deo Pandey ◽  
Itu Snigdh

Abstract Data quality became significant with the emergence of data warehouse systems. While accuracy is intrinsic data quality, validity of data presents a wider perspective, which is more representational and contextual in nature. Through our article we present a different perspective in data collection and collation. We focus on faults experienced in data sets and present validity as a function of allied parameters such as completeness, usability, availability and timeliness for determining the data quality. We also analyze the applicability of these metrics and apply modifications to make it conform to IoT applications. Another major focus of this article is to verify these metrics on aggregated data set instead of separate data values. This work focuses on using the different validation parameters for determining the quality of data generated in a pervasive environment. Analysis approach presented is simple and can be employed to test the validity of collected data, isolate faults in the data set and also measure the suitability of data before applying algorithms for analysis.


Author(s):  
A. Anny Leema ◽  
M. Hemalatha

Radio Frequency Identification (RFID) refers to wireless technology that uses radio waves to automatically identify items within a certain proximity. It is being widely used in various applications, but there is reluctance in the deployment of RFID due to the high cost involved and the challenging problems found in the observed colossal RFID data. The obtained data is of low quality and contains anomalies like false positives, false negatives, and duplication. To enhance the quality of data, cleaning is the essential task, so that the resultant data can be applied for high-end applications. This chapter investigates the existing physical, middleware, and deferred approaches to deal with the anomalies found in the RFID data. A novel hybrid approach is developed to solve data quality issues so that the demand for RFID data will certainly grow to meet the user needs.


Author(s):  
Anh Duy Tran ◽  
Somjit Arch-int ◽  
Ngamnij Arch-int

Conditional functional dependencies (CFDs) have been used to improve the quality of data, including detecting and repairing data inconsistencies. Approximation measures have significant importance for data dependencies in data mining. To adapt to exceptions in real data, the measures are used to relax the strictness of CFDs for more generalized dependencies, called approximate conditional functional dependencies (ACFDs). This paper analyzes the weaknesses of dependency degree, confidence and conviction measures for general CFDs (constant and variable CFDs). A new measure for general CFDs based on incomplete knowledge granularity is proposed to measure the approximation of these dependencies as well as the distribution of data tuples into the conditional equivalence classes. Finally, the effectiveness of stripped conditional partitions and this new measure are evaluated on synthetic and real data sets. These results are important to the study of theory of approximation dependencies and improvement of discovery algorithms of CFDs and ACFDs.


Author(s):  
Prajapati Chirag Rameshbhai ◽  
Solanki Pinal Prathamsinh ◽  

2002 ◽  
Vol 35 (1) ◽  
pp. 5-34 ◽  
Author(s):  
Gerardo L. Munck ◽  
Jay Verkuilen

A comprehensive and integrated framework for the analysis of data is offered and used to assess data sets on democracy. The framework first distinguishes among three challenges that are sequentially addressed: conceptualization, measurement, and aggregation. In turn, it specifies distinct tasks associated with these challenges and the standards of assessment that pertain to each task. This framework is applied to the data sets on democracy most frequently used in current statistical research, generating a systematic evaluation of these data sets. The authors’ conclusion is that constructors of democracy indices tend to be quite self-conscious about methodological issues but that even the best indices suffer from important weaknesses. More constructively, the article’s assessment of existing data sets on democracy identifies distinct areas in which attempts to improve the quality of data on democracy might fruitfully be focused.


Sign in / Sign up

Export Citation Format

Share Document