scholarly journals Alliance Rules- based Algorithm on Detecting Duplicate Entry Email

Author(s):  
Arif Hanafi ◽  
Sulaiman Harun ◽  
Sofika Enggari ◽  
Larissa Navia Rani

The way that email has extraordinary significance in present day business communication is certain. Consistently, a bulk of emails is sent from organizations to clients and suppliers, from representatives to their managers and starting with one colleague then onto the next. In this way there is vast of email in data warehouse. Data cleaning is an activity performed on the data sets of data warehouse to upgrade and keep up the quality and consistency of the data. This paper underlines the issues related with dirty data, detection of duplicatein email column. The paper identifies the strategy of data cleaning from adifferent point of view. It provides an algorithm to the discovery of error and duplicates entries in the data sets of existing data warehouse. The paper characterizes the alliance rules based on the concept of mathematical association rules to determine the duplicate entries in email column in data sets.

2012 ◽  
Vol 490-495 ◽  
pp. 1878-1882
Author(s):  
Yu Xiang Song

The alliance rules stated above based on the principle of data mining association rules provide a solution for detecting errors in the data sets. The errors are detected automatically. The manual intervention in the proposed algorithm is highly negligible resulting in high degree of automation and accuracy. The duplicity in the names field of the data warehouse has been remarkably cleansed and worked out. Domain independency has been achieved using the concept of integer domain which even adds on to the memory saving capability of the algorithm.


2003 ◽  
Vol 15 (6) ◽  
pp. 1448-1459 ◽  
Author(s):  
S.Y. Sung ◽  
Zhao Li ◽  
C.L. Tan ◽  
P.A. Ng

Author(s):  
Kumar Rahul ◽  
Rohitash Kumar Banyal

Each and every business enterprises require noise-free and clean data. There is a chance of an increase in dirty data as the data warehouse loads and refreshes a large quantity of data continuously from the various sources. Hence, in order to avoid the wrong conclusions, the data cleaning process becomes a vital one in various data-connected projects. This paper made an effort to introduce a novel data cleaning technique for the effective removal of dirty data. This process involves the following two steps: (i) dirty data detection and (ii) dirty data cleaning. The dirty data detection process has been assigned with the following process namely, data normalization, hashing, clustering, and finding the suspected data. In the clustering process, the optimal selection of centroid is the promising one and is carried out by employing the optimization concept. After the finishing of dirty data prediction, the subsequent process: dirty data cleaning begins to activate. The cleaning process also assigns with some processes namely, the leveling process, Huffman coding, and cleaning the suspected data. The cleaning of suspected data is performed based on the optimization concept. Hence, for solving all optimization problems, a new hybridized algorithm is proposed, the so-called Firefly Update Enabled Rider Optimization Algorithm (FU-ROA), which is the hybridization of the Rider Optimization Algorithm (ROA) and Firefly (FF) algorithm is introduced. To the end, the analysis of the performance of the implanted data cleaning method is scrutinized over the other traditional methods like Particle Swarm Optimization (PSO), FF, Grey Wolf Optimizer (GWO), and ROA in terms of their positive and negative measures. From the result, it can be observed that for iteration 12, the performance of the proposed FU-ROA model for test case 1 on was 0.013%, 0.7%, 0.64%, and 0.29% better than the extant PSO, FF, GWO, and ROA models, respectively.


1976 ◽  
Vol 15 (01) ◽  
pp. 36-42 ◽  
Author(s):  
J. Schlörer

From a statistical data bank containing only anonymous records, the records sometimes may be identified and then retrieved, as personal records, by on line dialogue. The risk mainly applies to statistical data sets representing populations, or samples with a high ratio n/N. On the other hand, access controls are unsatisfactory as a general means of protection for statistical data banks, which should be open to large user communities. A threat monitoring scheme is proposed, which will largely block the techniques for retrieval of complete records. If combined with additional measures (e.g., slight modifications of output), it may be expected to render, from a cost-benefit point of view, intrusion attempts by dialogue valueless, if not absolutely impossible. The bona fide user has to pay by some loss of information, but considerable flexibility in evaluation is retained. The proposal of controlled classification included in the scheme may also be useful for off line dialogue systems.


2013 ◽  
Vol 756-759 ◽  
pp. 3652-3658
Author(s):  
You Li Lu ◽  
Jun Luo

Under the study of Kernel Methods, this paper put forward two improved algorithm which called R-SVM & I-SVDD in order to cope with the imbalanced data sets in closed systems. R-SVM used K-means algorithm clustering space samples while I-SVDD improved the performance of original SVDD by imbalanced sample training. Experiment of two sets of system call data set shows that these two algorithms are more effectively and R-SVM has a lower complexity.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Eleanor F. Miller ◽  
Andrea Manica

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.


2019 ◽  
Vol 4 (1) ◽  
pp. 697-711 ◽  
Author(s):  
Erika Quendler

AbstractTourism is vitally important to the Austrian economy. The number of tourist destinations, both farms and other forms of accommodation, in the different regions of Austria is considerably and constantly changing. This paper discusses the position of the ‘farm holiday’ compared to other forms of tourism. Understanding the resilience of farm holidays is especially important but empirical research on this matter remains limited. The term ‘farm holiday’ covers staying overnight on a farm that is actively engaged in agriculture and has a maximum of 10 guest beds. The results reported in this paper are based on an analysis of secondary data from 2000 and 2018 by looking at two types of indicator: (i) accommodation capacity (supply side) and (ii) attractiveness of a destination (demand side). The data sets cover Austria and its NUTS3 regions. The results show the evolution of farm holidays vis-à-vis other forms of tourist accommodation. In the form of a quadrant matrix they also show the relative position of farm holidays regionally. While putting into question the resilience of farm holidays, the data also reveals where farm holidays could act to expand this niche or learn and improve to effect a shift in their respective position relative to the market ‘leaders’. However, there is clearly a need to learn more about farm holidays within the local context. This paper contributes to our knowledge of farm holidays from a regional point of view and tries to elaborate on the need for further research.


1987 ◽  
Vol 65 (11) ◽  
pp. 2822-2824 ◽  
Author(s):  
W. A. Montevecchi ◽  
J. F. Piatt

We present evidence to indicate that dehydration of prey transported by seabirds from capture sites at sea to chicks at colonies inflates estimates of wet weight energy densities. These findings and a comparison of wet and dry weight energy densities reported in the literature emphasize the importance of (i) accurate measurement of the fresh weight and water content of prey, (ii) use of dry weight energy densities in comparisons among species, seasons, and regions, and (iii) cautious interpretation and extrapolation of existing data sets.


2012 ◽  
Vol 132 (2) ◽  
pp. 485-487 ◽  
Author(s):  
Matthew H. Law ◽  
Grant W. Montgomery ◽  
Kevin M. Brown ◽  
Nicholas G. Martin ◽  
Graham J. Mann ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document