Reduced Data Sets and Entropy-Based Discretization

The advances and use of technology in all walks of life results in tremendous growth of data available for data mining. Large amount of knowledge available can be utilized to improve decision-making process. The data contains the noise or outlier data to some extent which hampers the classification performance of classifier built on that training data. The learning process on large data set becomes very slow, as it has to be done serially on available large datasets. It has been proved that random data reduction techniques can be used to build optimal decision trees. Thus, we can integrate data cleaning and data sampling techniques to overcome the problems in handling large data sets. In this proposed technique outlier data is first filtered out to get clean data with improved quality and then random sampling technique is applied on this clean data set to get reduced data set. This reduced data set is used to construct optimal decision tree. Experiments performed on several data sets proved that the proposed technique builds decision trees with enhanced classification accuracy as compared to classification performance on complete data set. Due to use of classification filter a quality of data is improved and sampling reduces the size of the data set. Thus, the proposed method constructs more accurate and optimal sized decision trees and it also avoids problems like overloading of memory and processor with large data sets. In addition, the time required to build a model on clean data is significantly reduced providing significant speedup.

Download Full-text

A Support Based Initialization Algorithm for Categorical Data Clustering

Journal of Information Technology Research ◽

10.4018/jitr.2018040104 ◽

2018 ◽

Vol 11 (2) ◽

pp. 53-67

Author(s):

Ajay Kumar ◽

Shishir Kumar

Keyword(s):

Categorical Data ◽

Selection Process ◽

Numerical Data ◽

Real Data ◽

Data Sets ◽

Data Set ◽

Data Object ◽

Data Points ◽

Wu Method ◽

Selection Algorithms

Several initial center selection algorithms are proposed in the literature for numerical data, but the values of the categorical data are unordered so, these methods are not applicable to a categorical data set. This article investigates the initial center selection process for the categorical data and after that present a new support based initial center selection algorithm. The proposed algorithm measures the weight of unique data points of an attribute with the help of support and then integrates these weights along the rows, to get the support of every row. Further, a data object having the largest support is chosen as an initial center followed by finding other centers that are at the greatest distance from the initially selected center. The quality of the proposed algorithm is compared with the random initial center selection method, Cao's method, Wu method and the method introduced by Khan and Ahmad. Experimental analysis on real data sets shows the effectiveness of the proposed algorithm.

Download Full-text

The Gamma Test

Heuristic and Optimization for Knowledge Discovery ◽

10.4018/978-1-930708-26-6.ch009 ◽

2011 ◽

pp. 142-167 ◽

Cited By ~ 10

Author(s):

Antonia J. Jones ◽

Dafydd Evans ◽

Steve Margetts ◽

Peter J. Durrant

Keyword(s):

Predictive Models ◽

Time Complexity ◽

Numerical Data ◽

Data Sets ◽

Analysis Tool ◽

Embedding Dimension ◽

Gamma Test ◽

Data Set ◽

Modelling Analysis ◽

Selection Of

The Gamma Test is a non-linear modelling analysis tool that allows us to quantify the extent to which a numerical input/output data set can be expressed as a smooth relationship. In essence, it allows us to efficiently calculate that part of the variance of the output that cannot be accounted for by the existence of any smooth model based on the inputs, even though this model is unknown. A key aspect of this tool is its speed: the Gamma Test has time complexity O(Mlog M), where M is the number of datapoints. For data sets consisting of a few thousand points and a reasonable number of attributes, a single run of the Gamma Test typically takes a few seconds. In this chapter we will show how the Gamma Test can be used in the construction of predictive models and classifiers for numerical data. In doing so, we will demonstrate the use of this technique for feature selection, and for the selection of embedding dimension when dealing with a time-series.

Download Full-text

Analysis of Different Evolutionary Techniques on Fuzzy Rule Base Generation

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8286 ◽

2019 ◽

Vol 16 (9) ◽

pp. 4008-4014

Author(s):

Savita Wadhawan ◽

Gautam Kumar ◽

Vivek Bhatnagar

Keyword(s):

Local Search ◽

Fuzzy Rule ◽

Numerical Data ◽

Memetic Algorithms ◽

Population Based ◽

Rule Base ◽

Data Sets ◽

Battery Charger ◽

Data Set ◽

Key Issues

This paper presents the analysis of different population based algorithms for the rulebase generation from numerical data sets. As fuzzy rulebase generation is one of the key issues in fuzzy modeling. The algorithms are applied on a rapid Ni–Cd battery charger data set. In this paper, we compare the efficiency of different algorithms and conclude that SCA algorithms with local search give remarkable efficiency as compared to SCA algorithms alone. Also found that the efficiency of SCA with local search is comparable to memetic algorithms.

Download Full-text

Popular Ensemble Methods: An Empirical Study

Journal of Artificial Intelligence Research ◽

10.1613/jair.614 ◽

1999 ◽

Vol 11 ◽

pp. 169-198 ◽

Cited By ~ 1161

Author(s):

D. Opitz ◽

R. Maclin

Keyword(s):

Neural Networks ◽

Empirical Study ◽

Decision Trees ◽

Noisy Data ◽

Ensemble Methods ◽

Classification Algorithm ◽

The Other ◽

Data Sets ◽

Data Set ◽

Networks Analysis

An ensemble consists of a set of individually trained classifiers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble. Bagging (Breiman, 1996c) and Boosting (Freund & Shapire, 1996; Shapire, 1990) are two relatively new but popular methods for producing ensembles. In this paper we evaluate these methods on 23 data sets using both neural networks and decision trees as our classification algorithm. Our results clearly indicate a number of conclusions. First, while Bagging is almost always more accurate than a single classifier, it is sometimes much less accurate than Boosting. On the other hand, Boosting can create ensembles that are less accurate than a single classifier -- especially when using neural networks. Analysis indicates that the performance of the Boosting methods is dependent on the characteristics of the data set being examined. In fact, further results show that Boosting ensembles may overfit noisy data sets, thus decreasing its performance. Finally, consistent with previous studies, our work suggests that most of the gain in an ensemble's performance comes in the first few classifiers combined; however, relatively large gains can be seen up to 25 classifiers when Boosting decision trees.

Download Full-text

Evaluation of machine learning algorithms for classification of primary biological aerosol using a new UV-LIF spectrometer

Atmospheric Measurement Techniques ◽

10.5194/amt-10-695-2017 ◽

2017 ◽

Vol 10 (2) ◽

pp. 695-708 ◽

Cited By ~ 25

Author(s):

Simon Ruske ◽

David O. Topping ◽

Virginia E. Foot ◽

Paul H. Kaye ◽

Warren R. Stanley ◽

...

Keyword(s):

Neural Networks ◽

Decision Trees ◽

Supervised Learning ◽

Ensemble Methods ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Data Set ◽

Shape Information ◽

Accuracy Of Measurements

Abstract. Characterisation of bioaerosols has important implications within environment and public health sectors. Recent developments in ultraviolet light-induced fluorescence (UV-LIF) detectors such as the Wideband Integrated Bioaerosol Spectrometer (WIBS) and the newly introduced Multiparameter Bioaerosol Spectrometer (MBS) have allowed for the real-time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal spores and pollen.This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents, bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification.For unsupervised learning we tested hierarchical agglomerative clustering with various different linkages. For supervised learning, 11 methods were tested, including decision trees, ensemble methods (random forests, gradient boosting and AdaBoost), two implementations for support vector machines (libsvm and liblinear) and Gaussian methods (Gaussian naïve Bayesian, quadratic and linear discriminant analysis, the k-nearest neighbours algorithm and artificial neural networks).The methods were applied to two different data sets produced using the new MBS, which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. The first data set contained mixed PSLs and the second contained a variety of laboratory-generated aerosol.Clustering in general performs slightly worse than the supervised learning methods, correctly classifying, at best, only 67. 6 and 91. 1 % for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 82. 8 and 98. 27 % of the testing data, respectively, across the two data sets.A possible alternative to gradient boosting is neural networks. We do however note that this method requires much more user input than the other methods, and we suggest that further research should be conducted using this method, especially using parallelised hardware such as the GPU, which would allow for larger networks to be trained, which could possibly yield better results.We also saw that some methods, such as clustering, failed to utilise the additional shape information provided by the instrument, whilst for others, such as the decision trees, ensemble methods and neural networks, improved performance could be attained with the inclusion of such information.

Download Full-text

Improved Classification Techniques to Predict the Co-disease in Diabetic Mellitus Patients using Discretization and Apriori Algorithm

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1434.0981119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 730-733

Keyword(s):

Data Mining ◽

Association Rules ◽

Census Data ◽

Early Stage ◽

Research Work ◽

Numerical Data ◽

Medical Data ◽

Data Sets ◽

Apriori Algorithm ◽

Data Set

The demand for data mining is now unavoidable in the medical industry due to its various applications and uses in predicting the diseases at the early stage. The methods available in the data mining theories are easy to extract the useful patterns and speed to recognize the task based outcomes. In data mining the classification models are really useful in building the classes for the medical data sets for future analysis in an accurate way. Besides these facilities, Association rules in data mining are a promising technique to find hidden patterns in a medical data set and have been successfully applied with market basket data, census data and financial data. Apriori algorithm, is considered to be a classic algorithm, is useful in mining frequent item sets on a database containing a large number of transactions and it also predicts the relevant association rules. Association rules capture the relationship of items that are present in data sets and when the data set contains continuous attributes, the existing algorithms may not work due to this, discretization can be applied to the association rules in order to find the relation between various patterns in data set. In this paper of our research, using Discretized Apriori the research work is done to predict the by-disease in people who are found with diabetic syndrome; also the rules extracted are analyzed. In the discretization step, numerical data is discretized and fed to the Apriori algorithm for better association rules to predict the diseases.

Download Full-text

Data Diagnostics Using Second-Order Tests of Benford's Law

Auditing A Journal of Practice & Theory ◽

10.2308/aud.2009.28.2.305 ◽

2009 ◽

Vol 28 (2) ◽

pp. 305-324 ◽

Cited By ~ 30

Author(s):

Mark J. Nigrini ◽

Steven J. Miller

Keyword(s):

Numerical Data ◽

Second Order ◽

Benford’S Law ◽

Data Sets ◽

Data Set ◽

Accounting Data ◽

Benford's Law ◽

Rounded Data ◽

Analytical Procedures ◽

Digit Frequencies

SUMMARY: Auditors are required to use analytical procedures to identify the existence of unusual transactions, events, and trends. Benford's Law gives the expected patterns of the digits in numerical data, and has been advocated as a test for the authenticity and reliability of transaction level accounting data. This paper describes a new second-order test that calculates the digit frequencies of the differences between the ordered (ranked) values in a data set. These digit frequencies approximate the frequencies of Benford's Law for most data sets. The second-order test is applied to four sets of transactional data. The second-order test detected errors in data downloads, rounded data, data generated by statistical procedures, and the inaccurate ordering of data. The test can be applied to any data set and nonconformity usually signals an unusual issue related to data integrity that might not have been easily detectable using traditional analytical procedures.

Download Full-text

Perdeuteration, crystallization, data collection and comparison of five neutron diffraction data sets of complexes of human galectin-3C

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798316015540 ◽

2016 ◽

Vol 72 (11) ◽

pp. 1194-1202 ◽

Cited By ~ 13

Author(s):

Francesco Manzoni ◽

Kadhirvel Saraboji ◽

Janina Sprenger ◽

Rohit Kumar ◽

Ann-Louise Noresson ◽

...

Keyword(s):

Data Collection ◽

Neutron Diffraction ◽

Complete Removal ◽

Carbohydrate Binding ◽

Data Sets ◽

Data Set ◽

Galectin 3 ◽

Reduced Data ◽

Collection Strategies ◽

Carbohydrate Binding Site

Galectin-3 is an important protein in molecular signalling events involving carbohydrate recognition, and an understanding of the hydrogen-bonding patterns in the carbohydrate-binding site of its C-terminal domain (galectin-3C) is important for the development of new potent inhibitors. The authors are studying these patterns using neutron crystallography. Here, the production of perdeuterated human galectin-3C and successive improvement in crystal size by the development of a crystal-growth protocol involving feeding of the crystallization drops are described. The larger crystals resulted in improved data quality and reduced data-collection times. Furthermore, protocols for complete removal of the lactose that is necessary for the production of large crystals of apo galectin-3C suitable for neutron diffraction are described. Five data sets have been collected at three different neutron sources from galectin-3C crystals of various volumes. It was possible to merge two of these to generate an almost complete neutron data set for the galectin-3C–lactose complex. These data sets provide insights into the crystal volumes and data-collection times necessary for the same system at sources with different technologies and data-collection strategies, and these insights are applicable to other systems.

Download Full-text

Improved Classification Techniques to Predict the Co-disease in Diabetic Mellitus Patients using Discretization and Apriori Algorithm

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k1434.0881119 ◽

2019 ◽

Vol 8 (11) ◽

pp. 730-733

Keyword(s):

Data Mining ◽

Association Rules ◽

Census Data ◽

Early Stage ◽

Research Work ◽

Numerical Data ◽

Medical Data ◽

Data Sets ◽

Apriori Algorithm ◽

Data Set

The demand for data mining is now unavoidable in the medical industry due to its various applications and uses in predicting the diseases at the early stage. The methods available in the data mining theories are easy to extract the useful patterns and speed to recognize the task based outcomes. In data mining the classification models are really useful in building the classes for the medical data sets for future analysis in an accurate way. Besides these facilities, Association rules in data mining are a promising technique to find hidden patterns in a medical data set and have been successfully applied with market basket data, census data and financial data. Apriori algorithm, is considered to be a classic algorithm, is useful in mining frequent item sets on a database containing a large number of transactions and it also predicts the relevant association rules. Association rules capture the relationship of items that are present in data sets and when the data set contains continuous attributes, the existing algorithms may not work due to this, discretization can be applied to the association rules in order to find the relation between various patterns in data set. In this paper of our research, using Discretized Apriori the research work is done to predict the by-disease in people who are found with diabetic syndrome; also the rules extracted are analyzed. In the discretization step, numerical data is discretized and fed to the Apriori algorithm for better association rules to predict the diseases.

Download Full-text