data discretization Latest Research Papers

Abstract Background A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. Results In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. Statistical tests applied to assess differences in performance confirm that DI2 generally outperforms well-established discretizations methods with statistical significance. Within classification tasks, DI2 displays either competitive or superior levels of predictive accuracy, particularly delineate for classifiers able to accommodate border values. Conclusions This work proposes a new unsupervised method for data discretization, DI2, that takes into account the underlying data regularities, the presence of outlier values disrupting expected regularities, as well as the relevance of border values. DI2 is available at https://github.com/JupitersMight/DI2

Download Full-text

The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm

PLoS ONE ◽

10.1371/journal.pone.0255684 ◽

2021 ◽

Vol 16 (8) ◽

pp. e0255684

Author(s):

Xin Liu ◽

Xuefeng Sang ◽

Jiaxuan Chang ◽

Yang Zheng ◽

Yuping Han

Keyword(s):

Water Supply ◽

Association Analysis ◽

Association Rules ◽

Strong Association ◽

Continuous Data ◽

Apriori Algorithm ◽

Analysis Method ◽

Attribution Analysis ◽

The One ◽

Data Discretization

Since water supply association analysis plays an important role in attribution analysis of water supply fluctuation, how to carry out effective association analysis has become a critical problem. However, the current techniques and methods used for association analysis are not very effective because they are based on continuous data. In general, there is different degrees of monotone relationship between continuous data, which makes the analysis results easily affected by monotone relationship. The multicollinearity between continuous data distorts these analytical methods and may generate incorrect results. Meanwhile, we cannot know the association rules and value interval between features and water supply. Therefore, the lack of an effective analysis method hinders the water supply association analysis. Association rules and value interval of features obtained from association analysis are helpful to grasp cause of water supply fluctuation and know the fluctuation interval of water supply, so as to provide better support for water supply dispatching. But the association rules and value interval between features and water supply are not fully understood. In this study, a data mining method coupling kmeans clustering discretization and apriori algorithm was proposed. The kmeans was used for data discretization to obtain the one-hot encoding that can be recognized by apriori, and the discretization can also avoid the influence of monotone relationship and multicollinearity on analysis results. All the rules eventually need to be validated in order to filter out spurious rules. The results show that the method in this study is an effective association analysis method. The method can not only obtain the valid strong association rules between features and water supply, but also understand whether the association relationship between features and water supply is direct or indirect. Meanwhile, the method can also obtain value interval of features, the association degree between features and confidence probability of rules.

Download Full-text

An Improved Data Discretization Algorithm based on Rough Sets Theory

2020 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) ◽

10.1109/ispa-bdcloud-socialcom-sustaincom51426.2020.00214 ◽

2020 ◽

Author(s):

Han Liu ◽

Chunyu Jiang ◽

Miaoqiong Wang ◽

Kai Wei ◽

Shu Yan

Keyword(s):

Rough Sets ◽

Rough Sets Theory ◽

Discretization Algorithm ◽

Sets Theory ◽

Data Discretization

Download Full-text

Applied Identification of Industry Data Science Using an Advanced Multi-Componential Discretization Model

Symmetry ◽

10.3390/sym12101620 ◽

2020 ◽

Vol 12 (10) ◽

pp. 1620

Author(s):

You-Shyang Chen ◽

Arun Kumar Sangaiah ◽

Su-Fen Chen ◽

Hsiu-Chen Huang

Keyword(s):

Feature Selection ◽

Decision Tree ◽

Cross Validation ◽

Time Lag ◽

Industry Data ◽

Decision Tree Learning ◽

Managerial Implication ◽

Split Method ◽

Model C ◽

Data Discretization

Applied human large-scale data are collected from heterogeneous science or industry databases for the purposes of achieving data utilization in complex application environments, such as in financial applications. This has posed great opportunities and challenges to all kinds of scientific data researchers. Thus, finding an intelligent hybrid model that solves financial application problems of the stock market is an important issue for financial analysts. In practice, classification applications that focus on the earnings per share (EPS) with financial ratios from an industry database often demonstrate that the data meet the abovementioned standards and have particularly high application value. This study proposes several advanced multicomponential discretization models, named Models A–E, where each model identifies and presents a positive/negative diagnosis based on the experiences of the latest financial statements from six different industries. The varied components of the model test performance measurements comparatively by using data-preprocessing, data-discretization, feature-selection, two data split methods, machine learning, rule-based decision tree knowledge, time-lag effects, different times of running experiments, and two different class types. The experimental dataset had 24 condition features and a decision feature EPS that was used to classify the data into two and three classes for comparison. Empirically, the analytical results of this study showed that three main determinants were identified: total asset growth rate, operating income per share, and times interest earned. The core components of the following techniques are as follows: data-discretization and feature-selection, with some noted classifiers that had significantly better accuracy. Total solution results demonstrated the following key points: (1) The highest accuracy, 92.46%, occurred in Model C from the use of decision tree learning with a percentage-split method for two classes in one run; (2) the highest accuracy mean, 91.44%, occurred in Models D and E from the use of naïve Bayes learning for cross-validation and percentage-split methods for each class for 10 runs; (3) the highest average accuracy mean, 87.53%, occurred in Models D and E with a cross-validation method for each class; (4) the highest accuracy, 92.46%, occurred in Model C from the use of decision tree learning-C4.5 with the percentage-split method and no time-lag for each class. This study concludes that its contribution is regarded as managerial implication and technical direction for practical finance in which a multicomponential discretization model has limited use and is rarely seen as applied by scientific industry data due to various restrictions.

Download Full-text

Reinforcement Learning-Based Genetic Algorithm in Optimizing Multidimensional Data Discretization Scheme

Mathematical Problems in Engineering ◽

10.1155/2020/1698323 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Qiong Chen ◽

Mengxing Huang ◽

Qiannan Xu ◽

Hao Wang ◽

Jinghui Wang

Keyword(s):

Genetic Algorithm ◽

Reinforcement Learning ◽

Control Function ◽

Fitness Function ◽

Population Diversity ◽

Multidimensional Data ◽

Multidimensional Space ◽

Landsat 8 ◽

Discretization Scheme ◽

Data Discretization

Feature discretization can reduce the complexity of data and improve the efficiency of data mining and machine learning. However, in the process of multidimensional data discretization, limited by the complex correlation among features and the performance bottleneck of traditional discretization criteria, the schemes obtained by most algorithms are not optimal in specific application scenarios and can even fail to meet the accuracy requirements of the system. Although some swarm intelligence algorithms can achieve better results, it is difficult to formulate appropriate strategies without prior knowledge, which will make the search in multidimensional space inefficient, consume many computing resources, and easily fall into local optima. To solve these problems, this paper proposes a genetic algorithm based on reinforcement learning to optimize the discretization scheme of multidimensional data. We use rough sets to construct the individual fitness function, and we design the control function to dynamically adjust population diversity. In addition, we introduce a reinforcement learning mechanism to crossover and mutation to determine the crossover fragments and mutation points of the discretization scheme to be optimized. We conduct simulation experiments on Landsat 8 and Gaofen-2 images, and we compare our method to the traditional genetic algorithm and state-of-the-art discretization methods. Experimental results show that the proposed optimization method can further reduce the number of intervals and simplify the multidimensional dataset without decreasing the data consistency and classification accuracy of discretization.

Download Full-text

Handling Numerical Features on Dataset Using Gauss Density Formula and Data Discretization Toward Naïve Bayes Algorithm

Proceedings of the Sriwijaya International Conference on Information Technology and Its Applications (SICONIAN 2019) ◽

10.2991/aisr.k.200424.072 ◽

2020 ◽

Author(s):

Mochammad YUSA ◽

Ernawati ERNAWATI ◽

Yudi SETIAWAN ◽

Desi ADRESWARI

Keyword(s):

Naive Bayes ◽

Naïve Bayes ◽

Bayes Algorithm ◽

Data Discretization

Download Full-text

Big Data Discretization

Big Data Preprocessing ◽

10.1007/978-3-030-39105-8_7 ◽

2020 ◽

pp. 121-146 ◽

Cited By ~ 1

Author(s):

Julián Luengo ◽

Diego García-Gil ◽

Sergio Ramírez-Gallego ◽

Salvador García ◽

Francisco Herrera

Keyword(s):

Big Data ◽

Data Discretization

Download Full-text

The optimal combination of feature selection and data discretization: An empirical study

Information Sciences ◽

10.1016/j.ins.2019.07.091 ◽

2019 ◽

Vol 505 ◽

pp. 282-293 ◽

Cited By ~ 8

Author(s):

Chih-Fong Tsai ◽

Yu-Chi Chen

Keyword(s):

Feature Selection ◽

Empirical Study ◽

Optimal Combination ◽

Data Discretization

Download Full-text

Extraction of Association Rule Mining using Apriori algorithm with Wolf Search Optimisation in R Programming

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1094.0782s719 ◽

2019 ◽

Vol 8 (2S7) ◽

pp. 504-507 ◽

Cited By ~ 1

Keyword(s):

Association Rules ◽

Hybrid Approach ◽

Numerical Data ◽

Rule Mining ◽

Data Set ◽

Regular Elements ◽

Fitness Functions ◽

Local Solutions ◽

R Programming ◽

Data Discretization

Association rules mining (ARM) is a standout amongst the most essential Data Mining Systems. Find attribute patterns as a binding rule in a data set. The discovery of these suggestion rules would result in a mutual method. Firstly, regular elements are produced and therefore the association rules are extracted. In the literature, different algorithms inspired by nature have been proposed as BCO, ACO, PSO, etc. to find interesting association rules. This article presents the performance of the ARM hybrid approach with the optimization of wolf research based on two different fitness functions. The goal is to discover the best promising rules in the data set, avoiding optimal local solutions. The implementation is done in numerical data that require data discretization as a preliminary phase and therefore the application of ARM with optimization to generate the best rules.

Download Full-text

data discretization
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Deep learning for missing value imputation of continuous data and the effect of data discretization

DI2: prior-free and multi-item discretization of biological data and its applications

The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm

An Improved Data Discretization Algorithm based on Rough Sets Theory

Applied Identification of Industry Data Science Using an Advanced Multi-Componential Discretization Model

Reinforcement Learning-Based Genetic Algorithm in Optimizing Multidimensional Data Discretization Scheme

Handling Numerical Features on Dataset Using Gauss Density Formula and Data Discretization Toward Naïve Bayes Algorithm

Big Data Discretization

The optimal combination of feature selection and data discretization: An empirical study

Extraction of Association Rule Mining using Apriori algorithm with Wolf Search Optimisation in R Programming

Export Citation Format

data discretizationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Deep learning for missing value imputation of continuous data and the effect of data discretization

DI2: prior-free and multi-item discretization of biological data and its applications

The water supply association analysis method in Shenzhen based on kmeans clustering discretization and apriori algorithm

An Improved Data Discretization Algorithm based on Rough Sets Theory

Applied Identification of Industry Data Science Using an Advanced Multi-Componential Discretization Model

Reinforcement Learning-Based Genetic Algorithm in Optimizing Multidimensional Data Discretization Scheme

Handling Numerical Features on Dataset Using Gauss Density Formula and Data Discretization Toward Naïve Bayes Algorithm

Big Data Discretization

The optimal combination of feature selection and data discretization: An empirical study

Extraction of Association Rule Mining using Apriori algorithm with Wolf Search Optimisation in R Programming

data discretization
Recently Published Documents