data discretization
Recently Published Documents


TOTAL DOCUMENTS

57
(FIVE YEARS 15)

H-INDEX

10
(FIVE YEARS 2)

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Leonardo Alexandre ◽  
Rafael S. Costa ◽  
Rui Henriques

Abstract Background A considerable number of data mining approaches for biomedical data analysis, including state-of-the-art associative models, require a form of data discretization. Although diverse discretization approaches have been proposed, they generally work under a strict set of statistical assumptions which are arguably insufficient to handle the diversity and heterogeneity of clinical and molecular variables within a given dataset. In addition, although an increasing number of symbolic approaches in bioinformatics are able to assign multiple items to values occurring near discretization boundaries for superior robustness, there are no reference principles on how to perform multi-item discretizations. Results In this study, an unsupervised discretization method, DI2, for variables with arbitrarily skewed distributions is proposed. Statistical tests applied to assess differences in performance confirm that DI2 generally outperforms well-established discretizations methods with statistical significance. Within classification tasks, DI2 displays either competitive or superior levels of predictive accuracy, particularly delineate for classifiers able to accommodate border values. Conclusions This work proposes a new unsupervised method for data discretization, DI2, that takes into account the underlying data regularities, the presence of outlier values disrupting expected regularities, as well as the relevance of border values. DI2 is available at https://github.com/JupitersMight/DI2


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0255684
Author(s):  
Xin Liu ◽  
Xuefeng Sang ◽  
Jiaxuan Chang ◽  
Yang Zheng ◽  
Yuping Han

Since water supply association analysis plays an important role in attribution analysis of water supply fluctuation, how to carry out effective association analysis has become a critical problem. However, the current techniques and methods used for association analysis are not very effective because they are based on continuous data. In general, there is different degrees of monotone relationship between continuous data, which makes the analysis results easily affected by monotone relationship. The multicollinearity between continuous data distorts these analytical methods and may generate incorrect results. Meanwhile, we cannot know the association rules and value interval between features and water supply. Therefore, the lack of an effective analysis method hinders the water supply association analysis. Association rules and value interval of features obtained from association analysis are helpful to grasp cause of water supply fluctuation and know the fluctuation interval of water supply, so as to provide better support for water supply dispatching. But the association rules and value interval between features and water supply are not fully understood. In this study, a data mining method coupling kmeans clustering discretization and apriori algorithm was proposed. The kmeans was used for data discretization to obtain the one-hot encoding that can be recognized by apriori, and the discretization can also avoid the influence of monotone relationship and multicollinearity on analysis results. All the rules eventually need to be validated in order to filter out spurious rules. The results show that the method in this study is an effective association analysis method. The method can not only obtain the valid strong association rules between features and water supply, but also understand whether the association relationship between features and water supply is direct or indirect. Meanwhile, the method can also obtain value interval of features, the association degree between features and confidence probability of rules.


Symmetry ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 1620
Author(s):  
You-Shyang Chen ◽  
Arun Kumar Sangaiah ◽  
Su-Fen Chen ◽  
Hsiu-Chen Huang

Applied human large-scale data are collected from heterogeneous science or industry databases for the purposes of achieving data utilization in complex application environments, such as in financial applications. This has posed great opportunities and challenges to all kinds of scientific data researchers. Thus, finding an intelligent hybrid model that solves financial application problems of the stock market is an important issue for financial analysts. In practice, classification applications that focus on the earnings per share (EPS) with financial ratios from an industry database often demonstrate that the data meet the abovementioned standards and have particularly high application value. This study proposes several advanced multicomponential discretization models, named Models A–E, where each model identifies and presents a positive/negative diagnosis based on the experiences of the latest financial statements from six different industries. The varied components of the model test performance measurements comparatively by using data-preprocessing, data-discretization, feature-selection, two data split methods, machine learning, rule-based decision tree knowledge, time-lag effects, different times of running experiments, and two different class types. The experimental dataset had 24 condition features and a decision feature EPS that was used to classify the data into two and three classes for comparison. Empirically, the analytical results of this study showed that three main determinants were identified: total asset growth rate, operating income per share, and times interest earned. The core components of the following techniques are as follows: data-discretization and feature-selection, with some noted classifiers that had significantly better accuracy. Total solution results demonstrated the following key points: (1) The highest accuracy, 92.46%, occurred in Model C from the use of decision tree learning with a percentage-split method for two classes in one run; (2) the highest accuracy mean, 91.44%, occurred in Models D and E from the use of naïve Bayes learning for cross-validation and percentage-split methods for each class for 10 runs; (3) the highest average accuracy mean, 87.53%, occurred in Models D and E with a cross-validation method for each class; (4) the highest accuracy, 92.46%, occurred in Model C from the use of decision tree learning-C4.5 with the percentage-split method and no time-lag for each class. This study concludes that its contribution is regarded as managerial implication and technical direction for practical finance in which a multicomponential discretization model has limited use and is rarely seen as applied by scientific industry data due to various restrictions.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Qiong Chen ◽  
Mengxing Huang ◽  
Qiannan Xu ◽  
Hao Wang ◽  
Jinghui Wang

Feature discretization can reduce the complexity of data and improve the efficiency of data mining and machine learning. However, in the process of multidimensional data discretization, limited by the complex correlation among features and the performance bottleneck of traditional discretization criteria, the schemes obtained by most algorithms are not optimal in specific application scenarios and can even fail to meet the accuracy requirements of the system. Although some swarm intelligence algorithms can achieve better results, it is difficult to formulate appropriate strategies without prior knowledge, which will make the search in multidimensional space inefficient, consume many computing resources, and easily fall into local optima. To solve these problems, this paper proposes a genetic algorithm based on reinforcement learning to optimize the discretization scheme of multidimensional data. We use rough sets to construct the individual fitness function, and we design the control function to dynamically adjust population diversity. In addition, we introduce a reinforcement learning mechanism to crossover and mutation to determine the crossover fragments and mutation points of the discretization scheme to be optimized. We conduct simulation experiments on Landsat 8 and Gaofen-2 images, and we compare our method to the traditional genetic algorithm and state-of-the-art discretization methods. Experimental results show that the proposed optimization method can further reduce the number of intervals and simplify the multidimensional dataset without decreasing the data consistency and classification accuracy of discretization.


2020 ◽  
pp. 121-146 ◽  
Author(s):  
Julián Luengo ◽  
Diego García-Gil ◽  
Sergio Ramírez-Gallego ◽  
Salvador García ◽  
Francisco Herrera
Keyword(s):  
Big Data ◽  

Association rules mining (ARM) is a standout amongst the most essential Data Mining Systems. Find attribute patterns as a binding rule in a data set. The discovery of these suggestion rules would result in a mutual method. Firstly, regular elements are produced and therefore the association rules are extracted. In the literature, different algorithms inspired by nature have been proposed as BCO, ACO, PSO, etc. to find interesting association rules. This article presents the performance of the ARM hybrid approach with the optimization of wolf research based on two different fitness functions. The goal is to discover the best promising rules in the data set, avoiding optimal local solutions. The implementation is done in numerical data that require data discretization as a preliminary phase and therefore the application of ARM with optimization to generate the best rules.


Sign in / Sign up

Export Citation Format

Share Document