scholarly journals SCI-Tree: An Incremental Algorithm for Computing Support Counts of all Closed Intervals from an Interval Dataset

Interval data mining is used to extract unknown patterns, hidden rules, associations etc. associated in interval based data. The extraction of closed interval is important because by mining the set of closed intervals and their support counts, the support counts of any interval can be computed easily. In this work an incremental algorithm for computing closed intervals together with their support counts from interval dataset is proposed. Many methods for mining closed intervals are available. Most of these methods assume a static data set as input and hence the algorithms are non-incremental. Real life data sets are however dynamic by nature. An efficient incremental algorithm called CI-Tree has been already proposed for computing closed intervals present in dynamic interval data. However this method could not compute the support values of the closed intervals. The proposed algorithm called SCI-Tree extracts all closed intervals together with their support values incrementally from the given interval data. Also, all the frequent closed intervals can be computed for any user defined minimum support with a single scan of SCI-Tree without revisiting the dataset. The proposed method has been tested with real life and synthetic datasets and results have been reported.

2013 ◽  
Vol 3 (4) ◽  
pp. 1-14 ◽  
Author(s):  
S. Sampath ◽  
B. Ramya

Cluster analysis is a branch of data mining, which plays a vital role in bringing out hidden information in databases. Clustering algorithms help medical researchers in identifying the presence of natural subgroups in a data set. Different types of clustering algorithms are available in the literature. The most popular among them is k-means clustering. Even though k-means clustering is a popular clustering method widely used, its application requires the knowledge of the number of clusters present in the given data set. Several solutions are available in literature to overcome this limitation. The k-means clustering method creates a disjoint and exhaustive partition of the data set. However, in some situations one can come across objects that belong to more than one cluster. In this paper, a clustering algorithm capable of producing rough clusters automatically without requiring the user to give as input the number of clusters to be produced. The efficiency of the algorithm in detecting the number of clusters present in the data set has been studied with the help of some real life data sets. Further, a nonparametric statistical analysis on the results of the experimental study has been carried out in order to analyze the efficiency of the proposed algorithm in automatic detection of the number of clusters in the data set with the help of rough version of Davies-Bouldin index.


Author(s):  
Muhammad H. Tahir ◽  
Muhammad Adnan Hussain ◽  
Gauss Cordeiro ◽  
Mahmoud El-Morshedy ◽  
Mohammed S. Eliwa

For bounded unit interval, we propose a new Kumaraswamy generalized (G) family of distributions from a new generator which could be an alternate to the Kumaraswamy-G family proposed earlier by Cordeiro and de-Castro in 2011. This new generator can also be used to develop alternate G-classes such as beta-G, McDonald-G, Topp-Leone-G, Marshall-Olkin-G and Transmuted-G for bounded unit interval. Some mathematical properties of this new family are obtained and maximum likelihood method is used for estimating the family parameters. We investigate the properties of one special model called a new Kumaraswamy-Weibull (NKwW) distribution. Parameter estimation is dealt and maximum likelihood estimators are assessed through simulation study. Two real life data sets are analyzed to illustrate the importance and flexibility of this distribution. In fact, this model outperforms some generalized Weibull models such as the Kumaraswamy-Weibull, McDonald-Weibull, beta-Weibull, exponentiated-generalized Weibull, gamma-Weibull, odd log-logistic-Weibull, Marshall-Olkin-Weibull, transmuted-Weibull, exponentiated-Weibull and Weibull distributions when applied to these data sets. The bivariate extension of the family is proposed and the estimation of parameters is given. The usefulness of the bivariate NKwW model is illustrated empirically by means of a real-life data set.


2021 ◽  
Author(s):  
Fatma Zohra Seghier ◽  
Halim Zeghdoudi

Abstract In this paper, a Poisson XLindley distribution (PXLD) has been obtained by compounding Poisson (PD) distribution with a continuous distribution. A general expression for its rth factorial moment about origin has been derived and hence its raw moments and central moments are obtained. The expressions for its coefficient of variation, skewness, kurtosis and index of dispersion have also been given. In particular, the method of maximum likelihood and the method of moments for the estimation of its parameters have been discussed. Finally, two real-life data sets are analyzed to investigate the suitability of the proposed distribution in modeling a real data set on Nipah virus infection, number of Hemocytometer yeast cell count data and epileptic seizure counts data.


2017 ◽  
Vol 26 (1) ◽  
pp. 153-168 ◽  
Author(s):  
Vijay Kumar ◽  
Jitender Kumar Chhabra ◽  
Dinesh Kumar

AbstractThe main problem of classical clustering technique is that it is easily trapped in the local optima. An attempt has been made to solve this problem by proposing the grey wolf algorithm (GWA)-based clustering technique, called GWA clustering (GWAC), through this paper. The search capability of GWA is used to search the optimal cluster centers in the given feature space. The agent representation is used to encode the centers of clusters. The proposed GWAC technique is tested on both artificial and real-life data sets and compared to six well-known metaheuristic-based clustering techniques. The computational results are encouraging and demonstrate that GWAC provides better values in terms of precision, recall, G-measure, and intracluster distances. GWAC is further applied for gene expression data set and its performance is compared to other techniques. Experimental results reveal the efficiency of the GWAC over other techniques.


2018 ◽  
Vol 12 (3) ◽  
pp. 100-122
Author(s):  
Benjamin Stark ◽  
Heiko Gewald ◽  
Heinrich Lautenbacher ◽  
Ulrich Haase ◽  
Siegmar Ruff

This article describes how the information about an individual's personal health is among ones most sensitive and important intangible belongings. When health information is misused, serious non-revertible damage can be caused, e.g. through making intimidating details public or leaking it to employers, insurances etc. Therefore, health information needs to be treated with the highest degree of confidentiality. In practice it proves difficult to achieve this goal. In a hospital setting medical staff across departments often needs to access patient data without directly obvious reasons, which makes it difficult to distinguish legitimate from illegitimate access. This article provides a mechanism to classify transactions at a large university medical center into plausible and questionable data access using a real-life data set of more than 60,000 transactions. The classification mechanism works with minimal data requirements and unsupervised data sets. The results were evaluated through manual cross-checks internally and by a group of external experts. Consequently, the hospital's data protection officer is now able to focus on analyzing questionable transactions instead of checking random samples.


Mathematics ◽  
2020 ◽  
Vol 8 (11) ◽  
pp. 1989
Author(s):  
Muhammad H. Tahir ◽  
Muhammad Adnan Hussain ◽  
Gauss M. Cordeiro ◽  
M. El-Morshedy ◽  
M. S. Eliwa

For bounded unit interval, we propose a new Kumaraswamy generalized (G) family of distributions through a new generator which could be an alternate to the Kumaraswamy-G family proposed earlier by Cordeiro and de Castro in 2011. This new generator can also be used to develop alternate G-classes such as beta-G, McDonald-G, Topp-Leone-G, Marshall-Olkin-G, and Transmuted-G for bounded unit interval. Some mathematical properties of this new family are obtained and maximum likelihood method is used for the estimation of G-family parameters. We investigate the properties of one special model called the new Kumaraswamy-Weibull (NKwW) distribution. Parameters of NKwW model are estimated by using maximum likelihood method, and the performance of these estimators are assessed through simulation study. Two real life data sets are analyzed to illustrate the importance and flexibility of the proposed model. In fact, this model outperforms some generalized Weibull models such as the Kumaraswamy-Weibull, McDonald-Weibull, beta-Weibull, exponentiated-generalized Weibull, gamma-Weibull, odd log-logistic-Weibull, Marshall-Olkin-Weibull, transmuted-Weibull and exponentiated-Weibull distributions when applied to these data sets. The bivariate extension of the family is also proposed, and the estimation of parameters is dealt. The usefulness of the bivariate NKwW model is illustrated empirically by means of a real-life data set.


2018 ◽  
Vol 33 (2) ◽  
pp. 113-124
Author(s):  
K. K. Jose ◽  
Lishamol Tomy ◽  
Sophia P. Thomas

Abstract In this article, a generalization of the Weibull distribution called Harris extended Weibull distribution is studied, and its properties are discussed. We fit the distribution to a real-life data set to show the applicability of this distribution in reliability modeling. Also, we derive a reliability test plan for acceptance or rejection of a lot of products submitted for inspection with lifetimes following this distribution. The operating characteristic functions of the sampling plans are obtained. The producer’s risk, minimum sample sizes and associated characteristics are computed and presented in tables. The results are illustrated using two data sets on ordered failure times of products as well as failure times of ball bearings.


2015 ◽  
Vol 2015 ◽  
pp. 1-11
Author(s):  
Xin Xu ◽  
Zhaohua Xiong ◽  
Wei Wang

Emitter identification has been widely recognized as one crucial issue for communication, electronic reconnaissance, and radar intelligence analysis. However, the measurements of emitter signal parameters typically take the form of uncertain intervals rather than precise values. In addition, the measurements are generally accumulated dynamically and continuously. As a result, one imminent task has become how to carry out discriminant analysis of interval-valued parameters incrementally for emitter identification. Existing machine learning approaches for interval-valued data analysis are unfit for this purpose as they generally assume a uniform distribution and are usually restricted to static data analysis. To address the above problems, we bring forward an incremental discriminant analysis method on interval-valued parameters (IDAIP) for emitter identification. Extensive experiments on both synthetic and real-life data sets have validated the efficiency and effectiveness of our method.


Stats ◽  
2021 ◽  
Vol 4 (2) ◽  
pp. 419-453
Author(s):  
Alex Ely Kossovsky

Benford’s Law predicts that the first significant digit on the leftmost side of numbers in real-life data is distributed between all possible 1 to 9 digits approximately as in LOG(1 + 1/digit), so that low digits occur much more frequently than high digits in the first place. Typically researchers, data analysts, and statisticians, rush to apply the chi-square test in order to verify compliance or deviation from this statistical law. In almost all cases of real-life data this approach is mistaken and without mathematical-statistics basis, yet it had become a dogma or rather an impulsive ritual in the field of Benford’s Law to apply the chi-square test for whatever data set the researcher is considering, regardless of its true applicability. The mistaken use of the chi-square test has led to much confusion and many errors, and has done a lot in general to undermine trust and confidence in the whole discipline of Benford’s Law. This article is an attempt to correct course and bring rationality and order to a field which had demonstrated harmony and consistency in all of its results, manifestations, and explanations. The first research question of this article demonstrates that real-life data sets typically do not arise from random and independent selections of data points from some larger universe of parental data as the chi-square approach supposes, and this conclusion is arrived at by examining how several real-life data sets are formed and obtained. The second research question demonstrates that the chi-square approach is actually all about the reasonableness of the random selection process and the Benford status of that parental universe of data and not solely about the Benford status of the data set under consideration, since the focus of the chi-square test is exclusively on whether the entire process of data selection was probable or too rare. In addition, a comparison of the chi-square statistic with the Sum of Squared Deviations (SSD) measure of distance from Benford is explored in this article, pitting one measure against the other, and concluding with a strong preference for the SSD measure.


2020 ◽  
Vol 13 (10) ◽  
pp. 1669-1681
Author(s):  
Zijing Tan ◽  
Ai Ran ◽  
Shuai Ma ◽  
Sheng Qin

Pointwise order dependencies (PODs) are dependencies that specify ordering semantics on attributes of tuples. POD discovery refers to the process of identifying the set Σ of valid and minimal PODs on a given data set D. In practice D is typically large and keeps changing, and it is prohibitively expensive to compute Σ from scratch every time. In this paper, we make a first effort to study the incremental POD discovery problem, aiming at computing changes ΔΣ to Σ such that Σ ⊕ ΔΣ is the set of valid and minimal PODs on D with a set Δ D of tuple insertion updates. (1) We first propose a novel indexing technique for inputs Σ and D. We give algorithms to build and choose indexes for Σ and D , and to update indexes in response to Δ D. We show that POD violations w.r.t. Σ incurred by Δ D can be efficiently identified by leveraging the proposed indexes, with a cost dependent on log (| D |). (2) We then present an effective algorithm for computing ΔΣ, based on Σ and identified violations caused by Δ D. The PODs in Σ that become invalid on D + Δ D are efficiently detected with the proposed indexes, and further new valid PODs on D + Δ D are identified by refining those invalid PODs in Σ on D + Δ D. (3) Finally, using both real-life and synthetic datasets, we experimentally show that our approach outperforms the batch approach that computes from scratch, up to orders of magnitude.


Sign in / Sign up

Export Citation Format

Share Document