MSJEP Classifier: “Modified Strong Jumping Emerging Patterns” for Fast Efficient Mining and for handling attributes whose values are associated with taxonomies

Author(s):  
Mohammed K. Hassan ◽  
◽  
Ahmed K. Hassan ◽  
Ali I. Eldesouky ◽  
◽  
...  

Modified Strong Jumping Emerging Patterns (MSJEPs) are those itemsets whose support increases from zero in one data set to non-zero in the other dataset with support constraints greater than the minimum support threshold (ζ). The support constraint of MSJEP removes potentially less useful JEPs while retaining those with high discriminating power. Contrast Pattern (CP)-tree-based discovery algorithm used for SJEP mining is a main-memory-based method. When the data set is large, it is unrealistic to assume that the CP-tree can fit in the main memory. The main idea to handle this problem is to first partition the data set into a set of projected data sets and then for each projected data set, we construct and mine its corresponding CP-tree. Trees of the projected data sets are called Separated Contrast Pattern Tree “SCP-trees” and Patterns generated from it are Called MSJEPs” Modified Strong Jumping Emerging Patterns”. Our proposal also investigates the weakness of emerging patterns in handling attributes whose values are associated with taxonomies and proposes using an MSJEP classifier to achieve better accuracy, better speed, and also handling attributes in taxonomy.

2019 ◽  
Vol 491 (4) ◽  
pp. 5238-5247 ◽  
Author(s):  
X Saad-Olivera ◽  
C F Martinez ◽  
A Costa de Souza ◽  
F Roig ◽  
D Nesvorný

ABSTRACT We characterize the radii and masses of the star and planets in the Kepler-59 system, as well as their orbital parameters. The star parameters are determined through a standard spectroscopic analysis, resulting in a mass of $1.359\pm 0.155\, \mathrm{M}_\odot$ and a radius of $1.367\pm 0.078\, \mathrm{R}_\odot$. The obtained planetary radii are $1.5\pm 0.1\, R_\oplus$ for the inner and $2.2\pm 0.1\, R_\oplus$ for the outer planet. The orbital parameters and the planetary masses are determined by the inversion of Transit Timing Variations (TTV) signals. We consider two different data sets: one provided by Holczer et al. (2016), with TTVs only for Kepler-59c, and the other provided by Rowe et al. (2015), with TTVs for both planets. The inversion method applies an algorithm of Bayesian inference (MultiNest) combined with an efficient N-body integrator (Swift). For each of the data set, we found two possible solutions, both having the same probability according to their corresponding Bayesian evidences. All four solutions appear to be indistinguishable within their 2-σ uncertainties. However, statistical analyses show that the solutions from Rowe et al. (2015) data set provide a better characterization. The first solution infers masses of $5.3_{-2.1}^{+4.0}~M_{\mathrm{\oplus }}$ and $4.6_{-2.0}^{+3.6}~M_{\mathrm{\oplus }}$ for the inner and outer planet, respectively, while the second solution gives masses of $3.0^{+0.8}_{-0.8}~M_{\mathrm{\oplus }}$ and $2.6^{+0.9}_{-0.8}~M_{\mathrm{\oplus }}$. These values point to a system with an inner super-Earth and an outer mini-Neptune. A dynamical study shows that the planets have almost co-planar orbits with small eccentricities (e < 0.1), close to the 3:2 mean motion resonance. A stability analysis indicates that this configuration is stable over million years of evolution.


1998 ◽  
Vol 185 ◽  
pp. 167-168
Author(s):  
T. Appourchaux ◽  
M.C. Rabello-Soares ◽  
L. Gizon

Two different data sets have been used to derive low-degree rotational splittings. One data set comes from the Luminosity Oscillations Imager of VIRGO on board SOHO; the observation starts on 27 March 96 and ends on 26 March 97, and are made of intensity time series of 12 pixels (Appourchaux et al, 1997, Sol. Phys., 170, 27). The other data set was kindly made available by the GONG project; the observation starts on 26 August 1995 and ends on 21 August 1996, and are made of complex Fourier spectra of velocity time series for l = 0 − 9. For the GONG data, the contamination of l = 1 from the spatial aliases of l = 6 and l = 9 required some cleaning. To achieve this, we applied the inverse of the leakage matrix of l = 1, 6 and 9 to the original Fourier spectra of the same degrees; cleaning of all 3 degrees was achieved simultaneously (Appourchaux and Gizon, 1997, these proceedings).


2013 ◽  
Vol 6 (4) ◽  
pp. 7593-7631 ◽  
Author(s):  
P. Paatero ◽  
S. Eberly ◽  
S. G. Brown ◽  
G. A. Norris

Abstract. EPA PMF version 5.0 and the underlying multilinear engine executable ME-2 contain three methods for estimating uncertainty in factor analytic models: classical bootstrap (BS), displacement of factor elements (DISP), and bootstrap enhanced by displacement of factor elements (BS-DISP). The goal of these methods is to capture the uncertainty of PMF analyses due to random errors and rotational ambiguity. It is shown that the three methods complement each other: depending on characteristics of the data set, one method may provide better results than the other two. Results are presented using synthetic data sets, including interpretation of diagnostics, and recommendations are given for parameters to report when documenting uncertainty estimates from EPA PMF or ME-2 applications.


2019 ◽  
Vol 16 (3) ◽  
pp. 705-731
Author(s):  
Haoze Lv ◽  
Zhaobin Liu ◽  
Zhonglian Hu ◽  
Lihai Nie ◽  
Weijiang Liu ◽  
...  

With the invention of big data era, data releasing is becoming a hot topic in database community. Meanwhile, data privacy also raises the attention of users. As far as the privacy protection models that have been proposed, the differential privacy model is widely utilized because of its many advantages over other models. However, for the private releasing of multi-dimensional data sets, the existing algorithms are publishing data usually with low availability. The reason is that the noise in the released data is rapidly grown as the increasing of the dimensions. In view of this issue, we propose algorithms based on regular and irregular marginal tables of frequent item sets to protect privacy and promote availability. The main idea is to reduce the dimension of the data set, and to achieve differential privacy protection with Laplace noise. First, we propose a marginal table cover algorithm based on frequent items by considering the effectiveness of query cover combination, and then obtain a regular marginal table cover set with smaller size but higher data availability. Then, a differential privacy model with irregular marginal table is proposed in the application scenario with low data availability and high cover rate. Next, we obtain the approximate optimal marginal table cover algorithm by our analysis to get the query cover set which satisfies the multi-level query policy constraint. Thus, the balance between privacy protection and data availability is achieved. Finally, extensive experiments have been done on synthetic and real databases, demonstrating that the proposed method preforms better than state-of-the-art methods in most cases.


1999 ◽  
Vol 11 ◽  
pp. 169-198 ◽  
Author(s):  
D. Opitz ◽  
R. Maclin

An ensemble consists of a set of individually trained classifiers (such as neural networks or decision trees) whose predictions are combined when classifying novel instances. Previous research has shown that an ensemble is often more accurate than any of the single classifiers in the ensemble. Bagging (Breiman, 1996c) and Boosting (Freund & Shapire, 1996; Shapire, 1990) are two relatively new but popular methods for producing ensembles. In this paper we evaluate these methods on 23 data sets using both neural networks and decision trees as our classification algorithm. Our results clearly indicate a number of conclusions. First, while Bagging is almost always more accurate than a single classifier, it is sometimes much less accurate than Boosting. On the other hand, Boosting can create ensembles that are less accurate than a single classifier -- especially when using neural networks. Analysis indicates that the performance of the Boosting methods is dependent on the characteristics of the data set being examined. In fact, further results show that Boosting ensembles may overfit noisy data sets, thus decreasing its performance. Finally, consistent with previous studies, our work suggests that most of the gain in an ensemble's performance comes in the first few classifiers combined; however, relatively large gains can be seen up to 25 classifiers when Boosting decision trees.


2019 ◽  
Vol 9 (18) ◽  
pp. 3801 ◽  
Author(s):  
Hyuk-Yoon Kwon

In this paper, we propose a method to construct a lightweight key-value store based on the Windows native features. The main idea is providing a thin wrapper for the key-value store on top of a built-in storage in Windows, called Windows registry. First, we define a mapping of the components in the key-value store onto the components in the Windows registry. Then, we present a hash-based multi-level registry index so as to distribute the key-value data balanced and to efficiently access them. Third, we implement basic operations of the key-value store (i.e., Get, Put, and Delete) by manipulating the Windows registry using the Windows native APIs. We call the proposed key-value store WR-Store. Finally, we propose an efficient ETL (Extract-Transform-Load) method to migrate data stored in WR-Store into any other environments that support existing key-value stores. Because the performance of the Windows registry has not been studied much, we perform the empirical study to understand the characteristics of WR-Store, and then, tune the performance of WR-Store to find the best parameter setting. Through extensive experiments using synthetic and real data sets, we show that the performance of WR-Store is comparable to or even better than the state-of-the-art systems (i.e., RocksDB, BerkeleyDB, and LevelDB). Especially, we show the scalability of WR-Store. That is, WR-Store becomes much more efficient than the other key-value stores as the size of data set increases. In addition, we show that the performance of WR-Store is maintained even in the case of intensive registry workloads where 1000 processes accessing to the registry actively are concurrently running.


2020 ◽  
Vol 10 (7) ◽  
pp. 2539 ◽  
Author(s):  
Toan Nguyen Mau ◽  
Yasushi Inoguchi

It is challenging to build a real-time information retrieval system, especially for systems with high-dimensional big data. To structure big data, many hashing algorithms that map similar data items to the same bucket to advance the search have been proposed. Locality-Sensitive Hashing (LSH) is a common approach for reducing the number of dimensions of a data set, by using a family of hash functions and a hash table. The LSH hash table is an additional component that supports the indexing of hash values (keys) for the corresponding data/items. We previously proposed the Dynamic Locality-Sensitive Hashing (DLSH) algorithm with a dynamically structured hash table, optimized for storage in the main memory and General-Purpose computation on Graphics Processing Units (GPGPU) memory. This supports the handling of constantly updated data sets, such as songs, images, or text databases. The DLSH algorithm works effectively with data sets that are updated with high frequency and is compatible with parallel processing. However, the use of a single GPGPU device for processing big data is inadequate, due to the small memory capacity of GPGPU devices. When using multiple GPGPU devices for searching, we need an effective search algorithm to balance the jobs. In this paper, we propose an extension of DLSH for big data sets using multiple GPGPUs, in order to increase the capacity and performance of the information retrieval system. Different search strategies on multiple DLSH clusters are also proposed to adapt our parallelized system. With significant results in terms of performance and accuracy, we show that DLSH can be applied to real-life dynamic database systems.


Mathematics ◽  
2020 ◽  
Vol 8 (9) ◽  
pp. 1606
Author(s):  
Daniela Onita ◽  
Adriana Birlutiu ◽  
Liviu P. Dinu

Images and text represent types of content that are used together for conveying a message. The process of mapping images to text can provide very useful information and can be included in many applications from the medical domain, applications for blind people, social networking, etc. In this paper, we investigate an approach for mapping images to text using a Kernel Ridge Regression model. We considered two types of features: simple RGB pixel-value features and image features extracted with deep-learning approaches. We investigated several neural network architectures for image feature extraction: VGG16, Inception V3, ResNet50, Xception. The experimental evaluation was performed on three data sets from different domains. The texts associated with images represent objective descriptions for two of the three data sets and subjective descriptions for the other data set. The experimental results show that the more complex deep-learning approaches that were used for feature extraction perform better than simple RGB pixel-value approaches. Moreover, the ResNet50 network architecture performs best in comparison to the other three deep network architectures considered for extracting image features. The model error obtained using the ResNet50 network is less by approx. 0.30 than other neural network architectures. We extracted natural language descriptors of images and we made a comparison between original and generated descriptive words. Furthermore, we investigated if there is a difference in performance between the type of text associated with the images: subjective or objective. The proposed model generated more similar descriptions to the original ones for the data set containing objective descriptions whose vocabulary is simpler, bigger and clearer.


2011 ◽  
Author(s):  
David Doria

This document presents a GUI application to manually select corresponding points in two data sets. The data sets can each be either an image or a point cloud. If both data sets are images, the functionality is equivalent to Matlab’s ‘cpselect’ function. There are many uses of selecting correspondences. If both data sets are images, the correspondences can be used to compute the fundamental matrix, or to perform registration. If both data sets are point clouds, the correspondences can be used to compute a landmark transformation. If one data set is an image and the other is a point cloud, the camera matrix relating the two can be computed.


2008 ◽  
Vol 17 (06) ◽  
pp. 1109-1129 ◽  
Author(s):  
BASILIS BOUTSINAS ◽  
COSTAS SIOTOS ◽  
ANTONIS GEROLIMATOS

One of the most important data mining problems is learning association rules of the form "90% of the customers that purchase product x also purchase product y". Discovering association rules from huge volumes of data requires substantial processing power. In this paper we present an efficient distributed algorithm for mining association rules that reduces the time complexity in a magnitude that renders as suitable for scaling up to very large data sets. The proposed algorithm is based on partitioning the initial data set into subsets and processing each subset in parallel. The proposed algorithm can maintain the set of association rules that are extracted when applying an association rule mining algorithm to all the data, by reducing the support threshold during processing the subsets. The above are confirmed by empirical tests that we present and which also demonstrate the utility of the method.


Sign in / Sign up

Export Citation Format

Share Document