Automatic and semantic pre — Selection of features using ontology for data mining on data sets related to cancer

This chapter reviews the fundamentals of inference, and gives a motivation for Bayesian analysis. The method is illustrated with dependency tests in data sets with categorical data variables, and the Dirichlet prior distributions. Principles and problems for deriving causality conclusions are reviewed, and illustrated with Simpson’s paradox. The selection of decomposable and directed graphical models illustrates the Bayesian approach. Bayesian and EM classification is shortly described. The material is illustrated on two cases, one in personalization of media distribution, one in schizophrenia research. These cases are illustrations of how to approach problem types that exist in many other application areas.

Download Full-text

Privacy-Preserving Hybrid K-Means

Censorship, Surveillance, and Privacy ◽

10.4018/978-1-5225-7113-1.ch049 ◽

2019 ◽

pp. 1009-1026

Author(s):

Zhiqiang Gao ◽

Yixiao Sun ◽

Xiaolong Cui ◽

Yutao Wang ◽

Yanyu Duan ◽

...

Keyword(s):

Data Mining ◽

Differential Privacy ◽

Privacy Preserving ◽

Local Optimum ◽

Data Sets ◽

Swarm Optimization ◽

Second Stage ◽

Private Data ◽

Privacy Budget ◽

Selection Of

This article describes how the most widely used clustering, k-means, is prone to fall into a local optimum. Notably, traditional clustering approaches are directly performed on private data and fail to cope with malicious attacks in massive data mining tasks against attackers' arbitrary background knowledge. It would result in violation of individuals' privacy, as well as leaks through system resources and clustering outputs. To address these issues, the authors propose an efficient privacy-preserving hybrid k-means under Spark. In the first stage, particle swarm optimization is executed in resilient distributed datasets to initiate the selection of clustering centroids in the k-means on Spark. In the second stage, k-means is executed on the condition that a privacy budget is set as ε/2t with Laplace noise added in each round of iterations. Extensive experimentation on public UCI data sets show that on the premise of guaranteeing utility of privacy data and scalability, their approach outperforms the state-of-the-art varieties of k-means by utilizing swarm intelligence and rigorous paradigms of differential privacy.

Download Full-text

Performance comparison of six Data mining models for soft churn customer prediction in Telecom

IJEEC - INTERNATIONAL JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTING ◽

10.7251/ijeec1801029m ◽

2019 ◽

Vol 2 (1) ◽

Author(s):

Marin Mandić ◽

Goran Kraljević ◽

Ivan Boban

Keyword(s):

Data Mining ◽

Prediction Models ◽

Principal Component ◽

Performance Comparison ◽

Data Sets ◽

Network Algorithms ◽

Imbalance Problem ◽

Definition Of ◽

Multiple Conditions ◽

Selection Of

Due to a high competition in the market, the telecom operators are affected by churn, therefore it is very important for them to identify which users are likely to leave them and switch to the competition telecom company. This research uses data on behaviour of the users from telecom systems that serve to identify patterns in behaviours and thereby recognize the churn. Creating new definition of prepaid soft churn based on multiple conditions is valuable contribution of this paper. At preparing data, a selection of useful attributes was made using the Principal Component Analysis (PCA). The normalization of the attribute values has also been made in order to obtain a proper balance of the influence of all the attributes. Common problem with telecom churn prediction data is imbalance, taking into account the target variable. Such a case is also in the data used in this paper, where the percentage of churners is 12%. Comparison of undersampling and oversampling was performed as a method for resolving the data imbalance problem. Data sets with undersampling and oversasmpling have been used to train the decision tree, logistic regression and neural network algorithms and therefore six prediction models for detecting the churn of the Prepaid users in the telecom were created in this paper. Performance analysis and comparison of the six developed Data mining models was also performed.

Download Full-text

The economics of selection of mail orders Drs. Zahavi and Levin are the masterminds behind the development of AMOS, a customized predictive modeling system for the Franklin Mint in Philadelphia, and GainSmarts, a general purpose data mining system that is the two-time winner of the KDD-CUP competition for the best data mining tools (1997 and 1998) sponsored by the American Association for Artificial Intelligence.

Journal of Interactive Marketing ◽

10.1002/dir.1016.abs ◽

2001 ◽

Vol 15 (3) ◽

pp. 53

Author(s):

Nissan Levin ◽

Jacob Zahavi

Keyword(s):

Artificial Intelligence ◽

Data Mining ◽

Predictive Modeling ◽

American Association ◽

General Purpose ◽

Mining System ◽

Data Mining System ◽

Mining Tools ◽

Selection Of

Download Full-text

A Survey on Preparing Data Sets for Data Mining Analysis using Horizontal Aggregations in SQL

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i4/0199 ◽

2017 ◽

Vol 7 (5) ◽

pp. 172-176

Author(s):

Prashant B. Rajole ◽

Keyword(s):

Data Mining ◽

Data Sets ◽

Data Mining Analysis

Download Full-text

K-MEANS CLUSTERING ALGORITHM FOR SERVICE DATA ANALYSIS BASED ON CUSTOMERS COMBINATION

Unes journal of Information System ◽

10.31933/ujis.3.1.001-007.2018 ◽

2018 ◽

Vol 3 (1) ◽

pp. 001

Author(s):

Zulhendra Zulhendra ◽

Gunadi Widi Nurcahyo ◽

Julius Santony

Keyword(s):

Data Mining ◽

Data Analysis ◽

Clustering Algorithm ◽

Customer Complaints ◽

Using Data ◽

Clustering Data ◽

Service Data ◽

Selection Of

In this study using Data Mining, namely K-Means Clustering. Data Mining can be used in searching for a large enough data analysis that aims to enable Indocomputer to know and classify service data based on customer complaints using Weka Software. In this study using the algorithm K-Means Clustering to predict or classify complaints about hardware damage on Payakumbuh Indocomputer. And can find out the data of Laptop brands most do service on Indocomputer Payakumbuh as one of the recommendations to consumers for the selection of Laptops.

Download Full-text

Selection of one-dimensional sedimentation: models for on-line use

Water Science & Technology ◽

10.2166/wst.1995.0100 ◽

1995 ◽

Vol 31 (2) ◽

pp. 193-204 ◽

Cited By ~ 7

Author(s):

Koen Grijspeerdt ◽

Peter Vanrolleghem ◽

Willy Verstraete

Keyword(s):

Steady State ◽

Selection Criteria ◽

Data Sets ◽

Concentration Profiles ◽

A Posteriori ◽

One Dimensional ◽

On Line ◽

Dynamic Concentration ◽

Selection Of ◽

Modelling Task

A comparative study of several recently proposed one-dimensional sedimentation models has been made. This has been achieved by fitting these models to steady-state and dynamic concentration profiles obtained in a down-scaled secondary decanter. The models were evaluated with several a posteriori model selection criteria. Since the purpose of the modelling task is to do on-line simulations, the calculation time was used as one of the selection criteria. Finally, the practical identifiability of the models for the available data sets was also investigated. It could be concluded that the model of Takács et al. (1991) gave the most reliable results.

Download Full-text

Real-time Approximation of Photometric Polygonal Lights

Proceedings of the ACM on Computer Graphics and Interactive Techniques ◽

10.1145/3384537 ◽

2020 ◽

Vol 3 (1) ◽

pp. 1-18

Author(s):

Christian Luksch ◽

Lukas Prost ◽

Michael Wimmer

Keyword(s):

Real Time ◽

Specular Reflection ◽

Near Field ◽

Measurement Data ◽

Data Sets ◽

Photometric Measurement ◽

Integration Technique ◽

Time Approximation ◽

Light Emitter ◽

Selection Of

We present a real-time rendering technique for photometric polygonal lights. Our method uses a numerical integration technique based on a triangulation to calculate noise-free diffuse shading. We include a dynamic point in the triangulation that provides a continuous near-field illumination resembling the shape of the light emitter and its characteristics. We evaluate the accuracy of our approach with a diverse selection of photometric measurement data sets in a comprehensive benchmark framework. Furthermore, we provide an extension for specular reflection on surfaces with arbitrary roughness that facilitates the use of existing real-time shading techniques. Our technique is easy to integrate into real-time rendering systems and extends the range of possible applications with photometric area lights.

Download Full-text

PCA for heterogeneous data sets in a distributed data mining

Proceedings of the Fourth Annual ACM Bangalore Conference on - COMPUTE '11 ◽

10.1145/1980422.1980451 ◽

2011 ◽

Author(s):

E. Chandra ◽

P. Ajitha

Keyword(s):

Data Mining ◽

Heterogeneous Data ◽

Distributed Data Mining ◽

Data Sets ◽

Distributed Data

Download Full-text

The predictability of reported drought events and impacts in the Ebro Basin using six different remote sensing data sets

Hydrology and Earth System Sciences ◽

10.5194/hess-21-4747-2017 ◽

2017 ◽

Vol 21 (9) ◽

pp. 4747-4765 ◽

Cited By ~ 8

Author(s):

Clara Linés ◽

Micha Werner ◽

Wim Bastiaanssen

Keyword(s):

Remote Sensing ◽

Remote Sensing Data ◽

Data Sets ◽

Drought Management ◽

Ebro Basin ◽

Sensing Data ◽

Drought Impacts ◽

Management Plans ◽

Drought Indicators ◽

Selection Of

Abstract. The implementation of drought management plans contributes to reduce the wide range of adverse impacts caused by water shortage. A crucial element of the development of drought management plans is the selection of appropriate indicators and their associated thresholds to detect drought events and monitor the evolution. Drought indicators should be able to detect emerging drought processes that will lead to impacts with sufficient anticipation to allow measures to be undertaken effectively. However, in the selection of appropriate drought indicators, the connection to the final impacts is often disregarded. This paper explores the utility of remotely sensed data sets to detect early stages of drought at the river basin scale and determine how much time can be gained to inform operational land and water management practices. Six different remote sensing data sets with different spectral origins and measurement frequencies are considered, complemented by a group of classical in situ hydrologic indicators. Their predictive power to detect past drought events is tested in the Ebro Basin. Qualitative (binary information based on media records) and quantitative (crop yields) data of drought events and impacts spanning a period of 12 years are used as a benchmark in the analysis. Results show that early signs of drought impacts can be detected up to 6 months before impacts are reported in newspapers, with the best correlation–anticipation relationships for the standard precipitation index (SPI), the normalised difference vegetation index (NDVI) and evapotranspiration (ET). Soil moisture (SM) and land surface temperature (LST) offer also good anticipation but with weaker correlations, while gross primary production (GPP) presents moderate positive correlations only for some of the rain-fed areas. Although classical hydrological information from water levels and water flows provided better anticipation than remote sensing indicators in most of the areas, correlations were found to be weaker. The indicators show a consistent behaviour with respect to the different levels of crop yield in rain-fed areas among the analysed years, with SPI, NDVI and ET providing again the stronger correlations. Overall, the results confirm remote sensing products' ability to anticipate reported drought impacts and therefore appear as a useful source of information to support drought management decisions.

Download Full-text