The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing

2006 ◽  
Vol 173 (3) ◽  
pp. 781-800 ◽  
Author(s):  
Sven F. Crone ◽  
Stefan Lessmann ◽  
Robert Stahlbock
2013 ◽  
Vol 20 (1) ◽  
pp. 23-38 ◽  
Author(s):  
Ding-Wen Tan ◽  
William Yeoh ◽  
Yee Ling Boo ◽  
Soung-Yue Liew

Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.


2010 ◽  
Vol 38 (1) ◽  
pp. 74-84 ◽  
Author(s):  
David Orentlicher

Pharmaceutical companies have long relied on direct marketing of their drugs to physicians through one-on-one meetings with sales representatives. This practice of “detailing” is substantial in its costs and its number of participants. Every year, pharmaceutical companies spend billions of dollars on millions of visits to physicians by tens of thousands of sales representatives.Critics have argued that drug detailing results in sub-optimal prescribing decisions by physicians, compromising patient health and driving up spending on medical care. In this view, physicians often are unduly influenced both by marketing presentations that do not accurately reflect evidence from the medical literature and by the gifts that sales representatives deliver in conjunction with their presentations.


Author(s):  
Suma B. ◽  
Shobha G.

<span>Privacy preserving data mining has become the focus of attention of government statistical agencies and database security research community who are concerned with preventing privacy disclosure during data mining. Repositories of large datasets include sensitive rules that need to be concealed from unauthorized access. Hence, association rule hiding emerged as one of the powerful techniques for hiding sensitive knowledge that exists in data before it is published. In this paper, we present a constraint-based optimization approach for hiding a set of sensitive association rules, using a well-structured integer linear program formulation. The proposed approach reduces the database sanitization problem to an instance of the integer linear programming problem. The solution of the integer linear program determines the transactions that need to be sanitized in order to conceal the sensitive rules while minimizing the impact of sanitization on the non-sensitive rules. We also present a heuristic sanitization algorithm that performs hiding by reducing the support or the confidence of the sensitive rules. The results of the experimental evaluation of the proposed approach on real-life datasets indicate the promising performance of the approach in terms of side effects on the original database.</span>


2021 ◽  
Author(s):  
Ivana Radojević ◽  
◽  
Aleksandar Ostojić ◽  
Nenad Stefanović

Using data mining techniques, this study analyzes the influence and dependance of bacterial communities that are determined in routine monitoring of open water quality status, such as heterotrophic bacteria (psychrophiles and mesophiles). The SeLaR database was used, which, in addition to various studies of integrated data related to the reservoirs of Serbia, is the basis for advanced data analysis – utilizing statistical methods and data mining. Data for reservoirs with different morphometric qualities, different positions, trophic status, and dominant bacterial community were analyzed. In this research, classification, and analysis of influential parameters, as well as scenario analysis was applied. The results indicate that a designed data mining system can analyze the state and influence of bacterial communities with different parameters that are determined both in standard routine analysis, and in some more specialized studies. This study showed that designed data mining system can serve as flexible, effective, and practical tool for monitoring water quality using bacterial communities in reservoirs.


Sign in / Sign up

Export Citation Format

Share Document