Subgroup discovery on big data: Pruning the search space on exhaustive search algorithms

Author(s):  
F. Padillo ◽  
J. M. Luna ◽  
S. Ventura
Author(s):  
Tad Hogg

Phase transitions have long been studied empirically in various combinatorial searches and theoretically in simplified models [91, 264, 301, 490]. The analogy with statistical physics [397], explored throughout this volume, shows how the many local choices made during search relate to global properties such as the resulting search cost. These studies have led to a better understanding of typical search behaviors [514] and improved search methods [195, 247, 261, 432, 433]. Among the current research questions in this field are the range of algorithms exhibiting the transition behavior and the algorithm-independent problem properties associated with the difficult instances concentrated near the transition. Towards this end, the present chapter examines quantum computer [123, 126, 158, 486] algorithms for nondeterministic polynomial (NP) combinatorial search problems [191]. As with many conventional methods, they exhibit the easy-hard-easy pattern of computational cost as the degree of constraint in the problems varies. We describe how properties of the search space affect the algorithms and identify an additional structural property, the energy gap, motivated by one quantum algorithm but applicable to a variety of techniques, both quantum and classical. Thus, the study of quantum search algorithms not only extends the range of algorithms exhibiting phase transitions, but also helps identify underlying structural properties. Specifically, the next two sections describe a class of hard search problems and the form of quantum search algorithms proposed to date. The remainder of the chapter presents algorithm behaviors, relevant problem structure, arid an approximate asymptotic analysis of their cost scaling. The final section discusses various open issues in designing and evaluating quantum algorithms, and relating their behavior to problem structure. The k-satisfiability (k -SAT) problem, as discussed earlier in this volume, consists of n Boolean variables and m clauses. A clause is a logical OR of k variables, each of which may be negated. A solution is an assignment, that is, a value for each variable, TRUE or FALSE, satisfying all the clauses. An assignment is said to conflict with any clause it does not satisfy.


Symmetry ◽  
2020 ◽  
Vol 12 (12) ◽  
pp. 2021
Author(s):  
Ahmad Asrul Ibrahim ◽  
Khairuddin Khalid ◽  
Hussain Shareef ◽  
Nor Azwan Mohamed Kamari

This paper proposes a technique to determine the possible optimal placement of the phasor measurement unit (PMU) in power grids for normal operating conditions. All possible combinations of PMU placement, including infeasible combinations, are typically considered in finding the optimal solution, which could be a massive search space. An integer search algorithm called the bounded search technique is introduced to reduce the search space in solving a minimum number of PMU allocations whilst maintaining full system observability. The proposed technique is based on connectivity and symmetry constraints that can be derived from the observability matrix. As the technique is coupled with the exhaustive technique, the technique is called the bounded exhaustive search (BES) technique. Several IEEE test systems, namely, IEEE 9-bus, IEEE 14-bus, IEEE 24-bus and IEEE 30-bus, are considered to showcase the performance of the proposed technique. An initial Monte Carlo simulation was carried out to evaluate the capability of the bounded search technique in providing a smaller feasible search space. The effectiveness of the BES technique in terms of computational time is compared with the existing exhaustive technique. Results demonstrate that the search space can be reduced tremendously, and the computational burden can be eased, when finding the optimal PMU placement in power grids.


2011 ◽  
Vol 2 (3) ◽  
pp. 27-44 ◽  
Author(s):  
Nashat Mansour ◽  
Ghia Sleiman-Haidar

University exam timetabling refers to scheduling exams into predefined days, time periods and rooms, given a set of constraints. Exam timetabling is a computationally intractable optimization problem, which requires heuristic techniques for producing adequate solutions within reasonable execution time. For large numbers of exams and students, sequential algorithms are likely to be time consuming. This paper presents parallel scatter search meta-heuristic algorithms for producing good sub-optimal exam timetables in a reasonable time. Scatter search is a population-based approach that generates solutions over a number of iterations and aims to combine diversification and search intensification. The authors propose parallel scatter search algorithms that are based on distributing the population of candidate solutions over a number of processors in a PC cluster environment. The main components of scatter search are computed in parallel and efficient communication techniques are employed. Empirical results show that the proposed parallel scatter search algorithms yield good speed-up. Also, they show that parallel scatter search algorithms improve solution quality because they explore larger parts of the search space within reasonable time, in contrast with the sequential algorithm.


2015 ◽  
Vol 23 (1) ◽  
pp. 101-129 ◽  
Author(s):  
Antonios Liapis ◽  
Georgios N. Yannakakis ◽  
Julian Togelius

Novelty search is a recent algorithm geared toward exploring search spaces without regard to objectives. When the presence of constraints divides a search space into feasible space and infeasible space, interesting implications arise regarding how novelty search explores such spaces. This paper elaborates on the problem of constrained novelty search and proposes two novelty search algorithms which search within both the feasible and the infeasible space. Inspired by the FI-2pop genetic algorithm, both algorithms maintain and evolve two separate populations, one with feasible and one with infeasible individuals, while each population can use its own selection method. The proposed algorithms are applied to the problem of generating diverse but playable game levels, which is representative of the larger problem of procedural game content generation. Results show that the two-population constrained novelty search methods can create, under certain conditions, larger and more diverse sets of feasible game levels than current methods of novelty search, whether constrained or unconstrained. However, the best algorithm is contingent on the particularities of the search space and the genetic operators used. Additionally, the proposed enhancement of offspring boosting is shown to enhance performance in all cases of two-population novelty search.


2021 ◽  
Vol 182 (2) ◽  
pp. 111-179
Author(s):  
Zaineb Chelly Dagdia ◽  
Christine Zarges

In the context of big data, granular computing has recently been implemented by some mathematical tools, especially Rough Set Theory (RST). As a key topic of rough set theory, feature selection has been investigated to adapt the related granular concepts of RST to deal with large amounts of data, leading to the development of the distributed RST version. However, despite of its scalability, the distributed RST version faces a key challenge tied to the partitioning of the feature search space in the distributed environment while guaranteeing data dependency. Therefore, in this manuscript, we propose a new distributed RST version based on Locality Sensitive Hashing (LSH), named LSH-dRST, for big data feature selection. LSH-dRST uses LSH to match similar features into the same bucket and maps the generated buckets into partitions to enable the splitting of the universe in a more efficient way. More precisely, in this paper, we perform a detailed analysis of the performance of LSH-dRST by comparing it to the standard distributed RST version, which is based on a random partitioning of the universe. We demonstrate that our LSH-dRST is scalable when dealing with large amounts of data. We also demonstrate that LSH-dRST ensures the partitioning of the high dimensional feature search space in a more reliable way; hence better preserving data dependency in the distributed environment and ensuring a lower computational cost.


Author(s):  
Stasinos Konstantopoulos ◽  
Rui Camacho ◽  
Nuno A. Fonseca ◽  
Vítor Santos Costa

This chapter introduces Inductive Logic Programming (ILP) from the perspective of search algorithms in Computer Science. It first briefly considers the Version Spaces approach to induction, and then focuses on Inductive Logic Programming: from its formal definition and main techniques and strategies, to priors used to restrict the search space and optimized sequential, parallel, and stochastic algorithms. The authors hope that this presentation of the theory and applications of Inductive Logic Programming will help the reader understand the theoretical underpinnings of ILP, and also provide a helpful overview of the State-of-the-Art in the domain.


2005 ◽  
Vol 24 ◽  
pp. 263-303 ◽  
Author(s):  
V. Bayer-Zubek ◽  
T. G. Dietterich

This paper studies the problem of learning diagnostic policies from training examples. A diagnostic policy is a complete description of the decision-making actions of a diagnostician (i.e., tests followed by a diagnostic decision) for all possible combinations of test results. An optimal diagnostic policy is one that minimizes the expected total cost, which is the sum of measurement costs and misdiagnosis costs. In most diagnostic settings, there is a tradeoff between these two kinds of costs. This paper formalizes diagnostic decision making as a Markov Decision Process (MDP). The paper introduces a new family of systematic search algorithms based on the AO* algorithm to solve this MDP. To make AO* efficient, the paper describes an admissible heuristic that enables AO* to prune large parts of the search space. The paper also introduces several greedy algorithms including some improvements over previously-published methods. The paper then addresses the question of learning diagnostic policies from examples. When the probabilities of diseases and test results are computed from training data, there is a great danger of overfitting. To reduce overfitting, regularizers are integrated into the search algorithms. Finally, the paper compares the proposed methods on five benchmark diagnostic data sets. The studies show that in most cases the systematic search methods produce better diagnostic policies than the greedy methods. In addition, the studies show that for training sets of realistic size, the systematic search algorithms are practical on today's desktop computers.


2018 ◽  
Author(s):  
Hao Chi ◽  
Chao Liu ◽  
Hao Yang ◽  
Wen-Feng Zeng ◽  
Long Wu ◽  
...  

ABSTRACTShotgun proteomics has grown rapidly in recent decades, but a large fraction of tandem mass spectrometry (MS/MS) data in shotgun proteomics are not successfully identified. We have developed a novel database search algorithm, Open-pFind, to efficiently identify peptides even in an ultra-large search space which takes into account unexpected modifications, amino acid mutations, semi- or non-specific digestion and co-eluting peptides. Tested on two metabolically labeled MS/MS datasets, Open-pFind reported 50.5‒117.0% more peptide-spectrum matches (PSMs) than the seven other advanced algorithms. More importantly, the Open-pFind results were more credible judged by the verification experiments using stable isotopic labeling. Tested on four additional large-scale datasets, 70‒85% of the spectra were confidently identified, and high-quality spectra were nearly completely interpreted by Open-pFind. Further, Open-pFind was over 40 times faster than the other three open search algorithms and 2‒3 times faster than three restricted search algorithms. Re-analysis of an entire human proteome dataset consisting of ∼25 million spectra using Open-pFind identified a total of 14,064 proteins encoded by 12,723 genes by requiring at least two uniquely identified peptides. In this search results, Open-pFind also excelled in an independent test for false positives based on the presence or absence of olfactory receptors. Thus, a practical use of the open search strategy has been realized by Open-pFind for the truly global-scale proteomics experiments of today and in the future.


Sign in / Sign up

Export Citation Format

Share Document