scholarly journals Blazing Signature Filter: a library for fast pairwise similarity comparisons

2017 ◽  
Author(s):  
Joon-Yong Lee ◽  
Grant M. Fujimoto ◽  
Ryan Wilson ◽  
H. Steven Wiley ◽  
Samuel H. Payne

AbstractIdentifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is that the vast majority of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.

Author(s):  
Krzysztof Jurczuk ◽  
Marcin Czajkowski ◽  
Marek Kretowski

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.


2018 ◽  
Vol 3 (1) ◽  
pp. 1-18
Author(s):  
Kislaya Kunjan ◽  
Huanmei Wu ◽  
Tammy R. Toscos ◽  
Bradley N. Doebbeling

2020 ◽  
Vol 17 (8) ◽  
pp. 3389-3393
Author(s):  
M. S. Roobini ◽  
Soujanya Mulakalapally ◽  
Navyasri Mungamuri ◽  
M. Lakshmi ◽  
Anitha Ponraj ◽  
...  

This report shows the outcome by applying large scale data mining techniques on the Finnish roads. From the research study it is very difficult task to perform because the collected data have uncertainty, incomplete and error values. So the data exploration is a challenging task. The data used in the process have been collected from Finnish road administration data sets. The data used in the process have been collected from Finnish road administration data sets. The main target of our project is to look into practicability of Robust clustering, to find the associations and repeated item sets and applying apprehend methods for the analysis of road accidents. While the results display the selected mining techniques and methods were capable to the understandable patterns. To calculate the accident frequency count as a parameter /c-means algorithm is used to cluster the locations. To characterize the surface conditions association rule mining is used. data mining skills disclosed different environmental reasons associated with road accidents. Intersection on highways have been identified as a dangerous for fatal accidents.


2013 ◽  
Vol 51 (5) ◽  
pp. 412-422 ◽  
Author(s):  
Charles Moseley ◽  
Harold Kleinert ◽  
Kathleen Sheppard-Jones ◽  
Stephen Hall

Abstract The application of scientific data in the development and implementation of sound public policy is a well-established practice, but there appears to be less consensus on the nature of the strategies that can and should be used to incorporate research data into policy decisions. This paper describes the promise and the challenges of using research evidence to inform public policy. Most specifically, we demonstrate how the application of a large-scale data set, the National Core Indicators (NCI), can be systematically used to drive state-level policy decisions, and we describe a case example of one state's application of NCI data to make significant changes to its Intellectual and Developmental Disabilities waiver. The need for continued research in this area is highlighted.


2004 ◽  
Vol 19 (1) ◽  
pp. 147-158 ◽  
Author(s):  
Matthew Ward ◽  
Wei Peng ◽  
Xiaoning Wang

Sign in / Sign up

Export Citation Format

Share Document