Spatial Data Sequence Selection Based on a User-Defined Condition Using GPGPU

The size of spatial data is growing intensively due to the emergence of and the tremendous advances in technology such as sensors and the internet of things. Supporting high-performance queries on this large volume of data becomes essential in several data- and compute-intensive applications. Unfortunately, most of the existing methods and approaches are based on a traditional computing framework (uniprocessors) which makes them not scalable and not adequate to deal with large-scale data. In this work, we present a high-performance query for massive spatio–temporal data. The query consists of selecting fixed size raster subsequences, based on the average of their region of interest, from a spatio–temporal raster sequence satisfying a user threshold condition. In our paper, for the purpose of simplification, we consider that the region of interest is the entire raster and not only a subregion. Our aim is to speed up the execution using parallel primitives and pure CUDA. Furthermore, we propose a new method based on a sorting step to save computations and boost the speed of the query execution. The test results show that the proposed methods are faster and good performance is achieved even with large-scale rasters and data.

Download Full-text

A Research of Matching Algorithm and Strategy on Entity with Complex Area

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.209-211.252 ◽

2012 ◽

Vol 209-211 ◽

pp. 252-255

Author(s):

Li Guo ◽

Hai Ying Zheng ◽

Yong Hong Wang ◽

Bin Zhang

Keyword(s):

Spatial Data ◽

Large Scale ◽

Data Matching ◽

Large Scale Data ◽

Polygon Area ◽

Spatial Data Integration ◽

One And Many ◽

Complex Polygon ◽

Scale Data ◽

Matching Relation

Data matching technology is a key technology for spatial data integration and fusion. This paper represents a solution to the complex polygon area, defines the area overlapped rate in the aspect of geometric measure, presents the data matching idea based on area overlapped rate .Then, this paper discusses and realizes the data matching relation of area elements including one to one , many to one and many to many. At last, region targets are set as the study object, large scale data are taken for example. We draw the conclusion: this algorithm is efficient.

Download Full-text

HiBuffer: Buffer Analysis of 10-Million-Scale Spatial Data in Real Time

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7120467 ◽

2018 ◽

Vol 7 (12) ◽

pp. 467 ◽

Cited By ~ 3

Author(s):

Mengyu Ma ◽

Ye Wu ◽

Wenze Luo ◽

Luo Chen ◽

Jun Li ◽

...

Keyword(s):

Real Time ◽

Spatial Data ◽

High Performance ◽

Large Scale ◽

Computation Time ◽

Buffer Analysis ◽

Data Volume ◽

Time Buffer ◽

Real World Datasets ◽

Spatial Indexes

Buffer analysis, a fundamental function in a geographic information system (GIS), identifies areas by the surrounding geographic features within a given distance. Real-time buffer analysis for large-scale spatial data remains a challenging problem since the computational scales of conventional data-oriented methods expand rapidly with increasing data volume. In this paper, we introduce HiBuffer, a visualization-oriented model for real-time buffer analysis. An efficient buffer generation method is proposed which introduces spatial indexes and a corresponding query strategy. Buffer results are organized into a tile-pyramid structure to enable stepless zooming. Moreover, a fully optimized hybrid parallel processing architecture is proposed for the real-time buffer analysis of large-scale spatial data. Experiments using real-world datasets show that our approach can reduce computation time by up to several orders of magnitude while preserving superior visualization effects. Additional experiments were conducted to analyze the influence of spatial data density, buffer radius, and request rate on HiBuffer performance, and the results demonstrate the adaptability and stability of HiBuffer. The parallel scalability of HiBuffer was also tested, showing that HiBuffer achieves high performance of parallel acceleration. Experimental results verify that HiBuffer is capable of handling 10-million-scale data.

Download Full-text

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ◽

10.1145/3219819.3219927 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alex Gittens ◽

Kai Rothauge ◽

Shusen Wang ◽

Michael W. Mahoney ◽

Lisa Gerhardt ◽

...

Keyword(s):

Data Analysis ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

RHPTree—Risk Hierarchical Pattern Tree for Scalable Long Pattern Mining

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3488380 ◽

2022 ◽

Vol 16 (4) ◽

pp. 1-33

Author(s):

Danlu Liu ◽

Yu Li ◽

William Baskett ◽

Dan Lin ◽

Chi-Ren Shyu

Keyword(s):

High Performance ◽

Large Scale ◽

Pattern Mining ◽

Tree Structure ◽

Research Initiative ◽

Speed Up ◽

Dynamic Tree ◽

Significant Patterns ◽

Search Approach ◽

Hierarchical Pattern

Risk patterns are crucial in biomedical research and have served as an important factor in precision health and disease prevention. Despite recent development in parallel and high-performance computing, existing risk pattern mining methods still struggle with problems caused by large-scale datasets, such as redundant candidate generation, inability to discover long significant patterns, and prolonged post pattern filtering. In this article, we propose a novel dynamic tree structure, Risk Hierarchical Pattern Tree (RHPTree), and a top-down search method, RHPSearch, which are capable of efficiently analyzing a large volume of data and overcoming the limitations of previous works. The dynamic nature of the RHPTree avoids costly tree reconstruction for the iterative search process and dataset updates. We also introduce two specialized search methods, the extended target search (RHPSearch-TS) and the parallel search approach (RHPSearch-SD), to further speed up the retrieval of certain items of interest. Experiments on both UCI machine learning datasets and sampled datasets of the Simons Foundation Autism Research Initiative (SFARI)—Simon’s Simplex Collection (SSC) datasets demonstrate that our method is not only faster but also more effective in identifying comprehensive long risk patterns than existing works. Moreover, the proposed new tree structure is generic and applicable to other pattern mining problems.

Download Full-text

Enabling low latency at large-scale data center and high-performance computing interconnect networks using fine-grained all-optical switching technology

2017 International Conference on Optical Network Design and Modeling (ONDM) ◽

10.23919/ondm.2017.7958532 ◽

2017 ◽

Cited By ~ 3

Author(s):

Nan Hua ◽

Zhizhen Zhong ◽

Xiaoping Zheng

Keyword(s):

Data Center ◽

High Performance ◽

Optical Switching ◽

Large Scale ◽

Fine Grained ◽

Large Scale Data ◽

All Optical ◽

Performance Computing ◽

All Optical Switching ◽

Scale Data

Download Full-text

A parallel computational framework for ultra-large-scale sequence clustering analysis

Bioinformatics ◽

10.1093/bioinformatics/bty617 ◽

2018 ◽

Vol 35 (3) ◽

pp. 380-388 ◽

Cited By ~ 2

Author(s):

Wei Zheng ◽

Qi Mao ◽

Robert J Genco ◽

Jean Wactawski-Wende ◽

Michael Buck ◽

...

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

De Novo ◽

Rapid Development ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Computational Framework ◽

Speed Up ◽

Scale Sequence

Abstract Motivation The rapid development of sequencing technology has led to an explosive accumulation of genomic data. Clustering is often the first step to be performed in sequence analysis. However, existing methods scale poorly with respect to the unprecedented growth of input data size. As high-performance computing systems are becoming widely accessible, it is highly desired that a clustering method can easily scale to handle large-scale sequence datasets by leveraging the power of parallel computing. Results In this paper, we introduce SLAD (Separation via Landmark-based Active Divisive clustering), a generic computational framework that can be used to parallelize various de novo operational taxonomic unit (OTU) picking methods and comes with theoretical guarantees on both accuracy and efficiency. The proposed framework was implemented on Apache Spark, which allows for easy and efficient utilization of parallel computing resources. Experiments performed on various datasets demonstrated that SLAD can significantly speed up a number of popular de novo OTU picking methods and meanwhile maintains the same level of accuracy. In particular, the experiment on the Earth Microbiome Project dataset (∼2.2B reads, 437 GB) demonstrated the excellent scalability of the proposed method. Availability and implementation Open-source software for the proposed method is freely available at https://www.acsu.buffalo.edu/~yijunsun/lab/SLAD.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Fitness evaluation reuse for accelerating GPU-based evolutionary induction of decision trees

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020957393 ◽

2020 ◽

Vol 35 (1) ◽

pp. 20-32

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Decision Trees ◽

Large Scale ◽

Real Life ◽

Biological Evolution ◽

Fixed Number ◽

Machine Learning Techniques ◽

Tree Structures ◽

Fitness Evaluation ◽

Large Scale Data ◽

Speed Up

Decision trees (DTs) are one of the most popular white-box machine-learning techniques. Traditionally, DTs are induced using a top-down greedy search that may lead to sub-optimal solutions. One of the emerging alternatives is an evolutionary induction inspired by the biological evolution. It searches for the tree structure and tests simultaneously, which results in less complex DTs with at least comparable prediction performance. However, the evolutionary search is computationally expensive, and its effective application to big data mining needs algorithmic and technological progress. In this paper, noting that many trees or their parts reappear during the evolution, we propose a reuse strategy. A fixed number of recently processed individuals (DTs) is stored in a so-called repository. A part of the repository entry (related to fitness calculations) is maintained on a CPU side to limit CPU/GPU memory transactions. The rest of the repository entry (tree structures) is located on a GPU side to speed up searching for similar DTs. As the most time-demanding task of the induction is the DTs’ evaluation, the GPU first searches similar DTs in the repository for reuse. If it fails, the GPU has to evaluate DT from the ground up. Large artificial and real-life datasets and various repository strategies are tested. Results show that the concept of reusing information from previous generations can accelerate the original GPU-based solution further. It is especially visible for large-scale data. To give an idea of the overall acceleration scale, the proposed solution can process even billions of objects in a few hours on a single GPU workstation.

Download Full-text

An Analysis Technique of Evacuation Simulation Using an Array DBMS

Journal of Disaster Research ◽

10.20965/jdr.2018.p0338 ◽

2018 ◽

Vol 13 (2) ◽

pp. 338-346

Author(s):

Yusuke Kawai ◽

Jing Zhao ◽

Kento Sugiura ◽

Yoshiharu Ishikawa ◽

Yukiko Wakita ◽

...

Keyword(s):

Query Processing ◽

Large Scale ◽

Evacuation Simulation ◽

Array Data ◽

Interactive Analysis ◽

Large Scale Data ◽

Analysis Technique ◽

Spatio Temporal ◽

Efficient Query Processing ◽

Large Scale Simulations

Today, large-scale simulations are thriving because of the increase of computating performance and storage capacity. Understanding the results of these simulations is not easy, and hence, support for interactive and exploratory analysis is becoming more important. This study focuses on spatio-temporal simulations and attempts to develop an analysis technology to support them. It uses a database system for supporting interactive analysis of large-scale data. Since the data gained via spatio-temporal simulations is not suitable for management in a relational DBMS (RDBMS), this study uses an array DBMS, a type of DBMS that has been garnering increased attention in recent years. An array DBMS is designed for the management of large-scale array data; it provides a logical model for array data, yet it also supports efficient query processing. SciDB is used as our specific array DBMS in this paper. This study targets disaster evacuation simulation data and demonstrates via experimentation that the query-processing functions offered by an array DBMS provide effective analysis support.

Download Full-text

Integrating Web service and grid enabling technologies to provide desktop access to high-performance cluster-based components for large-scale data services

36th Annual Simulation Symposium, 2003. ◽

10.1109/simsym.2003.1192810 ◽

2003 ◽

Cited By ~ 8

Author(s):

V.P. Holmes ◽

W.R. Johnson ◽

D.J. Miller

Keyword(s):

Web Service ◽

High Performance ◽

Large Scale ◽

Data Services ◽

Large Scale Data ◽

Enabling Technologies ◽

Scale Data

Download Full-text

A look back on 30 years of the Gordon Bell Prize

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017738610 ◽

2017 ◽

Vol 31 (6) ◽

pp. 469-484 ◽

Cited By ~ 3

Author(s):

Gordon Bell ◽

David H Bailey ◽

Jack Dongarra ◽

Alan H Karp ◽

Kevin Walsh

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

Peak Performance ◽

Outstanding Achievement ◽

Computing Machinery ◽

Large Scale Data ◽

The Us ◽

The Impact ◽

Performance Computing

The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.

Download Full-text