Repair of Voids in Multi-Labeled Triangular Mesh

In this paper, we propose a novel mesh repairing method for repairing voids from several meshes to ensure a desired topological correctness. The input to our method is several closed and manifold meshes without labels. The basic idea of the method is to search for and repair voids based on a multi-labeled mesh data structure and the idea of graph theory. We propose the judgment rules of voids between the input meshes and the method of void repairing based on the specified model priorities. It consists of three steps: (a) converting the input meshes into a multi-labeled mesh; (b) searching for quasi-voids using the breadth-first searching algorithm and determining true voids via the judgment rules of voids; (c) repairing voids by modifying mesh labels. The method can repair the voids accurately and only few invalid triangular facets are removed. In general, the method can repair meshes with one hundred thousand facets in approximately one second on very modest hardware. Moreover, it can be easily extended to process large-scale polygon models with millions of polygons. The experimental results of several data sets show the reliability and performance of the void repairing method based on the multi-labeled triangular mesh.

Download Full-text

A Real-Time Log Analyzer Based on MongoDB

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.497 ◽

2014 ◽

Vol 571-572 ◽

pp. 497-501 ◽

Cited By ~ 3

Author(s):

Qi Lv ◽

Wei Xie

Keyword(s):

Real Time ◽

Large Scale ◽

Performance Comparison ◽

Log Analysis ◽

Data Sets ◽

Time Data ◽

Real Time Analysis ◽

Large Scale Data ◽

Implementation Approach ◽

And Performance

Real-time log analysis on large scale data is important for applications. Specifically, real-time refers to UI latency within 100ms. Therefore, techniques which efficiently support real-time analysis over large log data sets are desired. MongoDB provides well query performance, aggregation frameworks, and distributed architecture which is suitable for real-time data query and massive log analysis. In this paper, a novel implementation approach for an event driven file log analyzer is presented, and performance comparison of query, scan and aggregation operations over MongoDB, HBase and MySQL is analyzed. Our experimental results show that HBase performs best balanced in all operations, while MongoDB provides less than 10ms query speed in some operations which is most suitable for real-time applications.

Download Full-text

A reduced order approach for probabilistic inversions of 3-D magnetotelluric data I: general formulation

Geophysical Journal International ◽

10.1093/gji/ggaa415 ◽

2020 ◽

Vol 223 (3) ◽

pp. 1837-1863

Author(s):

M C Manassero ◽

J C Afonso ◽

F Zyserman ◽

S Zlotnik ◽

I Fomin

Keyword(s):

Large Scale ◽

Computational Cost ◽

Forward Problem ◽

Parallel Structure ◽

Data Sets ◽

Magnetotelluric Data ◽

Reduced Order ◽

Mcmc Algorithms ◽

Simulation Based ◽

And Performance

SUMMARY Simulation-based probabilistic inversions of 3-D magnetotelluric (MT) data are arguably the best option to deal with the nonlinearity and non-uniqueness of the MT problem. However, the computational cost associated with the modelling of 3-D MT data has so far precluded the community from adopting and/or pursuing full probabilistic inversions of large MT data sets. In this contribution, we present a novel and general inversion framework, driven by Markov Chain Monte Carlo (MCMC) algorithms, which combines (i) an efficient parallel-in-parallel structure to solve the 3-D forward problem, (ii) a reduced order technique to create fast and accurate surrogate models of the forward problem and (iii) adaptive strategies for both the MCMC algorithm and the surrogate model. In particular, and contrary to traditional implementations, the adaptation of the surrogate is integrated into the MCMC inversion. This circumvents the need of costly offline stages to build the surrogate and further increases the overall efficiency of the method. We demonstrate the feasibility and performance of our approach to invert for large-scale conductivity structures with two numerical examples using different parametrizations and dimensionalities. In both cases, we report staggering gains in computational efficiency compared to traditional MCMC implementations. Our method finally removes the main bottleneck of probabilistic inversions of 3-D MT data and opens up new opportunities for both stand-alone MT inversions and multi-observable joint inversions for the physical state of the Earth’s interior.

Download Full-text

AstroCatR: a mechanism and tool for efficient time series reconstruction of large-scale astronomical catalogues

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa1413 ◽

2020 ◽

Vol 496 (1) ◽

pp. 629-637

Author(s):

Ce Yu ◽

Kun Li ◽

Shanjiang Tang ◽

Chao Sun ◽

Bin Ma ◽

...

Keyword(s):

Time Series ◽

High Performance ◽

Large Scale ◽

Extrasolar Planets ◽

Time Series Data ◽

Series Data ◽

Data Sets ◽

Observation Data ◽

Data Volume ◽

And Performance

ABSTRACT Time series data of celestial objects are commonly used to study valuable and unexpected objects such as extrasolar planets and supernova in time domain astronomy. Due to the rapid growth of data volume, traditional manual methods are becoming extremely hard and infeasible for continuously analysing accumulated observation data. To meet such demands, we designed and implemented a special tool named AstroCatR that can efficiently and flexibly reconstruct time series data from large-scale astronomical catalogues. AstroCatR can load original catalogue data from Flexible Image Transport System (FITS) files or data bases, match each item to determine which object it belongs to, and finally produce time series data sets. To support the high-performance parallel processing of large-scale data sets, AstroCatR uses the extract-transform-load (ETL) pre-processing module to create sky zone files and balance the workload. The matching module uses the overlapped indexing method and an in-memory reference table to improve accuracy and performance. The output of AstroCatR can be stored in CSV files or be transformed other into formats as needed. Simultaneously, the module-based software architecture ensures the flexibility and scalability of AstroCatR. We evaluated AstroCatR with actual observation data from The three Antarctic Survey Telescopes (AST3). The experiments demonstrate that AstroCatR can efficiently and flexibly reconstruct all time series data by setting relevant parameters and configuration files. Furthermore, the tool is approximately 3× faster than methods using relational data base management systems at matching massive catalogues.

Download Full-text

CMIP model evaluation with the ESMValTool v2.0

10.5194/egusphere-egu2020-13306 ◽

2020 ◽

Author(s):

Axel Lauer ◽

Fernando Iglesias-Suarez ◽

Veronika Eyring ◽

the ESMValTool development team

Keyword(s):

Model Evaluation ◽

Large Scale ◽

Performance Metrics ◽

Comprehensive Evaluation ◽

Coupled Model ◽

Data Sets ◽

Evaluation Tool ◽

Emergent Constraints ◽

Data Volume ◽

And Performance

<p>The Earth System Model Evaluation Tool (ESMValTool) has been developed with the aim of taking model evaluation to the next level by facilitating analysis of many different ESM components, providing well-documented source code and scientific background of implemented diagnostics and metrics and allowing for traceability and reproducibility of results (provenance). This has been made possible by a lively and growing development community continuously improving the tool supported by multiple national and European projects. The latest version (2.0) of the ESMValTool has been developed as a large community effort to specifically target the increased data volume of the Coupled Model Intercomparison Project Phase 6 (CMIP6) and the related challenges posed by analysis and evaluation of output from multiple high-resolution and complex ESMs. For this, the core functionalities have been completely rewritten in order to take advantage of state-of-the-art computational libraries and methods to allow for efficient and user-friendly data processing. Common operations on the input data such as regridding or computation of multi-model statistics are now centralized in a highly optimized preprocessor written in Python. The diagnostic part of the ESMValTool includes a large collection of standard recipes for reproducing peer-reviewed analyses of many variables across atmosphere, ocean, and land domains, with diagnostics and performance metrics focusing on the mean-state, trends, variability and important processes, phenomena, as well as emergent constraints. While most of the diagnostics use observational data sets (in particular satellite and ground-based observations) or reanalysis products for model evaluation some are also based on model-to-model comparisons. This presentation introduces the diagnostics newly implemented into ESMValTool v2.0 including an extended set of large-scale diagnostics for quasi-operational and comprehensive evaluation of ESMs, new diagnostics for extreme events, regional model and impact evaluation and analysis of ESMs, as well as diagnostics for emergent constraints and analysis of future projections from ESMs. The new diagnostics are illustrated with examples using results from the well-established CMIP5 and the newly available CMIP6 data sets.</p>

Download Full-text

Integrated Cloud Computing Environment for Upstream Geoscience Workflows

10.2118/204848-ms ◽

2021 ◽

Author(s):

Murtadha Al-Habib ◽

Yasser Al-Ghamdi

Keyword(s):

High Performance ◽

Large Scale ◽

End Users ◽

Data Sets ◽

Production Environment ◽

Remote Visualization ◽

Test Environment ◽

Petroleum Resources ◽

Customized Production ◽

And Performance

Abstract Extensive computing resources are required to leverage todays advanced geoscience workflows that are used to explore and characterize giant petroleum resources. In these cases, high-performance workstations are often unable to adequately handle the scale of computing required. The workflows typically utilize complex and massive data sets, which require advanced computing resources to store, process, manage, and visualize various forms of the data throughout the various lifecycles. This work describes a large-scale geoscience end-to-end interpretation platform customized to run on a cluster-based remote visualization environment. A team of computing infrastructure and geoscience workflow experts was established to collaborate on the deployment, which was broken down into separate phases. Initially, an evaluation and analysis phase was conducted to analyze computing requirements and assess potential solutions. A testing environment was then designed, implemented and benchmarked. The third phase used the test environment to determine the scale of infrastructure required for the production environment. Finally, the full-scale customized production environment was deployed for end users. During testing phase, aspects such as connectivity, stability, interactivity, functionality, and performance were investigated using the largest available geoscience datasets. Multiple computing configurations were benchmarked until optimal performance was achieved, under applicable corporate information security guidelines. It was observed that the customized production environment was able to execute workflows that were unable to run on local user workstations. For example, while conducting connectivity, stability and interactivity benchmarking, the test environment was operated for extended periods to ensure stability for workflows that require multiple days to run. To estimate the scale of the required production environment, varying categories of users’ portfolio were determined based on data type, scale and workflow. Continuous monitoring of system resources and utilization enabled continuous improvements to the final solution. The utilization of a fit-for-purpose, customized remote visualization solution may reduce or ultimately eliminate the need to deploy high-end workstations to all end users. Rather, a shared, scalable and reliable cluster-based solution can serve a much larger user community in a highly performant manner.

Download Full-text

A Dynamic Scaling Approach in Hadoop YARN

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.286176 ◽

2022 ◽

Vol 12 (2) ◽

pp. 0-0

Keyword(s):

Large Scale ◽

Distributed Processing ◽

Dynamic Scaling ◽

Data Sets ◽

Large Scale Data ◽

The People ◽

Big Data Applications ◽

Scaling Process ◽

And Performance

In Cloud based Big Data applications, Hadoop has been widely adopted for distributed processing large scale data sets. However, the wastage of energy consumption of data centers still constitutes an important axis of research due to overuse of resources and extra overhead costs. As a solution to overcome this challenge, a dynamic scaling of resources in Hadoop YARN Cluster is a practical solution. This paper proposes a dynamic scaling approach in Hadoop YARN (DSHYARN) to add or remove nodes automatically based on workload. It is based on two algorithms (scaling up/down) which are implemented to automate the scaling process in the cluster. This article aims to assure energy efficiency and performance of Hadoop YARN’ clusters. To validate the effectiveness of DSHYARN, a case study with sentiment analysis on tweets about covid-19 vaccine is provided. the goal is to analyze tweets of the people posted on Twitter application. The results showed improvement in CPU utilization, RAM utilization and Job Completion time. In addition, the energy has been reduced of 16% under average workload.

Download Full-text

The joint contribution of participation and performance to learning functions: Exploring the effects of age in large-scale data sets

Behavior Research Methods ◽

10.3758/s13428-018-1128-2 ◽

2018 ◽

Vol 51 (4) ◽

pp. 1531-1543 ◽

Cited By ~ 5

Author(s):

Mark Steyvers ◽

Aaron S. Benjamin

Keyword(s):

Large Scale ◽

Data Sets ◽

Large Scale Data ◽

Learning Functions ◽

And Performance ◽

Joint Contribution ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

bíogo: a simple high-performance bioinformatics toolkit for the Go language

10.1101/005033 ◽

2014 ◽

Cited By ~ 6

Author(s):

R Daniel Kortschak ◽

David L Adelson

Keyword(s):

High Performance ◽

Large Scale ◽

Large Data ◽

Biological Data ◽

Data Sets ◽

Barriers To Entry ◽

Data Types ◽

Concurrent Processing ◽

Computationally Intensive ◽

And Performance

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.

Download Full-text

Bio-Node – Bioinformatics in the Cloud

10.1101/2020.04.15.043596 ◽

2020 ◽

Author(s):

Yannick Spreen ◽

Maximilian Miller

Keyword(s):

Large Scale ◽

Research Area ◽

Data Sets ◽

Web Interface ◽

Data Types ◽

Function Annotation ◽

Specific Data ◽

And Performance ◽

Faster Development ◽

Optimization And Performance

Motivation: The applicability and reproducibility of bioinformatics methods and results often depend on the structure and software architecture of their development. Exponentially growing data sets require ever more optimization and performance with conventional computing capacities lacking this process. This creates a large overhead for software development in a research area which is primarily interested in solving complex biological problems rather than developing new, performant software solutions. In pure computer science, new structures in the field of web development have produced more efficient processes for container-based software solutions. The advantages of these structures have rarely been explored in a broader scientific scale. This is also the case with the trend of migrating computations from on premise resources to the cloud. Results: We created Bio-Node, a new platform for large scale bio data analysis utilizing cloud compute resources (publicly available at https://bio-node.de). Bio-Node enables building complex workflows using a sophisticated web interface. We applied Bio-Node to implement bioinformatic workflows for rapid metagenome function annotation. We further developed "Auto-Clustering", a workflow that automatically extracts the most suited clustering parameters for specific data types and subsequently enables to optimally segregate unknown samples of the same type. Compared to existing methods and approaches Bio-Node improves performance and costs of bioinformatics data analyses while providing an easier and faster development process with focus on reproducibility and reusability.

Download Full-text

Towards Big Linked Data

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2013100102 ◽

2013 ◽

Vol 9 (4) ◽

pp. 19-43 ◽

Cited By ~ 1

Author(s):

Bo Hu ◽

Nuno Carvalho ◽

Takahide Matsutsuka

Keyword(s):

Big Data ◽

Large Scale ◽

Stress Test ◽

Open Data ◽

Small Data ◽

Data Sets ◽

And Performance ◽

Machine Readable ◽

Future Work ◽

Semantic Layer

In light of the challenges of effectively managing Big Data, the authors are witnessing a gradual shift towards the increasingly popular Linked Open Data (LOD) paradigm. LOD aims to impose a machine-readable semantic layer over structured as well as unstructured data and hence automate some data analysis tasks that are not designed for computers. The convergence of Big Data and LOD is, however, not straightforward: the semantic layer of LOD and the Big Data large scale storage do not get along easily. Meanwhile, the sheer data size envisioned by Big Data denies certain computationally expensive semantic technologies, rendering the latter much less efficient than their performance on relatively small data sets. In this paper, the authors propose a mechanism allowing LOD to take advantage of existing large-scale data stores while sustaining its “semantic” nature. The authors demonstrate how RDF-based semantic models can be distributed across multiple storage servers and the authors examine how a fundamental semantic operation can be tuned to meet the requirements on distributed and parallel data processing. The authors' future work will focus on stress test of the platform in the magnitude of tens of billions of triples, as well as comparative studies in usability and performance against similar offerings.

Download Full-text