scholarly journals Exploring Approaches for Large Data in Seismology: User and Data Repository Perspectives

Author(s):  
Javier Quinteros ◽  
Jerry A. Carter ◽  
Jonathan Schaeffer ◽  
Chad Trabant ◽  
Helle A. Pedersen

Abstract New data acquisition techniques are generating data at much finer temporal and spatial resolution, compared to traditional seismic experiments. This is a challenge for data centers and users. As the amount of data potentially flowing into data centers increases by one or two orders of magnitude, data management challenges are found throughout all stages of the data flow. The Incorporated Research Institutions for Seismology—Réseau sismologique et géodésique français and GEOForschungsNetz data centers—carried out a survey and conducted interviews of users working with very large datasets to understand their needs and expectations. One of the conclusions is that existing data formats and services are not well suited for users of large datasets. Data centers are exploring storage solutions, data formats, and data delivery options to meet large dataset user needs. New approaches will need to be discussed within the community, to establish large dataset standards and best practices, perhaps through participation of stakeholders and users in discussion groups and forums.

Author(s):  
Trong Dinh Thac Do ◽  
Longbing Cao

Matrix Factorization (MF) is widely used in Recommender Systems (RSs) for estimating missing ratings in the rating matrix. MF faces major challenges of handling very sparse and large data. Poisson Factorization (PF) as an MF variant addresses these challenges with high efficiency by only computing on those non-missing elements. However, ignoring the missing elements in computation makes PF weak or incapable for dealing with columns or rows with very few observations (corresponding to sparse items or users). In this work, Metadata-dependent Poisson Factorization (MPF) is invented to address the user/item sparsity by integrating user/item metadata into PF. MPF adds the metadata-based observed entries to the factorized PF matrices. In addition, similar to MF, choosing the suitable number of latent components for PF is very expensive on very large datasets. Accordingly, we further extend MPF to Metadata-dependent Infinite Poisson Factorization (MIPF) that integrates Bayesian Nonparametric (BNP) technique to automatically tune the number of latent components. Our empirical results show that, by integrating metadata, MPF/MIPF significantly outperform the state-of-the-art PF models for sparse and large datasets. MIPF also effectively estimates the number of latent components.


2020 ◽  
Author(s):  
Jonathan M. Lilly ◽  
Paula Perez-Brunius

Abstract. A method for objectively extracting the displacement signals associated with coherent eddies from Lagrangian trajectories is presented, refined, and applied to a large dataset of 3761 surface drifters from the Gulf of Mexico. The method, wavelet ridge analysis, is modified to exclude the possibility of features changing from rotating in the cyclonic sense to rotating in the anticyclonic sense or vice-versa, transitions that would be physically unrealistic for a coherent eddy. A means for formally assessing statistical significance is introduced, addressing the issue of "false positives" arising by chance from an unstructured turbulent background, and opening the door to confident application of the method to very large datasets. Significance is measured in a two-dimensional parameter space by comparison with a stochastic dataset having statistical and spectral properties that match the original, but lacking organized oscillations due to eddies or waves. The application to the Gulf of Mexico reveals massive asymmetries between cyclones and anticyclones, with anticyclones dominating at radii larger than about 50 km, but an unexpectedly rich population of highly nonlinear cyclones dominating at smaller radii. Both the method and the Gulf of Mexico eddy dataset are made freely available to the community for use in future research.


2020 ◽  
Vol 196 ◽  
pp. 105777
Author(s):  
Jadson Jose Monteiro Oliveira ◽  
Robson Leonardo Ferreira Cordeiro

Now a day different data mining algorithms are ready to create the specific set of data known as Pattern from a huge data repository, but there is no infrastructure or system to save it as persistent storage for the generated patterns. Pattern warehouse presents a foundation to make these patterns safe in the specific environment for long term use. Most organizations are excited to know the information or patterns rather than raw data or group of unprocessed data. Because extracted knowledge play a vital role to take right decision for the growth of an organization. We have examined the sources of patterns generated from large data sets. In this paper, we have presented little importance on the application area of pattern and idea of patter warehouse, the architecture of pattern warehouse then correlation between data warehouse and data mining, association between data mining and pattern warehouse, critical evaluation between existing approaches which theoretically published and more stress on association rule related review elements. In this paper, we analyze the patterns warehouse, data warehouse concerning various factors like storage space, type of storage unit, characteristics, and provide several research domains.


2018 ◽  
Author(s):  
Hamid Bagher ◽  
Usha Muppiral ◽  
Andrew J Severin ◽  
Hridesh Rajan

AbstractBackgroundCreating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.ResultsHere, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species.ConclusionsIn order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.


2001 ◽  
Vol 27 (11) ◽  
pp. 1457-1478 ◽  
Author(s):  
Michael D Beynon ◽  
Tahsin Kurc ◽  
Umit Catalyurek ◽  
Chialin Chang ◽  
Alan Sussman ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document